Remembering The Fallen

Those that know me are aware that I have always had great respect for people who have served or are serving in the military. I thought it would be a nice project to create a Twitter bot that would tweet, on a given day, all the US Servicemembers killed in action on that day in prior years. You can view it on Twitter or follow @thedailyfallen directly.

I’ve had questions before about simple Data Science, Machine Learning, and software development projects before, so here is a good example of just such a project.

Some people might say there isn’t much Data Science going on here. FALSE! Any experienced Data Science person will tell you that the majority of the work is acquiring, cleaning, and transforming the data. Often there is little or no actual machine learning happening in a successful Data Science project. Data Science projects are successful because they make the business more successful. Sometimes that requires zero machine learning. In this project, probably 80% to 90% of the time was spent finding, cleaning, and transforming the data, which is par for the course in ANY Data Science project in my experience.

Excuse the digression.

Acquire data

This wasn’t as quick as I thought it would be, but it never is.

The Department of Defense website wasn’t very helpful, but after some searching I found some suitable data from the National Archives.

The data is in Defense Casualty Analysis System (DCAS) format, and includes all the service members who have died from 1950 through 2005, with another file containing deaths for 2006. The files are pipe-delimited and the fields are pretty well documented in the associated PDF.

They are also very clear on the contents of the file(s):

This file contains the records of U.S. military personnel casualties for deaths between the years 1950 and 2005, and two records of deaths related to Vietnam that occurred in 2006. The casualties occurred worldwide and resulted from both hostile and non-hostile action. The war or conflict for each casualty is identified as occurring during the Korean War, Vietnam War, Gulf War, War on Terrorism, or Peacetime. Because of the broad nature of the category of Peacetime, several fields further describe those casualties in terms of location, circumstances, category, and reason. Each record includes such information as: the service member’s name, service number, service branch, rank, pay grade, occupation, birth date, gender, home of record (city, county, state or province, country), marital status, religion, race, ethnicity, casualty circumstances, casualty location (city, state or province, country or over water), unit, duty, process date, death date, war or conflict, operation incident type, aircraft type, hostile or non-hostile death indicator, casualty type, and casualty category. Since this is the public use version of the file, the service number field is masked.

Clean and transform data

There were some records with errors in the file, as is nearly always the case, so after taking only the records listed as KILLED IN ACTION I further filtered out the ones where the date of death wasn’t 8 characters long since the file represents death date as an 8 character string in yyyymmdd format.

After that it was a matter of constructing the tweet string I wanted, which was a concatenation of rank, name, KILLED IN ACTION, location of death, and year of death. The month and day are always the same as the current one, only the year is different, so I didn’t include it in the tweet text. Additionally, I split out the day and month of death, so that I could match on it later to get all the records for a particular month and day.

After these steps I simply output the tweet text, day of death, and month of death columns in a new pipe-delimited file.

The preprocessing code is pretty standard fare. I’m sure it can be made cleaner and/or more efficient, but it’s not needed for this data set or project.

import pandas as pd
import datetime
pd.options.mode.chained_assignment = None

data = pd.read_csv('allpuf.dat', sep='|', low_memory=False, header=None)
kia = data[(data[44] == 'KILLED IN ACTION')]
kia.rename(columns={7 : 'rank', 34 : 'dod', 35 : 'yod', 30 : 'countryOfDeath'}, inplace=True)

kia['dod'] = kia['dod'].astype(str)
kia['dodlen'] = kia['dod'].apply(lambda x: len(str(x)))
kia = kia[kia['dodlen'] == 8]

kia['name'] = kia[4].apply(lambda x: x.split(' ')[0] + ', ' + ' '.join(x.split(' ')[1:]))
kia['yod'] = kia['yod'].astype(str)

kia['dayDeath'] = kia['dod'].apply(lambda x: int(str(x)[6:]))
kia['monthDeath'] = kia['dod'].apply(lambda x: int(str(x)[4:6]))

kia['tweet'] = kia['rank'] + " " + kia['name'] + ", " + "KILLED IN ACTION, " + kia['countryOfDeath'] + ". " + "(" + kia['yod'] + ")"

toTweet = kia[['tweet', 'dayDeath', 'monthDeath']]
toTweet.to_csv('kia.tweets.dat', sep='|', index=False, headers=None)

Prototype the bot

Before going through all the Twitter account setup and API access things, it’s easier to just prototype things by printing the tweets STDOUT. The main idea is to load in the file, get the current day and month, take all the deaths on the current day and month and then calculate the delay between them so that all can be tweeted before the end of the day. I also put in a buffer of two minutes, so that the script would finish no later than 2358. I do not account for things like rate limiting, restarting the service should it fail, exception handling, and so on. Out of scope for this project.

import sys
import time
import math
import pandas as pd
from datetime import datetime

def main():
	kia = pd.read_csv(sys.argv[1], sep='|')
	now = datetime.now()
	year = now.year
	day = now.day
	month = now.month
	buffer_seconds = 120
	remaining_secs = (datetime(year, month, day+1) - datetime.now()).total_seconds() - buffer_seconds

	tweets = kia[(kia['dayDeath'] == day) & (kia['monthDeath'] == month)]['tweet'].values
	delay = math.floor(remaining_secs / len(tweets))

	for tweet in tweets:
		print(tweet)
		time.sleep(delay)

if __name__ == '__main__':
	main()

Get Twitter API credentials

Now it’s time to set up a new Twitter account and get API access. The hardest part is finding an available handle on Twitter. Really. It will probably take you longer to settle on an available name than it will to get your API keys.

Note that you must provide a phone number in order for Twitter to generate all the tokens you need.

Set up a new Twitter account, make sure to include a phone number
Go to https://dev.twitter.com and set up a new application
After the application is set up, copy the Consumer Key and Consumer Secret strings
Click the button to generate an access token
Copy the Token and Token Secret strings

Get the bot tweeting

The first step is to pip install twython, a Python library for Twitter which allows tweeting (amongst other things) from any Python script. It’s straightforward to use and flexible as well, as you can see from the Basic Usage page.

Then, simply add in the necessary credentials, set up the Twitter connection object, and post an actual tweet instead of the print we had before.

import sys
import time
import math
import pandas as pd
from twython import Twython
from datetime import datetime

consumer_key = 'YOUR CONSUMER KEY HERE'
consumer_secret = 'YOUR CONSUMER SECRET HERE'
token = 'YOUR TOKEN HERE'
token_secret = 'YOUR TOKEN SECRET HERE'

twitter = Twython(consumer_key, consumer_secret, token, token_secret)

def main():
	kia = pd.read_csv(sys.argv[1], sep='|')
	now = datetime.now()
	year = now.year
	day = now.day
	month = now.month
	buffer_seconds = 120
	remaining_secs = (datetime(year, month, day+1) - datetime.now()).total_seconds() - buffer_seconds

	tweets = kia[(kia['dayDeath'] == day) & (kia['monthDeath'] == month)]['tweet'].values		
	delay = math.floor(remaining_secs / len(tweets))

	for tweet in tweets:
		twitter.update_status(status=tweet)
		time.sleep(delay)

if __name__ == '__main__':
	main()

This will have our bot tweeting all the required death records, and it should finish no later than 1158 on any given day.

Deploy to a virtual machine somewhere

Since I do my work on a laptop which isn’t always running, I keep the bot going on a virtual machine with Amazon Web Services. I won’t go into details here of getting an EC2 instance up and running on AWS, but once you have that you can just SFTP or SCP the bot and the data file up to the machine.

Additionally, since the bot uses Pandas, rather than bothering with pip and dependencies and so on, I did what I always do for Python and just installed Anaconda from Continuum Analytics. If you aren’t using Anaconda for your Python work, life is probably harder for you than it needs to be.

Also, if you are still using Python 2 instead of Python 3…

Futurama meme ‘You’re bad and you should feel bad’

Add scheduling

For something like this, the easiest way to schedule it is to just add a cron job to run the bot every day at midnight, and it will then finish before 2358 on the same day.

So crontab -e and we just add an entry like 00 00 * * * /path/to/your/python /path/to/your/script /path/to/your/input/file

Now the bot will start every day at midnight, and finish by 2358, only to be restarted for a new day a few minutes later.

Fin

If you want to do some quick Data Science projects, I’d say this is a pretty representative and straightforward example. Additionally, though I haven’t gotten around to it yet, putting the code on GitHub is usually a good idea. If you decide to do that, DO NOT FORGET to remove your API credentials before you commit and push the code.

So that’s it. Gathering data and building Twitter bots isn’t any kind of magic, although bots which interact with others are a different matter.

Now when I pull up my Twitter feed I get a nice reminder of those who gave their lives in service of the interests of the United States.

Remember.