Data Processing Workflows


Chicago School of Data Days - Sept. 20, 2014

David Eads @eads | NPR Visuals @nprviz

Based on an NPR Visuals blog post


Slides: http://recoveredfactory.net/data-workflow

Presentation source: github.com/eads/data-workflow | Made with Tarbell

Case study

Defense Logistics Agency "1033" program provides military gear to local law enforcement agencies

End product

MRAPs And Bayonets: What We Know About The Pentagon's 1033 Program
by Arezou Rezvani, Jessica Pupovac, David Eads and Tyler Fisher

Automate everything

Make a workflow

  • Preserve your source data
  • Create scripts to carry out repeatable transformations
  • Import into a relational database
  • Join with other data sets
  • Manage processing code with version control
  • Create scripts to generate and export analysis and summarization

Control script


#!/bin/bash

echo 'IMPORT DATA'
echo '-----------'
./import.sh

echo 'CREATE SUMMARY FILES'
echo '--------------------'
./summarize.sh

echo 'EXPORT PROCESSED DATA'
echo '---------------------'
./export.sh

Use version control

Work in public

Fields

  • State
  • County
  • National Supply Number: a standardized categorization system for equipment
  • Quantity
  • Units: A description of the unit to use for the item (e.g. "each" or "square feet")
  • Acquisition cost: The per-unit cost of the item when purchased by the military
  • Ship date: When the item was shipped to a law enforcement agency

Clean the data:
Convert Excel data to CSV with Python

  • Parse the date field, which represents dates in two different formats
  • Strip out extra spaces from any strings (of which there are many)
  • Split the National Stock Number into two additional fields:
    • First two digits represent the "federal supply group" (e.g. "WEAPONS").
    • First four digits represent the "federal supply class" (e.g. "Guns, through 30 mm").


# NSN parsing example
if header == 'nsn':
    nsn_parts = cell_value.split('-')
    row_dict['federal_supply_class'] = nsn_parts[0]

Import data to PostgreSQL

  • Import the cleaned CSVs generated from the source data
  • Import a "FIPS crosswalk" CSV: Maps county name and state to the Federal Information Processing Standard identifier to identify counties.
  • Import a CSV file with Federal Supply Codes
  • Import 5 year county population estimates from the US Census Bureau's American Community Survey


# Example import statement
echo "Import FIPS crosswalk"
psql leso -c "CREATE TABLE fips (
  county varchar,
  state varchar,
  fips varchar
);"
psql leso -c "COPY fips FROM '`pwd`/src/fips_crosswalk.csv' DELIMITER ',' CSV HEADER;"

Agency import: csvkit magic

  • Use csvkit's in2csv command to extract each sheet
  • Use csvkit's csvstack command to combine the sheets and add a grouping column
  • Use csvkit's csvcut command to remove a pointless "row number" column
  • Import final output into Postgres database


echo "Import agency.csv"
in2csv --sheet "State Agencies" data.xlsx > state_agencies.csv
in2csv --sheet "Federal Agencies" data.xlsx > federal_agencies.csv
in2csv --sheet "Tribal Agencies" data.xlsx > tribal_agencies.csv

csvstack -n "agency_type" -g "state,federal,tribal"
  state_agencies.csv
  federal_agencies.csv
  tribal_agencies.csv | csvcut -c "1,3,4" > agencies.csv

Merge data with lookup tables

  • Census data:
    • County-name to FIPS code
    • FIPS code to census data
  • Federal supply class and federal supply group to Department of Defense lookup table

Summarize: Total cost by category


select
  c.full_name,
  c.code as federal_supply_class,
  sum((d.quantity * d.acquisition_cost)) as total_cost

from data as d

join codes as c on d.federal_supply_class = c.code

group by c.full_name, c.code

order by c.full_name

Summarize: Time series by category


select
  c.name,
  sum(quantity * acquisition_cost) as total_cost,
  extract(year from ship_date) as year

from data as d

join codes as c on d.federal_supply_category = c.code
group by c.name, year
order by year desc

Dump the data

Make corrections


Impact! (In Congress)

Rand Paul: "Mr. Estevez, in the NPR investigation of 1033 program they list that 12,000 bayonets have been given out. What purpose are bayonets being given out for?"

Impact! (For member stations)