Introduction & Background
I’m currently enrolled in the “Communicating Data Science Results” course on Coursera. Our first exercise involves examining a crime dataset and teasing out a finding (backed by 2-3 visualizations and accompanying descriptions that help illustrate our finding).
After walking through the San Francisco Summer 2014 crime dataset that was provided, I’ve decided to examine how incident resolution rates vary by hour of incident. I also examine how resolution rates are impacted by violence/non-violence associated with the incident.
Without further ado…
Finding: Crimes committed in San Francisco between 7-8am are more likely to be resolved by police than crimes committed during any other hour of the day; similarly, crimes from the 9-10pm window are least likely to be resolved.
To begin, I’ll walk through the process of how this data was collected, cleaned, analyzed and presented.
Collection: The dataset is made available here on Github. After downloading the data, I imported the CSV into Google Sheets for cleaning.
Cleaning: The original CSV data contains a column for the date on which the crime was committed, and a column for the time at which the crime was committed. Since I wanted to be able to easily analyze and query time series data in Google’s BigQuery tool, I created a calculated field for the complete datetime by doing
datetime = date + time inside a new column in Google Sheets. Beyond this step, not much cleaning needed to be done.
Analysis: The cleaned data was exported from Google Sheets and uploaded as a new table in BigQuery:
Once the data was in a table in BigQuery, I was able to write SQL queries over the data to explore and tease out insights.
Presentation: After a query was written that produced the data I wanted to chart, I piped the results over to Google’s new dashboarding product called Data Studio. Data Studio uses this connector to make it very easy to connect BigQuery results to a Data Studio dashboard. While this tool is usually (and most powerfully) used to create automated reports & dashboards, it can be effective for quick and simple analyses.
Overall Resolution Rates
To determine which hours of crime corresponded with the best/worst resolution rates, I used the following query:
select hour(datetime) as hour_of_incident, count(incident_num) as num_incidents, count(case when resolution != "NONE" then incident_num end) as num_resolved from coursera.san_francisco_summer_2014_crime group by hour_of_incident order by hour_of_incident
Inside Data Studio, I created a calculated field titled “Overall Resolution Rate” that was equal to
num_resolved / num_incidents.
Overall Resolution Rate by
hour_of_incident shows that incidents created between 7-8am are resolved more frequently than incidents created in any other hour:
We can also see that incidents created between 9-10pm are resolved less frequently than incidents created during any other hour.
Resolution Rates by Violent/Non-Violent Offense
How might this result change if we dig a little deeper into the data? One common way of slicing crime data is to separate violent offenses from non-violent offenses. The FBI’s Violent Crime page from 2011 states the following:
violent crime is composed of four offenses: murder and nonnegligent manslaughter, forcible rape, robbery, and aggravated assault.
In looking at San Francisco’s category coding, there are 34 unique categories of crime, and only two categories from Summer 2014 data match the FBI’s descriptions: “ASSAULT” and “ROBBERY”.
Therefore, we can modify our query to use the following CASE statement to separate violent crimes from non-violent crimes in our totals:
case when category = "ASSAULT" or category = "ROBBERY" then "violent" else "non-violent" end as violence,
Once the new query is piped into Data Studio, we can see the following:
We quickly notice that the green non-violent line closely matches the overall gray line, but the red violent line doesn’t show the same pattern; this suggests that non-violent crime made up the large majority of crime in San Francisco in the summer of 2014. A quick query confirms this to be true:
select *, ratio_to_report(num_incidents) over() as percent_of_total from ( select case when category = "ASSAULT" or category = "ROBBERY" then "violent" else "non-violent" end as violence, count(incident_num) as num_incidents, from coursera.san_francisco_summer_2014_crime group by violence )
Even though violent crime made up a small fraction of the total crime in San Francisco in the summer of 2014, this type of crime appears to be most effectively resolved when incidents take place between 5-7am. Similarly, violent crimes committed between 1-2am are less likely to be resolved than violent crimes during any other hour.
In addition, the range of incident resolution for violent crimes (22% min – 52% max) is in sharp contrast to the series of non-violent data, which shows resolution rates in a narrower band (26% min – 41% max).