During the past 3 weeks, protests have taken place all over the country and the world in response to the murder of George Floyd by the Minneapolis Police. Since the advent of video recordings via cell phones, there have been many instances of police brutality captured against Black people. The relationship between law enforcement and the Black community has always been tenuous and now it has been brought to the attention of the rest of the world. In the landmark Supreme court case of Terry v. Ohio (1968) it was ruled that police could stop, question, and frisk a person if they have reasonable suspicion that the person had committed a crime (Brandes, S.A. et al., 2019).
In the 2000’s to the early 2010’s the New York City (NYC) stop and frisk policy garnered national attention due to the high number of stops and profiling of Black people. At the height of the policy, in 2011 there were 658,724 stops recorded with over 50% of the stops targeting Black people (NYCLU 2019). Since then, the number of stops per year has substantially decreased to 13,459 stop in 2019. Opponents of this policy argue this is still too many stops, especially since in 2019 about 66% of the people stopped were innocent. It has been also shown that the stopping of white people more likely led to an arrest in comparison to Black and Hispanic people, implying the police may be targeting minorities and being more mindful of stopping white people (Gelman, A., et. al., 2007). There has been substantial research conducted showing the psychological distress of a stop and frisk policy on communities of color in NYC (Sewell, A. et al., 2016).
In this project I analyze the Stop, Question and Frisk Data from the New York Police Department (NYPD) from the most current three years: 2017, 2018, and 2019 (NYC Stop and Frisk Data). I chose these years for the following reasons: The years 2018-2019 was not included in the most recent NYCLU report. In 2017 the NYPD moved to an electronic form, as opposed to manually writing down a response for each question in the handwritten forms used prior to 2017. Lastly 2017 was the first year of the Trump presidency and I was curious to investigate if his rhetoric on race may have affected law enforcements behavior toward minorities.
! pip install plotly==4.8.1
# load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import plotly.express as px
from IPython.display import HTML
from nltk.sentiment.vader import SentimentIntensityAnalyzer
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)
Data was downloaded from the NYPD website (https://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page). In each dataset each row is a stop of a specific person, and each column is a variable. There are a total of 83 different variables in each dataset.
# read in excel files from https://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page
sf_17 = pd.read_excel("sqf_2017.xlsx") # 2017
sf_18 = pd.read_excel("sqf_2018.xlsx") # 2018
sf_19 = pd.read_excel("sqf_2019.xlsx") # 2019
After looking over the dataset I chose to keep 39 of the 83 variables for futhur analyses.
sf_17_sub = sf_17[["STOP_FRISK_DATE","STOP_FRISK_TIME","YEAR2","MONTH2","DAY2","ISSUING_OFFICER_RANK","SUSPECTED_CRIME_DESCRIPTION","SUSPECT_ARRESTED_FLAG","SUSPECT_ARREST_OFFENSE", "OFFICER_IN_UNIFORM_FLAG", "FRISKED_FLAG", "SEARCHED_FLAG", "OTHER_CONTRABAND_FLAG", "FIREARM_FLAG", "KNIFE_CUTTER_FLAG", "OTHER_WEAPON_FLAG", "WEAPON_FOUND_FLAG", "PHYSICAL_FORCE_CEW_FLAG", "PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG", "PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG", "PHYSICAL_FORCE_OC_SPRAY_USED_FLAG", "PHYSICAL_FORCE_OTHER_FLAG", "PHYSICAL_FORCE_RESTRAINT_USED_FLAG", "PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG", "PHYSICAL_FORCE_WEAPON_IMPACT_FLAG", "SUSPECTS_ACTIONS_CASING_FLAG", "SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG", "DEMEANOR_OF_PERSON_STOPPED", "SUSPECT_REPORTED_AGE", "SUSPECT_SEX", "SUSPECT_RACE_DESCRIPTION", "SUSPECT_BODY_BUILD_TYPE", "SUSPECT_OTHER_DESCRIPTION", "STOP_LOCATION_PRECINCT", "STOP_LOCATION_FULL_ADDRESS", "STOP_LOCATION_STREET_NAME", "STOP_LOCATION_PATROL_BORO_NAME", "STOP_LOCATION_BORO_NAME", "SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG"]]
sf_18_sub = sf_18[["STOP_FRISK_DATE","STOP_FRISK_TIME","YEAR2","MONTH2","DAY2","ISSUING_OFFICER_RANK","SUSPECTED_CRIME_DESCRIPTION","SUSPECT_ARRESTED_FLAG","SUSPECT_ARREST_OFFENSE", "OFFICER_IN_UNIFORM_FLAG", "FRISKED_FLAG", "SEARCHED_FLAG", "OTHER_CONTRABAND_FLAG", "FIREARM_FLAG", "KNIFE_CUTTER_FLAG", "OTHER_WEAPON_FLAG", "WEAPON_FOUND_FLAG", "PHYSICAL_FORCE_CEW_FLAG", "PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG", "PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG", "PHYSICAL_FORCE_OC_SPRAY_USED_FLAG", "PHYSICAL_FORCE_OTHER_FLAG", "PHYSICAL_FORCE_RESTRAINT_USED_FLAG", "PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG", "PHYSICAL_FORCE_WEAPON_IMPACT_FLAG", "SUSPECTS_ACTIONS_CASING_FLAG", "SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG", "DEMEANOR_OF_PERSON_STOPPED", "SUSPECT_REPORTED_AGE", "SUSPECT_SEX", "SUSPECT_RACE_DESCRIPTION", "SUSPECT_BODY_BUILD_TYPE", "SUSPECT_OTHER_DESCRIPTION", "STOP_LOCATION_PRECINCT", "STOP_LOCATION_FULL_ADDRESS", "STOP_LOCATION_STREET_NAME", "STOP_LOCATION_PATROL_BORO_NAME", "STOP_LOCATION_BORO_NAME", "SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG"]]
sf_19_sub = sf_19[["STOP_FRISK_DATE","STOP_FRISK_TIME","YEAR2","MONTH2","DAY2","ISSUING_OFFICER_RANK","SUSPECTED_CRIME_DESCRIPTION","SUSPECT_ARRESTED_FLAG","SUSPECT_ARREST_OFFENSE", "OFFICER_IN_UNIFORM_FLAG", "FRISKED_FLAG", "SEARCHED_FLAG", "OTHER_CONTRABAND_FLAG", "FIREARM_FLAG", "KNIFE_CUTTER_FLAG", "OTHER_WEAPON_FLAG", "WEAPON_FOUND_FLAG", "PHYSICAL_FORCE_CEW_FLAG", "PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG", "PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG", "PHYSICAL_FORCE_OC_SPRAY_USED_FLAG", "PHYSICAL_FORCE_OTHER_FLAG", "PHYSICAL_FORCE_RESTRAINT_USED_FLAG", "PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG", "PHYSICAL_FORCE_WEAPON_IMPACT_FLAG", "SUSPECTS_ACTIONS_CASING_FLAG", "SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG", "DEMEANOR_OF_PERSON_STOPPED", "SUSPECT_REPORTED_AGE", "SUSPECT_SEX", "SUSPECT_RACE_DESCRIPTION", "SUSPECT_BODY_BUILD_TYPE", "SUSPECT_OTHER_DESCRIPTION", "STOP_LOCATION_PRECINCT", "STOP_LOCATION_FULL_ADDRESS", "STOP_LOCATION_STREET_NAME", "STOP_LOCATION_PATROL_BORO_NAME", "STOP_LOCATION_BORO_NAME", "SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG"]]
After subsetting the data from the 3 years, I appended the datasets to make 1 dataset of the stops from all 3 years.
sf_sub = sf_17_sub.append(sf_18_sub, ignore_index= True)
sf_sub = sf_sub.append(sf_19_sub, ignore_index= True)
len(sf_sub) # length of data
In total there were 36,096 stops in the years 2017-2019.
sf_sub.info() # columns chosen
sf_sub.head()
Above we see first the names of the 39 variables in the dataset sf_sub
and then a table of the first 5 observations (stops).
race_table_percent = sf_sub.groupby("SUSPECT_RACE_DESCRIPTION").size().agg({'Percent Total': lambda x: x/len(sf_sub)})
race_table_percent*100
In the above table we see that the people stopped were mostly described as have one of the following four races: Black, White Hispanic, White, or Black Hispanic.
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == '(null)', 'SUSPECT_RACE_DESCRIPTION'] = 'UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'MALE', 'SUSPECT_RACE_DESCRIPTION'] = 'UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'AMER IND', 'SUSPECT_RACE_DESCRIPTION'] = 'OTHER'
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'AMERICAN INDIAN/ALASKAN N', 'SUSPECT_RACE_DESCRIPTION'] = 'OTHER'
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'AMERICAN INDIAN/ALASKAN NATIVE', 'SUSPECT_RACE_DESCRIPTION'] = 'OTHER'
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'ASIAN / PACIFIC ISLANDER', 'SUSPECT_RACE_DESCRIPTION'] = 'OTHER'
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'ASIAN/PAC.ISL', 'SUSPECT_RACE_DESCRIPTION'] = 'OTHER'
sf_sub.loc[sf_sub["SUSPECT_SEX"] == '(null)', "SUSPECT_SEX"] = 'UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_SEX"] == '19', "SUSPECT_SEX"] = 'UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_SEX"] == '23', "SUSPECT_SEX"] = 'UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_SEX"] == '24', "SUSPECT_SEX"] = 'UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_SEX"] == '39', "SUSPECT_SEX"] = 'UNKNOWN'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG"] == '(null)', "PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG"] = 'N'
sf_sub.loc[sf_sub["OFFICER_IN_UNIFORM_FLAG"] == '(', "OFFICER_IN_UNIFORM_FLAG"] = 'N'
sf_sub.loc[sf_sub["FRISKED_FLAG"] == '(', "FRISKED_FLAG"] = 'N'
sf_sub.loc[sf_sub["FRISKED_FLAG"] == 'V', "FRISKED_FLAG"] = 'Y'
sf_sub.loc[sf_sub["SEARCHED_FLAG"] == '(', "SEARCHED_FLAG"] = 'N'
sf_sub.loc[sf_sub["OTHER_CONTRABAND_FLAG"] == '(', "OTHER_CONTRABAND_FLAG"] = 'N'
sf_sub.loc[sf_sub["FIREARM_FLAG"] == '(null)', "FIREARM_FLAG"] = 'N'
sf_sub.loc[sf_sub["KNIFE_CUTTER_FLAG"] == '(null)', "KNIFE_CUTTER_FLAG"] = 'N'
sf_sub.loc[sf_sub["WEAPON_FOUND_FLAG"] == '(null)', "WEAPON_FOUND_FLAG"] = 'N'
sf_sub.loc[sf_sub["WEAPON_FOUND_FLAG"] == '(', "WEAPON_FOUND_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_CEW_FLAG"] == '(null)', "PHYSICAL_FORCE_CEW_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG"] == '(null)', "PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG"] == '(null)', "PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_OC_SPRAY_USED_FLAG"] == '(null)', "PHYSICAL_FORCE_OC_SPRAY_USED_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_OTHER_FLAG"] == '(null)', "PHYSICAL_FORCE_OTHER_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_RESTRAINT_USED_FLAG"] == '(null)', "PHYSICAL_FORCE_RESTRAINT_USED_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG"] == '(null)', "PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG"] = 'N'
sf_sub.loc[sf_sub["SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG"] == '(null)', "SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG"] = 'N'
sf_sub.loc[sf_sub["SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG"] == '(null)', "SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG"] = 'N'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == '(null)', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'NONE', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'UNK', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'UNKNOWN', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'UNKNOW', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'NO', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'UKNOWN', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'UNKOWN', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'NA', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'N/A', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == '0', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 0, "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub["SUSPECT_OTHER_DESCRIPTION"] = sf_sub["SUSPECT_OTHER_DESCRIPTION"].fillna('NA/NONE/UNKNOWN')
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'TATOOS', "SUSPECT_OTHER_DESCRIPTION"] = 'TATTOOS'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'TATTOS', "SUSPECT_OTHER_DESCRIPTION"] = 'TATTOOS'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'TATOO', "SUSPECT_OTHER_DESCRIPTION"] = 'TATTOO'
sf_sub.loc[sf_sub["DEMEANOR_OF_PERSON_STOPPED"] == 'IRRATE', "DEMEANOR_OF_PERSON_STOPPED"] = 'IRATE'
sf_sub.loc[sf_sub["DEMEANOR_OF_PERSON_STOPPED"] == '1', "DEMEANOR_OF_PERSON_STOPPED"] = 'NONE'
sf_sub.loc[sf_sub["DEMEANOR_OF_PERSON_STOPPED"] == 1, "DEMEANOR_OF_PERSON_STOPPED"] = 'NONE'
sf_sub.loc[sf_sub["DEMEANOR_OF_PERSON_STOPPED"] == 'NEVEVOUS', "DEMEANOR_OF_PERSON_STOPPED"] = 'NERVOUS'
sf_sub["DEMEANOR_OF_PERSON_STOPPED"] = sf_sub["DEMEANOR_OF_PERSON_STOPPED"].fillna('NONE')
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'STATEN IS', "STOP_LOCATION_BORO_NAME"] = 'STATEN ISLAND'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '(null)', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBBX', "STOP_LOCATION_BORO_NAME"] = 'BRONX'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBBN', "STOP_LOCATION_BORO_NAME"] = 'BROOKLYN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBMN', "STOP_LOCATION_BORO_NAME"] = 'MANHATTAN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0208760', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0190241', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0208169', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0986759', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBMS', "STOP_LOCATION_BORO_NAME"] = 'MANHATTAN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0210334', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBSI', "STOP_LOCATION_BORO_NAME"] = 'STATEN ISLAND'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0237177', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0155070', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBBS', "STOP_LOCATION_BORO_NAME"] = 'BROOKLYN'
sf_sub.loc[sf_sub["SUSPECT_ARREST_OFFENSE"] == '(null)', "SUSPECT_ARREST_OFFENSE"] = 'NO ARREST'
sf_sub["STOP_LOCATION_PRECINCT"] = sf_sub["STOP_LOCATION_PRECINCT"].fillna(999)
for i in range(len(sf_sub)):
sf_sub.loc[i,"STOP_LOCATION_PRECINCT"] = int(round(sf_sub.loc[i, "STOP_LOCATION_PRECINCT"], 0))
Due to the nature of the dataset, there had to be some data cleaning conducted in order to organize some of the columns.
For the SUSPECT_RACE_DESCRIPTION
column, converted values that were marked as (null) or "MALE" as unknown. Also if someone was described as not Black, White, Black Hispanic, or White Hispanic, convereted value to other.
Changed non male or female values to "unknown in the SUSPECT_SEX
column.
For most of the columns that are flags, converted (null) or other values to N (No) values.
In the SUSPECT_OTHER_DESCRIPTION
, classified null of other values that were obviously null values to 'NA/NONE/UNKNOWN'.
In the DEMEANOR_OF_PERSON_STOPPED
corrected some misspellings and classfied some observtions to 'NONE' if no demeanor was noted.
Organized the STOP_LOCATION_BORO_NAME
so that it represnts one of the 5 boroughs of NYC or "unknown" if it was not clear how to classify the stop.
Changed the (null) values for the SUSPECT_ARREST_OFFENSE
to "No Arrest".
Finally, convereted missing values in the STOP_LOCATION_PRECINCT
column to 999 since the precints are numbered from 1 to 123.
pattern = r'(\d{2})'
for i in range(len(sf_sub)):
result = re.match(pattern, str(sf_sub["SUSPECT_REPORTED_AGE"][i]))
if result:
sf_sub.loc[i, "SUSPECT_REPORTED_AGE"] = int(sf_sub.loc[i, "SUSPECT_REPORTED_AGE"])
else:
sf_sub.loc[i, "SUSPECT_REPORTED_AGE"] = int(0)
Above, converted age column to integer. First matched values that contained 2 digits, than classified all other values as 0, which are unknown values since the expectation is that the police would not stop anyone of age less than 10 or over 99.
sf_sub = sf_sub.query('SUSPECT_RACE_DESCRIPTION != ("UNKNOWN", "OTHER")')
Due to the low % of unknown or other races, decided to only keep stops that were described as having the following races: Black, White Hispanic, White, or Black Hispanic.
len(sf_sub)
The cleaned and updated dataset includes 34,908 observations ommiting the stops conducting on people that were not described as Black, White, Black Hispanic, or White Hispanic.
sns.set(style="darkgrid", color_codes=True)
ax = sns.countplot(data=sf_sub, y="SUSPECT_RACE_DESCRIPTION", order=sf_sub["SUSPECT_RACE_DESCRIPTION"].value_counts(ascending=True).index);
ax.set_ylabel("Race")
ax.set_xlabel("Number of Stops")
ax.set_title("Number of Stops By Race", fontsize=16);
In the above barplot, we see that Black people are stopped substianlly more than any other group. They represent over 50% of the total stops, even though they make up 25% of the population in NYC (NYC census data 2019).
sns.set(style="darkgrid", color_codes=True)
ax = sns.countplot(data=sf_sub, y="SUSPECT_SEX", order=sf_sub["SUSPECT_SEX"].value_counts(ascending=True).index);
ax.set_ylabel("Sex")
ax.set_xlabel("Number of Stops")
ax.set_title("Number of Stops By Sex", fontsize=16);
The majority of stops are of males.
sns.set(style="darkgrid", color_codes=True)
ax = sns.countplot(data=sf_sub, y="DAY2", order=sf_sub["DAY2"].value_counts(ascending=True).index);
ax.set_ylabel("Weekday")
ax.set_xlabel("Number of Stops")
ax.set_xticks(range(0,6001,500))
ax.set_title("Number of Stops By Weekday", fontsize=16);
Over 5500 of the stops occured on Saturdays, while less than 4000 stops occured on Mondays.
sns.set(style="darkgrid", color_codes=True)
ax = sns.countplot(data=sf_sub, y="MONTH2", order=sf_sub["MONTH2"].value_counts(ascending=True).index);
ax.set_ylabel("Month")
ax.set_xlabel("Number of Stops")
ax.set_xticks(range(0,3501,250))
ax.set_title("Number of Stops By Month", fontsize=16);
The least number of stops were made in December, while the Spring months of March, April and May had the highest number of stops.
sf_sub_age_no_NA = sf_sub.query('SUSPECT_REPORTED_AGE > 0 ') # did not include unknown (null) values
sns.set(style="ticks", color_codes=True)
sns.despine(left=True)
fig, ax = plt.subplots(1, 1, figsize=(10,8))
sns.distplot(sf_sub_age_no_NA["SUSPECT_REPORTED_AGE"], kde=False, color="teal", bins = range(10,80,5), ax = ax)
ax.set_xlim(10,80)
ax.set_xticks(range(10,80,5))
ax.set_yticks(range(0,8001,1000))
ax.set_ylabel("Number of Stops")
ax.set_xlabel("Age")
ax.set_title("Number of Stops By Age", fontsize=16);
Above is a histogram of stop organized by age. We see a right skewed distribution, where the majority of people stopped were between the ages 15-30. I ommited people that were aged 80 or more in this histogram since there were so few observations.
len(sf_sub.query('SUSPECT_REPORTED_AGE > 80 ')) # only 7 people stopped were above 80 years old.
sns.set(style="darkgrid", color_codes=True)
ax = sns.countplot(data=sf_sub, y="STOP_LOCATION_BORO_NAME", order=sf_sub["STOP_LOCATION_BORO_NAME"].value_counts(ascending=True).index);
ax.set_ylabel("Borough")
ax.set_xlabel("Number of Stops")
ax.set_title("Number of Stops By Borough", fontsize=16);
We see that the most number of stops is made in Brooklyn. This is expected since Brooklyn has the highest population of all the boroughs (NYC census data). It is interesting that the second most number of stops is made in Manhattan, even though Queens is the second most populous borough and has about 600,000 more residents. Many NYC residents commute to and work in Manhattan, so this could explain this discrepancy between number of stops and population of each borough.
sf_sub_age_no_NA_U = sf_sub_age_no_NA.query('SUSPECT_SEX != "UNKNOWN"')
fig = px.density_heatmap(sf_sub_age_no_NA_U, x="SUSPECT_RACE_DESCRIPTION", y="SUSPECT_REPORTED_AGE", facet_col="SUSPECT_SEX", color_continuous_scale="cividis")
fig.show()
In the above interactice visualizaton we see that the highest subgroup of people stopped were Black males aged 16-17 (1978 stops). The highest subgroup of White males that were stopped were of the age 30-31 (210 stops).
# Code from https://mode.com/example-gallery/python_dataframe_styling/
# Set CSS properties for th elements in dataframe
th_props = [
('font-size', '12px'),
('text-align', 'center'),
('font-weight', 'bold'),
('color', '#6d6d6d'),
('background-color', '#f7f7f9')
]
# Set CSS properties for td elements in dataframe
td_props = [
('font-size', '12px'),
('font-weight', 'bold')
]
# Set table styles
styles = [
dict(selector="th", props=th_props),
dict(selector="td", props=td_props)
]
# Set colormap equal to seaborns light green color palette
cm = sns.light_palette("limegreen", as_cmap=True)
race_crime = sf_sub.groupby(["SUSPECT_RACE_DESCRIPTION","SUSPECTED_CRIME_DESCRIPTION"]).size().unstack().T
(race_crime.style
.background_gradient(cmap=cm, subset=['BLACK', 'BLACK HISPANIC', 'WHITE', 'WHITE HISPANIC'])
.set_caption('Race and Alleged Crime Description')
.format("{:,.0f}", na_rep="0")
.set_table_styles(styles))
In the above table we see the breakdown of the alleged crime a person had commited for each stopped by each race. We see for all four races criminal possesion of a weapon (CPW) as the numebr one reason for stopping someone.
CPW_only = sf_sub.query('SUSPECTED_CRIME_DESCRIPTION == "CPW"') # subset only CPW alleged suspects
CPW_weapon = CPW_only.groupby(["SUSPECT_RACE_DESCRIPTION", "WEAPON_FOUND_FLAG"]).size().unstack().T
(CPW_weapon.style
.background_gradient(cmap=cm, subset=['BLACK', 'BLACK HISPANIC', 'WHITE', 'WHITE HISPANIC'])
.set_caption('Race and Weapon Found for CPW')
.format("{:,.0f}", na_rep="0")
.set_table_styles(styles))
I decided to subset the dataset to further analyze the stops that involved alleged CPW suspects. I grouped by if from these stops, did the police actually find a weapon on the suspect. In the above table, we see that for all four races, the police did not find a weapon for the majority of CPW stops. Furthermore, we see that for White suspects the "hit rate" (proportion of stops that yield a positive result (Gelman, A., et. al., 2007)) is higher than Black suspects. This could be due to implicit or explicit bias towards Black people from the police.
race_frisk = sf_sub.groupby(["SUSPECT_RACE_DESCRIPTION","FRISKED_FLAG"]).size().unstack().T
(race_frisk.style
.background_gradient(cmap=cm, subset=['BLACK', 'BLACK HISPANIC', 'WHITE', 'WHITE HISPANIC'])
.set_caption('Race and Frisk')
.format("{:,.0f}", na_rep="0")
.set_table_styles(styles))
Once an officer stops a person, if they deem necessary they can frisk the person for weapons or contraband. In the above table we see how often an officer decided to frisk a person based on their race. Black and Hispanic people were much more likely to be frisked in comparison to White people. About 60% of Black suspects were frisked as opposed to about 45% of White suspects.
sf_sub_M_or_F = sf_sub.query('SUSPECT_SEX != "UNKNOWN"') # subset only Male or Females
race_sex_frisk = sf_sub_M_or_F.groupby(["SUSPECT_RACE_DESCRIPTION","SUSPECT_SEX" ,"FRISKED_FLAG"]).size().unstack().T
(race_sex_frisk.style
.background_gradient(cmap=cm, subset=['BLACK', 'BLACK HISPANIC', 'WHITE', 'WHITE HISPANIC'])
.set_caption('Race, Sex, Frisk')
.format("{:,.0f}", na_rep="0")
.set_table_styles(styles))
The above table futher shows the breakdown if a suspect was frisked or not, based on their race and sex. Overall, females were more likley to not be frisked. White females were much more likely to not be frisked in comparison to Hispanic or Black females. This table also shows the stark disparity in frisks between White suspects and Hispanic or Black suspects.
race_arrest = sf_sub.groupby(["SUSPECT_RACE_DESCRIPTION","SUSPECT_ARRESTED_FLAG"]).size().unstack().T
race_arrest
(race_arrest.style
.background_gradient(cmap=cm, subset=['BLACK', 'BLACK HISPANIC', 'WHITE', 'WHITE HISPANIC'])
.set_caption('Race and Arrest')
.format("{:,.0f}", na_rep="0")
.set_table_styles(styles))
Here we see what proporation of stops led to arrests. About 30% of stops involving Black or Black Hispanic people led to an arrest, while about 32% of stops involving White or White Hispanic people led to an arrest. Overall about 70% of people stopped were found to be innocent.
fig = px.treemap(sf_sub, path=["STOP_LOCATION_BORO_NAME","STOP_LOCATION_PRECINCT", "SUSPECT_RACE_DESCRIPTION" ])
fig.show()
In the above visualization, I created an interactive tree map. First the data is broken down into the 5 boroughs (and the values for which there was no borough indicated marked as 'UNKNOWN'). Then within each borough we can see how many stop there were in each precinct. Finally, within each precinct we can see the number of people from each race there were stopped. Of the 77 precincts, only 13 precincts stop non-black people most often. There has been considerable work done to analyze precinct level stops and associated crime rates within each precinct (Levchak P.J. 2017).
In these electronic forms there are 2 columns where we can perform sentiment analysis. Sentiment analysis could be used to understand the emotional component of a text. Here I used VADER, a model used to measure the negative, positive, neutral and overall sentamint intensity of a text (Hutto, C.J. et. al., 2014). The first column I chose to run sentimant analysis on was the DEMEANOR_OF_PERSON_STOPPED
. Below is a list of the top 10 responses for the demeanor of a person stopped.
sf_sub["DEMEANOR_OF_PERSON_STOPPED"].value_counts().head(10)
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()
sid.polarity_scores("CALM")
sid.polarity_scores("NERVOUS")
sid.polarity_scores("NONE")
Above we see what the VADER scores are for the top 3 responses for DEMEANOR_OF_PERSON_STOPPED
. The scale for the compound (overall) score is a value between -1 to 1. The closer to -1 the more negative, while the closer to 1, the more positive the sentamint is. "CALM" had an overall score of 0.3183, "NERVOUS" had an overall score of -0.2732, and "NONE" (Indicating no response was written for that stop) had an overall score of 0.0. These scores make sense since calm is usually more positive, nervous is usually more negative, while none is fairly neutral.
sf_sub["SUSPECT_OTHER_DESCRIPTION"].value_counts().head(10)
Above are the top responses for the variable SUSPECT_OTHER_DESCRIPTION
. Below I make new columns for each score for each variable that I will be running sentamint analysis on.
sf_sub['DEMEANOR_NEGATIVE']=float(0)
sf_sub['DEMEANOR_NEUTRAL']=float(0)
sf_sub['DEMEANOR_POSITIVE']=float(0)
sf_sub['DEMEANOR_COMPOUND']=float(0)
sf_sub["SUSPECT_DESCR_NEGATIVE"]=float(0)
sf_sub["SUSPECT_DESCR_NEUTRAL"]=float(0)
sf_sub["SUSPECT_DESCR_POSITIVE"]=float(0)
sf_sub["SUSPECT_DESCR_COMPOUND"]=float(0)
for i in range(36096):
if i in sf_sub.index:
ss = sid.polarity_scores(sf_sub["DEMEANOR_OF_PERSON_STOPPED"][i])
sf_sub.loc[i, 'DEMEANOR_NEGATIVE'] = ss['neg']
sf_sub.loc[i,'DEMEANOR_NEUTRAL'] = ss['neu']
sf_sub.loc[i,'DEMEANOR_POSITIVE'] = ss['pos']
sf_sub.loc[i,'DEMEANOR_COMPOUND'] = ss['compound']
for i in range(36096):
if i in sf_sub.index:
ss = sid.polarity_scores(sf_sub["SUSPECT_OTHER_DESCRIPTION"][i])
sf_sub.loc[i,"SUSPECT_DESCR_NEGATIVE"] = ss['neg']
sf_sub.loc[i,"SUSPECT_DESCR_NEUTRAL"] = ss['neu']
sf_sub.loc[i,"SUSPECT_DESCR_POSITIVE"] = ss['pos']
sf_sub.loc[i,"SUSPECT_DESCR_COMPOUND"] = ss['compound']
sf_sub.describe()
Here we see the descriptive statistics for each column created using the VADER model for sentament analysis.
fig = px.box(sf_sub, x="SUSPECT_RACE_DESCRIPTION", y='DEMEANOR_COMPOUND', color="FRISKED_FLAG")
fig.show()
The above boxplots are organized on the x axis by race, and on the y axis is the overall score for the demeanor of the person stopped. In the red is the score of a person who was frisked and in the blue is the score of a person who was not frisked. An interesting pattern to note is the distribution of scores is fairly identical between the 4 races for people that were not frisked, but different for the people that were frisked. A White person who was frisked had about the same score as a White person that was not frisked, but that is not the case for people of the other 3 races. Black, White Hispanic, and Black Hispanic people that were stopped AND frisked had lower overall VADER scores in comparison to people of the same race that were not frisked.
fig = px.box(sf_sub, x="SUSPECT_RACE_DESCRIPTION", y='DEMEANOR_COMPOUND', color="SUSPECT_ARRESTED_FLAG")
fig.show()
The above boxplots are organized on the x axis by race, and on the y axis is the overall score for the demeanor of the person stopped. In the red is the score of a person who was arrested and in the blue is the score of a person who was not arrested. We see a similar pattern to the previous boxplots. The distribution of scores is fairly identical between the 4 races for people that were not arrested, but different for the people that were arrested. A White person who was arrested had about the same score as a White person that was not arrested. On the other hand, Black, White Hispanic, and Black Hispanic people that were stopped AND arrested had lower overall VADER scores in comparison to people of the same race that were not frisked.
sf_sub_M_or_F = sf_sub.query('SUSPECT_SEX != "UNKNOWN"')
fig = px.box(sf_sub_M_or_F, x="SUSPECT_RACE_DESCRIPTION", y='DEMEANOR_COMPOUND', color="SUSPECT_SEX")
fig.show()
The above boxplots are organized on the x axis by race, and on the y axis is the overall score for the demeanor of the person stopped. In the red is the score of males suspects and in the blue is the score of female suspects. We notice that the score is similar between the 4 races for the male suspect. In comparison to the male suspects, for each race the female suspects all recieved lower VADER scores.
In the past 3 years NYPD stopped young Black males most often. Over 50% of the total stops were of Black people and over 90% of the stops were of males. Geographically, the most stops were made in Brooklyn. Looking at the borough and precinct treemap, we can see that in Brooklyn 19 of the 23 precincts predominatly stopped black people. It would be interesting to see the racial make-up of those precinct neighborhoods in order to compare the racial population of the community with this dataset. For example in precinct 67, when excluding unknown or other races, about 95% of the stops were of Black people. What is the racial population of that neighborhood? Could there be any contribution from tourists or people commuting to that neighborhood for work during the day that would change that racial popultion?
The sentiment analysis using the VADER model provided some insightful results. Why is there a difference between the boxplots of frisked and not frisked people for Black and Hispanic people but no differences for White people? Same observation was made for arrests. Further analysis of the text from the DEMEANOR_OF_PERSON_STOPPED
variable would be of interest. The differences in sex for all racial groups is also evident. Females are said to have more negative demanor in comparison to males. This could be due to the small sample size of females that are stopped.
There are numerous avenues that could be researched for further analysis from this dataset. Some possible questions: When are people of different races stopped during the day? For each race, what proportion of stops lead to physical force? In the precincts that do not stop Black people the most, what is the racial makeup of that precinct neighborhood?
There are a few questions to address from this dataset. First, these are only the recorded stops that are reported by the NYPD. There could be numerous stops that are not recorded by an officer for various reasons that could alter the data. Also, this data is inputted by the officer and thus they are essentially the data collectors. It would be important to somehow validate some of these stops, possibly by looking through police body cam footage.
This analysis further shows the explicit and implicit bias police have towards Black people. It is imperative during this time we all evaluate our bias toward people of different races. Police should be held to the highest standards since they are the ones tasked with keeping our communities safe and are in positions of authority. Unfortuantely research has shown that the sometimes invasive and frequent stops have had detremental health effects on minorities in racially diverse communities (Sewell A.A. et. al., 2016). Research to uncover biases towards minorites is extremely important and we as statisticians should do our part to help fight the power.