Stop and Frisk in New York City from 2017 - 2019

Roupen Khanjian

Final Project PSTAT 234

Spring 2020

Introduction

During the past 3 weeks, protests have taken place all over the country and the world in response to the murder of George Floyd by the Minneapolis Police. Since the advent of video recordings via cell phones, there have been many instances of police brutality captured against Black people. The relationship between law enforcement and the Black community has always been tenuous and now it has been brought to the attention of the rest of the world. In the landmark Supreme court case of Terry v. Ohio (1968) it was ruled that police could stop, question, and frisk a person if they have reasonable suspicion that the person had committed a crime (Brandes, S.A. et al., 2019).

In the 2000’s to the early 2010’s the New York City (NYC) stop and frisk policy garnered national attention due to the high number of stops and profiling of Black people. At the height of the policy, in 2011 there were 658,724 stops recorded with over 50% of the stops targeting Black people (NYCLU 2019). Since then, the number of stops per year has substantially decreased to 13,459 stop in 2019. Opponents of this policy argue this is still too many stops, especially since in 2019 about 66% of the people stopped were innocent. It has been also shown that the stopping of white people more likely led to an arrest in comparison to Black and Hispanic people, implying the police may be targeting minorities and being more mindful of stopping white people (Gelman, A., et. al., 2007). There has been substantial research conducted showing the psychological distress of a stop and frisk policy on communities of color in NYC (Sewell, A. et al., 2016).

In this project I analyze the Stop, Question and Frisk Data from the New York Police Department (NYPD) from the most current three years: 2017, 2018, and 2019 (NYC Stop and Frisk Data). I chose these years for the following reasons: The years 2018-2019 was not included in the most recent NYCLU report. In 2017 the NYPD moved to an electronic form, as opposed to manually writing down a response for each question in the handwritten forms used prior to 2017. Lastly 2017 was the first year of the Trump presidency and I was curious to investigate if his rhetoric on race may have affected law enforcements behavior toward minorities.

Data Cleaning and Wrangling

In [1]:
! pip install plotly==4.8.1
Requirement already satisfied: plotly==4.8.1 in /opt/conda/lib/python3.7/site-packages (4.8.1)
Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.7/site-packages (from plotly==4.8.1) (1.3.3)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from plotly==4.8.1) (1.14.0)
In [2]:
# load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
import plotly.express as px
from IPython.display import HTML
from nltk.sentiment.vader import SentimentIntensityAnalyzer

%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)

Data was downloaded from the NYPD website (https://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page). In each dataset each row is a stop of a specific person, and each column is a variable. There are a total of 83 different variables in each dataset.

In [3]:
# read in excel files from https://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page
sf_17 = pd.read_excel("sqf_2017.xlsx") # 2017
sf_18 = pd.read_excel("sqf_2018.xlsx") # 2018
sf_19 = pd.read_excel("sqf_2019.xlsx") # 2019

After looking over the dataset I chose to keep 39 of the 83 variables for futhur analyses.

In [4]:
sf_17_sub = sf_17[["STOP_FRISK_DATE","STOP_FRISK_TIME","YEAR2","MONTH2","DAY2","ISSUING_OFFICER_RANK","SUSPECTED_CRIME_DESCRIPTION","SUSPECT_ARRESTED_FLAG","SUSPECT_ARREST_OFFENSE", "OFFICER_IN_UNIFORM_FLAG", "FRISKED_FLAG", "SEARCHED_FLAG", "OTHER_CONTRABAND_FLAG", "FIREARM_FLAG", "KNIFE_CUTTER_FLAG", "OTHER_WEAPON_FLAG", "WEAPON_FOUND_FLAG", "PHYSICAL_FORCE_CEW_FLAG", "PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG", "PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG", "PHYSICAL_FORCE_OC_SPRAY_USED_FLAG", "PHYSICAL_FORCE_OTHER_FLAG", "PHYSICAL_FORCE_RESTRAINT_USED_FLAG", "PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG", "PHYSICAL_FORCE_WEAPON_IMPACT_FLAG", "SUSPECTS_ACTIONS_CASING_FLAG", "SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG", "DEMEANOR_OF_PERSON_STOPPED", "SUSPECT_REPORTED_AGE", "SUSPECT_SEX", "SUSPECT_RACE_DESCRIPTION", "SUSPECT_BODY_BUILD_TYPE", "SUSPECT_OTHER_DESCRIPTION", "STOP_LOCATION_PRECINCT", "STOP_LOCATION_FULL_ADDRESS", "STOP_LOCATION_STREET_NAME", "STOP_LOCATION_PATROL_BORO_NAME", "STOP_LOCATION_BORO_NAME", "SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG"]]
sf_18_sub = sf_18[["STOP_FRISK_DATE","STOP_FRISK_TIME","YEAR2","MONTH2","DAY2","ISSUING_OFFICER_RANK","SUSPECTED_CRIME_DESCRIPTION","SUSPECT_ARRESTED_FLAG","SUSPECT_ARREST_OFFENSE", "OFFICER_IN_UNIFORM_FLAG", "FRISKED_FLAG", "SEARCHED_FLAG", "OTHER_CONTRABAND_FLAG", "FIREARM_FLAG", "KNIFE_CUTTER_FLAG", "OTHER_WEAPON_FLAG", "WEAPON_FOUND_FLAG", "PHYSICAL_FORCE_CEW_FLAG", "PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG", "PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG", "PHYSICAL_FORCE_OC_SPRAY_USED_FLAG", "PHYSICAL_FORCE_OTHER_FLAG", "PHYSICAL_FORCE_RESTRAINT_USED_FLAG", "PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG", "PHYSICAL_FORCE_WEAPON_IMPACT_FLAG", "SUSPECTS_ACTIONS_CASING_FLAG", "SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG", "DEMEANOR_OF_PERSON_STOPPED", "SUSPECT_REPORTED_AGE", "SUSPECT_SEX", "SUSPECT_RACE_DESCRIPTION", "SUSPECT_BODY_BUILD_TYPE", "SUSPECT_OTHER_DESCRIPTION", "STOP_LOCATION_PRECINCT", "STOP_LOCATION_FULL_ADDRESS", "STOP_LOCATION_STREET_NAME", "STOP_LOCATION_PATROL_BORO_NAME", "STOP_LOCATION_BORO_NAME", "SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG"]]
sf_19_sub = sf_19[["STOP_FRISK_DATE","STOP_FRISK_TIME","YEAR2","MONTH2","DAY2","ISSUING_OFFICER_RANK","SUSPECTED_CRIME_DESCRIPTION","SUSPECT_ARRESTED_FLAG","SUSPECT_ARREST_OFFENSE", "OFFICER_IN_UNIFORM_FLAG", "FRISKED_FLAG", "SEARCHED_FLAG", "OTHER_CONTRABAND_FLAG", "FIREARM_FLAG", "KNIFE_CUTTER_FLAG", "OTHER_WEAPON_FLAG", "WEAPON_FOUND_FLAG", "PHYSICAL_FORCE_CEW_FLAG", "PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG", "PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG", "PHYSICAL_FORCE_OC_SPRAY_USED_FLAG", "PHYSICAL_FORCE_OTHER_FLAG", "PHYSICAL_FORCE_RESTRAINT_USED_FLAG", "PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG", "PHYSICAL_FORCE_WEAPON_IMPACT_FLAG", "SUSPECTS_ACTIONS_CASING_FLAG", "SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG", "DEMEANOR_OF_PERSON_STOPPED", "SUSPECT_REPORTED_AGE", "SUSPECT_SEX", "SUSPECT_RACE_DESCRIPTION", "SUSPECT_BODY_BUILD_TYPE", "SUSPECT_OTHER_DESCRIPTION", "STOP_LOCATION_PRECINCT", "STOP_LOCATION_FULL_ADDRESS", "STOP_LOCATION_STREET_NAME", "STOP_LOCATION_PATROL_BORO_NAME", "STOP_LOCATION_BORO_NAME", "SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG"]]

After subsetting the data from the 3 years, I appended the datasets to make 1 dataset of the stops from all 3 years.

In [5]:
sf_sub = sf_17_sub.append(sf_18_sub, ignore_index= True)
sf_sub = sf_sub.append(sf_19_sub, ignore_index= True)
In [6]:
len(sf_sub) # length of data
Out[6]:
36096

In total there were 36,096 stops in the years 2017-2019.

In [7]:
sf_sub.info() # columns chosen
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36096 entries, 0 to 36095
Data columns (total 39 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   STOP_FRISK_DATE                           36096 non-null  object 
 1   STOP_FRISK_TIME                           36089 non-null  object 
 2   YEAR2                                     36096 non-null  int64  
 3   MONTH2                                    36096 non-null  object 
 4   DAY2                                      36096 non-null  object 
 5   ISSUING_OFFICER_RANK                      36096 non-null  object 
 6   SUSPECTED_CRIME_DESCRIPTION               36096 non-null  object 
 7   SUSPECT_ARRESTED_FLAG                     36096 non-null  object 
 8   SUSPECT_ARREST_OFFENSE                    36096 non-null  object 
 9   OFFICER_IN_UNIFORM_FLAG                   36096 non-null  object 
 10  FRISKED_FLAG                              36096 non-null  object 
 11  SEARCHED_FLAG                             36096 non-null  object 
 12  OTHER_CONTRABAND_FLAG                     36096 non-null  object 
 13  FIREARM_FLAG                              36096 non-null  object 
 14  KNIFE_CUTTER_FLAG                         36096 non-null  object 
 15  OTHER_WEAPON_FLAG                         36096 non-null  object 
 16  WEAPON_FOUND_FLAG                         36096 non-null  object 
 17  PHYSICAL_FORCE_CEW_FLAG                   36096 non-null  object 
 18  PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG    36096 non-null  object 
 19  PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG      36096 non-null  object 
 20  PHYSICAL_FORCE_OC_SPRAY_USED_FLAG         36096 non-null  object 
 21  PHYSICAL_FORCE_OTHER_FLAG                 36096 non-null  object 
 22  PHYSICAL_FORCE_RESTRAINT_USED_FLAG        36096 non-null  object 
 23  PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG    36096 non-null  object 
 24  PHYSICAL_FORCE_WEAPON_IMPACT_FLAG         36096 non-null  object 
 25  SUSPECTS_ACTIONS_CASING_FLAG              36096 non-null  object 
 26  SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG  36096 non-null  object 
 27  DEMEANOR_OF_PERSON_STOPPED                33750 non-null  object 
 28  SUSPECT_REPORTED_AGE                      36096 non-null  object 
 29  SUSPECT_SEX                               36096 non-null  object 
 30  SUSPECT_RACE_DESCRIPTION                  36096 non-null  object 
 31  SUSPECT_BODY_BUILD_TYPE                   36096 non-null  object 
 32  SUSPECT_OTHER_DESCRIPTION                 33388 non-null  object 
 33  STOP_LOCATION_PRECINCT                    35665 non-null  float64
 34  STOP_LOCATION_FULL_ADDRESS                36091 non-null  object 
 35  STOP_LOCATION_STREET_NAME                 36091 non-null  object 
 36  STOP_LOCATION_PATROL_BORO_NAME            36096 non-null  object 
 37  STOP_LOCATION_BORO_NAME                   36096 non-null  object 
 38  SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG   36096 non-null  object 
dtypes: float64(1), int64(1), object(37)
memory usage: 10.7+ MB
In [8]:
sf_sub.head()
Out[8]:
STOP_FRISK_DATE STOP_FRISK_TIME YEAR2 MONTH2 DAY2 ISSUING_OFFICER_RANK SUSPECTED_CRIME_DESCRIPTION SUSPECT_ARRESTED_FLAG SUSPECT_ARREST_OFFENSE OFFICER_IN_UNIFORM_FLAG ... SUSPECT_SEX SUSPECT_RACE_DESCRIPTION SUSPECT_BODY_BUILD_TYPE SUSPECT_OTHER_DESCRIPTION STOP_LOCATION_PRECINCT STOP_LOCATION_FULL_ADDRESS STOP_LOCATION_STREET_NAME STOP_LOCATION_PATROL_BORO_NAME STOP_LOCATION_BORO_NAME SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG
0 2017-01-16 14:25:59.999000 2017 January Monday SGT TERRORISM N (null) Y ... MALE (null) THN (null) 1.0 180 GREENWICH STREET GREENWICH STREET PBMS MANHATTAN (null)
1 2017-01-16 14:25:59.999000 2017 January Monday SGT TERRORISM N (null) Y ... MALE (null) MED (null) 1.0 180 GREENWICH STREET GREENWICH STREET PBMS MANHATTAN (null)
2 2017-02-08 11:09:59.999000 2017 February Wednesday POM OTHER N (null) N ... FEMALE WHITE THN NaN 1.0 WALL STREET && BROADWAY WALL STREET PBMS MANHATTAN (null)
3 2017-02-20 11:34:59.999000 2017 February Monday POM GRAND LARCENY AUTO N (null) Y ... MALE BLACK HISPANIC U UNK 1.0 75 GREENE STREET GREENE STREET PBMS MANHATTAN (null)
4 2017-02-21 13:20:00 2017 February Tuesday POM BURGLARY N (null) Y ... FEMALE BLACK THN (null) 1.0 429 WEST BROADWAY WEST BROADWAY PBMS MANHATTAN (null)

5 rows × 39 columns

Above we see first the names of the 39 variables in the dataset sf_sub and then a table of the first 5 observations (stops).

In [9]:
race_table_percent = sf_sub.groupby("SUSPECT_RACE_DESCRIPTION").size().agg({'Percent Total': lambda x: x/len(sf_sub)})
race_table_percent*100
Out[9]:
               SUSPECT_RACE_DESCRIPTION      
Percent Total  (null)                             1.163564
               AMER IND                           0.024934
               AMERICAN INDIAN/ALASKAN N          0.022163
               AMERICAN INDIAN/ALASKAN NATIVE     0.044326
               ASIAN / PACIFIC ISLANDER           1.446144
               ASIAN/PAC.ISL                      0.570700
               BLACK                             57.671210
               BLACK HISPANIC                     8.593750
               MALE                               0.019393
               WHITE                              9.048094
               WHITE HISPANIC                    21.395723
dtype: float64

In the above table we see that the people stopped were mostly described as have one of the following four races: Black, White Hispanic, White, or Black Hispanic.

In [10]:
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == '(null)', 'SUSPECT_RACE_DESCRIPTION'] = 'UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'MALE', 'SUSPECT_RACE_DESCRIPTION'] = 'UNKNOWN'

sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'AMER IND', 'SUSPECT_RACE_DESCRIPTION'] = 'OTHER'
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'AMERICAN INDIAN/ALASKAN N', 'SUSPECT_RACE_DESCRIPTION'] = 'OTHER'
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'AMERICAN INDIAN/ALASKAN NATIVE', 'SUSPECT_RACE_DESCRIPTION'] = 'OTHER'
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'ASIAN / PACIFIC ISLANDER', 'SUSPECT_RACE_DESCRIPTION'] = 'OTHER'
sf_sub.loc[sf_sub["SUSPECT_RACE_DESCRIPTION"] == 'ASIAN/PAC.ISL', 'SUSPECT_RACE_DESCRIPTION'] = 'OTHER'

sf_sub.loc[sf_sub["SUSPECT_SEX"] == '(null)', "SUSPECT_SEX"] = 'UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_SEX"] == '19', "SUSPECT_SEX"] = 'UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_SEX"] == '23', "SUSPECT_SEX"] = 'UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_SEX"] == '24', "SUSPECT_SEX"] = 'UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_SEX"] == '39', "SUSPECT_SEX"] = 'UNKNOWN'

sf_sub.loc[sf_sub["PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG"] == '(null)', "PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG"] = 'N'
sf_sub.loc[sf_sub["OFFICER_IN_UNIFORM_FLAG"] == '(', "OFFICER_IN_UNIFORM_FLAG"] = 'N'
sf_sub.loc[sf_sub["FRISKED_FLAG"] == '(', "FRISKED_FLAG"] = 'N'
sf_sub.loc[sf_sub["FRISKED_FLAG"] == 'V', "FRISKED_FLAG"] = 'Y'
sf_sub.loc[sf_sub["SEARCHED_FLAG"] == '(', "SEARCHED_FLAG"] = 'N'
sf_sub.loc[sf_sub["OTHER_CONTRABAND_FLAG"] == '(', "OTHER_CONTRABAND_FLAG"] = 'N'
sf_sub.loc[sf_sub["FIREARM_FLAG"] == '(null)', "FIREARM_FLAG"] = 'N'
sf_sub.loc[sf_sub["KNIFE_CUTTER_FLAG"] == '(null)', "KNIFE_CUTTER_FLAG"] = 'N'
sf_sub.loc[sf_sub["WEAPON_FOUND_FLAG"] == '(null)', "WEAPON_FOUND_FLAG"] = 'N'
sf_sub.loc[sf_sub["WEAPON_FOUND_FLAG"] == '(', "WEAPON_FOUND_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_CEW_FLAG"] == '(null)', "PHYSICAL_FORCE_CEW_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG"] == '(null)', "PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG"] == '(null)', "PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_OC_SPRAY_USED_FLAG"] == '(null)', "PHYSICAL_FORCE_OC_SPRAY_USED_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_OTHER_FLAG"] == '(null)', "PHYSICAL_FORCE_OTHER_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_RESTRAINT_USED_FLAG"] == '(null)', "PHYSICAL_FORCE_RESTRAINT_USED_FLAG"] = 'N'
sf_sub.loc[sf_sub["PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG"] == '(null)', "PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG"] = 'N'
sf_sub.loc[sf_sub["SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG"] == '(null)', "SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG"] = 'N'
sf_sub.loc[sf_sub["SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG"] == '(null)', "SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG"] = 'N'

sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == '(null)', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'NONE', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'UNK', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'UNKNOWN', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'UNKNOW', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'NO', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'UKNOWN', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'UNKOWN', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'NA', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'N/A', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == '0', "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 0, "SUSPECT_OTHER_DESCRIPTION"] = 'NA/NONE/UNKNOWN'
sf_sub["SUSPECT_OTHER_DESCRIPTION"] = sf_sub["SUSPECT_OTHER_DESCRIPTION"].fillna('NA/NONE/UNKNOWN')

sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'TATOOS', "SUSPECT_OTHER_DESCRIPTION"] = 'TATTOOS'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'TATTOS', "SUSPECT_OTHER_DESCRIPTION"] = 'TATTOOS'
sf_sub.loc[sf_sub["SUSPECT_OTHER_DESCRIPTION"] == 'TATOO', "SUSPECT_OTHER_DESCRIPTION"] = 'TATTOO'

sf_sub.loc[sf_sub["DEMEANOR_OF_PERSON_STOPPED"] == 'IRRATE', "DEMEANOR_OF_PERSON_STOPPED"] = 'IRATE'
sf_sub.loc[sf_sub["DEMEANOR_OF_PERSON_STOPPED"] == '1', "DEMEANOR_OF_PERSON_STOPPED"] = 'NONE'
sf_sub.loc[sf_sub["DEMEANOR_OF_PERSON_STOPPED"] == 1, "DEMEANOR_OF_PERSON_STOPPED"] = 'NONE'
sf_sub.loc[sf_sub["DEMEANOR_OF_PERSON_STOPPED"] == 'NEVEVOUS', "DEMEANOR_OF_PERSON_STOPPED"] = 'NERVOUS'
sf_sub["DEMEANOR_OF_PERSON_STOPPED"] = sf_sub["DEMEANOR_OF_PERSON_STOPPED"].fillna('NONE')

sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'STATEN IS', "STOP_LOCATION_BORO_NAME"] = 'STATEN ISLAND'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '(null)', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBBX', "STOP_LOCATION_BORO_NAME"] = 'BRONX'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBBN', "STOP_LOCATION_BORO_NAME"] = 'BROOKLYN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBMN', "STOP_LOCATION_BORO_NAME"] = 'MANHATTAN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0208760', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0190241', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0208169', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0986759', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBMS', "STOP_LOCATION_BORO_NAME"] = 'MANHATTAN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0210334', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBSI', "STOP_LOCATION_BORO_NAME"] = 'STATEN ISLAND'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0237177', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == '0155070', "STOP_LOCATION_BORO_NAME"] = 'UNKNOWN'
sf_sub.loc[sf_sub["STOP_LOCATION_BORO_NAME"] == 'PBBS', "STOP_LOCATION_BORO_NAME"] = 'BROOKLYN'

sf_sub.loc[sf_sub["SUSPECT_ARREST_OFFENSE"] == '(null)', "SUSPECT_ARREST_OFFENSE"] = 'NO ARREST'

sf_sub["STOP_LOCATION_PRECINCT"] = sf_sub["STOP_LOCATION_PRECINCT"].fillna(999)
for i in range(len(sf_sub)):
    sf_sub.loc[i,"STOP_LOCATION_PRECINCT"] = int(round(sf_sub.loc[i, "STOP_LOCATION_PRECINCT"], 0))

Due to the nature of the dataset, there had to be some data cleaning conducted in order to organize some of the columns.

  • For the SUSPECT_RACE_DESCRIPTION column, converted values that were marked as (null) or "MALE" as unknown. Also if someone was described as not Black, White, Black Hispanic, or White Hispanic, convereted value to other.

  • Changed non male or female values to "unknown in the SUSPECT_SEX column.

  • For most of the columns that are flags, converted (null) or other values to N (No) values.

  • In the SUSPECT_OTHER_DESCRIPTION, classified null of other values that were obviously null values to 'NA/NONE/UNKNOWN'.

  • In the DEMEANOR_OF_PERSON_STOPPED corrected some misspellings and classfied some observtions to 'NONE' if no demeanor was noted.

  • Organized the STOP_LOCATION_BORO_NAME so that it represnts one of the 5 boroughs of NYC or "unknown" if it was not clear how to classify the stop.

  • Changed the (null) values for the SUSPECT_ARREST_OFFENSE to "No Arrest".

  • Finally, convereted missing values in the STOP_LOCATION_PRECINCT column to 999 since the precints are numbered from 1 to 123.

In [11]:
pattern = r'(\d{2})'

for i in range(len(sf_sub)):
    result = re.match(pattern, str(sf_sub["SUSPECT_REPORTED_AGE"][i]))
    if result:
        sf_sub.loc[i, "SUSPECT_REPORTED_AGE"] = int(sf_sub.loc[i, "SUSPECT_REPORTED_AGE"])
    else:
        sf_sub.loc[i, "SUSPECT_REPORTED_AGE"] = int(0)

Above, converted age column to integer. First matched values that contained 2 digits, than classified all other values as 0, which are unknown values since the expectation is that the police would not stop anyone of age less than 10 or over 99.

In [12]:
sf_sub = sf_sub.query('SUSPECT_RACE_DESCRIPTION != ("UNKNOWN", "OTHER")')

Due to the low % of unknown or other races, decided to only keep stops that were described as having the following races: Black, White Hispanic, White, or Black Hispanic.

Exploratory Data Analysis (EDA)

In [13]:
len(sf_sub)
Out[13]:
34908

The cleaned and updated dataset includes 34,908 observations ommiting the stops conducting on people that were not described as Black, White, Black Hispanic, or White Hispanic.

In [14]:
sns.set(style="darkgrid", color_codes=True)
ax = sns.countplot(data=sf_sub, y="SUSPECT_RACE_DESCRIPTION", order=sf_sub["SUSPECT_RACE_DESCRIPTION"].value_counts(ascending=True).index);
ax.set_ylabel("Race")
ax.set_xlabel("Number of Stops")
ax.set_title("Number of Stops By Race", fontsize=16);

In the above barplot, we see that Black people are stopped substianlly more than any other group. They represent over 50% of the total stops, even though they make up 25% of the population in NYC (NYC census data 2019).

In [15]:
sns.set(style="darkgrid", color_codes=True)
ax = sns.countplot(data=sf_sub, y="SUSPECT_SEX", order=sf_sub["SUSPECT_SEX"].value_counts(ascending=True).index);
ax.set_ylabel("Sex")
ax.set_xlabel("Number of Stops")
ax.set_title("Number of Stops By Sex", fontsize=16);

The majority of stops are of males.

In [16]:
sns.set(style="darkgrid", color_codes=True)
ax = sns.countplot(data=sf_sub, y="DAY2", order=sf_sub["DAY2"].value_counts(ascending=True).index);
ax.set_ylabel("Weekday")
ax.set_xlabel("Number of Stops")
ax.set_xticks(range(0,6001,500))
ax.set_title("Number of Stops By Weekday", fontsize=16);

Over 5500 of the stops occured on Saturdays, while less than 4000 stops occured on Mondays.

In [17]:
sns.set(style="darkgrid", color_codes=True)
ax = sns.countplot(data=sf_sub, y="MONTH2", order=sf_sub["MONTH2"].value_counts(ascending=True).index);
ax.set_ylabel("Month")
ax.set_xlabel("Number of Stops")
ax.set_xticks(range(0,3501,250))
ax.set_title("Number of Stops By Month", fontsize=16);

The least number of stops were made in December, while the Spring months of March, April and May had the highest number of stops.

In [18]:
sf_sub_age_no_NA = sf_sub.query('SUSPECT_REPORTED_AGE > 0 ') # did not include unknown (null) values

sns.set(style="ticks", color_codes=True)
sns.despine(left=True)

fig, ax = plt.subplots(1, 1, figsize=(10,8))
sns.distplot(sf_sub_age_no_NA["SUSPECT_REPORTED_AGE"], kde=False, color="teal", bins = range(10,80,5), ax = ax)

ax.set_xlim(10,80)
ax.set_xticks(range(10,80,5))
ax.set_yticks(range(0,8001,1000))
ax.set_ylabel("Number of Stops")
ax.set_xlabel("Age")
ax.set_title("Number of Stops By Age", fontsize=16);
<Figure size 864x648 with 0 Axes>

Above is a histogram of stop organized by age. We see a right skewed distribution, where the majority of people stopped were between the ages 15-30. I ommited people that were aged 80 or more in this histogram since there were so few observations.

In [19]:
len(sf_sub.query('SUSPECT_REPORTED_AGE > 80 ')) # only 7 people stopped were above 80 years old.
Out[19]:
7
In [20]:
sns.set(style="darkgrid", color_codes=True)
ax = sns.countplot(data=sf_sub, y="STOP_LOCATION_BORO_NAME", order=sf_sub["STOP_LOCATION_BORO_NAME"].value_counts(ascending=True).index);
ax.set_ylabel("Borough")
ax.set_xlabel("Number of Stops")
ax.set_title("Number of Stops By Borough", fontsize=16);

We see that the most number of stops is made in Brooklyn. This is expected since Brooklyn has the highest population of all the boroughs (NYC census data). It is interesting that the second most number of stops is made in Manhattan, even though Queens is the second most populous borough and has about 600,000 more residents. Many NYC residents commute to and work in Manhattan, so this could explain this discrepancy between number of stops and population of each borough.

In [21]:
sf_sub_age_no_NA_U = sf_sub_age_no_NA.query('SUSPECT_SEX != "UNKNOWN"')
fig = px.density_heatmap(sf_sub_age_no_NA_U, x="SUSPECT_RACE_DESCRIPTION", y="SUSPECT_REPORTED_AGE", facet_col="SUSPECT_SEX", color_continuous_scale="cividis")
fig.show()

In the above interactice visualizaton we see that the highest subgroup of people stopped were Black males aged 16-17 (1978 stops). The highest subgroup of White males that were stopped were of the age 30-31 (210 stops).

In [22]:
# Code from https://mode.com/example-gallery/python_dataframe_styling/

# Set CSS properties for th elements in dataframe
th_props = [
  ('font-size', '12px'),
  ('text-align', 'center'),
  ('font-weight', 'bold'),
  ('color', '#6d6d6d'),
  ('background-color', '#f7f7f9')
  ]

# Set CSS properties for td elements in dataframe
td_props = [
  ('font-size', '12px'),
  ('font-weight', 'bold')
  ]

# Set table styles
styles = [
  dict(selector="th", props=th_props),
  dict(selector="td", props=td_props)
  ]

# Set colormap equal to seaborns light green color palette
cm = sns.light_palette("limegreen", as_cmap=True)
In [23]:
race_crime = sf_sub.groupby(["SUSPECT_RACE_DESCRIPTION","SUSPECTED_CRIME_DESCRIPTION"]).size().unstack().T
(race_crime.style
  .background_gradient(cmap=cm, subset=['BLACK', 'BLACK HISPANIC', 'WHITE', 'WHITE HISPANIC'])
  .set_caption('Race and Alleged Crime Description')
  .format("{:,.0f}", na_rep="0")
  .set_table_styles(styles))
/opt/conda/lib/python3.7/site-packages/matplotlib/colors.py:527: RuntimeWarning:

invalid value encountered in less

Out[23]:
Race and Alleged Crime Description
SUSPECT_RACE_DESCRIPTION BLACK BLACK HISPANIC WHITE WHITE HISPANIC
SUSPECTED_CRIME_DESCRIPTION
ASSAULT 2,592 417 342 1,021
AUTO STRIPPIG 67 19 17 47
BURGLARY 1,213 215 458 631
CPSP 96 19 26 54
CPW 6,205 872 551 1,948
CRIMINAL MISCHIEF 342 45 82 150
CRIMINAL POSSESSION OF CONTROLLED SUBSTANCE 124 10 69 62
CRIMINAL POSSESSION OF FORGED INSTRUMENT 34 4 3 6
CRIMINAL POSSESSION OF MARIHUANA 365 61 24 166
CRIMINAL SALE OF CONTROLLED SUBSTANCE 115 25 47 80
CRIMINAL SALE OF MARIHUANA 57 9 6 23
CRIMINAL TRESPASS 1,020 160 202 482
FELONY 0 0 1 0
FORCIBLE TOUCHING 59 7 16 35
GRAND LARCENY 895 152 115 273
GRAND LARCENY AUTO 577 94 136 271
MAKING GRAFFITI 36 15 54 68
MENACING 401 67 56 163
MISD 1 2 0 1
MISDEMEANOR 1 2 0 1
MURDER 46 7 4 21
OTHER 871 130 152 321
PETIT LARCENY 1,870 216 449 607
PROSTITUTION 13 1 3 8
RAPE 43 8 8 15
RECKLESS ENDANGERMENT 142 18 7 44
ROBBERY 3,415 500 290 1,104
TERRORISM 7 1 17 3
THEFT OF SERVICES 96 8 25 30
UNAUTHORIZED USE OF A VEHICLE 114 18 106 88

In the above table we see the breakdown of the alleged crime a person had commited for each stopped by each race. We see for all four races criminal possesion of a weapon (CPW) as the numebr one reason for stopping someone.

In [24]:
CPW_only = sf_sub.query('SUSPECTED_CRIME_DESCRIPTION == "CPW"') # subset only CPW alleged suspects
CPW_weapon = CPW_only.groupby(["SUSPECT_RACE_DESCRIPTION", "WEAPON_FOUND_FLAG"]).size().unstack().T

(CPW_weapon.style
  .background_gradient(cmap=cm, subset=['BLACK', 'BLACK HISPANIC', 'WHITE', 'WHITE HISPANIC'])
  .set_caption('Race and Weapon Found for CPW')
  .format("{:,.0f}", na_rep="0")
  .set_table_styles(styles))
Out[24]:
Race and Weapon Found for CPW
SUSPECT_RACE_DESCRIPTION BLACK BLACK HISPANIC WHITE WHITE HISPANIC
WEAPON_FOUND_FLAG
N 5,046 701 425 1,506
Y 1,159 171 126 442

I decided to subset the dataset to further analyze the stops that involved alleged CPW suspects. I grouped by if from these stops, did the police actually find a weapon on the suspect. In the above table, we see that for all four races, the police did not find a weapon for the majority of CPW stops. Furthermore, we see that for White suspects the "hit rate" (proportion of stops that yield a positive result (Gelman, A., et. al., 2007)) is higher than Black suspects. This could be due to implicit or explicit bias towards Black people from the police.

In [25]:
race_frisk = sf_sub.groupby(["SUSPECT_RACE_DESCRIPTION","FRISKED_FLAG"]).size().unstack().T

(race_frisk.style
  .background_gradient(cmap=cm, subset=['BLACK', 'BLACK HISPANIC', 'WHITE', 'WHITE HISPANIC'])
  .set_caption('Race and Frisk')
  .format("{:,.0f}", na_rep="0")
  .set_table_styles(styles))
Out[25]:
Race and Frisk
SUSPECT_RACE_DESCRIPTION BLACK BLACK HISPANIC WHITE WHITE HISPANIC
FRISKED_FLAG
N 8,231 1,167 1,808 3,137
Y 12,586 1,935 1,458 4,586

Once an officer stops a person, if they deem necessary they can frisk the person for weapons or contraband. In the above table we see how often an officer decided to frisk a person based on their race. Black and Hispanic people were much more likely to be frisked in comparison to White people. About 60% of Black suspects were frisked as opposed to about 45% of White suspects.

In [26]:
sf_sub_M_or_F = sf_sub.query('SUSPECT_SEX != "UNKNOWN"') # subset only Male or Females
race_sex_frisk = sf_sub_M_or_F.groupby(["SUSPECT_RACE_DESCRIPTION","SUSPECT_SEX" ,"FRISKED_FLAG"]).size().unstack().T

(race_sex_frisk.style
  .background_gradient(cmap=cm, subset=['BLACK', 'BLACK HISPANIC', 'WHITE', 'WHITE HISPANIC'])
  .set_caption('Race, Sex, Frisk')
  .format("{:,.0f}", na_rep="0")
  .set_table_styles(styles))
Out[26]:
Race, Sex, Frisk
SUSPECT_RACE_DESCRIPTION BLACK BLACK HISPANIC WHITE WHITE HISPANIC
SUSPECT_SEX FEMALE MALE FEMALE MALE FEMALE MALE FEMALE MALE
FRISKED_FLAG
N 1,246 6,954 141 1,019 303 1,499 453 2,679
Y 550 11,992 71 1,856 110 1,342 249 4,320

The above table futher shows the breakdown if a suspect was frisked or not, based on their race and sex. Overall, females were more likley to not be frisked. White females were much more likely to not be frisked in comparison to Hispanic or Black females. This table also shows the stark disparity in frisks between White suspects and Hispanic or Black suspects.

In [27]:
race_arrest = sf_sub.groupby(["SUSPECT_RACE_DESCRIPTION","SUSPECT_ARRESTED_FLAG"]).size().unstack().T
race_arrest

(race_arrest.style
  .background_gradient(cmap=cm, subset=['BLACK', 'BLACK HISPANIC', 'WHITE', 'WHITE HISPANIC'])
  .set_caption('Race and Arrest')
  .format("{:,.0f}", na_rep="0")
  .set_table_styles(styles))
Out[27]:
Race and Arrest
SUSPECT_RACE_DESCRIPTION BLACK BLACK HISPANIC WHITE WHITE HISPANIC
SUSPECT_ARRESTED_FLAG
N 14,748 2,165 2,209 5,179
Y 6,069 937 1,057 2,544

Here we see what proporation of stops led to arrests. About 30% of stops involving Black or Black Hispanic people led to an arrest, while about 32% of stops involving White or White Hispanic people led to an arrest. Overall about 70% of people stopped were found to be innocent.

In [28]:
fig = px.treemap(sf_sub, path=["STOP_LOCATION_BORO_NAME","STOP_LOCATION_PRECINCT", "SUSPECT_RACE_DESCRIPTION" ])
fig.show()

In the above visualization, I created an interactive tree map. First the data is broken down into the 5 boroughs (and the values for which there was no borough indicated marked as 'UNKNOWN'). Then within each borough we can see how many stop there were in each precinct. Finally, within each precinct we can see the number of people from each race there were stopped. Of the 77 precincts, only 13 precincts stop non-black people most often. There has been considerable work done to analyze precinct level stops and associated crime rates within each precinct (Levchak P.J. 2017).

Sentiment Analysis

In these electronic forms there are 2 columns where we can perform sentiment analysis. Sentiment analysis could be used to understand the emotional component of a text. Here I used VADER, a model used to measure the negative, positive, neutral and overall sentamint intensity of a text (Hutto, C.J. et. al., 2014). The first column I chose to run sentimant analysis on was the DEMEANOR_OF_PERSON_STOPPED. Below is a list of the top 10 responses for the demeanor of a person stopped.

In [29]:
sf_sub["DEMEANOR_OF_PERSON_STOPPED"].value_counts().head(10)
Out[29]:
CALM             7280
NERVOUS          3384
NONE             2274
CPW              1817
UPSET            1525
ROBBERY          1480
NORMAL           1453
COOPERATIVE       793
PETIT LARCENY     727
COMPLIANT         648
Name: DEMEANOR_OF_PERSON_STOPPED, dtype: int64
In [30]:
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
In [31]:
sid.polarity_scores("CALM")
Out[31]:
{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.3182}
In [32]:
sid.polarity_scores("NERVOUS")
Out[32]:
{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.2732}
In [33]:
sid.polarity_scores("NONE")
Out[33]:
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

Above we see what the VADER scores are for the top 3 responses for DEMEANOR_OF_PERSON_STOPPED. The scale for the compound (overall) score is a value between -1 to 1. The closer to -1 the more negative, while the closer to 1, the more positive the sentamint is. "CALM" had an overall score of 0.3183, "NERVOUS" had an overall score of -0.2732, and "NONE" (Indicating no response was written for that stop) had an overall score of 0.0. These scores make sense since calm is usually more positive, nervous is usually more negative, while none is fairly neutral.

In [34]:
sf_sub["SUSPECT_OTHER_DESCRIPTION"].value_counts().head(10)
Out[34]:
NA/NONE/UNKNOWN    25632
BLACK                476
TATTOO               276
BEARD                188
TATTOOS              121
BLACK JACKET         102
WHITE                 93
GOATEE                59
BLUE J                56
GLASSES               55
Name: SUSPECT_OTHER_DESCRIPTION, dtype: int64

Above are the top responses for the variable SUSPECT_OTHER_DESCRIPTION. Below I make new columns for each score for each variable that I will be running sentamint analysis on.

In [35]:
sf_sub['DEMEANOR_NEGATIVE']=float(0)
sf_sub['DEMEANOR_NEUTRAL']=float(0)
sf_sub['DEMEANOR_POSITIVE']=float(0)
sf_sub['DEMEANOR_COMPOUND']=float(0)
In [36]:
sf_sub["SUSPECT_DESCR_NEGATIVE"]=float(0)
sf_sub["SUSPECT_DESCR_NEUTRAL"]=float(0)
sf_sub["SUSPECT_DESCR_POSITIVE"]=float(0)
sf_sub["SUSPECT_DESCR_COMPOUND"]=float(0)
In [37]:
for i in range(36096):
    if i in sf_sub.index:
        ss = sid.polarity_scores(sf_sub["DEMEANOR_OF_PERSON_STOPPED"][i])
        sf_sub.loc[i, 'DEMEANOR_NEGATIVE'] = ss['neg']
        sf_sub.loc[i,'DEMEANOR_NEUTRAL'] = ss['neu']
        sf_sub.loc[i,'DEMEANOR_POSITIVE'] = ss['pos']
        sf_sub.loc[i,'DEMEANOR_COMPOUND'] = ss['compound']
In [38]:
for i in range(36096):
    if i in sf_sub.index:
        ss = sid.polarity_scores(sf_sub["SUSPECT_OTHER_DESCRIPTION"][i])
        sf_sub.loc[i,"SUSPECT_DESCR_NEGATIVE"] = ss['neg']
        sf_sub.loc[i,"SUSPECT_DESCR_NEUTRAL"] = ss['neu']
        sf_sub.loc[i,"SUSPECT_DESCR_POSITIVE"] = ss['pos']
        sf_sub.loc[i,"SUSPECT_DESCR_COMPOUND"] = ss['compound']
In [39]:
sf_sub.describe()
Out[39]:
YEAR2 STOP_LOCATION_PRECINCT DEMEANOR_NEGATIVE DEMEANOR_NEUTRAL DEMEANOR_POSITIVE DEMEANOR_COMPOUND SUSPECT_DESCR_NEGATIVE SUSPECT_DESCR_NEUTRAL SUSPECT_DESCR_POSITIVE SUSPECT_DESCR_COMPOUND
count 34908.000000 34908.000000 34908.000000 34908.000000 34908.000000 34908.000000 34908.000000 34908.000000 34908.000000 34908.000000
mean 2018.055174 78.569296 0.295825 0.445012 0.259163 -0.033712 0.004796 0.986874 0.008302 0.001581
std 0.830868 1122.062035 0.438461 0.471766 0.426276 0.291886 0.056814 0.089735 0.068995 0.051607
min 2017.000000 1.000000 0.000000 0.000000 0.000000 -0.927000 0.000000 0.000000 0.000000 -0.848100
25% 2017.000000 34.000000 0.000000 0.000000 0.000000 -0.273200 0.000000 1.000000 0.000000 0.000000
50% 2018.000000 62.000000 0.000000 0.227000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000
75% 2019.000000 84.000000 0.861000 1.000000 0.697000 0.318200 0.000000 1.000000 0.000000 0.000000
max 2019.000000 208760.000000 1.000000 1.000000 1.000000 0.838400 1.000000 1.000000 1.000000 0.815900

Here we see the descriptive statistics for each column created using the VADER model for sentament analysis.

In [40]:
fig = px.box(sf_sub, x="SUSPECT_RACE_DESCRIPTION", y='DEMEANOR_COMPOUND', color="FRISKED_FLAG")
fig.show()

The above boxplots are organized on the x axis by race, and on the y axis is the overall score for the demeanor of the person stopped. In the red is the score of a person who was frisked and in the blue is the score of a person who was not frisked. An interesting pattern to note is the distribution of scores is fairly identical between the 4 races for people that were not frisked, but different for the people that were frisked. A White person who was frisked had about the same score as a White person that was not frisked, but that is not the case for people of the other 3 races. Black, White Hispanic, and Black Hispanic people that were stopped AND frisked had lower overall VADER scores in comparison to people of the same race that were not frisked.

In [41]:
fig = px.box(sf_sub, x="SUSPECT_RACE_DESCRIPTION", y='DEMEANOR_COMPOUND', color="SUSPECT_ARRESTED_FLAG")
fig.show()

The above boxplots are organized on the x axis by race, and on the y axis is the overall score for the demeanor of the person stopped. In the red is the score of a person who was arrested and in the blue is the score of a person who was not arrested. We see a similar pattern to the previous boxplots. The distribution of scores is fairly identical between the 4 races for people that were not arrested, but different for the people that were arrested. A White person who was arrested had about the same score as a White person that was not arrested. On the other hand, Black, White Hispanic, and Black Hispanic people that were stopped AND arrested had lower overall VADER scores in comparison to people of the same race that were not frisked.

In [42]:
sf_sub_M_or_F = sf_sub.query('SUSPECT_SEX != "UNKNOWN"')
fig = px.box(sf_sub_M_or_F, x="SUSPECT_RACE_DESCRIPTION", y='DEMEANOR_COMPOUND', color="SUSPECT_SEX")
fig.show()

The above boxplots are organized on the x axis by race, and on the y axis is the overall score for the demeanor of the person stopped. In the red is the score of males suspects and in the blue is the score of female suspects. We notice that the score is similar between the 4 races for the male suspect. In comparison to the male suspects, for each race the female suspects all recieved lower VADER scores.

Discussion and Conclusions

In the past 3 years NYPD stopped young Black males most often. Over 50% of the total stops were of Black people and over 90% of the stops were of males. Geographically, the most stops were made in Brooklyn. Looking at the borough and precinct treemap, we can see that in Brooklyn 19 of the 23 precincts predominatly stopped black people. It would be interesting to see the racial make-up of those precinct neighborhoods in order to compare the racial population of the community with this dataset. For example in precinct 67, when excluding unknown or other races, about 95% of the stops were of Black people. What is the racial population of that neighborhood? Could there be any contribution from tourists or people commuting to that neighborhood for work during the day that would change that racial popultion?

The sentiment analysis using the VADER model provided some insightful results. Why is there a difference between the boxplots of frisked and not frisked people for Black and Hispanic people but no differences for White people? Same observation was made for arrests. Further analysis of the text from the DEMEANOR_OF_PERSON_STOPPED variable would be of interest. The differences in sex for all racial groups is also evident. Females are said to have more negative demanor in comparison to males. This could be due to the small sample size of females that are stopped.

There are numerous avenues that could be researched for further analysis from this dataset. Some possible questions: When are people of different races stopped during the day? For each race, what proportion of stops lead to physical force? In the precincts that do not stop Black people the most, what is the racial makeup of that precinct neighborhood?

There are a few questions to address from this dataset. First, these are only the recorded stops that are reported by the NYPD. There could be numerous stops that are not recorded by an officer for various reasons that could alter the data. Also, this data is inputted by the officer and thus they are essentially the data collectors. It would be important to somehow validate some of these stops, possibly by looking through police body cam footage.

This analysis further shows the explicit and implicit bias police have towards Black people. It is imperative during this time we all evaluate our bias toward people of different races. Police should be held to the highest standards since they are the ones tasked with keeping our communities safe and are in positions of authority. Unfortuantely research has shown that the sometimes invasive and frequent stops have had detremental health effects on minorities in racially diverse communities (Sewell A.A. et. al., 2016). Research to uncover biases towards minorites is extremely important and we as statisticians should do our part to help fight the power.

References

  • Bandes, SA, Pryor, M, Kerrison, EM, Goff, PA. (2019). The mismeasure of Terry stops: Assessing the psychological and emotional harms of stop and frisk to individuals and communities. Behav Sci Law; 37: 176– 194.
  • Gelman, A., Fagan, J., & Kiss, A. (2007). An analysis of the New York City Police Department's “stop‐and frisk” policy in the context of claims of racial bias. Journal of the American Statistical Association, 102(479)), 813–823.
  • Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
  • Levchak P.J., (2017) Do Precinct Characteristics Influence Stop-and-Frisk in New York City? A Multi-Level Analysis of Post-Stop Outcomes, Justice Quarterly, 34:3, 377-406.
  • NYC census data (2019). Available at https://www.census.gov/quickfacts/fact/table/newyorkcitynewyork,NY/PST045219
  • NYC crime map (2019). Available at https://maps.nyc.gov/crime/
  • NYC Stop and Frisk Data. Available at https://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page
  • NYCLU Stop‐and‐Frisk Data (2019). Available at https://www.nyclu.org/en/publications/stop-and-frisk-de-blasio-era-2019
  • Rengifo, A., & Fowler, K. (2016). Stop, question, and complain: Citizen grievances against the NYPD and the opacity of police stops across New York City precincts, 2007–2013. Journal of Urban Health, 93, 32–41.
  • Sewell, A. A., Jefferson, K. A., & Lee, H. (2016). Living under surveillance: Gender, psychological distress, and stop‐question‐ and‐frisk policing in New York City. Social Science and Medicine, 159, 1–13.
  • Sewell, A. A., & Jefferson, K. A. (2016). Collateral Damage: The Health Effects of Invasive Police Encounters in New York City. Journal of urban health : bulletin of the New York Academy of Medicine, 93 Suppl 1(Suppl 1), 42–67.