Understanding Traffic Crashes in New York City

Author

Ali Irtaza, Tarun Kumanduri

Published

December 8, 2024

Introduction

Motor vehicle collisions are a serious concern for urban traffic safety, and understanding them is key to reducing injuries and fatalities. In a crowded city like New York, traffic accidents happen frequently, impacting not only drivers but also pedestrians, cyclists, and passengers. Each year, thousands of crashes occur, making it a major public safety issue. This project focuses on analyzing and visualizing data from the Motor Vehicle Collisions dataset, which includes detailed information on reported accidents.

By exploring this data, the project uncovers the time frames where deaths and injuries are highest, identifies the top contributing factors to collisions, and examines how collision trends have changed over time. These insights aim to provide a deeper understanding of traffic safety challenges and inform strategies to reduce accidents and save lives.

Research Question

What are the key contributing factors to motor vehicle collisions in different boroughs of NYC, and how do these factors vary across boroughs? Is there a relationship between the time of day and the number of injuries and deaths that are happening? And, how have collision rates and patterns in New York City evolved over the years? What are the types of vehicles that are causing highest number of collisions?

Data Sources

The dataset contains daily updated motor vehicle collision records, starting from August 1, 2012, with continuous data from April 1, 2016, to September 18, 2024. Provided by the New York Police Department (NYPD), it includes detailed information on crash times, dates, demographics, injuries, and fatalities (categorized by persons, pedestrians, cyclists, and motorists), contributing factors, and vehicle types involved. Location details such as latitude, longitude, borough, ZIP codes, and specific crash sites are also available, though some entries have missing information. The dataset covers both single and multiple-vehicle crashes, offering insights into various collision factors. While comprehensive, users should note that fields like location, contributing factors, and vehicle types may have incomplete data due to the nature of incident reporting.

Data File

Motor_Vehicle_Collisions_-_Crashes_20240918.csv

Date Downloaded

September 18, 2024

Downloaded Source of Data

The data has been downloaded from the NYC open data website which is officially maintained by the government of New York City whose aim is to make public data from New York City agencies and organizations available for public use through one web portal and promote the use of Open Data, encouraging engagement, and empowering agencies. The website which we collected this data from is here.

Original Source of Data

The original data was collected by the NYPD and made available on the NYC Open Data portal. The data is derived from collision reports, likely including paper forms like MV-104AN.

Missing Data

Some data fields, particularly location, contributing factors, and vehicle types, may have missing values. Missing data could result from incomplete incident reporting or errors during data entry or transcription.

Potential Biases

The data may have inherent biases due to how it was collected. These biases could include underreporting of certain types of collisions (e.g., minor collisions not reported to the police), differences in reporting accuracy based on the time of day or police staffing levels, and potential socioeconomic or geographic disparities in reporting.

Data Preprocessing

Code
# Loading the dataset
# We converted the RAW CSV data file to parquet and saved it in data_processed folder\
# Shared the drive link for downloading the raw dataset

data <- read_parquet(here('data_processed','motor_vehicle.parquet'))

data <- data %>%
  mutate(crash_date = as.Date(crash_date, format = '%m/%d/%Y'))

data <- data %>%
  mutate(crash_hour = hour(hms(crash_time)))

data
#> # A tibble: 2,119,946 × 30
#>    crash_date crash_time borough   zip_code latitude longitude location         
#>    <date>     <time>     <chr>        <dbl>    <dbl>     <dbl> <chr>            
#>  1 2021-09-11 02:39      <NA>            NA     NA        NA   <NA>             
#>  2 2022-03-26 11:45      <NA>            NA     NA        NA   <NA>             
#>  3 2022-06-29 06:55      <NA>            NA     NA        NA   <NA>             
#>  4 2021-09-11 09:35      BROOKLYN     11208     40.7     -73.9 (40.667202, -73.…
#>  5 2021-12-14 08:13      BROOKLYN     11233     40.7     -73.9 (40.683304, -73.…
#>  6 2021-04-14 12:47      <NA>            NA     NA        NA   <NA>             
#>  7 2021-12-14 17:05      <NA>            NA     40.7     -74.0 (40.709183, -73.…
#>  8 2021-12-14 08:17      BRONX        10475     40.9     -73.8 (40.86816, -73.8…
#>  9 2021-12-14 21:10      BROOKLYN     11207     40.7     -73.9 (40.67172, -73.8…
#> 10 2021-12-14 14:58      MANHATTAN    10017     40.8     -74.0 (40.75144, -73.9…
#> # ℹ 2,119,936 more rows
#> # ℹ 23 more variables: on_street_name <chr>, cross_street_name <chr>,
#> #   off_street_name <chr>, number_of_persons_injured <dbl>,
#> #   number_of_persons_killed <dbl>, number_of_pedestrians_injured <dbl>,
#> #   number_of_pedestrians_killed <dbl>, number_of_cyclist_injured <dbl>,
#> #   number_of_cyclist_killed <dbl>, number_of_motorist_injured <dbl>,
#> #   number_of_motorist_killed <dbl>, contributing_factor_vehicle_1 <chr>, …

Analysis

Deaths caused in New York City

Code
data$crash_hour <- as.numeric(replace(data$crash_hour, data$crash_hour == "24", "0"))


killed_hour <- data %>%
  filter(!is.na(crash_hour)) %>%
  group_by(crash_hour) %>%
  summarise(deaths = sum(number_of_persons_killed, na.rm = TRUE)) %>%
  arrange(crash_hour)

max_deaths <- killed_hour %>%
  filter(deaths == max(deaths))

deaths_chart <- ggplot(killed_hour, aes(x = crash_hour, y = deaths)) +
  geom_line(color = "#0073C2FF", size = 1.1) +
  geom_point(color = "#0073C2FF", size = 2) +
  geom_point(data = max_deaths, aes(x = crash_hour, y = deaths), 
             color = "#FF5733", size = 5, shape = 21, fill = "#FF5733") + 
  geom_label(data = max_deaths, aes(x = crash_hour, y = deaths, label = paste("Peak:", deaths)),
             nudge_y = 8,nudge_x =1.05 , fill = "#FFF9C4", color = "black", size = 4, label.size = 0.2) +
  geom_vline(data = max_deaths, aes(xintercept = crash_hour), linetype = "dotted", color = "#FF5733", size = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
  theme_minimal(base_size = 14) + 
  labs(
    title = "NYC crash caused deaths reach their highest at 4 AM",
    subtitle = "Highlighting peak times deaths during a 24-Hour cycle in NYC",
    x = "Time of the Day",
    y = "Number of Fatalities"
  ) +
  scale_x_continuous(
    breaks = 0:23, 
    labels = c("12 AM", "1 AM", "2 AM", "3 AM", "4 AM", "5 AM", 
               "6 AM", "7 AM", "8 AM", "9 AM", "10 AM", "11 AM", 
               "12 PM", "1 PM", "2 PM", "3 PM", "4 PM", "5 PM", 
               "6 PM", "7 PM", "8 PM", "9 PM", "10 PM", "11 PM")
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1, size = 13),
    axis.text.y = element_text(size = 13),
    axis.title = element_text(size = 11),
    plot.title = element_text(size = 20, hjust = 0.5, color = "#0073C2FF",face = "bold"),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "black"),
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_line(color = "lightgray", linetype = "dotted"),
    panel.grid.minor = element_blank(),
    plot.margin = margin(1, 1, 1, 1, "cm")
  )

deaths_chart

This chart illustrates the number of fatalities from motor vehicle collisions in New York City across a 24-hour period. The data reveals a clear peak at 4:00 a.m., with 176 deaths recorded during this hour, making it the most dangerous time for fatal crashes.

The spike at 4:00 a.m. may be attributed to factors such as driver fatigue, reduced visibility at night, or an increased likelihood of impaired driving after late-night activities. Additionally, lower traffic volumes during these hours might lead to higher vehicle speeds, increasing the severity of crashes. Understanding these time-specific patterns is essential for informing targeted safety measures, such as increased law enforcement presence or awareness campaigns during high-risk hours.

With fatalities peaking at 4 a.m., understanding the contributing factors to these crashes is critical. The next analysis reveals which factors play the most significant role in fatal traffic incidents across NYC.

Code
crash_factors_killed <- data %>%
  mutate(contributing_factor = coalesce(contributing_factor_vehicle_1, contributing_factor_vehicle_2, 
                                      contributing_factor_vehicle_3, contributing_factor_vehicle_4, 
                                      contributing_factor_vehicle_5)) %>%
    mutate(contributing_factor = recode(contributing_factor,
    "Pedestrian/Bicyclist/Other Pedestrian Error/Confusion" = "Pedestrian/Cyclist Confusion"
  )) %>% 
  group_by(contributing_factor) %>%
  summarise(total_killed = sum(number_of_persons_killed, na.rm = TRUE)) %>%
  filter(!is.na(contributing_factor), total_killed > 0) %>% 
  filter(contributing_factor != 'Unspecified') %>%
  arrange(desc(total_killed)) %>%
  slice_head(n = 12)

max_killed <- max(crash_factors_killed$total_killed)
plot_limit <- max_killed * 1.15

crash_factors_killed <- crash_factors_killed %>%
  mutate(highlight = case_when(
    row_number() == 1 ~ "Highest",
    row_number() == 2 ~ "Second Highest",
    TRUE ~ "Others"
  ))

custom_colors <- c("Highest" = "#D32F2F",    # Dark Red
                   "Second Highest" = "#FF7043",  # Bright Orange
                   "Others" = "lightblue")    # Grayish Blue


ggplot(crash_factors_killed, 
       aes(x = reorder(contributing_factor, total_killed), 
           y = total_killed, 
           fill = highlight)) +
  geom_bar(stat = "identity") +
  coord_flip(clip = "off") +
  scale_y_continuous(
    limits = c(0, plot_limit),
    expand = expansion(mult = c(0, 0.1))
  ) +
  scale_fill_manual(values = custom_colors, guide = FALSE) +
  geom_text(
    aes(label = total_killed),
    hjust = -0.2,
    size = 3.5,
    color = "black"
  ) +
  labs(
    title = "Speeding and Inattention Lead to the Most Deaths in Traffic Crashes in NYC",
    x = NULL,
    y = "Total Fatalities"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16, margin = margin(b = 10)),
    axis.text.y = element_text(size = 9),
    axis.text.x = element_text(size = 9),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    plot.margin = margin(r = 50, l = 10, t = 20, b = 20)
  )

This chart highlights the top contributing factors to fatal crashes in New York City, with “Unsafe Speed” and “Driver Inattention/Distraction” emerging as the leading causes, accounting for 420 and 382 fatalities, respectively. Other significant factors include “Traffic Control Disregarded” and “Failure to Yield Right-of-Way,” each contributing to 279 fatalities. These findings underscore the critical role of speeding and driver distraction in fatal crashes, suggesting a need for stricter enforcement of speed limits and campaigns to reduce distracted driving behaviors. Addressing these key factors could significantly enhance road safety in NYC.

To understand how these contributing factors have evolved over time, the next chart compares the impact of these factors in 2013 and 2023, providing insights into shifting traffic safety challenges and priorities over the past decade.

Code
factors_data <- data %>%
    mutate(contributing_factor = coalesce(contributing_factor_vehicle_1, contributing_factor_vehicle_2, 
                                      contributing_factor_vehicle_3, contributing_factor_vehicle_4, 
                                      contributing_factor_vehicle_5)) %>%
    mutate(contributing_factor = recode(contributing_factor,
    "Pedestrian/Bicyclist/Other Pedestrian Error/Confusion" = "Pedestrian/Cyclist Confusion")) %>% 
  filter(contributing_factor != "Unspecified") 

crash_factors_2013 <- factors_data %>%
  filter(format(crash_date, "%Y") == "2013") %>%
  mutate(
    total_killed = number_of_persons_killed
  ) %>%
  group_by(contributing_factor) %>%
  summarise(total_killed_2013 = sum(total_killed, na.rm = TRUE))

crash_factors_2023 <- factors_data %>%
  filter(format(crash_date, "%Y") == "2023") %>%
  mutate(
    total_killed = number_of_persons_killed
  ) %>%
  group_by(contributing_factor) %>%
  summarise(total_killed_2023 = sum(total_killed, na.rm = TRUE))

crash_factors_summary <- crash_factors_2013 %>%
  full_join(crash_factors_2023, by = "contributing_factor") %>%
  filter(!is.na(contributing_factor)) %>%
  arrange(desc(total_killed_2023))

crash_factors_summary %>% 
  slice_head(n = 20) %>% 
  ggplot(aes(y = fct_reorder(contributing_factor, total_killed_2023))) +
  geom_segment(aes(x = total_killed_2013, xend = total_killed_2023, yend = contributing_factor), 
               color = 'lightblue', size = 1) + 
  geom_point(aes(x = total_killed_2013, color = "Deaths in 2013"), size = 3) + 
  geom_point(aes(x = total_killed_2023, color = "Deaths in 2023"), size = 3) + 
  scale_color_manual(
    values = c("Deaths in 2013" = "steelblue", 
               "Deaths in 2023" = "red"),
    name = "Year"
  ) + 
  scale_x_continuous(labels = scales::comma) +
  labs(
    title = "How NYC Traffic Deaths Have Shifted Over a Decade (2013–2023)",
    x = "Number of Deaths",
    y = "Contributing Factor"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold",hjust = 0.5),
    legend.position = "right"
  )

This chart illustrates how the contributing factors to traffic fatalities in New York City have shifted over a decade (2013–2023). Key observations include a sharp rise in fatalities caused by “Unsafe Speed,” which emerged as the leading factor in 2023 compared to its minimal impact in 2013. Similarly, “Driver Inattention/Distraction” remained a consistent and significant contributor to deaths over the years. Other factors like “Traffic Control Disregarded” and “Failure to Yield Right-of-Way” also show persistent impacts, albeit with less dramatic changes.

The marked increase in fatalities linked to unsafe speed highlights evolving road safety challenges, possibly influenced by changing traffic conditions or behavior. This shift underscores the need for stricter speed enforcement and expanded public safety initiatives targeting driver awareness. Understanding these changes over time can guide more effective policymaking to address the most pressing risks on NYC roads.

While unsafe speed and driver distraction dominate as causes of fatalities, it’s important to explore how these factors compare to those contributing to injuries, which we analyze next.

Injuries caused in New York City

Code
injuries_hour <- data %>%
  filter(!is.na(crash_hour)) %>%
  group_by(crash_hour) %>%
  summarise(injuries = sum(number_of_persons_injured, na.rm = TRUE)) %>%
  arrange(crash_hour)

max_injuries <- injuries_hour %>%
  filter(injuries == max(injuries))

injuries_chart <- ggplot(injuries_hour, aes(x = crash_hour, y = injuries)) +
  geom_line(color = "#0073C2FF", size = 1.1) + 
  geom_point(color = "#0073C2FF", size = 2) +
  geom_point(data = max_injuries, aes(x = crash_hour, y = injuries), 
             color = "#FF5733", size = 5, shape = 21, fill = "#FF5733") + 
 geom_label(data = max_injuries, aes(x = crash_hour, y = injuries, label = paste("Peak:", injuries)),
           nudge_y = 2500,nudge_x = 1.5, fill = "#FFF9C4", color = "black", size = 4, label.size = 0.3, fontface = "bold") +
    geom_vline(data = max_injuries, aes(xintercept = crash_hour), linetype = "dotted", color = "#FF5733", size = 0.8)+

  geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
  theme_minimal(base_size = 14) + 
  labs(
    title = "NYC Crash Injuries Reach Their Highest at 5 PM",
    subtitle = "Highlighting peak times for injuries during a 24-Hour cycle in NYC",
    x = "Time of the Day",
    y = "Number of Injuries"
  ) +
  scale_x_continuous(
    breaks = 0:23, 
    labels = c("12 AM", "1 AM", "2 AM", "3 AM", "4 AM", "5 AM", 
               "6 AM", "7 AM", "8 AM", "9 AM", "10 AM", "11 AM", 
               "12 PM", "1 PM", "2 PM", "3 PM", "4 PM", "5 PM", 
               "6 PM", "7 PM", "8 PM", "9 PM", "10 PM", "11 PM")
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  theme(
    axis.text.x = element_text(angle = 30, hjust = 1, size = 10),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 11),
    plot.title = element_text(size = 16, hjust = 0.5, color = "#0073C2FF"),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "black"),
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_line(color = "lightgray", linetype = "dotted"),
    panel.grid.minor = element_blank(),
    plot.margin = margin(1, 1, 1, 1, "cm")
  )

injuries_chart

Injuries show an uneven distribution throughout the 24-hour day, with noticeable peaks around 4:00 p.m to 6:00 p.m peaking at 5:00 p.m. These peak times likely correspond to evening rush hours when traffic density is highest, leading to an increased rate of incidents with injuries. This pattern suggests that time of day is a significant factor in injury occurrences, possibly due to the greater volume of vehicles on the road during commuting hours. Additionally, these peaks may reflect other factors, such as visibility conditions, driver fatigue, or behavioral shifts during specific times, like early morning or late afternoon. This trend supports our research question, highlighting how time of day impacts injury rates on the road.

Building on this, the next chart examines the contributing factors for injuries, shedding light on the primary causes behind these incidents.

Code
crash_factors_injured <- data %>%
  mutate(contributing_factor = coalesce(contributing_factor_vehicle_1, contributing_factor_vehicle_2, 
                                      contributing_factor_vehicle_3, contributing_factor_vehicle_4, 
                                      contributing_factor_vehicle_5)) %>%
    mutate(contributing_factor = recode(contributing_factor,
    "Pedestrian/Bicyclist/Other Pedestrian Error/Confusion" = "Pedestrian/Cyclist Confusion"
  )) %>% 
  group_by(contributing_factor) %>%
  summarise(total_injured = sum(number_of_persons_injured, na.rm = TRUE)) %>%
  filter(!is.na(contributing_factor), total_injured > 0) %>% 
  filter(contributing_factor != 'Unspecified') %>%
  arrange(desc(total_injured)) %>%
  slice_head(n = 10)

crash_factors_injured <- crash_factors_injured %>%
  mutate(highlight = case_when(
    row_number() == 1 ~ "Highest",
    row_number() == 2 ~ "Second Highest",
    TRUE ~ "Others"
  ))

custom_colors <- c("Highest" = "#D32F2F",    # Dark Red
                   "Second Highest" = "#FF7043",  # Bright Orange
                   "Others" = "#B0BEC5")    # Grayish Blue

max_injured <- max(crash_factors_injured$total_injured)
plot_limit_injured <- max_injured * 1.15

ggplot(crash_factors_injured, 
       aes(x = reorder(contributing_factor, total_injured), 
           y = total_injured, 
           fill = highlight)) +
  geom_bar(stat = "identity") +
  coord_flip(clip = "off") +
  scale_y_continuous(
    limits = c(0, plot_limit_injured),
    expand = expansion(mult = c(0, 0.1))
  ) +
  scale_fill_manual(values = custom_colors, guide = FALSE) +
  geom_text(
    aes(label = total_injured),
    hjust = -0.2,
    size = 3.5,
    color = "black"
  ) +
  labs(
    title = "Drivers Inattention and Failure to Yield Right of Way \nLead to the Most Injuries in Traffic Crashes in NYC",
    x = NULL,
    y = "Total Injuries"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 18, margin = margin(b = 10)),
    plot.subtitle = element_text(color = "grey40", size = 12, margin = margin(b = 20)),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    plot.margin = margin(r = 50, l = 10, t = 20, b = 20)
  )

This chart highlights the top contributing factors to injuries in traffic crashes in New York City. “Driver Inattention/Distraction” is the leading cause, responsible for 144,234 injuries, followed by “Failure to Yield Right-of-Way” with 65,344 injuries. Other notable factors include “Following Too Closely” (44,998 injuries) and “Traffic Control Disregarded” (27,248 injuries).

These findings emphasize the dangers of distracted driving and failure to yield, which together account for the majority of injuries. Addressing these issues through driver education programs, stricter traffic law enforcement, and advanced vehicle technologies like collision detection systems could significantly reduce injuries on NYC roads. This data highlights critical areas for intervention to improve overall traffic safety.

To further analyze the evolution of these contributing factors, the next chart compares how the causes of traffic injuries in NYC have shifted between 2013 and 2023, providing insights into changes in driver behavior and urban mobility trends over the past decade.

Code
factors_data_injuries <- data %>%
  filter(!is.na(contributing_factor_vehicle_1)) %>%
  filter(contributing_factor_vehicle_1 != "Unspecified") %>% 
    mutate(contributing_factor_vehicle_1 = recode(contributing_factor_vehicle_1,
    "Pedestrian/Bicyclist/Other Pedestrian Error/Confusion" = "Pedestrian/Cyclist Confusion"))

crash_factors_injuries_2013 <- factors_data_injuries %>%
  filter(format(crash_date, "%Y") == "2013") %>%
  mutate(
    total_injured = number_of_persons_injured
  ) %>%
  group_by(contributing_factor_vehicle_1) %>%
  summarise(total_injured_2013 = sum(total_injured, na.rm = TRUE)) %>%
  rename(contributing_factor = contributing_factor_vehicle_1)

crash_factors_injuries_2023 <- factors_data_injuries %>%
  filter(format(crash_date, "%Y") == "2023") %>%
  mutate(
    total_injured = number_of_persons_injured
  ) %>%
  group_by(contributing_factor_vehicle_1) %>%
  summarise(total_injured_2023 = sum(total_injured, na.rm = TRUE)) %>%
  rename(contributing_factor = contributing_factor_vehicle_1)

crash_factors_injuries_summary <- crash_factors_injuries_2013 %>%
  full_join(crash_factors_injuries_2023, by = "contributing_factor") %>%
  filter(!is.na(contributing_factor)) %>% 
  arrange(desc(total_injured_2023))

crash_factors_injuries_summary %>%
  slice_head(n = 20) %>%
  ggplot(aes(y = fct_reorder(contributing_factor, total_injured_2023))) +
  geom_segment(aes(x = total_injured_2013, xend = total_injured_2023, yend = contributing_factor), 
               color = 'lightblue', size = 1) +
  geom_point(aes(x = total_injured_2013, color = "Injuries in 2013"), size = 3) +
  geom_point(aes(x = total_injured_2023, color = "Injuries in 2023"), size = 3) +
  scale_color_manual(
    values = c("Injuries in 2013" = "orange", 
               "Injuries in 2023" = "purple"),
    name = "Year"
  ) +
  scale_x_continuous(labels = scales::comma) +
  labs(
    title = "How NYC Traffic Injuries Have Shifted Over a Decade (2013–2023)",
    x = "Number of Injuries",
    y = "Contributing Factor",
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.position = "right"
  )

This chart shows how the contributing factors to traffic injuries in New York City have shifted over a decade (2013–2023). “Driver Inattention/Distraction” consistently remains the top contributor, with injuries from this factor increasing significantly in 2023. “Failure to Yield Right-of-Way” also shows a steady rise, maintaining its position as a major contributor to injuries. Other factors such as “Following Too Closely,” “Unsafe Speed,” and “Traffic Control Disregarded” demonstrate varying trends, though their impacts remain significant across the years.

The increasing role of driver distraction and failure to yield highlights persistent road safety challenges. The rise in injuries over the years could reflect evolving traffic conditions, changes in driver behavior, or higher traffic volumes. To further understand the spatial dynamics of these incidents, the next chart examines the top contributing factors to collisions across different boroughs, revealing how these factors vary geographically and offering insight into borough-specific traffic safety priorities.

Collisions caused in NYC

Code
top_factors_borough <- data %>%
  mutate(contributing_factor = coalesce(contributing_factor_vehicle_1, 
                                        contributing_factor_vehicle_2,
                                        contributing_factor_vehicle_3, 
                                        contributing_factor_vehicle_4, 
                                        contributing_factor_vehicle_5)) %>% 
  filter(contributing_factor != "Unspecified") %>%
  group_by(borough, contributing_factor) %>%
  summarise(total_collisions = n(), .groups = 'drop') %>%
  filter(!is.na(borough), !is.na(contributing_factor)) %>%
  arrange(desc(total_collisions)) %>%
  group_by(borough) %>%
  slice_max(order_by = total_collisions, n = 5) %>%
  ungroup() %>%
  arrange(borough, desc(total_collisions)) %>%
  mutate(contributing_factor = fct_rev(fct_inorder(contributing_factor)))

borough_totals <- top_factors_borough %>%
  group_by(borough) %>%
  summarise(total_incidents = sum(total_collisions)) %>%
  arrange(desc(total_incidents))

top_factors_borough <- top_factors_borough %>%
  mutate(borough = factor(borough, levels = borough_totals$borough))

ggplot(top_factors_borough, 
       aes(x = contributing_factor, y = total_collisions, fill = total_collisions)) +
  geom_bar(stat = "identity", width = 0.7) + 
  coord_flip() +
  facet_wrap(~ borough, nrow = 1, scales = "free_x") +
  scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.05))) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(
    title = "Top Contributing Factors of Traffic Incidents by Borough in NYC",
    subtitle = "Driver Inattention and Failure to Yield Are Leading Causes in Most Boroughs",
    x = NULL, 
    y = "Number of Incidents",
    fill = "Total\nIncidents"
  ) +
  theme_minimal() +
  theme(
    strip.text = element_text(size = 14, face = "bold", hjust = 0.5),
    panel.border = element_rect(color = "black", fill = NA, size = 1),
    axis.text.x = element_text(size = 15, angle = 45, hjust = 1, vjust = 1),
    axis.text.y = element_text(size = 15),
    plot.title = element_text(face = "bold", size = 23, hjust = 0.5, margin = margin(b = 10)),
    plot.subtitle = element_text(size = 15, color = "grey40", hjust = 0.5, margin = margin(b = 20)),
    legend.position = "right",
    legend.title = element_text(size = 10, face = "bold"),
    legend.text = element_text(size = 11),
    panel.grid.major.y = element_blank(), 
    panel.grid.minor = element_blank(),
    plot.margin = margin(20, 30, 20, 20)
  )

This chart breaks down the top contributing factors to traffic incidents by borough in New York City, revealing borough-specific trends. Across most boroughs, “Driver Inattention/Distraction” is the leading cause of incidents, with significant numbers in Queens, Brooklyn, and Manhattan. “Failure to Yield Right-of-Way” also emerges as a prominent factor, particularly in Staten Island and the Bronx. Other contributing factors like “Backing Unsafely” and “Following Too Closely” vary in significance depending on the borough.

The variation in contributing factors highlights the need for borough-specific traffic safety initiatives. For example, addressing driver distraction may require citywide campaigns, while targeting “Failure to Yield Right-of-Way” could involve enhanced enforcement at intersections in Staten Island and the Bronx. These insights provide a foundation for tailored safety measures to address unique traffic patterns and behaviors in each borough.

To better understand how these incidents have evolved over time, the next chart analyzes collision trends using a 30-day moving average. This provides a clear view of how the frequency of collisions has shifted, particularly in response to events like the COVID-19 pandemic and changing traffic patterns in NYC.

Code
collisions_over_time <- data %>%
  mutate(crash_date = as.Date(crash_date, format = "%m/%d/%Y")) %>%
  group_by(borough, crash_date) %>%
  summarise(total_collisions = n(), .groups = 'drop') %>%
  filter(!is.na(borough))

collisions_smoothed <- collisions_over_time %>%
  group_by(borough) %>%
  arrange(crash_date) %>%
  mutate(moving_avg = zoo::rollmean(total_collisions, k = 30, fill = NA))

custom_theme <- theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "grey40"),
    axis.title = element_text(size = 11, face = "bold"),
    axis.text = element_text(size = 10),
    legend.position = "none", 
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    plot.margin = margin(t = 20, r = 20, b = 20, l = 20)
  )

label_positions <- collisions_smoothed %>%
  group_by(borough) %>%
  filter(crash_date == max(crash_date)) 


trend <- ggplot(collisions_smoothed, aes(x = crash_date, y = moving_avg, color = borough)) +
  geom_line(size = 1.1) + 
  geom_text(data = data.frame(x = as.Date("2025-02-17"), y = 57.6004265491554, label = "Brooklyn"),
  mapping = aes(x = x, y = y, label = label), size = 6.47, colour = "#FF5733" , fontface = 2, inherit.aes = FALSE)+ 
  geom_text(data = data.frame(x = as.Date("2024-12-20"), y = 46.6004265491554, label = "Queens"),
  mapping = aes(x = x, y = y, label = label), size = 6.47, colour = "#BE95BE" , fontface = 2, inherit.aes = FALSE) +
  geom_text(data = data.frame(x = as.Date("2025-02-17"), y = 30.6004265491554, label = "Manhattan"),
  mapping = aes(x = x, y = y, label = label), size = 6.47, colour = "#71A3c1" , fontface = 2, inherit.aes = FALSE) +
  geom_text(data = data.frame(x = as.Date("2024-12-17"), y = 25.6004265491554, label = "Bronx"),
  mapping = aes(x = x, y = y, label = label), size = 6.47, colour = "#18b36f" , fontface = 2, inherit.aes = FALSE) +
  geom_text(data = data.frame(x = as.Date("2025-02-17"), y = 8.6004265491554, label = "Staten Island"),
  mapping = aes(x = x, y = y, label = label), size = 6.47, colour = "#97d00d" , fontface = 2, inherit.aes = FALSE) +
  geom_vline(xintercept = as.Date("2020-04-17"), color = "red", linetype = "dashed", size = 1.0) + 
  geom_text(data = data.frame(x = as.Date("2020-05-17"), y = 125, label = "COVID Lockdown"),
  mapping = aes(x = x, y = y, label = label), size = 8.47, colour = "red" , fontface = 2, inherit.aes = FALSE, hjust = 0)+
  scale_color_brewer(palette = "Set2") +
  scale_x_date(date_breaks = "3 months", date_labels = "%b %Y") +
  labs(
    title = "NYC Traffic Collisions Declined During COVID-19 and Remain Lower Across Boroughs",
    subtitle = "30-day moving average highlighting borough trends",
    x = NULL, 
    y = NULL 
  ) +
  custom_theme +
  theme(
    strip.text = element_text(size = 14, face = "bold", hjust = 0.5),
    panel.border = element_rect(color = "black", fill = NA, size = 1),
    axis.text.x = element_text(size = 15, angle = 45, hjust = 1, vjust = 1),
    axis.text.y = element_text(size = 15),
    plot.title = element_text(face = "bold", size = 20, hjust = 0.5, margin = margin(b = 10)),
    plot.subtitle = element_text(size = 17, color = "grey40", hjust = 0.5, margin = margin(b = 20)),
    panel.grid.major.y = element_blank(), 
    panel.grid.minor = element_blank()
  )

trend

This chart shows the trends in traffic collisions across New York City’s boroughs from 2012 to 2023, highlighting a significant decline during the COVID-19 lockdown in early 2020. Using a 30-day moving average, the chart reveals a sharp and consistent drop in collision rates across all boroughs, particularly in Brooklyn and Queens, which historically had the highest incident counts. Despite a gradual rebound post-lockdown, collision rates have not returned to pre-pandemic levels, indicating lasting changes in traffic patterns.

The sharp decline during the lockdown is likely due to reduced mobility, fewer vehicles on the road, and work-from-home policies. The sustained lower rates post-lockdown could reflect shifts in commuting behavior, such as increased remote work, or changes in urban mobility trends. These findings emphasize how external events like the pandemic can profoundly impact traffic patterns and highlight opportunities for urban planners to build on these trends to sustain safer streets in NYC.

To gain further insights into traffic safety, the next chart examines which types of vehicles are most frequently involved in crashes, shedding light on the role vehicle categories play in contributing to collision rates across the city.

Code
cleaned_vehicle_data <- data %>%
  mutate(vehicle_type_code_1 = str_to_upper(vehicle_type_code_1), 
         vehicle_type_code_1 = str_trim(vehicle_type_code_1),
         vehicle_type_code_1 = case_when(
           vehicle_type_code_1 %in% c("TAXI", "TAXI CAB") ~ "TAXI",
           vehicle_type_code_1 %in% c("STATION WAGON/SPORT UTILITY VEHICLE", 
                                      "SPORT UTILITY / STATION WAGON") ~ "SUV",
           vehicle_type_code_1 %in% c("4 DR SEDAN", "SEDAN", "PASSENGER VEHICLE") ~ "SEDAN",
           vehicle_type_code_1 %in% c("VAN", "MINIVAN") ~ "VAN",
           vehicle_type_code_1 %in% c("BOX TRUCK", "TRUCK") ~ "TRUCK",
           TRUE ~ vehicle_type_code_1
         ))

refined_vehicle_summary <- cleaned_vehicle_data %>%
  filter(!is.na(vehicle_type_code_1)) %>%  
  group_by(vehicle_type_code_1) %>%
  summarise(crash_count = n()) %>%
  arrange(desc(crash_count)) %>%
  slice_head(n = 10) 

refined_vehicle_chart <- ggplot(refined_vehicle_summary, aes(x = reorder(vehicle_type_code_1, crash_count), y = crash_count)) +
  geom_col(aes(fill = vehicle_type_code_1 == "SEDAN"), width = 0.8) + 
  scale_fill_manual(values = c("TRUE" = "#800000", "FALSE" = "#115f9a"), guide = "none") +
  coord_flip() +
  geom_text(aes(label = scales::comma(crash_count)), hjust = -0.2, color = "black", size = 3) +
  labs(
    title = "Sedans and SUVs Dominate NYC Crash Reports",
    subtitle = "Over 1 Million Sedans Involved in Crashes, Followed Closely by SUVs",
    x = "Vehicle Type",
    y = "Number of Crashes"
  ) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal(base_size = 15) +
  theme(
    plot.title.position = "plot",
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 13, margin = margin(b = 10), hjust = 0.5),
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(size = 10),
    axis.title.x = element_text(margin = margin(t = 10)),
    panel.grid = element_blank()
  )

refined_vehicle_chart

This chart highlights the types of vehicles involved in NYC traffic collisions, with sedans and SUVs dominating crash reports. Sedans are involved in over 1 million crashes, followed by SUVs with approximately 649,171 incidents. Other vehicle types, such as taxis (84,735), pick-up trucks (47,258), and buses (36,330), contribute significantly but are far fewer compared to sedans and SUVs.

The prominence of sedans and SUVs in collision data reflects their widespread use in the city. This finding underscores the importance of focusing road safety interventions, such as advanced driver assistance systems or public awareness campaigns, on these vehicle categories. Addressing safety issues with these dominant vehicle types could significantly reduce overall crash numbers in NYC.

Conclusion

From the analysis, we observe several critical patterns and insights about motor vehicle collisions in New York City. Fatalities peak at 4:00 a.m., likely due to factors such as reduced visibility, driver fatigue, or impaired driving, while injuries show distinct peaks during rush hour (4:00 p.m to 6:00 p.m, peaking at 5:00 p.m ), correlating with increased traffic density. Driver inattention/distraction consistently emerges as the leading contributing factor for both fatalities and injuries, followed by unsafe speed for fatalities and failure to yield right-of-way for injuries. Borough-specific analysis highlights variations in contributing factors, with driver distraction being a citywide issue and failure to yield being particularly significant in Staten Island and the Bronx. Sedans and SUVs dominate crash reports, reflecting their widespread use in the city. Collision rates sharply declined during the COVID-19 lockdown and have remained below pre-pandemic levels, suggesting lasting changes in traffic behavior.

To extend this analysis, additional data sources such as pedestrian and cyclist movement data, detailed weather conditions during crashes, or socioeconomic information could provide deeper insights into collision patterns. Understanding the impact of external factors like urban infrastructure changes, public transit usage, and emerging mobility trends (e.g., e-scooters, ride-sharing) could also enhance the study. Future analyses could focus on designing predictive models for high-risk times and locations or exploring the effectiveness of specific interventions like speed limit reductions or public awareness campaigns. These steps would help develop more targeted strategies to improve traffic safety in NYC.

Attribution

We both contributed equally.

Data Dictionary

Variable Name Description
crash_date Date of the crash (YYYY-MM-DD format).
crash_time Time of the crash (HH:MM:SS format).
borough Borough where the crash occurred (e.g., BROOKLYN, BRONX).
zip_code Zip code of the crash location.
latitude Latitude coordinate of the crash location.
longitude Longitude coordinate of the crash location.
location Combined latitude and longitude of the crash location.
on_street_name Name of the street where the crash occurred.
cross_street_name Name of the nearest cross street.
off_street_name Off-street location of the crash, if applicable.
number_of_persons_injured Number of persons injured in the crash.
number_of_persons_killed Number of persons killed in the crash.
number_of_pedestrians_injured Number of pedestrians injured in the crash.
number_of_pedestrians_killed Number of pedestrians killed in the crash.
number_of_cyclist_injured Number of cyclists injured in the crash.
number_of_cyclist_killed Number of cyclists killed in the crash.
number_of_motorist_injured Number of motorists injured in the crash.
number_of_motorist_killed Number of motorists killed in the crash.
contributing_factor_vehicle_1 Primary contributing factor for vehicle 1 in the crash.
contributing_factor_vehicle_2 Secondary contributing factor for vehicle 2 in the crash.
contributing_factor_vehicle_3 Additional contributing factor for vehicle 3 in the crash (if applicable).
contributing_factor_vehicle_4 Additional contributing factor for vehicle 4 in the crash (if applicable).
contributing_factor_vehicle_5 Additional contributing factor for vehicle 5 in the crash (if applicable).
collision_id Unique identifier for each crash.
vehicle_type_code_1 Type of vehicle 1 involved in the crash (e.g., Sedan, SUV, Truck).
vehicle_type_code_2 Type of vehicle 2 involved in the crash (if applicable).
vehicle_type_code_3 Type of vehicle 3 involved in the crash (if applicable).
vehicle_type_code_4 Type of vehicle 4 involved in the crash (if applicable).
vehicle_type_code_5 Type of vehicle 5 involved in the crash (if applicable).
crash_hour Hour of the day when the crash occurred (24-hour format).