Research question

How might measures of time such as day of week, or time of day correlate with the rate of vehicle collisions in the DC area.

Data sources

District Department of Transportation (DDOT) Crashes_in_DC.csv 106 MB https://opendata.dc.gov/datasets/DCGIS::crashes-in-dc/about

The data we are using is originally derived from COBALT, the data management system of the Metropolitan Police Department (MPD), and is being used by Open Data DC, but the data we are using is maintained by the District Department of Transportation (DDOT). DDOT processes new crash reports each night depending on if MPD has sufficient information, like good quality coordinates. This means that data from the US Park Police or other agencies in DC is not recorded in the MPD crash data. According to Open Data DC, DDOT has made a summary of the original data to include crash location references for each crash; however, this means that the location data that was missing or incomplete was not published in this data set. Additionally, according to DDOT Wiki, this data set may have “significant gaps and quality issues” as not all crashes are mapped. This means that the information included in the DC Open Data may slightly differ from the data from the MPD and that there is no guarantee that all of the data is complete and accurate. In addition to federal crash reports that are not included in MPD crash data, the “Crashes in DC” data set will not completely match MPD’s reported crash statistics due to changing crash severity conditions, delayed reporting, and data entry errors. However, the data is not biased in any way since the data takes into account multiple variables and is just organizing a measured component of an activity that may not be incomplete due to poor X/Y coordinate data according to DDOT, so personal bias may not be relevant.

Data cleaning

year_interval <- interval(ymd("2015-01-01"), ymd("2020-12-31"))
dc_data = read_csv(here("data_raw", "Crashes_in_DC.csv"))

dc_data_clean = dc_data %>%
    clean_names() %>%
    mutate(crash_date = ymd_hms(reportdate)) %>%
    filter(crash_date %within% year_interval) %>%
    mutate(
        fatal = (
            fatalpassenger |
                fatal_bicyclist | fatal_pedestrian | fatal_driver
        ),
        injury = (
            majorinjuries_pedestrian |
                minorinjuries_pedestrian |
                unknowninjuries_pedestrian |
                majorinjuries_bicyclist |
                minorinjuries_bicyclist  |
                unknowninjuries_bicyclist |
                majorinjuries_driver |
                minorinjuries_driver  | unknowninjuries_driver |
                majorinjuriespassenger |
                minorinjuriespassenger  | unknowninjuriespassenger
        )
    ) %>%
    mutate(type = case_when(injury ~ 'injury', fatal ~ 'fatal', T ~ 'property')) %>%
    select(crash_date, type, fatal, injury, latitude, longitude) %>%
    mutate(
        year = year(crash_date),
        week = week(crash_date),
        month = month(crash_date, label = TRUE),
        day = day(crash_date),
        hour = hour(crash_date),
        day_of_week = wday(crash_date, label = TRUE, abbr = TRUE)
    )

# vehicle involved crashes in DC from 2015-2020
variable description
crash_date ymd-hms of crash
type fatal, injury, or property

Analysis and figures

In the exploration of our research question we first explored how hour of the day relates to the fatality of crashes in DC. This plot shows that there is an uptick in fatal crashes at 1100. Next we wanted to look at how this was compared all crashes.

In addition, we we wanted to look at how the day of the week affected crashes both fatal and all other crashes. From these graphs we can take away how the “safest day” of the week is Thursday as there is the least number of fatal car crashes compared to all crashes throughout the week where there is no significant change.

Lastly, we wanted to look at how the month affected both fatal and all other crashes. From these graphs, we can see that there is no discernible correlation between all crashes and the month in which they occur; however, the summer months are deadlier, with what looks like a seasonal distribution centered in the month of July.

This animation visualizes all crashes for every day in May 2019, the month in our data that had the most total crashes aswell as the highest number of fatal crashes.

This interactive heatmap allows you to explore the everage crashes on any given day. It is worth noting that Fevuary 29th is a leap year and will have far less crashes in comparison as a result.

This animation highlights how the few crashes occurred in the weeks following the first COVID death in DC.

Addendum

Our data, while being very well formatted coming from DDOT, did have some issues. In 2015 a large number of reports had their crash_date set to 05:00 which messed with some visualizations.