WMATA Ridership

Author

Abdullah Alghamdi, Ziqian Yang, Arman Naseh

Published

December 10, 2023

Research Question

How have extenuating environmental circumstances such as the COVID-19 pandemic, different weather conditions, and 2023’s record high temperature affect the DC Metrorail’s ridership?

Sub-questions

COVID-19

  • Did the corona virus significantly impact the amount of metrorail riders? If yes, is it possible for WMATA’s metrorail riders to return to there pre-covid peaks?

Weather

  • Does metrorail ridership downtrend in bad weather such as rain or snow and by how much?

Ridership

  • What are the day to day ridership patterns for each day of the week?
  • How does ridership fluctuate when it comes to holidays?

Dataset 1: Ridership

URL: https://www.wmata.com/initiatives/ridership-portal/Metrorail-Ridership-Summary.cfm

Summary: The database named Weekday Ridership.csv, displays weekday ridership in the WMATA Metrorail from 2012 to 2023. The data has been pre-filtered on the WMATA website to meet those specific requirements on the database before importing into R. Holidays are excluded and almost every single day is recorded with entries and exits for each date. WMATA records all entries and exits every time someone passes through the gates but they do not account for multiple boardings in one entry they only record them as one entry.

Validity: The data comes directly from WMATA who are running the Metrorail and they have showcased reliable methods of obtaining data.

Variable Name Description
Date Dates from 2012 to 2023 excluding holidays and some random dates
Entries Total entries on a date
Exits Total exits on a date
Code
ridership_data_clean <- ridership_data_raw %>%
    rename(Date = `...1`) %>%  
    mutate(Date = mdy(Date))   

ridership_2018 <- ridership_data_clean %>%
    filter(year(Date) == 2018) %>%
    mutate(DayOfWeek = wday(Date, label = TRUE, week_start = 1))

ggplot(ridership_2018, aes(x = DayOfWeek, y = `Total Entries`)) +
    geom_boxplot() +
    labs(title = "Ridership by Day of the Week in 2018",
         x = "Day of the Week",
         y = "Ridership") +
    theme_minimal(base_family = "Georgia") +
    theme(plot.title = element_text(size = 18, face = "bold")) +
    scale_y_continuous(
        limits = c(0, 800000),
        labels = scales::comma)

We can see that the weekly ridership of the DC Metrorail is quite predictable. During the weekdays, more people ride the train to work or school back and forth, but in the weekends, we see a huge dip as people don’t go to work and school on those days. Ridership peaks around Wednesday and Thursday but there are some significant outliers that are higher than the average on Tuesday and Saturday especially. We were able to pinpoint that the outlier on Saturday was a result of the March for Our Lives Protest under the Trump administration by people who were going to Washington DC via metro to participate in it. The highest recorded ridership in 2018, which was on Tuesday, was most likely the result of the Washington Capitals’ parade to celebrate the Stanley Cup Championship.

Dataset 2: Weather

URL: https://www.weather.gov/wrh/Climate?wfo=lwx

Summary: The data is obtained directly from the National Weather Service website, and it is raw weather data on snowfall and precipitation. The website leads us to NOWData - NOAA Online Weather Data finder. However, it may require some data cleaning and processing to handle missing values or inconsistencies in the data. Weather data can sometimes have missing values, especially for remote or less-monitored locations.

Validity: The data is obtained directly from the National Weather Service, which is considered a reliable and authoritative source for weather data.

Variable Name Description
Date Date of the weather summary
Max Temperature Maximum temperature for the day (°F)
Min Temperature Minimum temperature for the day (°F)
Precipitation Total precipitation for the day (inches)
Snowfall Total snowfall for the day (inches)
Snow Depth Snow depth at the end of the day (inches)
Wind Speed Maximum wind speed during the day (mph)
Wind Direction Predominant wind direction for the day
Weather Condition Description of the weather condition (e.g., clear, rainy, snow)
Code
combined_data <- merge(ridership_data_clean, weather_data_combined, by="Date")

# Creating a new column 'WeatherCondition' to classify the weather based on the precipitation and snowfall values
combined_data$WeatherCondition <- ifelse(combined_data$Value.x > 0 & combined_data$Value.y == 0, 'Rain',
                                         ifelse(combined_data$Value.y > 0, 'Snow', 'Good'))

combined_data_filtered <- combined_data %>% filter(!is.na(WeatherCondition))


ggplot(combined_data_filtered, aes(x = WeatherCondition, y = `Total Entries`, fill = WeatherCondition)) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.2) + # Add points with a slight transparency
  scale_fill_brewer(palette = "Set2") + # Add color
  labs(
    title = "Box Plot of Total Entries by Weather Condition",
    subtitle = "Data from 2016-2019",
    x = "Weather Condition",
    y = "Ridership",
  ) +
  theme_minimal(base_family = "Georgia") +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 14),
    axis.title = element_text(size = 14),
  )

The provided box plot visualizes the relationship between WMATA ridership (measured by total entries) and weather conditions from 2016 to 2019. Three distinct weather categories are considered: Good, Rain, and Snow. The analysis of this chart will explore the central tendencies, dispersion, and outliers within each category, offering insights into how weather influences public transportation usage.

The box plot presents the following characteristics for each weather condition:

Good Weather: Good Weather indicates there is no rain or snow. The median ridership is around the 600,000 entries mark, with a relatively narrow interquartile range (IQR), suggesting consistent ridership behavior during favorable weather conditions. The concentration of data points within the IQR indicates that the majority of the days had a ridership level close to the median, showing less variability. However, the presence of outliers both above and below the box indicates that there were exceptional days with significantly higher or lower ridership than usual.

Rain: The median ridership during rainy days is slightly lower than on good weather days, which could imply a deterrent effect of rain on the willingness or necessity to use public transit. The IQR is broader compared to good weather days, indicating more variability in ridership on rainy days. This variability could be due to factors such as rain intensity, duration, or timing relative to commuting hours. Outliers are present here as well, especially on the lower end, suggesting that particularly rainy days may have significantly reduced transit use.

Snow: The median ridership on snowy days is lower than on both good and rainy days, with an IQR similar in width to rainy days. This trend points to a more substantial impact of snow on ridership, which aligns with common challenges associated with commuting during snow, such as service disruptions, delays, and safety concerns. The spread of outliers is smaller for snowy days, which might indicate that extreme variations in ridership are less frequent under snowy conditions, potentially due to a general avoidance of travel during heavy snowfall.

Central Tendencies: Good weather is associated with the highest median ridership, suggesting it is the most favorable condition for transit use. Rain and snow show a decrease in median ridership, with snow having the most substantial impact.

Dispersion: The variability of ridership under different weather conditions is an important aspect to consider. Rain shows the greatest variability, which could be influenced by the unpredictability of rain events and their varying impact on commuter decisions. Snow, while having a lower median ridership, shows a similar level of dispersion to rain, reflecting that snowfall can also lead to inconsistent ridership patterns.

Outliers: The presence of outliers in each category reveals days with atypical ridership. Under good weather, both high and low outliers suggest occasional deviations from typical commuting patterns, which could be due to special events or other non-weather-related factors. Rain and snow outliers are predominantly on the lower side, underscoring adverse weather’s potential to significantly deter ridership.

Code
combined_data_filtered <- combined_data %>%
  filter(Value.x > 0 | Value.y > 0) %>%
  mutate(WeatherType = ifelse(Value.x > 0 & Value.y == 0, "Rain", ifelse(Value.y > 0, "Snow", NA))) %>%
  # Rename columns to 'Rain' and 'Snow' for the gather function
  rename(Rain = Value.x, Snow = Value.y) %>%
  gather(key = "Condition", value = "Measurement", Rain, Snow, factor_key=TRUE) %>%
  filter(!is.na(Measurement) & Measurement > 0)

# Creating a faceted scatter plot for Total Entries vs Weather (excluding zero values)
p <- ggplot(combined_data_filtered, aes(x = Measurement, y = `Total Entries`)) +
  geom_point(aes(color = Condition), alpha = 0.6) +
  geom_smooth(method = "lm", aes(color = Condition), se = FALSE) + # Add a linear regression line without confidence interval
  facet_wrap(~Condition, scales = "free_x") + # Create facets for Rain and Snow with independent x scales
  labs(
    title = "Ridership vs. Weather",
    subtitle = "Ridership downtrends more significantly in snowy conditions",
    x = "Precipitation (Rain in mm, Snow in inches)",
    y = "Ridership"
  ) +
  scale_color_manual(values = c("Rain" = "blue", "Snow" = "red")) +
  theme_minimal(base_family = "Georgia") +
  theme(
    legend.position = "bottom",
    strip.background = element_rect(fill = "lightblue"), # Customizing facet label background
    strip.text = element_text(size = 15, face = "bold"), # Customizing facet label text
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 14),
    axis.title = element_text(size = 14),
    axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)
  ) +
  guides(color = guide_legend(title = "Weather Condition")) +
  scale_y_continuous(
        limits = c(0, 800000),
        labels = scales::comma)

# Print the plot
print(p)

The scatter plot provided illustrates the relationship between WMATA ridership, as gauged by total entries, and various weather conditions—specifically rain (measured in millimeters) and snow (measured in inches). By examining the distribution of data points and the trend lines, we gain insights into how precipitation affects public transit use.

The scatter plot depicts two distinct clusters of points, each representing a different weather condition on the x-axis, and corresponding ridership numbers on the y-axis. Here’s a deeper analysis of the observed patterns:

Rain Cluster: A dense congregation of points at the lower end of the rainfall spectrum suggests that light to moderate rain has a minimal deterrent effect on ridership. This indicates a certain level of tolerance or preparedness among commuters for these conditions. The negative slope of the trend line, however, indicates a general decrease in ridership as rainfall increases. This suggests that beyond a certain threshold, rain may discourage WMATA use, possibly due to concerns over delays, safety, or the discomfort of traveling in wet conditions.

Snow Cluster: The snow-related data points are more spread out, with a steep negative trend line that is more pronounced than that of the rain. This implies a stronger correlation between snowfall and a decrease in ridership. This can be attributed to the disruptive impact of snow on transportation infrastructure, including road closures, reduced visibility, and potential delays or cancellations of transit services. The impact of even a small amount of snow appears significant, likely due to the cumulative effects of snow on the city’s ability to maintain normal transit operations.

Linear Regression Trends: The trend lines for both rain and snow reveal valuable information. For rain, the more gradual slope suggests a more linear relationship between the increase in precipitation and a decrease in ridership. For snow, the steeper slope implies that ridership is more sensitive to changes in snowfall amounts, with even minor increases potentially resulting in significant drops in ridership.

Data Point Density and Spread: The density of data points for rain indicates that the most common weather events involve less precipitation, whereas snow events are more variable in terms of their occurrence and impact. The spread of the snow data points might also reflect the less frequent but more impactful nature of snow events compared to rain.

Outliers: There are outliers present in both clusters. In the rain cluster, outliers on the higher end of the ridership scale could represent days when rain was forecasted but did not significantly materialize, or when other factors (such as events or service changes) influenced ridership. Outliers in the snow cluster, particularly those showing high ridership despite significant snowfall, may indicate days when the snow did not impact commuting hours or when the WMATA system was able to maintain effective service despite the weather.

COVID-19

Code
#| fig.width: 12
#| fig.height: 8

ggplot(ridership_data_clean, aes(x = Date, y = `Total Entries`)) +
    geom_line(color = "darkred") +
    labs(title = "How has the COVID-19 Pandemic Affected DC Metro Ridership?",
         subtitle = "2018-2023",
         x = "Date",
         y = "Ridership") +
    scale_x_date(
        date_breaks = "1 year", 
        date_labels = "%Y",
        limits = as.Date(c("2015-01-01", "2023-12-31"))
    ) +
    scale_y_continuous(
        limits = c(0, 800000),
        labels = scales::comma
    ) +
    theme_classic(base_family = "Georgia",
                  base_size = 16) +
    theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 14))

Public transportation in the US has been on the downtrend slowly over the years and the DC Metrorail is no exception to this. It can be seen in the pre-covid era, ridership is very slowly decreasing or just static even though more stations were opening up and trains being made. The immediate spikes and plummets all over the line graph in each year are the result of weekends having much lower amounts of riders. In 2020, as a result of quarantine in the COVID-19 pandemic, many stations were probably closed because there was no one to operate them which results in averages as low as around 15,000 riders daily. However, in post-covid it seems that ridership is on the track to a healthy recovery but will most likely not be able to achieve the same ridership as pre-covid as a result of work from home and company policies changing to not need as many employees to be physically present on a day-to-day basis.

Code
average_ridership_by_day <- ridership_data_clean %>%
    filter(year(Date) %in% c(2018, 2020, 2022)) %>%
    mutate(Year = as.factor(year(Date)),
           DayOfWeek = wday(Date, label = TRUE, week_start = 1)) %>%
    group_by(Year, DayOfWeek) %>%
    summarise(AverageRidership = mean(`Total Entries`, na.rm = TRUE)) %>%
    ungroup()  

# Create the bar chart
ggplot(average_ridership_by_day, aes(x = DayOfWeek, y = AverageRidership, fill = Year)) +
    geom_bar(stat = "identity", position = position_dodge(width = 0.7)) +
    labs(title = "Average Ridership by Day of the Week",
         subtitle = "Comparison between 2018, 2020, and 2022",
         x = "Day of the Week",
         y = "Average Ridership") +
    scale_fill_brewer(palette = "Set1") +  # Use a distinguishable color palette
    theme_minimal() +
    
    theme_classic(base_family = "Georgia",
                  base_size = 16) +
    theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 14)) +
    
     scale_y_continuous(
        limits = c(0, 650000),
        labels = scales::comma)

Code
    theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Improve readability of x labels
#> List of 1
#>  $ axis.text.x:List of 11
#>   ..$ family       : NULL
#>   ..$ face         : NULL
#>   ..$ colour       : NULL
#>   ..$ size         : NULL
#>   ..$ hjust        : num 1
#>   ..$ vjust        : NULL
#>   ..$ angle        : num 45
#>   ..$ lineheight   : NULL
#>   ..$ margin       : NULL
#>   ..$ debug        : NULL
#>   ..$ inherit.blank: logi FALSE
#>   ..- attr(*, "class")= chr [1:2] "element_text" "element"
#>  - attr(*, "class")= chr [1:2] "theme" "gg"
#>  - attr(*, "complete")= logi FALSE
#>  - attr(*, "validate")= logi TRUE

Here we have a more direct comparison to see how far the Metrorail has fallen during peak covid compared to pre-covid activity and it is most evidently shown on weekdays. 2018 is significantly higher than the rest while 2022 doesn’t even recover enough to make up for 50% of pre-covid ridership.

Attribution:

All of the final report has been done by Abdullah and Ziqian and Arman will be doing the final presentation. Abdullah was responsible for the ridership and COVID-19 sections, while Ziqian was responsible for the weather section

Appendix

Code
spelling::spell_check_files("report.qmd")

# Load libraries and settings here
library(tidyverse)
library(readr)
library(readxl)
library(dplyr)
library(tidyr)
library(lubridate)
library(here)
library(ggplot2)
library(here)
library(extrafont)


knitr::opts_chunk$set(
  warning = FALSE,
  message = FALSE,
  comment = "#>",
  fig.path = "figs/", # Folder where rendered plots are saved
  fig.width = 7.252, # Default plot width
  fig.height = 4, # Default plot height
  fig.retina = 3 # For better plot resolution
)

# Put any other "global" settings here, e.g. a ggplot theme:
theme_set(theme_bw(base_size = 20))

# Write code below here to load any data used in project

ridership_data_raw <- read_excel(here("data_raw", "Total Entries and Exits.xlsx"))

ridership_data_clean <- ridership_data_raw %>%
  rename(Date = `...1`) %>%  
  mutate(Date = mdy(Date))  

# Reading Rainfall Data for 2016-2019

rainfall_2016 <- read_excel(here("data_raw", "Mean_Rain_Fall_By_Day_2016.xls"))
rainfall_2017 <- read_excel(here("data_raw", "Mean_Rain_Fall_By_Day_2017.xlsx"))
rainfall_2018 <- read_excel(here("data_raw", "Mean_Rain_Fall_By_Day_2018.xlsx"))
rainfall_2019 <- read_excel(here("data_raw", "Mean_Rain_Fall_By_Day_2019.xlsx"))

# Reading Snowfall Data for 2016-2019

snowfall_2016 <- read_excel(here("data_raw", "Mean Snowfall by Day 2016.xlsx"))
snowfall_2017 <- read_excel(here("data_raw", "Mean Snowfall by Day 2017.xlsx"))
snowfall_2018 <- read_excel(here("data_raw", "Mean Snowfall by Day 2018.xlsx"))
snowfall_2019 <- read_excel(here("data_raw", "Mean Snowfall by Day 2019.xlsx"))

#Weather data cleaning

#CleanRainfall_2016
rainfall_2016 <- rainfall_2016 %>%
  mutate(across(-1, ~ case_when(
    . %in% c("T", "M") ~ NA_real_,  # Replace "T" and "M" with NA
    TRUE ~ as.numeric(as.character(.))
  ))) %>%
  # If the first column is not named 'Day', rename it
  rename(Day = 1)
rainfall_long_2016 <- rainfall_2016 %>%
  pivot_longer(
    cols = -Day, # This selects all columns except for 'Day' to be pivoted
    names_to = "Month",
    values_to = "Value"
  ) %>%
  mutate(
    # Create a date column assuming the year is 2016
    Date = make_date(year = 2016, month = match(Month, month.abb), day = Day)
  ) %>%
  select(Date, Month, Value)

# Now remove rows where the Date is NA
rainfall_long_2016 <- rainfall_long_2016 %>%
  drop_na(Date)
#CleanRainfall_2017
rainfall_2017 <- rainfall_2017 %>%
  mutate(across(-1, ~ case_when(
    . %in% c("T", "M") ~ NA_real_,  # Replace "T" and "M" with NA
    TRUE ~ as.numeric(as.character(.))
  ))) %>%
  # If the first column is not named 'Day', rename it
  rename(Day = 1)
rainfall_long_2017 <- rainfall_2017 %>%
  pivot_longer(
    cols = -Day, # This selects all columns except for 'Day' to be pivoted
    names_to = "Month",
    values_to = "Value"
  ) %>%
  mutate(
    # Create a date column assuming the year is 2016
    Date = make_date(year = 2017, month = match(Month, month.abb), day = Day)
  ) %>%
  select(Date, Month, Value)

# Now remove rows where the Date is NA
rainfall_long_2017 <- rainfall_long_2017 %>%
  drop_na(Date)


#CleanRainfall_2018
rainfall_2018 <- rainfall_2018 %>%
  mutate(across(-1, ~ case_when(
    . %in% c("T", "M") ~ NA_real_,  # Replace "T" and "M" with NA
    TRUE ~ as.numeric(as.character(.))
  ))) %>%
  # If the first column is not named 'Day', rename it
  rename(Day = 1)
rainfall_long_2018 <- rainfall_2018 %>%
  pivot_longer(
    cols = -Day, # This selects all columns except for 'Day' to be pivoted
    names_to = "Month",
    values_to = "Value"
  ) %>%
  mutate(
    # Create a date column assuming the year is 2016
    Date = make_date(year = 2018, month = match(Month, month.abb), day = Day)
  ) %>%
  select(Date, Month, Value)

# Now remove rows where the Date is NA
rainfall_long_2018 <- rainfall_long_2018 %>%
  drop_na(Date)
    
#CleanRainfall_2019
rainfall_2019 <- rainfall_2019 %>%
  mutate(across(-1, ~ case_when(
    . %in% c("T", "M") ~ NA_real_,  # Replace "T" and "M" with NA
    TRUE ~ as.numeric(as.character(.))
  ))) %>%
  # If the first column is not named 'Day', rename it
  rename(Day = 1)
rainfall_long_2019 <- rainfall_2019 %>%
  pivot_longer(
    cols = -Day, # This selects all columns except for 'Day' to be pivoted
    names_to = "Month",
    values_to = "Value"
  ) %>%
  mutate(
    # Create a date column assuming the year is 2016
    Date = make_date(year = 2019, month = match(Month, month.abb), day = Day)
  ) %>%
  select(Date, Month, Value)

# Now remove rows where the Date is NA
rainfall_long_2019 <- rainfall_long_2019 %>%
  drop_na(Date)
    
#Cleansnowfall_2016
snowfall_2016 <- snowfall_2016 %>%
  mutate(across(-1, ~ case_when(
    . %in% c("T", "M") ~ NA_real_,  # Replace "T" and "M" with NA
    TRUE ~ as.numeric(as.character(.))
  ))) %>%
  # If the first column is not named 'Day', rename it
  rename(Day = 1)
snowfall_long_2016 <- snowfall_2016 %>%
  pivot_longer(
    cols = -Day, # This selects all columns except for 'Day' to be pivoted
    names_to = "Month",
    values_to = "Value"
  ) %>%
  mutate(
    # Create a date column assuming the year is 2016
    Date = make_date(year = 2016, month = match(Month, month.abb), day = Day)
  ) %>%
  select(Date, Month, Value)

# Now remove rows where the Date is NA
snowfall_long_2016 <- snowfall_long_2016 %>%
  drop_na(Date)

#Cleansnowfall_2017
snowfall_2017 <- snowfall_2017 %>%
  mutate(across(-1, ~ case_when(
    . %in% c("T", "M") ~ NA_real_,  # Replace "T" and "M" with NA
    TRUE ~ as.numeric(as.character(.))
  ))) %>%
  # If the first column is not named 'Day', rename it
  rename(Day = 1)
snowfall_long_2017 <- snowfall_2017 %>%
  pivot_longer(
    cols = -Day, # This selects all columns except for 'Day' to be pivoted
    names_to = "Month",
    values_to = "Value"
  ) %>%
  mutate(
    # Create a date column assuming the year is 2016
    Date = make_date(year = 2017, month = match(Month, month.abb), day = Day)
  ) %>%
  select(Date, Month, Value)

# Now remove rows where the Date is NA
snowfall_long_2017 <- snowfall_long_2017 %>%
  drop_na(Date)

#Cleansnowfall_2018
snowfall_2018 <- snowfall_2018 %>%
  mutate(across(-1, ~ case_when(
    . %in% c("T", "M") ~ NA_real_,  # Replace "T" and "M" with NA
    TRUE ~ as.numeric(as.character(.))
  ))) %>%
  # If the first column is not named 'Day', rename it
  rename(Day = 1)
snowfall_long_2018 <- snowfall_2018 %>%
  pivot_longer(
    cols = -Day, # This selects all columns except for 'Day' to be pivoted
    names_to = "Month",
    values_to = "Value"
  ) %>%
  mutate(
    # Create a date column assuming the year is 2016
    Date = make_date(year = 2018, month = match(Month, month.abb), day = Day)
  ) %>%
  select(Date, Month, Value)

# Now remove rows where the Date is NA
snowfall_long_2018 <- snowfall_long_2018 %>%
  drop_na(Date)

#Cleansnowfall_2019
snowfall_2019 <- snowfall_2019 %>%
  mutate(across(-1, ~ case_when(
    . %in% c("T", "M") ~ NA_real_,  # Replace "T" and "M" with NA
    TRUE ~ as.numeric(as.character(.))
  ))) %>%
  # If the first column is not named 'Day', rename it
  rename(Day = 1)
snowfall_long_2019 <- snowfall_2019 %>%
  pivot_longer(
    cols = -Day, # This selects all columns except for 'Day' to be pivoted
    names_to = "Month",
    values_to = "Value"
  ) %>%
  mutate(
    # Create a date column assuming the year is 2016
    Date = make_date(year = 2019, month = match(Month, month.abb), day = Day)
  ) %>%
  select(Date, Month, Value)

# Now remove rows where the Date is NA
snowfall_long_2019 <- snowfall_long_2019 %>%
  drop_na(Date)

snowfall_long_combined <- bind_rows(snowfall_long_2016, snowfall_long_2017, 
                                    snowfall_long_2018, snowfall_long_2019)
rainfall_long_combined <- bind_rows(rainfall_long_2016, rainfall_long_2017, 
                                    rainfall_long_2018, rainfall_long_2019)

weather_data_combined <- full_join(rainfall_long_combined, snowfall_long_combined, by = "Date")

ridership_data_clean <- ridership_data_raw %>%
    rename(Date = `...1`) %>%  
    mutate(Date = mdy(Date))   

ridership_2018 <- ridership_data_clean %>%
    filter(year(Date) == 2018) %>%
    mutate(DayOfWeek = wday(Date, label = TRUE, week_start = 1))

ggplot(ridership_2018, aes(x = DayOfWeek, y = `Total Entries`)) +
    geom_boxplot() +
    labs(title = "Ridership by Day of the Week in 2018",
         x = "Day of the Week",
         y = "Ridership") +
    theme_minimal(base_family = "Georgia") +
    theme(plot.title = element_text(size = 18, face = "bold")) +
    scale_y_continuous(
        limits = c(0, 800000),
        labels = scales::comma)
combined_data <- merge(ridership_data_clean, weather_data_combined, by="Date")

# Creating a new column 'WeatherCondition' to classify the weather based on the precipitation and snowfall values
combined_data$WeatherCondition <- ifelse(combined_data$Value.x > 0 & combined_data$Value.y == 0, 'Rain',
                                         ifelse(combined_data$Value.y > 0, 'Snow', 'Good'))

combined_data_filtered <- combined_data %>% filter(!is.na(WeatherCondition))


ggplot(combined_data_filtered, aes(x = WeatherCondition, y = `Total Entries`, fill = WeatherCondition)) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.2) + # Add points with a slight transparency
  scale_fill_brewer(palette = "Set2") + # Add color
  labs(
    title = "Box Plot of Total Entries by Weather Condition",
    subtitle = "Data from 2016-2019",
    x = "Weather Condition",
    y = "Ridership",
  ) +
  theme_minimal(base_family = "Georgia") +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 14),
    axis.title = element_text(size = 14),
  )
combined_data_filtered <- combined_data %>%
  filter(Value.x > 0 | Value.y > 0) %>%
  mutate(WeatherType = ifelse(Value.x > 0 & Value.y == 0, "Rain", ifelse(Value.y > 0, "Snow", NA))) %>%
  # Rename columns to 'Rain' and 'Snow' for the gather function
  rename(Rain = Value.x, Snow = Value.y) %>%
  gather(key = "Condition", value = "Measurement", Rain, Snow, factor_key=TRUE) %>%
  filter(!is.na(Measurement) & Measurement > 0)

# Creating a faceted scatter plot for Total Entries vs Weather (excluding zero values)
p <- ggplot(combined_data_filtered, aes(x = Measurement, y = `Total Entries`)) +
  geom_point(aes(color = Condition), alpha = 0.6) +
  geom_smooth(method = "lm", aes(color = Condition), se = FALSE) + # Add a linear regression line without confidence interval
  facet_wrap(~Condition, scales = "free_x") + # Create facets for Rain and Snow with independent x scales
  labs(
    title = "Ridership vs. Weather",
    subtitle = "Ridership downtrends more significantly in snowy conditions",
    x = "Precipitation (Rain in mm, Snow in inches)",
    y = "Ridership"
  ) +
  scale_color_manual(values = c("Rain" = "blue", "Snow" = "red")) +
  theme_minimal(base_family = "Georgia") +
  theme(
    legend.position = "bottom",
    strip.background = element_rect(fill = "lightblue"), # Customizing facet label background
    strip.text = element_text(size = 15, face = "bold"), # Customizing facet label text
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 14),
    axis.title = element_text(size = 14),
    axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)
  ) +
  guides(color = guide_legend(title = "Weather Condition")) +
  scale_y_continuous(
        limits = c(0, 800000),
        labels = scales::comma)

# Print the plot
print(p)
#| fig.width: 12
#| fig.height: 8

ggplot(ridership_data_clean, aes(x = Date, y = `Total Entries`)) +
    geom_line(color = "darkred") +
    labs(title = "How has the COVID-19 Pandemic Affected DC Metro Ridership?",
         subtitle = "2018-2023",
         x = "Date",
         y = "Ridership") +
    scale_x_date(
        date_breaks = "1 year", 
        date_labels = "%Y",
        limits = as.Date(c("2015-01-01", "2023-12-31"))
    ) +
    scale_y_continuous(
        limits = c(0, 800000),
        labels = scales::comma
    ) +
    theme_classic(base_family = "Georgia",
                  base_size = 16) +
    theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 14))
average_ridership_by_day <- ridership_data_clean %>%
    filter(year(Date) %in% c(2018, 2020, 2022)) %>%
    mutate(Year = as.factor(year(Date)),
           DayOfWeek = wday(Date, label = TRUE, week_start = 1)) %>%
    group_by(Year, DayOfWeek) %>%
    summarise(AverageRidership = mean(`Total Entries`, na.rm = TRUE)) %>%
    ungroup()  

# Create the bar chart
ggplot(average_ridership_by_day, aes(x = DayOfWeek, y = AverageRidership, fill = Year)) +
    geom_bar(stat = "identity", position = position_dodge(width = 0.7)) +
    labs(title = "Average Ridership by Day of the Week",
         subtitle = "Comparison between 2018, 2020, and 2022",
         x = "Day of the Week",
         y = "Average Ridership") +
    scale_fill_brewer(palette = "Set1") +  # Use a distinguishable color palette
    theme_minimal() +
    
    theme_classic(base_family = "Georgia",
                  base_size = 16) +
    theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 14)) +
    
     scale_y_continuous(
        limits = c(0, 650000),
        labels = scales::comma)
    theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Improve readability of x labels