combined_plot
Cracking The NCAA Team Rankings Code
The Attributes of the Best NCAA Football Playoff Teams Over a Decade
Introduction
College football is a highly valued sport in universities and colleges in the United States. Many college football stars go on to make careers in larger football leagues such as the NFL. Due to the competitive nature of the sport, some individuals have great personal connections with their favorite college football teams. For this reason, when the highly-anticipated AP News rankings of NCAA (division for college football) are released, many football fans have questions about how the top four teams were selected by the NCAA Selection Committee. The committee’s decisions happen in a closed-door meeting between members, and the committee’s decision is final. Some predict that the committee bases their decision on the difficulty of the team’s season, but what does that entail? Defense ability? Number of interceptions? This project aims to crack the code.
In this research project, we will observe data from the top 25 teams in the 2023 season of NCAA college football and analyze how the attributes of the top four teams in 2023 compare to those of the other teams in the past ten years. Conducting an analysis of the rankings for the top four teams based on attribute will allow us to determine which attributes were likely most important in the AP News ranking based on the most recent results (2023).
After determining which categories were most important in the year 2023 – which we did in Figure 1 below – we will apply these factors to the past ten years to examine whether they were also important in the selection of the number one team over an extended period.
Figure 1 - Most Important Factors 2023
The factors we decided were most important in the year 2023 were average offensive yards/pass, number of touchdowns, average defensive yards/game, total offensive yards/game.
In this analysis, we want to answer the question “Which attributes are the most important to the College Football Playoff Selection Committee when selecting the top 4 teams in NCAA College Football?”.
Some may assume that the most important attribute of the top four teams is their ability to score many points in a game, but in the recent 2023 season, the number of points scored does not have a direct correlation with the top four teams.
While many statistical data points are available on each team in the NCAA, it is still difficult to conclude which attributes the committee finds the most important in arriving at their decision. Therefore, our goal with this research project is to conclude which attributes led the committee’s decision to choose the University of Michigan, The University of Washington, The University of Texas, and the University of Alabama as the top four football teams for the 2023 season. After comparing these results to those of previous years, we plan to predict which factors will be most important to the selection committee in choosing the top four teams for this 2024 season.
Key Findings
- There is a correlation between Average Touchdowns and chances of being chosen for the top teams in the past 10 years. The more touchdowns a team makes in a season, the higher the chance of that team being chosen as a top team.
- There is not a correlation between Average Offensive Yards/Pass and chances of being chosen for the top teams in the past 10 years, more research is needed to determine if there is another reason why there appeared to be a relationship in 2023.
- There is a correlation between Average Defensive Yards/Game and chances of being chosen for the top teams in the past 10 years. The lower the average defensive yards/game, the higher the chance of that team being chosen as a top team.
- There is a slight correlation between Average Offensive Yards/Game and chances of being chosen for the top teams in the past 10 years. All number-one teams had an average or higher number of offensive yards per game, so having a higher-than-average number of offensive yards/game led to greater chances of being chosen as a top team.
Research Question
Which attributes are the most important to the College Football Playoff Selection Committee when selecting the top 4 teams in NCAA College Football?
- In the year 2023, which factors were most important?
- Were these factors important over the past ten years?
Answering these questions will allow us to gain insight into how the NCAA Selection Committee made its final decisions in the AP News Rankings.
Data Sources
The Associated Press
https://apnews.com/
The sports section of The Associated Press, specifically its AP top 25 section, is curated by knowledgeable sports broadcasters and analysts. It is a trusted global news platform that covers topics beyond sports, from business to technology, economics, and more. The Associated Press is the original source for the AP Top 25 college football rankings, as they conduct the voting and publish the results. The results of the Associated Press poll are then released and often referred to by other sports broadcasters. There is not likely to be bias in information from the Associated Press because the Associated Press is a reputable news organization that adheres to strict standards, including objectivity and fairness.
College Football Playoff
https://collegefootballplayoff.com/sports/selection-committee/roster/2023-24
The official site of the College Football Playoff provides large amounts of information on the playoff system, including team rankings, selection committee rosters, and detailed explanations of the selection process for determining national champions in college football. There is not likely to be bias in information because the organization operates independently of any specific team or conference.
We intended to use this source to research the individuals that make up the selection committee. The data regarding the committee members is the original source of data, as it provides verified data directly from the organization itself.
NCAA
https://www.ncaa.com/rankings/football/fbs/college-football-playoff
The NCAA website offers updated rankings, scores, and news related to college football. It also provides insights into the playoff landscape and championship selection. There is no bias in the information because the original source for most of the data on this website is the colleges and programs that make up the NCAA.
We intended to use this source for its information on rankings, standings, and statistics. The original source for a majority of the data on this website is the colleges and programs that make up the NCAA. Participating universities provide their stats and standings to the NCAA for reporting.
ESPN
https://www.espn.com/college-football/standings/_/season/2023
ESPN’s college football section delivers coverage of the sport, featuring real-time standings, analysis, highlights, and news updates. There is no bias in the information because game statistics are obtained directly from the NCAA, and rankings are from recognized polls such as the AP Poll.
This is our main source of data, as it contains information on all of our factors. The data comes from both internal and external sources such as the NCAA, and rankings information is gathered from unique poll results for each division, such as NCAAF, AP Poll, and AFCS Coaches Poll.
Analysis and Findings
Factor 1 - Number of Touchdowns
This chart illustrates the correlation between the top-ranked team in each year of NCAA football and how that team ranked with the number of Touchdowns scored that year. Overall, the top teams from each year were not always ranked first in touchdowns (or even in the top 10). But that does not necessarily mean that this attribute was not important. The graph below shows a much more detailed picture of where these teams fell in terms of scoring touchdowns.
td_rank
This chart details how each top ranked team compared in number of touchdowns scored with all other teams in NCAA football. As you can see, for most all years, the top team scored high above the average number of Touchdowns scored. This means that while the top team may not have always been the top scorer, they were always in the top percent
everyone
Factor 2 - Average Defensive Yards/Game
This chart analyzes the relationship between overall top-ranked teams and leading teams in defensive yards. Overall, number one-ranked teams also ranked relatively high in defensive yards ranking. The defensive yards data ranking was determined by which teams allowed the least amount of yards to be gained on defense compared to all other teams in the NCAAF. Unfortunately, there was a lot of missing data for this factor and only 3 out of the 9 years provided data. We chose to omit those years with missing data and dedicate our analysis for this factor to only the years 2014, 2021, and 2023. However, despite the lack of complete information, it’s clear that these top-ranked teams were correlated to positive performance in defensive yards.
defensive_dot_plot
In a more detailed analysis, we can examine how each team performed in comparison to all other teams in the NCAAF, with the average performance represented by a black dotted line. Teams that ranked highly overall, indicated by the blue dot, also had a relatively low number of yards gained while on defense.
defensive
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).
Factor 3 - Average Offensive Yards Per Game
This chart illustrates the relationship between the number 1 ranked teams and their performance in gaining offensive yards per game over the past nine years. The hypothesis was to identify a positive correlation between offensive yards rank and overall rank. However, it was observed that over the years, the importance of accumulating a large number of offensive yards was not directly related to how teams ranked overall. While this relationship wasn’t as pronounced as expected, it still shows that top-ranked teams accumulate more offensive yards than 85% of all other teams.
graph1
To gain a deeper understanding of the significance of offensive yards, this chart presents a comprehensive analysis of the total yards gained by the top-ranked team in comparison to the performance of all other teams within the same year. The black line on the graph represents the average offensive yards amassed by teams across the league for that season. By examining this data closely, we can clearly see that all of the teams that ranked at the top of the standings achieved offensive yard totals that were either at or above the league’s average. This insight highlights the correlation between offensive productivity and overall team success in the league.
offensive
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_point()`).
Factor 4 - Average Offensive Yards Per Pass
The graph below shows the rankings of the top teams over ten years based on the team’s average offensive yards per pass. From a first glance of the graph, we can see that there is not much consistency with the number of average yards per pass ranking, as it fluctuates from mid-to-high ranks, mostly staying in the 8 to 10 range. While it is true that most of these teams stay in the top 10 range, we cannot definitively say that the ranking of average yards per pass has a correlation with the selection of the top teams for each year. By simply observing the graph, we do not see any trends, but let us look a bit deeper.
avg_ypp_line_and_dot
To look further into the data, we created a faceted dot plot where we compared the average yards per pass of each team to that of the number one top-ranked team for each year, which are indicated by the red dots in the graph below. As shown in the figure, the top-ranked teams for each year are often not consistently in one area above or below the average yards per pass. Therefore, we cannot say that this factor matters for all years.
average_YPP
Conclusions
Our analysis indicates that factors such as total touchdowns (TDs), average defensive yards allowed per game, and average offensive yards gained per game play a significant role in determining team selection for the NCAA College Football Playoff. Teams that excel in scoring, gaining yards, and limiting opponents’ yardage are more likely to be considered for the Final Four, underscoring the importance of offensive strength not just for winning games but for achieving playoff contention.
However, our findings suggest that excelling in these metrics alone does not guarantee playoff selection. The College Football Playoff Committee’s decisions often appear inconsistent, with some choices seeming arbitrary or lacking clear data-driven justification. While top-performing teams generally rank above average in key categories, they are not always the best in every metric. Additional factors, such as the strength of a team’s schedule, conference performance, and subjective considerations like “good losses” (close games against strong opponents), often influence the committee’s decisions in ways that are difficult to quantify.
This analysis underscores the dynamic nature of the playoff selection process, which varies significantly from season to season. Although the committee consistently considers certain performance metrics, its decisions are also shaped by unpredictable variables such as injuries, upsets, and unforeseen circumstances. Consequently, there is no definitive attribute or strategy that guarantees playoff selection, mirroring the complexity and unpredictability of the game itself.
Attribution
Introduction - Alana Lee
Figure 1 and Description - Maddie Wilson
Key Findings - Everyone
Research Question - Everyone
Data Sources - Alana Lee
Factor 1 and Analyses - Maddie Wilson
Factor 2 and Analyses - Kyla Rounsoville
Factor 3 and Analyses - Kyla Rounsoville
Factor 4 and Analyses - Alana Lee
Conclusion - Maddie Wilson and Kyla Rounsoville
Appendix
library(rvest)
library(dplyr)
<- read_html(here::here('data_raw', '2023_passing_offense.html')) %>%
data_list23 html_table()
<- cbind(data_list23[[1]], data_list23[[2]])
data23#View(data23)
<- c(
top25 "Ohio State Buckeyes",
"Notre Dame Fighting Irish",
"Alabama Crimson Tide",
"Clemson Tigers",
"Iowa Hawkeyes",
"LSU Tigers",
"Michigan Wolverines",
"Oklahoma State Cowboys",
"Texas Longhorns",
"Georgia Bulldogs",
"Washington Huskies",
"Cincinnati Bearcats",
"Florida Gators",
"Oklahoma Sooners",
"Oregon Ducks",
"USC Trojans",
"Utah Utes",
"Auburn Tigers",
"Boise State Broncos",
"Florida State Seminoles",
"Miami Hurricanes",
"Northwestern Wildcats",
"Oklahoma Sooners",
"Tennessee Volunteers",
"Penn State Nittany Lions",
"Texas A&M Aggies")
<- data23 %>%
filtered_data23 mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
#View(filtered_data23)
#View(filtered_data23)
library(ggplot2)
<- c(
team_colors "Michigan Wolverines" = "blue",
"Alabama Crimson Tide" = "red",
"Texas Longhorns" = "darkorange",
"Washington Huskies" = "purple")
ggplot(filtered_data23, aes(x = Team, y = YDS, fill = Team)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
theme_minimal() +
labs(title = "YDS by Team", x = "Team", y = "Score")
# Load necessary packages
library(ggplot2)
library(dplyr)
$YDS <- gsub("[^0-9.-]", "", filtered_data23$YDS)
filtered_data23
$YDS <- as.numeric(filtered_data23$YDS)
filtered_data23#View(filtered_data23)
<- ggplot(filtered_data23, aes(x = reorder(Team, YDS), y = YDS, fill = Team)) + # Reorder Team by yards on x-axis
yds_by_team geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
labs(x = "Team", y = "Total Offensive Yards", title = "Total Offensive Yards by Team in 2023 Season") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
theme(legend.position = "none")# Rotate x-axis labels for readability
yds_by_team
<- ggplot(filtered_data23, aes(x = reorder(Team, `YDS/G`), y = `YDS/G`, fill = Team)) + # Reorder Team by YDS/G
yds_by_game geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
labs(x = "Offensive Yards Per Game", y = "Team", title = "Offensive Yards Per Game in 2023 Season") + # Adjust labels for flipped axes
theme(axis.text.y = element_text(angle = 0, hjust = 1), # Adjust for readability on flipped axis
legend.position = "none") + # Remove legend
coord_flip() # Flip x and y axes
yds_by_game
<- ggplot(filtered_data23, aes(x = reorder(Team, TD), y = TD, fill = Team)) + # Reorder Team by TD
TotalTD geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
labs(x = "# of Touchdowns", y = "Team", title = "Number of Total Touchdowns") + # Adjust labels for flipped axes
theme(axis.text.y = element_text(angle = 0, hjust = 1), # Adjust for readability on flipped axis
legend.position = "none") + # Remove legend
coord_flip() # Flip x and y axes
TotalTD
%>% arrange(SACK)
filtered_data23
<- ggplot(filtered_data23, aes(x = reorder(Team, `CMP%`), y = `CMP%`, fill = Team)) + # Reorder Team by CMP% on x-axis
completion_percentage geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
labs(x = "Team", y = "Completion Percentage", title = "Completion Percentage by Team") +
theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for readability
legend.position = "none") # Remove legend
completion_percentage
<- ggplot(filtered_data23, aes(x = reorder(Team, AVG), y = AVG, fill = Team)) + # Reorder Team by AVG
AVG_Yards_per_play geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
labs(x = "Average Yards per Gain", y = "Team", title = "Average Yards Per Gain") + # Adjust labels after flipping
theme(axis.text.y = element_text(angle = 0, hjust = 1), # Adjust for vertical axis
legend.position = "none") + # Remove legend
coord_flip() # Flip x and y axes
AVG_Yards_per_play
library(rvest)
library(ggplot2)
library(tidyr)
library(dplyr)
library(stringdist)
library(janitor)
library(stringr)
#install.packages("stringdist")
<- read_html(here::here('data_raw', 'defense_23.html')) %>%
football_defense_list html_table()
<- cbind(football_defense_list[[1]], football_defense_list[[2]])
defense2023
<- defense2023[1, ] # Extract the first row as new headers
new_headers <- defense2023[-1, ] # Remove the first row from the dataset
defense2023 names(defense2023) <- new_headers
rownames(defense2023) <- NULL
colnames(defense2023)[1:10] <- c(
'Team',
'Games Played',
'Total YDS',
'Total YDS/Game',
'Passing YDS',
'Passing YDS/Game',
'Rushing YDS',
'Rushing YDS/Game',
'Points',
'Points/Game'
)
<- defense2023 %>%
long_defense_data pivot_longer(cols = c('Games Played':'Points/Game'), # Specify the range using column names
names_to = "Metric",
values_to = "Value")
<- function(metric_name) {
create_bar_chart ggplot(filter(long_defense_data, Metric == metric_name), aes(x = Team, y = as.numeric(Value), fill = Team)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = paste("Comparison of", metric_name, "across Teams"),
x = "Teams",
y = metric_name) +
guides(fill = "none") # Use "none" instead of FALSE to remove the fill legend
}
# List of metrics for which we want to create bar charts
<- c('Games Played', 'Total YDS', 'Total YDS/Game', 'Passing YDS',
metrics 'Passing YDS/Game', 'Rushing YDS', 'Rushing YDS/Game', 'Points', 'Points/Game')
# Generate the bar charts for each metric
#for (metric in metrics) {
#print(create_bar_chart(metric))
#}
# Identify rows that can't be converted to numeric
<- long_defense_data %>%
problematic_rows filter(is.na(as.numeric(Value))) %>%
distinct()
# Display the problematic rows
#print(problematic_rows)
<- read_html(here::here('data_raw', 'defense_23.html')) %>%
football_defense_list html_table()
<- cbind(football_defense_list[[1]], football_defense_list[[2]])
defense2023
<- defense2023[1, ] # Extract the first row as new headers
new_headers <- defense2023[-1, ] # Remove the first row from the dataset
defense2023 names(defense2023) <- new_headers
rownames(defense2023) <- NULL
colnames(defense2023)[1:10] <- c(
'Team',
'Games Played',
'Total YDS',
'Total YDS/Game',
'Passing YDS',
'Passing YDS/Game',
'Rushing YDS',
'Rushing YDS/Game',
'Points',
'Points/Game'
)
<- defense2023 %>%
long_defense_data pivot_longer(cols = c('Games Played':'Points/Game'), # Specify the range using column names
names_to = "Metric",
values_to = "Value")
# Identify rows that can't be converted to numeric
<- long_defense_data %>%
problematic_rows filter(is.na(suppressWarnings(as.numeric(Value)))) %>% # This will filter rows that contain non-numeric values
distinct()
# View the problematic rows
#print(problematic_rows)
# Display the problematic rows
#print(problematic_rows)
# Remove commas from the Value column and convert to numeric
<- long_defense_data %>%
long_defense_data mutate(Value = as.numeric(gsub(",", "", Value)))
# Check for any remaining NA values
<- long_defense_data %>%
problematic_rows filter(is.na(Value)) %>%
distinct()
## View the problematic rows
#print(problematic_rows)
<- 2016:2023
years
<- data.frame()
all_data
for (year in years) {
<- paste0("https://www.espn.com/college-football/rankings/_/poll/1/week/1/year/", year, "/seasontype/3")
url
<- tryCatch({
page read_html(url)
error = function(e) {
}, message(paste("Error reading page for year:", year))
return(NULL)
})
if (!is.null(page)) {
<- page %>% html_node("table")
table
if (!is.null(table)) {
<- table %>%
data html_table(fill = TRUE)
$Year <- year
data<- bind_rows(all_data, data)
all_data else {
} message(paste("No table found for year:", year))
}
}
}
<- all_data %>%
all_data select("RK",'Team', "REC","PTS","Year")
<- all_data %>%
top_teams group_by(Team) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
slice_head(n = 25)
write.csv(top_teams, "top_teams.csv", row.names = FALSE)
<- c(
top25 "Ohio State Buckeyes",
"Notre Dame Fighting Irish",
"Alabama Crimson Tide",
"Clemson Tigers",
"Iowa Hawkeyes",
"LSU Tigers",
"Michigan Wolverines",
"Oklahoma State Cowboys",
"Texas Longhorns",
"Georgia Bulldogs",
"Washington Huskies",
"Cincinnati Bearcats",
"Florida Gators",
"Oklahoma Sooners",
"Oregon Ducks",
"USC Trojans",
"Utah Utes",
"Auburn Tigers",
"Boise State Broncos",
"Florida State Seminoles",
"Miami Hurricanes",
"Northwestern Wildcats",
"Oklahoma Sooners",
"Tennessee Volunteers",
"Penn State Nittany Lions",
"Texas A&M Aggies"
)
<- long_defense_data %>%
long_defense_data mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
# Bar Charts
$Color <- ifelse(long_defense_data$Team %in%
long_defense_datac("Michigan Wolverines", "Washington Huskies", "Texas Longhorns", "Alabama Crimson Tide"),
"lightblue", "grey")
<- ggplot(filter(long_defense_data, Metric == 'Rushing YDS/Game'),
rushing_yds aes(x = reorder(Team, -Value), y = Value, fill = Team)) + # Reorder Team by Value
geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) + # Use custom team colors
labs(
x = "Rushing YDS/Game",
y = "Team",
title = "Comparison of Rushing YDS/Game across Teams"
+
) theme(
axis.text.y = element_text(angle = 0, hjust = 1), # Adjust for flipped axes
legend.position = "none" # Remove the legend
+
) coord_flip() # Flip x and y axes
rushing_yds
AVG_Yards_per_play
TotalTD
yds_by_game
library(cowplot)
# Combine the four plots into a grid
<- plot_grid(
combined_plot # The plots to combine
rushing_yds, AVG_Yards_per_play, TotalTD, yds_by_game, ncol = 2, # Arrange plots in 2 columns
align = "v" # Align vertically within columns
)
# Display the combined plot
combined_plot
library(cowplot)
# Reduce text size for each plot
<- rushing_yds + theme(plot.title = element_text(size = 10), axis.text = element_text(size = 5))
rushing_yds <- AVG_Yards_per_play + theme(plot.title = element_text(size = 10), axis.text = element_text(size = 5))
AVG_Yards_per_play <- TotalTD + theme(plot.title = element_text(size = 10), axis.text = element_text(size = 5))
TotalTD <- yds_by_game + theme(plot.title = element_text(size = 10), axis.text = element_text(size = 5))
yds_by_game
# Combine all plots in a single row
<- plot_grid(
combined_plot
rushing_yds,
AVG_Yards_per_play,
TotalTD,
yds_by_game,ncol = 4 # Arrange plots in a single row
)
# Display the combined plot
combined_plot
# Define a consistent order for teams (e.g., alphabetical)
library(dplyr)
# Arrange teams by most TD and create consistent order
<- filtered_data23 %>%
consistent_team_order arrange(desc(TD)) %>% # Sort teams by TD in descending order
pull(Team) %>% # Extract the sorted team names
unique() # Ensure unique team names
# Apply this order to the 'Team' column
$Team <- factor(filtered_data23$Team, levels = consistent_team_order)
filtered_data23
# If using another dataset, apply the same factor levels
$Team <- factor(long_defense_data$Team, levels = consistent_team_order)
long_defense_data
# Apply this order to all datasets
$Team <- factor(filtered_data23$Team, levels = consistent_team_order)
filtered_data23$Team <- factor(long_defense_data$Team, levels = consistent_team_order)
long_defense_data
<- mean(filter(long_defense_data, Metric == 'Rushing YDS/Game')$Value, na.rm = TRUE)
avg_rushing
<- ggplot(filter(long_defense_data, Metric == 'Rushing YDS/Game'),
rushing_yds aes(x = Value, y = Team, fill = Team)) + # Teams aligned on y-axis
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_rushing, linetype = "dashed", color = "black", size = 0.75) + # Add average line
scale_fill_manual(values = team_colors) + # Use custom team colors
labs(
x = "Rushing YDS/Game",
y = "Team",
title = "Comparison of Rushing YDS/Game across Teams"
+
) theme(
axis.text.y = element_text(size = 8),
legend.position = "none" # Remove the legend
)<- mean(filtered_data23$AVG, na.rm = TRUE)
avg_yards
<- ggplot(filtered_data23,
AVG_Yards_per_play aes(x = AVG, y = Team, fill = Team)) + # Teams aligned on y-axis
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_yards, linetype = "dashed", color = "black", size = 0.75) + # Add average line
scale_fill_manual(values = team_colors) +
labs(
x = "Average Yards per Play",
y = "Team",
title = "Average Yards Per Play by Team"
+
) theme(
axis.text.y = element_text(size = 8),
legend.position = "none"
)<- mean(filtered_data23$TD, na.rm = TRUE)
avg_td
<- ggplot(filtered_data23,
TotalTD aes(x = TD, y = Team, fill = Team)) + # Teams aligned on y-axis
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_td, linetype = "dashed", color = "black", size = 0.75) + # Add average line
scale_fill_manual(values = team_colors) +
labs(
x = "Total Touchdowns",
y = "Team",
title = "Total Touchdowns by Team"
+
) theme(
axis.text.y = element_text(size = 8),
legend.position = "none"
)
<- mean(filtered_data23$`YDS/G`, na.rm = TRUE)
avg_yds_game
<- ggplot(filtered_data23,
yds_by_game aes(x = `YDS/G`, y = Team, fill = Team)) + # Teams aligned on y-axis
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_yds_game, linetype = "dashed", color = "black", size = 0.75) + # Add average line
scale_fill_manual(values = team_colors) +
labs(
x = "Offensive Yards per Game",
y = "Team",
title = "Offensive Yards per Game by Team"
+
) theme(
axis.text.y = element_text(size = 8),
legend.position = "none"
)
<- plot_grid(
combined_plot
TotalTD,
rushing_yds,
AVG_Yards_per_play,
yds_by_game,ncol = 4 # Arrange plots in a single row
)
# Display the combined plot
combined_plot
<- ggplot(filter(long_defense_data, Metric == 'Rushing YDS/Game'),
rushing_yds aes(x = Value, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_rushing, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Rushing YDS/Game",
y = NULL, # Remove y-axis title
title = "Rushing YDS/Game Per Teams"
+
) theme(
axis.text.y = element_blank(), # Remove team names
legend.position = "none"
)
<- ggplot(filtered_data23,
AVG_Yards_per_play aes(x = AVG, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_yards, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Average Yards per Play",
y = NULL, # Remove y-axis title
title = "Average Yards/Pass by Team"
+
) theme(
axis.text.y = element_blank(), # Remove team names
legend.position = "none"
)
<- ggplot(filtered_data23,
TotalTD aes(x = TD, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_td, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Total Touchdowns",
y = "Team", # Retain y-axis title and team names
title = "Total Touchdowns by Team"
+
) theme(
axis.text.y = element_text(size = 10), # Retain team names
legend.position = "none"
)
<- ggplot(filtered_data23,
yds_by_game aes(x = `YDS/G`, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_yds_game, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Offensive Yards per Game",
y = NULL, # Remove y-axis title
title = "Offensive Yards/Game by Team"
+
) theme(
axis.text.y = element_blank(), # Remove team names
legend.position = "none"
)
<- plot_grid(
combined_plot
TotalTD,
rushing_yds,
AVG_Yards_per_play,
yds_by_game,ncol = 4 # Arrange plots in a single row
)
# Display the combined plot
combined_plot
<- plot_grid(
combined_plot
TotalTD,
rushing_yds,
AVG_Yards_per_play,
yds_by_game,ncol = 4, # Arrange plots in a single row
align = "h", # Align horizontally and vertically
axis = "tb", # Align axes
rel_widths = c(5, 2, 2, 2) # Equal widths for all plots
)
# Display the combined plot
combined_plot
library(ggplot2)
library(tidyr)
library(dplyr)
library(readr)
library(tidyverse)
library(rvest)
library(here)
library(cowplot)
<- c(14,15,16,17,18,19,20,21,22,23)
years
<- function(start_year, end_year) {
scrape_clean # Initialize an empty list to store yearly data
<- list()
all_data
# Loop over the range of years
for (year in start_year:end_year) {
# Generate the file path
<- here("data_raw", paste0("defense_yards_", year, ".html"))
file_path
# Read the HTML file and extract tables
<- read_html(file_path) %>%
defense_yards html_table(fill = TRUE)
# Combine the first two tables (assuming they are in the desired format)
<- cbind(defense_yards[[1]], defense_yards[[2]])
combined_table
# Add a column for the year
$Year <- year
combined_table
# Append to the list
as.character(year)]] <- combined_table
all_data[[
}
# Combine all the yearly tables into one data frame
<- bind_rows(all_data)
combined_data_def
return(combined_data_def)
}
# Example usage
<- 14
start_year <- 23
end_year
<- scrape_clean(start_year, end_year)
combined_defense_data
# View the combined data
<- combined_defense_data[1, ]
new_headers <- combined_defense_data[-1, ]
combined_offense_data names(combined_defense_data) <- new_headers
<- combined_defense_data[, c(1,8,11)] %>%
cleaned_defense rename(Team = 1, Yards_Game = 2, Year = 3) %>%
mutate(Yards_Game = as.numeric(gsub(",", "", Yards_Game)))
#or (1 in years) {
$Year <- cleaned_defense$Year + 2000
cleaned_defenseglimpse(cleaned_defense)
cleaned_defense #unique(all_data$Team)
<- data.frame()
all_data
<- 2014:2023
years # Read in data for each year
for (year in years) {
<- paste0("https://www.espn.com/college-football/rankings/_/poll/1/week/1/year/", year, "/seasontype/3")
url <- tryCatch({
page read_html(url)
error = function(e) {
}, message(paste("Error reading page for year:", year))
return(NULL)
})if (!is.null(page)) {
<- page %>% html_node("table")
table if (!is.null(table)) {
<- table %>%
data html_table(fill = TRUE)
$Year <- year
data<- bind_rows(all_data, data)
all_data else {
} message(paste("No table found for year:", year))
}
}
} View(all_data)
# all data is a smaller data frame consisted of
# rankings of top 25 teams every year
# edit offense data frame to also include RK information form all_data.
# Create a mapping of old names to new names
<- c(
name_mapping "OSU Ohio State" = "OSU",
"OSU Ohio State " = "OSU",
"UCF UCF " = "UCF",
"ORE Oregon" = "ORE",
"TCU TCU" = "TCU",
"ALA Alabama" = "ALA",
"ALA Alabama " = "ALA",
"FSU Florida State" = "FSU",
"MSU Michigan State" = "MSU",
"BAY Baylor" = "BAY",
"GT Georgia Tech" = "GT",
"UGA Georgia" = "UGA",
"UGA Georgia " = "UGA",
"UCLA UCLA" = "UCLA",
"MSST Mississippi State" = "MSST",
"ASU Arizona State" = "ASU",
"WIS Wisconsin" = "WIS",
"MIZ Missouri" = "MIZ",
"CLEM Clemson" = "CLEM",
"CLEM Clemson " = "CLEM",
"BSU Boise State" = "BSU",
"MISS Ole Miss" = "MISS",
"KSU Kansas State" = "KSU",
"ARIZ Arizona" = "ARIZ",
"USC USC" = "USC",
"UTAH Utah" = "UTAH",
"AUB Auburn" = "AUB",
"MRSH Marshall" = "MRSH",
"LOU Louisville" = "LOU",
"MEM Memphis" = "MEM",
"STAN Stanford" = "STAN",
"OKLA Oklahoma" = "OKLA",
"HOU Houston" = "HOU",
"IOWA Iowa" = "IOWA",
"ND Notre Dame" = "ND",
"MICH Michigan" = "MICH",
"MICH Michigan " = "MICH",
"UNC North Carolina" = "UNC",
"LSU LSU" = "LSU",
"LSU LSU " = "LSU",
"NAVY Navy" = "NAVY",
"OKST Oklahoma State" = "OKST",
"TENN Tennessee" = "TENN",
"NW Northwestern" = "NW",
"WKU Western Kentucky" = "WKU",
"FLA Florida" = "FLA",
"WASH Washington" = "WASH",
"PSU Penn State" = "PSU",
"VT Virginia Tech" = "VT",
"COLO Colorado" = "COLO",
"WVU West Virginia" = "WVU",
"USF South Florida" = "USF",
"MIAMI Miami" = "MIAMI",
"SDSU San Diego State" = "SDSU",
"NCST NC State" = "NCST",
"TEX Texas" = "TEX",
"WSU Washington State" = "WSU",
"UK Kentucky" = "UK",
"SYR Syracuse" = "SYR",
"TA&M Texas A&M" = "TA&M",
"FRES Fresno State" = "FRES",
"ARMY Army" = "ARMY",
"USU Utah State" = "USU",
"CIN Cincinnati" = "CIN",
"MINN Minnesota" = "MINN",
"APP Appalachian State" = "APP",
"AFA Air Force" = "AFA",
"OU Oklahoma" = "OU",
"ISU Iowa State" = "ISU",
"BYU BYU" = "BYU",
"IU Indiana" = "IU",
"CCU Coastal Carolina" = "CCU",
"UL Louisiana" = "UL",
"LIB Liberty" = "LIB",
"BALL Ball State" = "BALL",
"SJSU San José State" = "SJSU",
"BUFF Buffalo" = "BUFF",
"PITT Pittsburgh" = "PITT",
"WAKE Wake Forest" = "WAKE",
"ARK Arkansas" = "ARK",
"TULN Tulane" = "TULN",
"ORST Oregon State" = "ORST",
"TROY Troy" = "TROY",
"SC South Carolina" = "SC",
"SMU SMU" = "SMU",
"KU Kansas" = "KU"
)
# Update the 'Team' column using the mapping
# Remove numbers and parentheses
<- all_data %>%
all_data mutate(
Team = gsub("[0-9()]", "", Team)
)
<- all_data %>%
all_data mutate(Team = if_else(Team %in% names(name_mapping),
name_mapping[Team], Team))
<- all_data %>%
combined_data_def inner_join( cleaned_defense , by = c("Team", "Year"))
# final combined datat frame for the past 9 yeard
glimpse(combined_data_def)
<- combined_data_def[,c(1,2,8,9)]
combined_data_def
<- combined_data_def %>%
combined_data_def group_by(Year) %>% # Group by Year so that ranking is done per year
mutate(yards_rank = rank(Yards_Game, ties.method = "min")) %>% # Rank within each year
ungroup()
head(combined_data_def)
<- combined_data_def %>%
combined_data_deffilter(Year %in% c(2014, 2021, 2023))
<- ggplot(combined_data_def %>%
defensive_dot_plot filter(RK == 1), # Filter for top-ranked teams
aes(x = Year, y = yards_rank)) + # Mapping Year to x and yards_rank to y
geom_point(color = "purple") + # Uniform color for points
geom_line(color = "black", size = 0.25) + # Uniform color for lines
scale_y_reverse() + # Reverse y-axis
labs(
title = "Do #1 Teams Excel in Defensive Yards/Game?",
x = "Year",
y = "Defensive Yards Rank"
+
) theme_minimal() +
theme(
legend.position = "none" # Remove the legend
)
# Display the plot
defensive_dot_plot
defensive_dot_plot
## plot two
<- cleaned_defense %>%
combined_data_def_2 left_join(all_data, by = c("Team", "Year"))
glimpse(combined_data_def_2)
#View(combined_data)
<- combined_data_def_2[,c(1,2,3,4)]
combined_data_def_2 glimpse(combined_data_def_2)
<- combined_data_def_2 %>%
combined_data_def_2filter(Year %in% c(2014, 2021, 2023))
<- ggplot(data = combined_data_def_2,
defensive aes(x = Team, y = Yards_Game,
fill = case_when(
== 1 ~ "Top Rank", # Only color "Top Rank" for RK == 1
RK TRUE ~ NA_character_ # Don't assign fill for other ranks
+
))) # Plot all points (background points with light grey color)
geom_point(size = 1, shape = 21, color = "lightgrey", fill = "lightgrey") +
# Plot the blue points for RK == 1 (on the front layer, larger size)
geom_point(data = combined_data_def_2 %>% filter(RK == 1),
aes(fill = "Top Rank"), size = 2, shape = 21) + # Blue dots, larger size
facet_wrap(~ Year, scales = "free_x") +
geom_hline(data = combined_data_def_2 %>%
group_by(Year) %>%
summarise(avg_yards = mean(Yards_Game, na.rm = TRUE)),
aes(yintercept = avg_yards),
linetype = "dashed", color = "black",size = 0.75) + # Add dotted line for the average Yards_Game
labs(
title = "Defensive Yards Per Game by Year",
x = NULL, # Remove x-axis label
y = "Yards Per Game",
fill = "Legend" # Legend title
+
) scale_fill_manual(
values = c("Top Rank" = "purple"), # Only show blue for "Top Rank"
na.value = "transparent" # Make the other fill values invisible in the legend
+
) theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_blank(),
legend.position = "bottom", # Place legend at the bottom
strip.text = element_text(size = 10),
aspect.ratio = .7
)
defensive
library(rvest)
library(dplyr)
####### 2023 ###################################################################
<- read_html(here::here('data_raw', '2023_passing_offense.html')) %>%
data_list23 html_table()
<- cbind(data_list23[[1]], data_list23[[2]])
data23View(data23)
#iew(data23)
<- c(
top25 "Ohio State Buckeyes",
"Notre Dame Fighting Irish",
"Alabama Crimson Tide",
"Clemson Tigers",
"Iowa Hawkeyes",
"LSU Tigers",
"Michigan Wolverines",
"Oklahoma State Cowboys",
"Texas Longhorns",
"Georgia Bulldogs",
"Washington Huskies",
"Cincinnati Bearcats",
"Florida Gators",
"Oklahoma Sooners",
"Oregon Ducks",
"USC Trojans",
"Utah Utes",
"Auburn Tigers",
"Boise State Broncos",
"Florida State Seminoles",
"Miami Hurricanes",
"Northwestern Wildcats",
"Oklahoma Sooners",
"Tennessee Volunteers",
"Penn State Nittany Lions",
"Texas A&M Aggies")
<- data23 %>%
filtered_data23 mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
#View(filtered_data23)
#View(filtered_data23)
library(ggplot2)
<- c(
team_colors "Michigan Wolverines" = "blue",
"Alabama Crimson Tide" = "red",
"Texas Longhorns" = "orange",
"Washington Huskies" = "purple")
library(dplyr)
# Rank teams in df23 based on the number of TDs
<- data23 %>%
data23 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2023)
<- data23 %>%
average_TD_23 group_by(Year) %>%
summarize(Average_Touchdowns = mean(TD, na.rm = TRUE))
#View(data23)
####### 2022 ###################################################################
<- read_html(here::here('data_raw', '2022_passing_offense.html')) %>%
data_list22 html_table()
<- cbind(data_list22[[1]], data_list22[[2]])
data22#View(data22)
<- c(
top25 "OHIO",
"ND",
"ALA",
"CLEM",
"IOWA",
"LSU",
"MICH",
"OKST",
"TEX",
"UGA",
"WASH",
"CIN",
"FLA",
"OU",
"ORE",
"USC",
"UTAH",
"AUB",
"BOIS",
"FSU",
"MIA",
"NU",
"TENN",
"PSU",
"TA&M")
<- data22 %>%
filtered_data22 mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
#View(filtered_data22)
<- c(
team_lookup "OHIO" = "Ohio State Buckeyes",
"ND" = "Notre Dame Fighting Irish",
"ALA" = "Alabama Crimson Tide",
"CLEM" = "Clemson Tigers",
"IOWA" = "Iowa Hawkeyes",
"LSU" = "LSU Tigers",
"MICH" = "Michigan Wolverines",
"OKST" = "Oklahoma State Cowboys",
"TEX" = "Texas Longhorns",
"UGA" = "Georgia Bulldogs",
"WASH" = "Washington Huskies",
"CIN" = "Cincinnati Bearcats",
"FLA" = "Florida Gators",
"OU" = "Oklahoma Sooners",
"ORE" = "Oregon Ducks",
"USC" = "USC Trojans",
"UTAH" = "Utah Utes",
"AUB" = "Auburn Tigers",
"BOIS" = "Boise State Broncos",
"FSU" = "Florida State Seminoles",
"MIA" = "Miami Hurricanes",
"NU" = "Northwestern Wildcats",
"TENN" = "Tennessee Volunteers",
"PSU" = "Penn State Nittany Lions",
"TA&M" = "Texas A&M Aggies"
)
# Rename teams in the 'Team' column of the dataset
<- filtered_data22 %>%
filtered_data22 mutate(Team = recode(Team, !!!team_lookup))
#View(filtered_data22)
<- data22 %>%
data22 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2022)
#View(df22)
####### 2021 ###################################################################
<- read_html(here::here('data_raw', '2021_passing_offense.html')) %>%
data_list21 html_table()
<- cbind(data_list21[[1]], data_list21[[2]])
data21#View(data21)
<- c(
top25 "OHIO",
"ND",
"ALA",
"CLEM",
"IOWA",
"LSU",
"MICH",
"OKST",
"TEX",
"UGA",
"WASH",
"CIN",
"FLA",
"OU",
"ORE",
"USC",
"UTAH",
"AUB",
"BOIS",
"FSU",
"MIA",
"NU",
"TENN",
"PSU",
"TA&M")
<- data21 %>%
filtered_data21 mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
#View(filtered_data21)
<- c(
team_lookup "OHIO" = "Ohio State Buckeyes",
"ND" = "Notre Dame Fighting Irish",
"ALA" = "Alabama Crimson Tide",
"CLEM" = "Clemson Tigers",
"IOWA" = "Iowa Hawkeyes",
"LSU" = "LSU Tigers",
"MICH" = "Michigan Wolverines",
"OKST" = "Oklahoma State Cowboys",
"TEX" = "Texas Longhorns",
"UGA" = "Georgia Bulldogs",
"WASH" = "Washington Huskies",
"CIN" = "Cincinnati Bearcats",
"FLA" = "Florida Gators",
"OU" = "Oklahoma Sooners",
"ORE" = "Oregon Ducks",
"USC" = "USC Trojans",
"UTAH" = "Utah Utes",
"AUB" = "Auburn Tigers",
"BOIS" = "Boise State Broncos",
"FSU" = "Florida State Seminoles",
"MIA" = "Miami Hurricanes",
"NU" = "Northwestern Wildcats",
"TENN" = "Tennessee Volunteers",
"PSU" = "Penn State Nittany Lions",
"TA&M" = "Texas A&M Aggies"
)
# Rename teams in the 'Team' column of the dataset
<- filtered_data21 %>%
filtered_data21 mutate(Team = recode(Team, !!!team_lookup))
#View(filtered_data21)
<- data21 %>%
data21 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2021)
####### 2020 ###################################################################
<- read_html(here::here('data_raw', '2020_passing_offense.html')) %>%
data_list20 html_table()
<- cbind(data_list20[[1]], data_list20[[2]])
data20#View(data20)
<- data20 %>%
data20 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD)))%>%
mutate(Year = 2020)
####### 2019 ###################################################################
<- read_html(here::here('data_raw', '2019_passing_offense.html')) %>%
data_list19 html_table()
<- cbind(data_list19[[1]], data_list19[[2]])
data19#View(data19)
<- data19 %>%
data19 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2019)
####### 2018 ###################################################################
<- read_html(here::here('data_raw', '2018_passing_offense.html')) %>%
data_list18 html_table()
<- cbind(data_list18[[1]], data_list18[[2]])
data18#View(data18)
<- data18 %>%
data18 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD)))%>%
mutate(Year = 2018)
####### 2017 ###################################################################
<- read_html(here::here('data_raw', '2017_passing_offense.html')) %>%
data_list17 html_table()
<- cbind(data_list17[[1]], data_list17[[2]])
data17#View(data17)
<- data17 %>%
data17 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2017)
#View(data17)
####### 2016 ###################################################################
<- read_html(here::here('data_raw', '2016_passing_offense.html')) %>%
data_list16 html_table()
<- cbind(data_list16[[1]], data_list16[[2]])
data16#View(data16)
<- data16 %>%
data16 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2016)
####### 2015 ###################################################################
<- read_html(here::here('data_raw', '2015_passing_offense.html')) %>%
data_list15 html_table()
<- cbind(data_list15[[1]], data_list15[[2]])
data15#View(data15)
<- data15 %>%
data15 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2015)
####### 2014 ###################################################################
<- read_html(here::here('data_raw', '2014_passing_offense.html')) %>%
data_list14 html_table()
<- cbind(data_list14[[1]], data_list14[[2]])
data14#View(data14)
<- data14 %>%
data14 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2014)
####### BUILDING THE DATA SET ###################################################################
<- data23 %>%
df23 filter(Team == "Michigan Wolverines")
<- data22 %>%
df22 filter(Team == "UGA")
<- data21 %>%
df21 filter(Team == "UGA")
<- data20 %>%
df20 filter(Team == "ALA")
<- data19 %>%
df19 filter(Team == "LSU")
<- data18 %>%
df18 filter(Team == "CLEM")
<- data17 %>%
df17 filter(Team == "ALA")
<- data16 %>%
df16 filter(Team == "CLEM")
<- data15 %>%
df15 filter(Team == "ALA")
<- data14 %>%
df14 filter(Team == "ALA")
# Combine all data sets into one
<- bind_rows(df23, df22, df21, df20, df19, df18, df17, df16, df15, df14)
combined_data
#View(combined_data)
library(ggplot2)
<- ggplot(combined_data, aes(x = Year, y = Rank)) +
td_rank # Black line connecting all points
geom_line(color = "black", linewidth = 0.25) +
# Uniform points for each year
geom_point(color = "red", size = 2) +
# Reverse y-axis to show Rank 1 at the top
scale_y_reverse() +
labs(
title = "Do #1 Teams Excel in Total Number of Touchdowns?",
x = "Year",
y = "Rank"
+
) theme_minimal() +
theme(
legend.position = "none" # Remove the legend
)
# Display the plot
td_rank
<- ggplot(combined_data, aes(x = Year, y = TD)) +
touchdowns geom_line(color = "black", linewidth = 0.25) + # Black line connecting all points
geom_point(aes(color = Team), size = 3) + # Points for each year and team with color
# Reverse y-axis to show Rank 1 at the top
labs(
title = "#1 Team Number of Touchdowns Over the Years",
x = "Year",
y = "Touchdowns",
color = "Team"
+
) theme_minimal() +
theme(
legend.position = "bottom"
)
td_rank
touchdowns
<- bind_rows(data23, data22, data21, data20, data19, data18, data17, data16, data15, data14)
combined_whole_data #View(combined_whole_data)
library(dplyr)
<- combined_whole_data %>%
combined_whole_data mutate(row_key = do.call(paste, c(., sep = "_")))
#View(combined_whole_data)
<- combined_data %>%
combined_data mutate(row_key = do.call(paste, c(., sep = "_")))
# Check presence and assign color code
<- combined_whole_data %>%
combined_whole_data mutate(
present_in_combined_data = row_key %in% combined_data$row_key,
color_code = ifelse(present_in_combined_data, "red", "lightgrey")
)
# Reorder the data so red points are plotted last
<- combined_whole_data %>%
combined_whole_data arrange(color_code == "red")
#View(combined_whole_data)
# Calculate the overall average touchdowns
<- mean(combined_whole_data$TD, na.rm = TRUE)
average_TD
# Create the plot with an average line and label
<- ggplot(combined_whole_data, aes(x = Team, y = TD, color = color_code)) +
everyone geom_point(size = 1.5, position = position_dodge(width = 0.5)) +
geom_hline(yintercept = average_TD, linetype = "dashed", color = "black", size = 0.5) + # Add average line
annotate(
"text", x = "ALA", y = 15 + 0.5,
label = "Average TDs", color = "black", size = 2.5, hjust = 0
+ # Add label for the average line
) scale_color_manual(values = c("red" = "red", "lightgrey" = "lightgrey")) +
theme_minimal() +
labs(
title = "Total Touchdowns by Team (Split by Year)",
x = "Team",
y = "Total Touchdowns",
color = "Presence in Combined Data"
+
) theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
legend.position = "bottom" # Remove the legend
+
) facet_wrap(~ Year) # Use fixed y-axis for all facets
#
everyone
touchdowns
td_rank
<- ggplot(combined_whole_data, aes(x = Team, y = TD, color = color_code)) +
everyone # Plot points with slight dodge for clarity
geom_point(size = 1.5, position = position_dodge(width = 0.5)) +
# Add dashed line for the average
geom_hline(yintercept = average_TD, linetype = "dashed", color = "black", size = 0.5) +
# Add label for the average line
annotate(
"text", x = "ALA", y = 15 + 0.5,
label = "Average TDs", color = "black", size = 2.5, hjust = 0
+
)
# Customize color scale to only include red in the legend
scale_color_manual(
values = c("red" = "red"),
labels = c("red" = "Top Teams"), # Only keep red in the legend
guide = guide_legend(override.aes = list(size = 3)) # Adjust legend point size
+
)
# Apply minimal theme
theme_minimal() +
# Add labels and adjust layout
labs(
title = "Total Touchdowns by Team (Split by Year)",
x = NULL, # Remove x-axis label
y = "Total Touchdowns",
color = NULL # Remove legend title
+
)
# Adjust theme
theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
legend.position = "bottom" # Place legend at the bottom
+
)
# Add facets
facet_wrap(~ Year) # Use fixed y-axis for all facets
# Display the plot
everyone
<- ggplot(combined_whole_data, aes(x = Team, y = TD)) +
everyone # Plot all points as light grey
geom_point(color = "lightgrey", size = 1.5, position = position_dodge(width = 0.5)) +
# Overlay red points for "Top Teams"
geom_point(
data = combined_whole_data %>% filter(color_code == "red"),
aes(color = "Top Teams"), size = 1.5, position = position_dodge(width = 0.5)
+
)
# Add dashed line for the average
geom_hline(yintercept = average_TD, linetype = "dashed", color = "black", size = 0.5) +
# Add label for the average line
annotate(
"text", x = "LSU", y = 15 + 0.5,
label = "Average", color = "black", size = 2.5, hjust = 0
+
)
# Customize color scale to show only "Top Teams" in the legend
scale_color_manual(
values = c("Top Teams" = "red"), # Red for "Top Teams"
labels = c("Top Teams"), # Only "Top Teams" in legend
guide = guide_legend(override.aes = list(size = 3)) # Adjust legend point size
+
)
# Apply minimal theme
theme_minimal() +
# Add labels and adjust layout
labs(
title = "Total Touchdowns by Year",
x = NULL, # Remove x-axis label
y = "Total Touchdowns",
color = NULL # Remove legend title
+
)
# Adjust theme
theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
legend.position = "bottom" # Place legend at the bottom
+
)
# Add facets
facet_wrap(~ Year) # Use fixed y-axis for all facets
# Display the plot
everyone
library(ggplot2)
library(tidyr)
library(dplyr)
library(readr)
library(tidyverse)
library(rvest)
library(here)
library(cowplot)
<- c(14,15,16,17,18,19,20,21,22,23)
years
<- function(start_year, end_year) {
scrape_clean # Initialize an empty list to store yearly data
<- list()
all_data
# Loop over the range of years
for (year in start_year:end_year) {
# Generate the file path
<- here("data_raw", paste0("offense_yards_", year, ".html"))
file_path
# Read the HTML file and extract tables
<- read_html(file_path) %>%
offense_yards html_table(fill = TRUE)
# Combine the first two tables (assuming they are in the desired format)
<- cbind(offense_yards[[1]], offense_yards[[2]])
combined_table
# Add a column for the year
$Year <- year
combined_table
# Append to the list
as.character(year)]] <- combined_table
all_data[[
}
# Combine all the yearly tables into one data frame
<- bind_rows(all_data)
combined_data_def
return(combined_data_def)
}
# Example usage
<- 14
start_year <- 23
end_year
<- scrape_clean(start_year, end_year)
combined_offense_data
# View the combined data
<- combined_offense_data[1, ]
new_headers <- combined_offense_data[-1, ]
combined_offense_data names(combined_offense_data) <- new_headers
<- combined_offense_data[, c(1,8,11)] %>%
cleaned_offense rename(Team = 1, Yards_Game = 2, Year = 3) %>%
mutate(Yards_Game = as.numeric(gsub(",", "", Yards_Game)))
#or (1 in years) {
$Year <- cleaned_offense$Year + 2000
cleaned_offenseglimpse(cleaned_offense)
<- data.frame()
all_data
<- 2014:2023
years # Read in data for each year
for (year in years) {
<- paste0("https://www.espn.com/college-football/rankings/_/poll/1/week/1/year/", year, "/seasontype/3")
url <- tryCatch({
page read_html(url)
error = function(e) {
}, message(paste("Error reading page for year:", year))
return(NULL)
})if (!is.null(page)) {
<- page %>% html_node("table")
table if (!is.null(table)) {
<- table %>%
data html_table(fill = TRUE)
$Year <- year
data<- bind_rows(all_data, data)
all_data else {
} message(paste("No table found for year:", year))
}
}
}
# all data is a smaller data frame consisted of
# rankings of top 25 teams every year
# edit offense data frame to also include RK information form all_data.
# Create a mapping of old names to new names
<- c(
name_mapping "OSU Ohio State" = "OSU",
"OSU Ohio State " = "OSU",
"UCF UCF " = "UCF",
"ORE Oregon" = "ORE",
"TCU TCU" = "TCU",
"ALA Alabama" = "ALA",
"ALA Alabama " = "ALA",
"FSU Florida State" = "FSU",
"MSU Michigan State" = "MSU",
"BAY Baylor" = "BAY",
"GT Georgia Tech" = "GT",
"UGA Georgia" = "UGA",
"UGA Georgia " = "UGA",
"UCLA UCLA" = "UCLA",
"MSST Mississippi State" = "MSST",
"ASU Arizona State" = "ASU",
"WIS Wisconsin" = "WIS",
"MIZ Missouri" = "MIZ",
"CLEM Clemson" = "CLEM",
"CLEM Clemson " = "CLEM",
"BSU Boise State" = "BSU",
"MISS Ole Miss" = "MISS",
"KSU Kansas State" = "KSU",
"ARIZ Arizona" = "ARIZ",
"USC USC" = "USC",
"UTAH Utah" = "UTAH",
"AUB Auburn" = "AUB",
"MRSH Marshall" = "MRSH",
"LOU Louisville" = "LOU",
"MEM Memphis" = "MEM",
"STAN Stanford" = "STAN",
"OKLA Oklahoma" = "OKLA",
"HOU Houston" = "HOU",
"IOWA Iowa" = "IOWA",
"ND Notre Dame" = "ND",
"MICH Michigan" = "MICH",
"MICH Michigan " = "MICH",
"UNC North Carolina" = "UNC",
"LSU LSU" = "LSU",
"LSU LSU " = "LSU",
"NAVY Navy" = "NAVY",
"OKST Oklahoma State" = "OKST",
"TENN Tennessee" = "TENN",
"NW Northwestern" = "NW",
"WKU Western Kentucky" = "WKU",
"FLA Florida" = "FLA",
"WASH Washington" = "WASH",
"PSU Penn State" = "PSU",
"VT Virginia Tech" = "VT",
"COLO Colorado" = "COLO",
"WVU West Virginia" = "WVU",
"USF South Florida" = "USF",
"MIAMI Miami" = "MIAMI",
"SDSU San Diego State" = "SDSU",
"NCST NC State" = "NCST",
"TEX Texas" = "TEX",
"WSU Washington State" = "WSU",
"UK Kentucky" = "UK",
"SYR Syracuse" = "SYR",
"TA&M Texas A&M" = "TA&M",
"FRES Fresno State" = "FRES",
"ARMY Army" = "ARMY",
"USU Utah State" = "USU",
"CIN Cincinnati" = "CIN",
"MINN Minnesota" = "MINN",
"APP Appalachian State" = "APP",
"AFA Air Force" = "AFA",
"OU Oklahoma" = "OU",
"ISU Iowa State" = "ISU",
"BYU BYU" = "BYU",
"IU Indiana" = "IU",
"CCU Coastal Carolina" = "CCU",
"UL Louisiana" = "UL",
"LIB Liberty" = "LIB",
"BALL Ball State" = "BALL",
"SJSU San José State" = "SJSU",
"BUFF Buffalo" = "BUFF",
"PITT Pittsburgh" = "PITT",
"WAKE Wake Forest" = "WAKE",
"ARK Arkansas" = "ARK",
"TULN Tulane" = "TULN",
"ORST Oregon State" = "ORST",
"TROY Troy" = "TROY",
"SC South Carolina" = "SC",
"SMU SMU" = "SMU",
"KU Kansas" = "KU"
)
# Update the 'Team' column using the mapping
# Remove numbers and parentheses
<- all_data %>%
all_data mutate(
Team = gsub("[0-9()]", "", Team)
)
<- all_data %>%
all_data mutate(Team = if_else(Team %in% names(name_mapping),
name_mapping[Team], Team))
<- all_data %>%
combined_offense_data inner_join( cleaned_offense , by = c("Team", "Year"))
# final combined datat frame for the past 9 yeard
glimpse(combined_offense_data)
<- combined_offense_data[,c(1,2,8,9)]
combined_offense_data
<- combined_offense_data %>%
combined_offense_data_2 group_by(Year) %>% # Group by Year so that ranking is done per year
mutate(yards_rank = rank(Yards_Game, ties.method = "min")) %>% # Rank within each year
ungroup()
## plot 1
<- ggplot(combined_offense_data_2 %>%
graph1 filter(RK == 1), # Filter for top-ranked teams
aes(x = Year, y = yards_rank)) + # Mapping Year to x and yards_rank to y
geom_point(color = "blue", size = 2) + # Fixed color for points
geom_line(color = "black", size = 0.25) + # Fixed color for lines
scale_y_reverse() + # Reverse y-axis
labs(
title = "Do #1 Teams Excel in Offensive Yards per Game?",
x = "Year",
y = "Offensive Yards Rank"
+
) theme_minimal() +
theme(
legend.position = "none" # Remove the legend
)
# Display the plot
graph1
## plot two
<- cleaned_offense %>%
combined_offense_data_2 left_join(all_data, by = c("Team", "Year"))
<- combined_offense_data_2[,c(1,2,3,4)]
combined_offense_data_2 ## graph 2
<- ggplot(data = combined_offense_data_2,
offensive aes(x = Team, y = Yards_Game,
fill = case_when(
== 1 ~ "Top Rank", # Only color "Top Rank" for RK == 1
RK TRUE ~ NA_character_ # Don't assign fill for other ranks
+
))) # Plot all points (background points with light grey color)
geom_point(size = 1, shape = 21, color = "lightgrey", fill = "lightgrey") +
# Plot the blue points for RK == 1 (on the front layer, larger size)
geom_point(data = combined_offense_data_2 %>% filter(RK == 1),
aes(fill = "Top Rank"), size = 2, shape = 21) + # Blue dots, larger size
facet_wrap(~ Year, scales = "free_x") +
geom_hline(data = combined_offense_data_2 %>%
group_by(Year) %>%
summarise(avg_yards = mean(Yards_Game, na.rm = TRUE)),
aes(yintercept = avg_yards),
linetype = "dashed", color = "black") + # Add dotted line for the average Yards_Game
labs(
title = "Offensive Yards Per Game by Year",
x = NULL, # Remove x-axis label
y = "Yards Per Game",
fill = "Legend" # Legend title
+
) scale_fill_manual(
values = c("Top Rank" = "blue"), # Only show blue for "Top Rank"
na.value = "transparent" # Make the other fill values invisible in the legend
+
) theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_blank(),
legend.position = "bottom", # Place legend at the bottom
strip.text = element_text(size = 10),
aspect.ratio = .7
)
library(ggplot2)
library(dplyr)
# Compute yearly averages
<- combined_offense_data_2 %>%
yearly_avg group_by(Year) %>%
summarise(avg_yards = mean(Yards_Game, na.rm = TRUE))
# Create the plot
<- ggplot(data = combined_offense_data_2,
offensive aes(x = Team, y = Yards_Game,
fill = case_when(
== 1 ~ "Top Rank", # Only color "Top Rank" for RK == 1
RK TRUE ~ NA_character_ # Don't assign fill for other ranks
+
))) # Plot all points (background points with light grey color)
geom_point(size = 1, shape = 21, color = "lightgrey", fill = "lightgrey") +
# Plot the blue points for RK == 1 (on the front layer, larger size)
geom_point(data = combined_offense_data_2 %>% filter(RK == 1),
aes(fill = "Top Rank"), size = 2, shape = 21) + # Blue dots, larger size
facet_wrap(~ Year, scales = "free_x") +
# Add dotted line for the average Yards_Game
geom_hline(data = yearly_avg,
aes(yintercept = avg_yards),
linetype = "dashed", color = "black") +
# Add "Average" text label
geom_text(data = yearly_avg,
aes(x = Inf, y = 150, label = "Average"),
inherit.aes = FALSE,
hjust = 1.1, # Adjust horizontal alignment
color = "black",
size = 3) +
labs(
title = "Offensive Yards Per Game by Team, Faceted by Year",
x = NULL, # Remove x-axis label
y = "Yards Per Game",
fill = "Legend" # Legend title
+
) scale_fill_manual(
values = c("Top Rank" = "blue"), # Only show blue for "Top Rank"
na.value = "transparent" # Make the other fill values invisible in the legend
+
) theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_blank(),
legend.position = "bottom", # Place legend at the bottom
strip.text = element_text(size = 10),
aspect.ratio = .7
)
# Display the plot
offensive
<- ggplot(combined_offense_data_2, aes(x = Team, y = Yards_Game, color = case_when(
offensive == 1 ~ "Top Rank", # Highlight "Top Rank" teams
RK TRUE ~ "Other Teams" # All other teams
+
))) # Plot all points
geom_point(size = 1.5, position = position_dodge(width = 0.5)) +
# Add dashed line for average Yards/Game
geom_hline(data = yearly_avg, aes(yintercept = avg_yards),
linetype = "dashed", color = "black", size = 0.5) +
# Add text annotation for the average line
geom_text(data = yearly_avg, aes(x = -Inf, y = avg_yards, label = "Average"),
inherit.aes = FALSE, hjust = -0.1, vjust = -0.5,
color = "black", size = 2.5) +
# Manual color scale
scale_color_manual(
values = c("Top Rank" = "blue", "Other Teams" = "lightgrey")
+
)
# Facet by Year
facet_wrap(~ Year, scales = "free_x") +
# Labels and theme adjustments
labs(
title = "Offensive Yards Per Game by Team (Split by Year)",
x = "Team",
y = "Yards Per Game",
color = "Rank"
+
) theme_minimal() +
theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
panel.grid.major.x = element_blank(), # Remove major gridlines
panel.grid.minor.x = element_blank(), # Remove minor gridlines
legend.position = "none", # Remove legend
strip.text = element_text(size = 10) # Adjust facet label text size
)
# Display the plot
offensive
<- ggplot() +
offensive # Plot all points (background points with light grey color)
geom_point(data = combined_offense_data_2,
aes(x = Team, y = Yards_Game),
color = "lightgrey", size = 1.5, position = position_dodge(width = 0.5)) +
# Plot blue points for "Top Rank" on top
geom_point(data = combined_offense_data_2 %>% filter(RK == 1),
aes(x = Team, y = Yards_Game, color = "Top Rank"),
size = 1.5, position = position_dodge(width = 0.5)) +
# Add dashed line for average Yards/Game
geom_hline(data = yearly_avg, aes(yintercept = avg_yards),
linetype = "dashed", color = "black", size = 0.5) +
# Add text annotation for the average line
geom_text(data = yearly_avg, aes(x = 35, y = avg_yards, label = "Average"),
inherit.aes = FALSE, hjust = -0.1, vjust = -0.5,
color = "black", size = 2.5) +
# Manual color scale to include only "Top Rank" in the legend
scale_color_manual(
values = c("Top Rank" = "blue"), # Blue for "Top Rank"
labels = c("Top Rank"), # Only keep "Top Rank" in legend
guide = guide_legend(override.aes = list(size = 3)) # Adjust legend point size
+
)
# Facet by Year
facet_wrap(~ Year, scales = "free_x") +
# Labels and theme adjustments
labs(
title = "Offensive Yards Per Game by Team",
x = NULL, # Remove x-axis label
y = "Yards Per Game",
color = "Rank"
+
) theme_minimal() +
theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
panel.grid.major.x = element_blank(), # Remove major gridlines
panel.grid.minor.x = element_blank(), # Remove minor gridlines
legend.position = "bottom", # Place legend at the bottom
strip.text = element_text(size = 10) # Adjust facet label text size
)
# Display the plot
offensive
library(rvest)
library(tidyr)
library(dplyr)
library(stringdist)
library(janitor)
library(stringr)
library(ggplot2)
library(gganimate)
library(magick)
library(here)
###### GATHERING THE DATA #######
<- read_html(here::here('data_raw', '2023_passing_offense.html')) %>%
data_list23 html_table()
<- cbind(data_list23[[1]], data_list23[[2]])
data23
<- c(
top25 "Ohio State Buckeyes",
"Notre Dame Fighting Irish",
"Alabama Crimson Tide",
"Clemson Tigers",
"Iowa Hawkeyes",
"LSU Tigers",
"Michigan Wolverines",
"Oklahoma State Cowboys",
"Texas Longhorns",
"Georgia Bulldogs",
"Washington Huskies",
"Cincinnati Bearcats",
"Florida Gators",
"Oklahoma Sooners",
"Oregon Ducks",
"USC Trojans",
"Utah Utes",
"Auburn Tigers",
"Boise State Broncos",
"Florida State Seminoles",
"Miami Hurricanes",
"Northwestern Wildcats",
"Oklahoma Sooners",
"Tennessee Volunteers",
"Penn State Nittany Lions",
"Texas A&M Aggies")
<- data23 %>%
filtered_data23 mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
<- filtered_data23 %>%
df2 arrange(YDS)
<- 2014:2023
years <- data.frame()
all_data
# Read in data for each year
for (year in years) {
<- paste0("https://www.espn.com/college-football/rankings/_/poll/1/week/1/year/", year, "/seasontype/3")
url <- tryCatch({
page read_html(url)
error = function(e) {
}, message(paste("Error reading page for year:", year))
return(NULL)
})if (!is.null(page)) {
<- page %>% html_node("table")
table if (!is.null(table)) {
<- table %>%
data html_table(fill = TRUE)
$Year <- year
data<- bind_rows(all_data, data)
all_data else {
} message(paste("No table found for year:", year))
}
}
}
# Select only the columns we need
<- all_data %>%
all_data select("RK",'Team', "REC","PTS","Year")
<- all_data %>% filter(RK == 1)
ranked_1
###### DOT AND LINE GRAPH #######
<- function(year) {
extract_avg_ypp # Construct the file name for the HTML file
<- paste0(year, "_avg_yds_per_play.html")
file_name
# Load the data
<- read_html(here::here('data_raw', file_name)) %>%
avg_ypp html_table()
<- as.data.frame(avg_ypp) %>%
avg_ypp select("Team", "AVG") %>%
arrange(desc(AVG))
# Save the data frame globally
assign("avg_ypp", avg_ypp, envir = .GlobalEnv)
# Return the data frame as well
return(avg_ypp)
}
# Loop through years 2014 to 2023
for (year in 2014:2023) {
assign(paste0("avg_ypp_", year), extract_avg_ypp(year))
}
# Assuming avg_ypp_2014, avg_ypp_2015, ..., avg_ypp_2023 are defined
<- list(
avg_ypp_list "2014" = avg_ypp_2014,
"2015" = avg_ypp_2015,
"2016" = avg_ypp_2016,
"2017" = avg_ypp_2017,
"2018" = avg_ypp_2018,
"2019" = avg_ypp_2019,
"2020" = avg_ypp_2020,
"2021" = avg_ypp_2021,
"2022" = avg_ypp_2022,
"2023" = avg_ypp_2023
)
<- all_data %>%
combined_data filter(RK <= 1) %>% # Filter for top 10 teams
mutate(Year = as.numeric(Year)) # Ensure Year is numeric
<- combined_data %>%
combined_data rowwise() %>%
mutate(
# Extract the abbreviated team name (e.g., "OSU" from "OSU Ohio State (59)")
AbbrevTeam = str_extract(Team, "^[A-Z]+"),
# Dynamically look up the corresponding AVG value
AVG = avg_ypp_list[[as.character(Year)]] %>%
filter(Team == AbbrevTeam) %>%
pull(AVG) %>%
first() # Return the first match, if any
%>%
) ungroup()
# Plot
<- ggplot(combined_data, aes(x = Year, y = AVG)) +
avg_ypp_line_and_dot # Black line connecting all points
geom_line(color = "black", linewidth = 0.25, aes(group = 1)) +
# Uniform points for each year
geom_point(color = "orange", size = 2) +
labs(
title = "Do #1 Teams Excel in Average Offensive Yards/Pass?",
x = "Year",
y = "Average Yards/Pass Rank"
+
) theme_minimal() +
theme(
legend.position = "none" # Remove the legend
)
# Display the plot
avg_ypp_line_and_dot
###### FACETED DOT PLOT #########
# Function to prepare data for plotting
<- function(year, rank_index) {
prepare_avg_ypp_data # Construct the file name for the HTML file
<- paste0(year, "_avg_yds_per_play.html")
file_name
# Load the data
<- read_html(here::here('data_raw', file_name)) %>%
avg_ypp html_table()
<- as.data.frame(avg_ypp)
avg_ypp <- avg_ypp %>%
avg_ypp select('Team', "AVG") %>%
arrange(-AVG)
# Dynamically extract the abbreviation from the 'ranked_1' data based on rank_index
<- str_extract(ranked_1$Team[rank_index], "^[A-Za-z]+")
top_team_abbreviation
# Create a new column to indicate the highlight color
<- avg_ypp[1:50, ]
top_50_teams <- top_50_teams %>%
top_50_teams mutate(
is_highlight = Team == top_team_abbreviation,
fill_color = ifelse(is_highlight, "red", "lightgrey"), # Use regular red for highlight
display_name = ifelse(is_highlight, Team, ""),
Year = year # Add the year column for faceting
)
return(top_50_teams)
}
# Combine data for all years
<- bind_rows(lapply(2014:2023, function(year) {
all_years_data <- year - 2013 # Map year to rank index
rank_index prepare_avg_ypp_data(year, rank_index)
}))# Calculate the average AVG for each year
<- all_years_data %>%
yearly_averages group_by(Year) %>%
summarize(avg_yds = mean(AVG))
# Merge averages back into the data for labeling
<- all_years_data %>%
all_years_data left_join(yearly_averages, by = "Year")
# Create the faceted dot plot with average line
ggplot(all_years_data) +
# Plot all grey dots first
geom_point(data = filter(all_years_data, !is_highlight),
aes(x = reorder(Team, AVG), y = AVG),
color = "lightgrey", size = 1.5) +
# Plot the red highlighted dot on top with a legend
geom_point(data = filter(all_years_data, is_highlight),
aes(x = reorder(Team, AVG), y = AVG, color = "Top Rank"),
size = 1.5) +
# Add an average line for each year
geom_hline(data = yearly_averages, aes(yintercept = avg_yds), linetype = "dashed", color = "black") +
# Add a label for the average line
geom_text(data = yearly_averages,
aes(x = 25, y = avg_yds, label = "Average Yds/Pass"), # Move text slightly left
color = "black", hjust = 0, vjust = -0.5, size = 2.5) + # Black text, smaller size
labs(
x = "Team",
y = "Average Yards per Pass",
title = "Top 50 Teams - Average Yards Per Pass by Year",
color = "" # Legend title for red dot
+
) facet_wrap(~Year, ncol = 2) + # Facet by year
theme_minimal() +
theme(
axis.text.x = element_blank(), # Remove x-axis text labels
axis.ticks.x = element_blank(), # Remove x-axis ticks
legend.position = "bottom" # Add legend at the bottom
+
) scale_color_manual(values = c("Top Rank" = "red")) # Legend for red dot
# Create the faceted dot plot with average line styled like "everyone" graph
<- ggplot(all_years_data, aes(x = Team, y = AVG)) +
average_YPP # Plot all points as light grey
geom_point(color = "lightgrey", size = 1.5, position = position_dodge(width = 0.5)) +
# Overlay orange points for "Top Rank"
geom_point(
data = filter(all_years_data, is_highlight),
aes(color = "Top Rank"), size = 1.5, position = position_dodge(width = 0.5)
+
)
# Add dashed line for the average
geom_hline(data = yearly_averages, aes(yintercept = avg_yds),
linetype = "dashed", color = "black", size = 0.5) +
# Add label for the average line
annotate(
"text", x = 45, y = 7.25 + 0.5,
label = "Average", color = "black", size = 2.5, hjust = 0
+
)
# Customize color scale to show only "Top Rank" in the legend
scale_color_manual(
values = c("Top Rank" = "orange"), # Orange for "Top Rank"
labels = c("Top Rank"), # Only "Top Rank" in legend
guide = guide_legend(override.aes = list(size = 3)) # Adjust legend point size
+
)
# Apply minimal theme
theme_minimal() +
# Add labels and adjust layout
labs(
title = "Offensive Yards Per Pass by Year",
x = NULL, # Remove x-axis label
y = "Average Yards per Pass",
color = NULL # Remove legend title
+
)
# Adjust theme
theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
legend.position = "bottom" # Place legend at the bottom
+
)
# Add facets
facet_wrap(~ Year) # Use fixed y-axis for all facets
# Display the plot
average_YPP
<- ggplot(filter(long_defense_data, Metric == 'Rushing YDS/Game'),
rushing_yds aes(x = Value, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_rushing, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Rushing YDS/Game",
y = NULL, # Remove y-axis title
title = "Rushing YDS/Game Per Teams"
+
) theme(
axis.text.y = element_blank(), # Remove team names
legend.position = "none"
)
<- ggplot(filtered_data23,
AVG_Yards_per_play aes(x = AVG, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_yards, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Average Yards per Play",
y = NULL, # Remove y-axis title
title = "Average Yards/Pass by Team"
+
) theme(
axis.text.y = element_blank(), # Remove team names
legend.position = "none"
)
<- ggplot(filtered_data23,
TotalTD aes(x = TD, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_td, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Total Touchdowns",
y = "Team", # Retain y-axis title and team names
title = "Total Touchdowns by Team"
+
) theme(
axis.text.y = element_text(size = 7), # Retain team names
legend.position = "none"
)
<- ggplot(filtered_data23,
yds_by_game aes(x = `YDS/G`, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_yds_game, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Offensive Yards per Game",
y = NULL, # Remove y-axis title
title = "Offensive Yards/Game by Team"
+
) theme(
axis.text.y = element_blank(), # Remove team names
legend.position = "none"
)
<- plot_grid(
combined_plot
TotalTD,
rushing_yds,
AVG_Yards_per_play,
yds_by_game,ncol = 4, # Arrange plots in a single row
align = "h", # Align horizontally and vertically
axis = "tb", # Align axes
rel_widths = c(3.5, 2, 2, 2) # Equal widths for all plots
)
combined_plot
td_rank
everyone
defensive_dot_plot
defensive
graph1
offensive
avg_ypp_line_and_dot
average_YPP
library(rvest)
library(dplyr)
<- read_html(here::here('data_raw', '2023_passing_offense.html')) %>%
data_list23 html_table()
<- cbind(data_list23[[1]], data_list23[[2]])
data23#View(data23)
<- c(
top25 "Ohio State Buckeyes",
"Notre Dame Fighting Irish",
"Alabama Crimson Tide",
"Clemson Tigers",
"Iowa Hawkeyes",
"LSU Tigers",
"Michigan Wolverines",
"Oklahoma State Cowboys",
"Texas Longhorns",
"Georgia Bulldogs",
"Washington Huskies",
"Cincinnati Bearcats",
"Florida Gators",
"Oklahoma Sooners",
"Oregon Ducks",
"USC Trojans",
"Utah Utes",
"Auburn Tigers",
"Boise State Broncos",
"Florida State Seminoles",
"Miami Hurricanes",
"Northwestern Wildcats",
"Oklahoma Sooners",
"Tennessee Volunteers",
"Penn State Nittany Lions",
"Texas A&M Aggies")
<- data23 %>%
filtered_data23 mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
#View(filtered_data23)
#View(filtered_data23)
library(ggplot2)
<- c(
team_colors "Michigan Wolverines" = "blue",
"Alabama Crimson Tide" = "red",
"Texas Longhorns" = "darkorange",
"Washington Huskies" = "purple")
ggplot(filtered_data23, aes(x = Team, y = YDS, fill = Team)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
theme_minimal() +
labs(title = "YDS by Team", x = "Team", y = "Score")
# Load necessary packages
library(ggplot2)
library(dplyr)
$YDS <- gsub("[^0-9.-]", "", filtered_data23$YDS)
filtered_data23
$YDS <- as.numeric(filtered_data23$YDS)
filtered_data23#View(filtered_data23)
<- ggplot(filtered_data23, aes(x = reorder(Team, YDS), y = YDS, fill = Team)) + # Reorder Team by yards on x-axis
yds_by_team geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
labs(x = "Team", y = "Total Offensive Yards", title = "Total Offensive Yards by Team in 2023 Season") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
theme(legend.position = "none")# Rotate x-axis labels for readability
yds_by_team
<- ggplot(filtered_data23, aes(x = reorder(Team, `YDS/G`), y = `YDS/G`, fill = Team)) + # Reorder Team by YDS/G
yds_by_game geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
labs(x = "Offensive Yards Per Game", y = "Team", title = "Offensive Yards Per Game in 2023 Season") + # Adjust labels for flipped axes
theme(axis.text.y = element_text(angle = 0, hjust = 1), # Adjust for readability on flipped axis
legend.position = "none") + # Remove legend
coord_flip() # Flip x and y axes
yds_by_game
<- ggplot(filtered_data23, aes(x = reorder(Team, TD), y = TD, fill = Team)) + # Reorder Team by TD
TotalTD geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
labs(x = "# of Touchdowns", y = "Team", title = "Number of Total Touchdowns") + # Adjust labels for flipped axes
theme(axis.text.y = element_text(angle = 0, hjust = 1), # Adjust for readability on flipped axis
legend.position = "none") + # Remove legend
coord_flip() # Flip x and y axes
TotalTD
%>% arrange(SACK)
filtered_data23
<- ggplot(filtered_data23, aes(x = reorder(Team, `CMP%`), y = `CMP%`, fill = Team)) + # Reorder Team by CMP% on x-axis
completion_percentage geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
labs(x = "Team", y = "Completion Percentage", title = "Completion Percentage by Team") +
theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for readability
legend.position = "none") # Remove legend
completion_percentage
<- ggplot(filtered_data23, aes(x = reorder(Team, AVG), y = AVG, fill = Team)) + # Reorder Team by AVG
AVG_Yards_per_play geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) +
labs(x = "Average Yards per Gain", y = "Team", title = "Average Yards Per Gain") + # Adjust labels after flipping
theme(axis.text.y = element_text(angle = 0, hjust = 1), # Adjust for vertical axis
legend.position = "none") + # Remove legend
coord_flip() # Flip x and y axes
AVG_Yards_per_play
library(rvest)
library(ggplot2)
library(tidyr)
library(dplyr)
library(stringdist)
library(janitor)
library(stringr)
#install.packages("stringdist")
<- read_html(here::here('data_raw', 'defense_23.html')) %>%
football_defense_list html_table()
<- cbind(football_defense_list[[1]], football_defense_list[[2]])
defense2023
<- defense2023[1, ] # Extract the first row as new headers
new_headers <- defense2023[-1, ] # Remove the first row from the dataset
defense2023 names(defense2023) <- new_headers
rownames(defense2023) <- NULL
colnames(defense2023)[1:10] <- c(
'Team',
'Games Played',
'Total YDS',
'Total YDS/Game',
'Passing YDS',
'Passing YDS/Game',
'Rushing YDS',
'Rushing YDS/Game',
'Points',
'Points/Game'
)
<- defense2023 %>%
long_defense_data pivot_longer(cols = c('Games Played':'Points/Game'), # Specify the range using column names
names_to = "Metric",
values_to = "Value")
<- function(metric_name) {
create_bar_chart ggplot(filter(long_defense_data, Metric == metric_name), aes(x = Team, y = as.numeric(Value), fill = Team)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = paste("Comparison of", metric_name, "across Teams"),
x = "Teams",
y = metric_name) +
guides(fill = "none") # Use "none" instead of FALSE to remove the fill legend
}
# List of metrics for which we want to create bar charts
<- c('Games Played', 'Total YDS', 'Total YDS/Game', 'Passing YDS',
metrics 'Passing YDS/Game', 'Rushing YDS', 'Rushing YDS/Game', 'Points', 'Points/Game')
# Generate the bar charts for each metric
#for (metric in metrics) {
#print(create_bar_chart(metric))
#}
# Identify rows that can't be converted to numeric
<- long_defense_data %>%
problematic_rows filter(is.na(as.numeric(Value))) %>%
distinct()
# Display the problematic rows
#print(problematic_rows)
<- read_html(here::here('data_raw', 'defense_23.html')) %>%
football_defense_list html_table()
<- cbind(football_defense_list[[1]], football_defense_list[[2]])
defense2023
<- defense2023[1, ] # Extract the first row as new headers
new_headers <- defense2023[-1, ] # Remove the first row from the dataset
defense2023 names(defense2023) <- new_headers
rownames(defense2023) <- NULL
colnames(defense2023)[1:10] <- c(
'Team',
'Games Played',
'Total YDS',
'Total YDS/Game',
'Passing YDS',
'Passing YDS/Game',
'Rushing YDS',
'Rushing YDS/Game',
'Points',
'Points/Game'
)
<- defense2023 %>%
long_defense_data pivot_longer(cols = c('Games Played':'Points/Game'), # Specify the range using column names
names_to = "Metric",
values_to = "Value")
# Identify rows that can't be converted to numeric
<- long_defense_data %>%
problematic_rows filter(is.na(suppressWarnings(as.numeric(Value)))) %>% # This will filter rows that contain non-numeric values
distinct()
# View the problematic rows
#print(problematic_rows)
# Display the problematic rows
#print(problematic_rows)
# Remove commas from the Value column and convert to numeric
<- long_defense_data %>%
long_defense_data mutate(Value = as.numeric(gsub(",", "", Value)))
# Check for any remaining NA values
<- long_defense_data %>%
problematic_rows filter(is.na(Value)) %>%
distinct()
## View the problematic rows
#print(problematic_rows)
<- 2016:2023
years
<- data.frame()
all_data
for (year in years) {
<- paste0("https://www.espn.com/college-football/rankings/_/poll/1/week/1/year/", year, "/seasontype/3")
url
<- tryCatch({
page read_html(url)
error = function(e) {
}, message(paste("Error reading page for year:", year))
return(NULL)
})
if (!is.null(page)) {
<- page %>% html_node("table")
table
if (!is.null(table)) {
<- table %>%
data html_table(fill = TRUE)
$Year <- year
data<- bind_rows(all_data, data)
all_data else {
} message(paste("No table found for year:", year))
}
}
}
<- all_data %>%
all_data select("RK",'Team', "REC","PTS","Year")
<- all_data %>%
top_teams group_by(Team) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
slice_head(n = 25)
write.csv(top_teams, "top_teams.csv", row.names = FALSE)
<- c(
top25 "Ohio State Buckeyes",
"Notre Dame Fighting Irish",
"Alabama Crimson Tide",
"Clemson Tigers",
"Iowa Hawkeyes",
"LSU Tigers",
"Michigan Wolverines",
"Oklahoma State Cowboys",
"Texas Longhorns",
"Georgia Bulldogs",
"Washington Huskies",
"Cincinnati Bearcats",
"Florida Gators",
"Oklahoma Sooners",
"Oregon Ducks",
"USC Trojans",
"Utah Utes",
"Auburn Tigers",
"Boise State Broncos",
"Florida State Seminoles",
"Miami Hurricanes",
"Northwestern Wildcats",
"Oklahoma Sooners",
"Tennessee Volunteers",
"Penn State Nittany Lions",
"Texas A&M Aggies"
)
<- long_defense_data %>%
long_defense_data mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
# Bar Charts
$Color <- ifelse(long_defense_data$Team %in%
long_defense_datac("Michigan Wolverines", "Washington Huskies", "Texas Longhorns", "Alabama Crimson Tide"),
"lightblue", "grey")
<- ggplot(filter(long_defense_data, Metric == 'Rushing YDS/Game'),
rushing_yds aes(x = reorder(Team, -Value), y = Value, fill = Team)) + # Reorder Team by Value
geom_bar(stat = "identity") +
scale_fill_manual(values = team_colors) + # Use custom team colors
labs(
x = "Rushing YDS/Game",
y = "Team",
title = "Comparison of Rushing YDS/Game across Teams"
+
) theme(
axis.text.y = element_text(angle = 0, hjust = 1), # Adjust for flipped axes
legend.position = "none" # Remove the legend
+
) coord_flip() # Flip x and y axes
rushing_yds
AVG_Yards_per_play
TotalTD
yds_by_game
library(cowplot)
# Combine the four plots into a grid
<- plot_grid(
combined_plot # The plots to combine
rushing_yds, AVG_Yards_per_play, TotalTD, yds_by_game, ncol = 2, # Arrange plots in 2 columns
align = "v" # Align vertically within columns
)
# Display the combined plot
combined_plot
library(cowplot)
# Reduce text size for each plot
<- rushing_yds + theme(plot.title = element_text(size = 10), axis.text = element_text(size = 5))
rushing_yds <- AVG_Yards_per_play + theme(plot.title = element_text(size = 10), axis.text = element_text(size = 5))
AVG_Yards_per_play <- TotalTD + theme(plot.title = element_text(size = 10), axis.text = element_text(size = 5))
TotalTD <- yds_by_game + theme(plot.title = element_text(size = 10), axis.text = element_text(size = 5))
yds_by_game
# Combine all plots in a single row
<- plot_grid(
combined_plot
rushing_yds,
AVG_Yards_per_play,
TotalTD,
yds_by_game,ncol = 4 # Arrange plots in a single row
)
# Display the combined plot
combined_plot
# Define a consistent order for teams (e.g., alphabetical)
library(dplyr)
# Arrange teams by most TD and create consistent order
<- filtered_data23 %>%
consistent_team_order arrange(desc(TD)) %>% # Sort teams by TD in descending order
pull(Team) %>% # Extract the sorted team names
unique() # Ensure unique team names
# Apply this order to the 'Team' column
$Team <- factor(filtered_data23$Team, levels = consistent_team_order)
filtered_data23
# If using another dataset, apply the same factor levels
$Team <- factor(long_defense_data$Team, levels = consistent_team_order)
long_defense_data
# Apply this order to all datasets
$Team <- factor(filtered_data23$Team, levels = consistent_team_order)
filtered_data23$Team <- factor(long_defense_data$Team, levels = consistent_team_order)
long_defense_data
<- mean(filter(long_defense_data, Metric == 'Rushing YDS/Game')$Value, na.rm = TRUE)
avg_rushing
<- ggplot(filter(long_defense_data, Metric == 'Rushing YDS/Game'),
rushing_yds aes(x = Value, y = Team, fill = Team)) + # Teams aligned on y-axis
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_rushing, linetype = "dashed", color = "black", size = 0.75) + # Add average line
scale_fill_manual(values = team_colors) + # Use custom team colors
labs(
x = "Rushing YDS/Game",
y = "Team",
title = "Comparison of Rushing YDS/Game across Teams"
+
) theme(
axis.text.y = element_text(size = 8),
legend.position = "none" # Remove the legend
)<- mean(filtered_data23$AVG, na.rm = TRUE)
avg_yards
<- ggplot(filtered_data23,
AVG_Yards_per_play aes(x = AVG, y = Team, fill = Team)) + # Teams aligned on y-axis
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_yards, linetype = "dashed", color = "black", size = 0.75) + # Add average line
scale_fill_manual(values = team_colors) +
labs(
x = "Average Yards per Play",
y = "Team",
title = "Average Yards Per Play by Team"
+
) theme(
axis.text.y = element_text(size = 8),
legend.position = "none"
)<- mean(filtered_data23$TD, na.rm = TRUE)
avg_td
<- ggplot(filtered_data23,
TotalTD aes(x = TD, y = Team, fill = Team)) + # Teams aligned on y-axis
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_td, linetype = "dashed", color = "black", size = 0.75) + # Add average line
scale_fill_manual(values = team_colors) +
labs(
x = "Total Touchdowns",
y = "Team",
title = "Total Touchdowns by Team"
+
) theme(
axis.text.y = element_text(size = 8),
legend.position = "none"
)
<- mean(filtered_data23$`YDS/G`, na.rm = TRUE)
avg_yds_game
<- ggplot(filtered_data23,
yds_by_game aes(x = `YDS/G`, y = Team, fill = Team)) + # Teams aligned on y-axis
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_yds_game, linetype = "dashed", color = "black", size = 0.75) + # Add average line
scale_fill_manual(values = team_colors) +
labs(
x = "Offensive Yards per Game",
y = "Team",
title = "Offensive Yards per Game by Team"
+
) theme(
axis.text.y = element_text(size = 8),
legend.position = "none"
)
<- plot_grid(
combined_plot
TotalTD,
rushing_yds,
AVG_Yards_per_play,
yds_by_game,ncol = 4 # Arrange plots in a single row
)
# Display the combined plot
combined_plot
<- ggplot(filter(long_defense_data, Metric == 'Rushing YDS/Game'),
rushing_yds aes(x = Value, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_rushing, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Rushing YDS/Game",
y = NULL, # Remove y-axis title
title = "Rushing YDS/Game Per Teams"
+
) theme(
axis.text.y = element_blank(), # Remove team names
legend.position = "none"
)
<- ggplot(filtered_data23,
AVG_Yards_per_play aes(x = AVG, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_yards, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Average Yards per Play",
y = NULL, # Remove y-axis title
title = "Average Yards/Pass by Team"
+
) theme(
axis.text.y = element_blank(), # Remove team names
legend.position = "none"
)
<- ggplot(filtered_data23,
TotalTD aes(x = TD, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_td, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Total Touchdowns",
y = "Team", # Retain y-axis title and team names
title = "Total Touchdowns by Team"
+
) theme(
axis.text.y = element_text(size = 10), # Retain team names
legend.position = "none"
)
<- ggplot(filtered_data23,
yds_by_game aes(x = `YDS/G`, y = Team, fill = Team)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = avg_yds_game, linetype = "dashed", color = "black", size = 0.75) +
scale_fill_manual(values = team_colors) +
labs(
x = "Offensive Yards per Game",
y = NULL, # Remove y-axis title
title = "Offensive Yards/Game by Team"
+
) theme(
axis.text.y = element_blank(), # Remove team names
legend.position = "none"
)
<- plot_grid(
combined_plot
TotalTD,
rushing_yds,
AVG_Yards_per_play,
yds_by_game,ncol = 4 # Arrange plots in a single row
)
# Display the combined plot
combined_plot
<- plot_grid(
combined_plot
TotalTD,
rushing_yds,
AVG_Yards_per_play,
yds_by_game,ncol = 4, # Arrange plots in a single row
align = "h", # Align horizontally and vertically
axis = "tb", # Align axes
rel_widths = c(5, 2, 2, 2) # Equal widths for all plots
)
# Display the combined plot
combined_plot
library(ggplot2)
library(tidyr)
library(dplyr)
library(readr)
library(tidyverse)
library(rvest)
library(here)
library(cowplot)
<- c(14,15,16,17,18,19,20,21,22,23)
years
<- function(start_year, end_year) {
scrape_clean # Initialize an empty list to store yearly data
<- list()
all_data
# Loop over the range of years
for (year in start_year:end_year) {
# Generate the file path
<- here("data_raw", paste0("defense_yards_", year, ".html"))
file_path
# Read the HTML file and extract tables
<- read_html(file_path) %>%
defense_yards html_table(fill = TRUE)
# Combine the first two tables (assuming they are in the desired format)
<- cbind(defense_yards[[1]], defense_yards[[2]])
combined_table
# Add a column for the year
$Year <- year
combined_table
# Append to the list
as.character(year)]] <- combined_table
all_data[[
}
# Combine all the yearly tables into one data frame
<- bind_rows(all_data)
combined_data_def
return(combined_data_def)
}
# Example usage
<- 14
start_year <- 23
end_year
<- scrape_clean(start_year, end_year)
combined_defense_data
# View the combined data
<- combined_defense_data[1, ]
new_headers <- combined_defense_data[-1, ]
combined_offense_data names(combined_defense_data) <- new_headers
<- combined_defense_data[, c(1,8,11)] %>%
cleaned_defense rename(Team = 1, Yards_Game = 2, Year = 3) %>%
mutate(Yards_Game = as.numeric(gsub(",", "", Yards_Game)))
#or (1 in years) {
$Year <- cleaned_defense$Year + 2000
cleaned_defenseglimpse(cleaned_defense)
cleaned_defense #unique(all_data$Team)
<- data.frame()
all_data
<- 2014:2023
years # Read in data for each year
for (year in years) {
<- paste0("https://www.espn.com/college-football/rankings/_/poll/1/week/1/year/", year, "/seasontype/3")
url <- tryCatch({
page read_html(url)
error = function(e) {
}, message(paste("Error reading page for year:", year))
return(NULL)
})if (!is.null(page)) {
<- page %>% html_node("table")
table if (!is.null(table)) {
<- table %>%
data html_table(fill = TRUE)
$Year <- year
data<- bind_rows(all_data, data)
all_data else {
} message(paste("No table found for year:", year))
}
}
} View(all_data)
# all data is a smaller data frame consisted of
# rankings of top 25 teams every year
# edit offense data frame to also include RK information form all_data.
# Create a mapping of old names to new names
<- c(
name_mapping "OSU Ohio State" = "OSU",
"OSU Ohio State " = "OSU",
"UCF UCF " = "UCF",
"ORE Oregon" = "ORE",
"TCU TCU" = "TCU",
"ALA Alabama" = "ALA",
"ALA Alabama " = "ALA",
"FSU Florida State" = "FSU",
"MSU Michigan State" = "MSU",
"BAY Baylor" = "BAY",
"GT Georgia Tech" = "GT",
"UGA Georgia" = "UGA",
"UGA Georgia " = "UGA",
"UCLA UCLA" = "UCLA",
"MSST Mississippi State" = "MSST",
"ASU Arizona State" = "ASU",
"WIS Wisconsin" = "WIS",
"MIZ Missouri" = "MIZ",
"CLEM Clemson" = "CLEM",
"CLEM Clemson " = "CLEM",
"BSU Boise State" = "BSU",
"MISS Ole Miss" = "MISS",
"KSU Kansas State" = "KSU",
"ARIZ Arizona" = "ARIZ",
"USC USC" = "USC",
"UTAH Utah" = "UTAH",
"AUB Auburn" = "AUB",
"MRSH Marshall" = "MRSH",
"LOU Louisville" = "LOU",
"MEM Memphis" = "MEM",
"STAN Stanford" = "STAN",
"OKLA Oklahoma" = "OKLA",
"HOU Houston" = "HOU",
"IOWA Iowa" = "IOWA",
"ND Notre Dame" = "ND",
"MICH Michigan" = "MICH",
"MICH Michigan " = "MICH",
"UNC North Carolina" = "UNC",
"LSU LSU" = "LSU",
"LSU LSU " = "LSU",
"NAVY Navy" = "NAVY",
"OKST Oklahoma State" = "OKST",
"TENN Tennessee" = "TENN",
"NW Northwestern" = "NW",
"WKU Western Kentucky" = "WKU",
"FLA Florida" = "FLA",
"WASH Washington" = "WASH",
"PSU Penn State" = "PSU",
"VT Virginia Tech" = "VT",
"COLO Colorado" = "COLO",
"WVU West Virginia" = "WVU",
"USF South Florida" = "USF",
"MIAMI Miami" = "MIAMI",
"SDSU San Diego State" = "SDSU",
"NCST NC State" = "NCST",
"TEX Texas" = "TEX",
"WSU Washington State" = "WSU",
"UK Kentucky" = "UK",
"SYR Syracuse" = "SYR",
"TA&M Texas A&M" = "TA&M",
"FRES Fresno State" = "FRES",
"ARMY Army" = "ARMY",
"USU Utah State" = "USU",
"CIN Cincinnati" = "CIN",
"MINN Minnesota" = "MINN",
"APP Appalachian State" = "APP",
"AFA Air Force" = "AFA",
"OU Oklahoma" = "OU",
"ISU Iowa State" = "ISU",
"BYU BYU" = "BYU",
"IU Indiana" = "IU",
"CCU Coastal Carolina" = "CCU",
"UL Louisiana" = "UL",
"LIB Liberty" = "LIB",
"BALL Ball State" = "BALL",
"SJSU San José State" = "SJSU",
"BUFF Buffalo" = "BUFF",
"PITT Pittsburgh" = "PITT",
"WAKE Wake Forest" = "WAKE",
"ARK Arkansas" = "ARK",
"TULN Tulane" = "TULN",
"ORST Oregon State" = "ORST",
"TROY Troy" = "TROY",
"SC South Carolina" = "SC",
"SMU SMU" = "SMU",
"KU Kansas" = "KU"
)
# Update the 'Team' column using the mapping
# Remove numbers and parentheses
<- all_data %>%
all_data mutate(
Team = gsub("[0-9()]", "", Team)
)
<- all_data %>%
all_data mutate(Team = if_else(Team %in% names(name_mapping),
name_mapping[Team], Team))
<- all_data %>%
combined_data_def inner_join( cleaned_defense , by = c("Team", "Year"))
# final combined datat frame for the past 9 yeard
glimpse(combined_data_def)
<- combined_data_def[,c(1,2,8,9)]
combined_data_def
<- combined_data_def %>%
combined_data_def group_by(Year) %>% # Group by Year so that ranking is done per year
mutate(yards_rank = rank(Yards_Game, ties.method = "min")) %>% # Rank within each year
ungroup()
head(combined_data_def)
<- combined_data_def %>%
combined_data_deffilter(Year %in% c(2014, 2021, 2023))
<- ggplot(combined_data_def %>%
defensive_dot_plot filter(RK == 1), # Filter for top-ranked teams
aes(x = Year, y = yards_rank)) + # Mapping Year to x and yards_rank to y
geom_point(color = "purple") + # Uniform color for points
geom_line(color = "black", size = 0.25) + # Uniform color for lines
scale_y_reverse() + # Reverse y-axis
labs(
title = "Do #1 Teams Excel in Defensive Yards/Game?",
x = "Year",
y = "Defensive Yards Rank"
+
) theme_minimal() +
theme(
legend.position = "none" # Remove the legend
)
# Display the plot
defensive_dot_plot
defensive_dot_plot
## plot two
<- cleaned_defense %>%
combined_data_def_2 left_join(all_data, by = c("Team", "Year"))
glimpse(combined_data_def_2)
#View(combined_data)
<- combined_data_def_2[,c(1,2,3,4)]
combined_data_def_2 glimpse(combined_data_def_2)
<- combined_data_def_2 %>%
combined_data_def_2filter(Year %in% c(2014, 2021, 2023))
<- ggplot(data = combined_data_def_2,
defensive aes(x = Team, y = Yards_Game,
fill = case_when(
== 1 ~ "Top Rank", # Only color "Top Rank" for RK == 1
RK TRUE ~ NA_character_ # Don't assign fill for other ranks
+
))) # Plot all points (background points with light grey color)
geom_point(size = 1, shape = 21, color = "lightgrey", fill = "lightgrey") +
# Plot the blue points for RK == 1 (on the front layer, larger size)
geom_point(data = combined_data_def_2 %>% filter(RK == 1),
aes(fill = "Top Rank"), size = 2, shape = 21) + # Blue dots, larger size
facet_wrap(~ Year, scales = "free_x") +
geom_hline(data = combined_data_def_2 %>%
group_by(Year) %>%
summarise(avg_yards = mean(Yards_Game, na.rm = TRUE)),
aes(yintercept = avg_yards),
linetype = "dashed", color = "black",size = 0.75) + # Add dotted line for the average Yards_Game
labs(
title = "Defensive Yards Per Game by Year",
x = NULL, # Remove x-axis label
y = "Yards Per Game",
fill = "Legend" # Legend title
+
) scale_fill_manual(
values = c("Top Rank" = "purple"), # Only show blue for "Top Rank"
na.value = "transparent" # Make the other fill values invisible in the legend
+
) theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_blank(),
legend.position = "bottom", # Place legend at the bottom
strip.text = element_text(size = 10),
aspect.ratio = .7
)
defensive
library(rvest)
library(dplyr)
####### 2023 ###################################################################
<- read_html(here::here('data_raw', '2023_passing_offense.html')) %>%
data_list23 html_table()
<- cbind(data_list23[[1]], data_list23[[2]])
data23View(data23)
#iew(data23)
<- c(
top25 "Ohio State Buckeyes",
"Notre Dame Fighting Irish",
"Alabama Crimson Tide",
"Clemson Tigers",
"Iowa Hawkeyes",
"LSU Tigers",
"Michigan Wolverines",
"Oklahoma State Cowboys",
"Texas Longhorns",
"Georgia Bulldogs",
"Washington Huskies",
"Cincinnati Bearcats",
"Florida Gators",
"Oklahoma Sooners",
"Oregon Ducks",
"USC Trojans",
"Utah Utes",
"Auburn Tigers",
"Boise State Broncos",
"Florida State Seminoles",
"Miami Hurricanes",
"Northwestern Wildcats",
"Oklahoma Sooners",
"Tennessee Volunteers",
"Penn State Nittany Lions",
"Texas A&M Aggies")
<- data23 %>%
filtered_data23 mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
#View(filtered_data23)
#View(filtered_data23)
library(ggplot2)
<- c(
team_colors "Michigan Wolverines" = "blue",
"Alabama Crimson Tide" = "red",
"Texas Longhorns" = "orange",
"Washington Huskies" = "purple")
library(dplyr)
# Rank teams in df23 based on the number of TDs
<- data23 %>%
data23 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2023)
<- data23 %>%
average_TD_23 group_by(Year) %>%
summarize(Average_Touchdowns = mean(TD, na.rm = TRUE))
#View(data23)
####### 2022 ###################################################################
<- read_html(here::here('data_raw', '2022_passing_offense.html')) %>%
data_list22 html_table()
<- cbind(data_list22[[1]], data_list22[[2]])
data22#View(data22)
<- c(
top25 "OHIO",
"ND",
"ALA",
"CLEM",
"IOWA",
"LSU",
"MICH",
"OKST",
"TEX",
"UGA",
"WASH",
"CIN",
"FLA",
"OU",
"ORE",
"USC",
"UTAH",
"AUB",
"BOIS",
"FSU",
"MIA",
"NU",
"TENN",
"PSU",
"TA&M")
<- data22 %>%
filtered_data22 mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
#View(filtered_data22)
<- c(
team_lookup "OHIO" = "Ohio State Buckeyes",
"ND" = "Notre Dame Fighting Irish",
"ALA" = "Alabama Crimson Tide",
"CLEM" = "Clemson Tigers",
"IOWA" = "Iowa Hawkeyes",
"LSU" = "LSU Tigers",
"MICH" = "Michigan Wolverines",
"OKST" = "Oklahoma State Cowboys",
"TEX" = "Texas Longhorns",
"UGA" = "Georgia Bulldogs",
"WASH" = "Washington Huskies",
"CIN" = "Cincinnati Bearcats",
"FLA" = "Florida Gators",
"OU" = "Oklahoma Sooners",
"ORE" = "Oregon Ducks",
"USC" = "USC Trojans",
"UTAH" = "Utah Utes",
"AUB" = "Auburn Tigers",
"BOIS" = "Boise State Broncos",
"FSU" = "Florida State Seminoles",
"MIA" = "Miami Hurricanes",
"NU" = "Northwestern Wildcats",
"TENN" = "Tennessee Volunteers",
"PSU" = "Penn State Nittany Lions",
"TA&M" = "Texas A&M Aggies"
)
# Rename teams in the 'Team' column of the dataset
<- filtered_data22 %>%
filtered_data22 mutate(Team = recode(Team, !!!team_lookup))
#View(filtered_data22)
<- data22 %>%
data22 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2022)
#View(df22)
####### 2021 ###################################################################
<- read_html(here::here('data_raw', '2021_passing_offense.html')) %>%
data_list21 html_table()
<- cbind(data_list21[[1]], data_list21[[2]])
data21#View(data21)
<- c(
top25 "OHIO",
"ND",
"ALA",
"CLEM",
"IOWA",
"LSU",
"MICH",
"OKST",
"TEX",
"UGA",
"WASH",
"CIN",
"FLA",
"OU",
"ORE",
"USC",
"UTAH",
"AUB",
"BOIS",
"FSU",
"MIA",
"NU",
"TENN",
"PSU",
"TA&M")
<- data21 %>%
filtered_data21 mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
#View(filtered_data21)
<- c(
team_lookup "OHIO" = "Ohio State Buckeyes",
"ND" = "Notre Dame Fighting Irish",
"ALA" = "Alabama Crimson Tide",
"CLEM" = "Clemson Tigers",
"IOWA" = "Iowa Hawkeyes",
"LSU" = "LSU Tigers",
"MICH" = "Michigan Wolverines",
"OKST" = "Oklahoma State Cowboys",
"TEX" = "Texas Longhorns",
"UGA" = "Georgia Bulldogs",
"WASH" = "Washington Huskies",
"CIN" = "Cincinnati Bearcats",
"FLA" = "Florida Gators",
"OU" = "Oklahoma Sooners",
"ORE" = "Oregon Ducks",
"USC" = "USC Trojans",
"UTAH" = "Utah Utes",
"AUB" = "Auburn Tigers",
"BOIS" = "Boise State Broncos",
"FSU" = "Florida State Seminoles",
"MIA" = "Miami Hurricanes",
"NU" = "Northwestern Wildcats",
"TENN" = "Tennessee Volunteers",
"PSU" = "Penn State Nittany Lions",
"TA&M" = "Texas A&M Aggies"
)
# Rename teams in the 'Team' column of the dataset
<- filtered_data21 %>%
filtered_data21 mutate(Team = recode(Team, !!!team_lookup))
#View(filtered_data21)
<- data21 %>%
data21 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2021)
####### 2020 ###################################################################
<- read_html(here::here('data_raw', '2020_passing_offense.html')) %>%
data_list20 html_table()
<- cbind(data_list20[[1]], data_list20[[2]])
data20#View(data20)
<- data20 %>%
data20 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD)))%>%
mutate(Year = 2020)
####### 2019 ###################################################################
<- read_html(here::here('data_raw', '2019_passing_offense.html')) %>%
data_list19 html_table()
<- cbind(data_list19[[1]], data_list19[[2]])
data19#View(data19)
<- data19 %>%
data19 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2019)
####### 2018 ###################################################################
<- read_html(here::here('data_raw', '2018_passing_offense.html')) %>%
data_list18 html_table()
<- cbind(data_list18[[1]], data_list18[[2]])
data18#View(data18)
<- data18 %>%
data18 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD)))%>%
mutate(Year = 2018)
####### 2017 ###################################################################
<- read_html(here::here('data_raw', '2017_passing_offense.html')) %>%
data_list17 html_table()
<- cbind(data_list17[[1]], data_list17[[2]])
data17#View(data17)
<- data17 %>%
data17 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2017)
#View(data17)
####### 2016 ###################################################################
<- read_html(here::here('data_raw', '2016_passing_offense.html')) %>%
data_list16 html_table()
<- cbind(data_list16[[1]], data_list16[[2]])
data16#View(data16)
<- data16 %>%
data16 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2016)
####### 2015 ###################################################################
<- read_html(here::here('data_raw', '2015_passing_offense.html')) %>%
data_list15 html_table()
<- cbind(data_list15[[1]], data_list15[[2]])
data15#View(data15)
<- data15 %>%
data15 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2015)
####### 2014 ###################################################################
<- read_html(here::here('data_raw', '2014_passing_offense.html')) %>%
data_list14 html_table()
<- cbind(data_list14[[1]], data_list14[[2]])
data14#View(data14)
<- data14 %>%
data14 arrange(desc(TD)) %>% # Arrange teams in descending order of TDs
mutate(Rank = dense_rank(desc(TD))) %>%
mutate(Year = 2014)
####### BUILDING THE DATA SET ###################################################################
<- data23 %>%
df23 filter(Team == "Michigan Wolverines")
<- data22 %>%
df22 filter(Team == "UGA")
<- data21 %>%
df21 filter(Team == "UGA")
<- data20 %>%
df20 filter(Team == "ALA")
<- data19 %>%
df19 filter(Team == "LSU")
<- data18 %>%
df18 filter(Team == "CLEM")
<- data17 %>%
df17 filter(Team == "ALA")
<- data16 %>%
df16 filter(Team == "CLEM")
<- data15 %>%
df15 filter(Team == "ALA")
<- data14 %>%
df14 filter(Team == "ALA")
# Combine all data sets into one
<- bind_rows(df23, df22, df21, df20, df19, df18, df17, df16, df15, df14)
combined_data
#View(combined_data)
library(ggplot2)
<- ggplot(combined_data, aes(x = Year, y = Rank)) +
td_rank # Black line connecting all points
geom_line(color = "black", linewidth = 0.25) +
# Uniform points for each year
geom_point(color = "red", size = 2) +
# Reverse y-axis to show Rank 1 at the top
scale_y_reverse() +
labs(
title = "Do #1 Teams Excel in Total Number of Touchdowns?",
x = "Year",
y = "Rank"
+
) theme_minimal() +
theme(
legend.position = "none" # Remove the legend
)
# Display the plot
td_rank
<- ggplot(combined_data, aes(x = Year, y = TD)) +
touchdowns geom_line(color = "black", linewidth = 0.25) + # Black line connecting all points
geom_point(aes(color = Team), size = 3) + # Points for each year and team with color
# Reverse y-axis to show Rank 1 at the top
labs(
title = "#1 Team Number of Touchdowns Over the Years",
x = "Year",
y = "Touchdowns",
color = "Team"
+
) theme_minimal() +
theme(
legend.position = "bottom"
)
td_rank
touchdowns
<- bind_rows(data23, data22, data21, data20, data19, data18, data17, data16, data15, data14)
combined_whole_data #View(combined_whole_data)
library(dplyr)
<- combined_whole_data %>%
combined_whole_data mutate(row_key = do.call(paste, c(., sep = "_")))
#View(combined_whole_data)
<- combined_data %>%
combined_data mutate(row_key = do.call(paste, c(., sep = "_")))
# Check presence and assign color code
<- combined_whole_data %>%
combined_whole_data mutate(
present_in_combined_data = row_key %in% combined_data$row_key,
color_code = ifelse(present_in_combined_data, "red", "lightgrey")
)
# Reorder the data so red points are plotted last
<- combined_whole_data %>%
combined_whole_data arrange(color_code == "red")
#View(combined_whole_data)
# Calculate the overall average touchdowns
<- mean(combined_whole_data$TD, na.rm = TRUE)
average_TD
# Create the plot with an average line and label
<- ggplot(combined_whole_data, aes(x = Team, y = TD, color = color_code)) +
everyone geom_point(size = 1.5, position = position_dodge(width = 0.5)) +
geom_hline(yintercept = average_TD, linetype = "dashed", color = "black", size = 0.5) + # Add average line
annotate(
"text", x = "ALA", y = 15 + 0.5,
label = "Average TDs", color = "black", size = 2.5, hjust = 0
+ # Add label for the average line
) scale_color_manual(values = c("red" = "red", "lightgrey" = "lightgrey")) +
theme_minimal() +
labs(
title = "Total Touchdowns by Team (Split by Year)",
x = "Team",
y = "Total Touchdowns",
color = "Presence in Combined Data"
+
) theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
legend.position = "bottom" # Remove the legend
+
) facet_wrap(~ Year) # Use fixed y-axis for all facets
#
everyone
touchdowns
td_rank
<- ggplot(combined_whole_data, aes(x = Team, y = TD, color = color_code)) +
everyone # Plot points with slight dodge for clarity
geom_point(size = 1.5, position = position_dodge(width = 0.5)) +
# Add dashed line for the average
geom_hline(yintercept = average_TD, linetype = "dashed", color = "black", size = 0.5) +
# Add label for the average line
annotate(
"text", x = "ALA", y = 15 + 0.5,
label = "Average TDs", color = "black", size = 2.5, hjust = 0
+
)
# Customize color scale to only include red in the legend
scale_color_manual(
values = c("red" = "red"),
labels = c("red" = "Top Teams"), # Only keep red in the legend
guide = guide_legend(override.aes = list(size = 3)) # Adjust legend point size
+
)
# Apply minimal theme
theme_minimal() +
# Add labels and adjust layout
labs(
title = "Total Touchdowns by Team (Split by Year)",
x = NULL, # Remove x-axis label
y = "Total Touchdowns",
color = NULL # Remove legend title
+
)
# Adjust theme
theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
legend.position = "bottom" # Place legend at the bottom
+
)
# Add facets
facet_wrap(~ Year) # Use fixed y-axis for all facets
# Display the plot
everyone
<- ggplot(combined_whole_data, aes(x = Team, y = TD)) +
everyone # Plot all points as light grey
geom_point(color = "lightgrey", size = 1.5, position = position_dodge(width = 0.5)) +
# Overlay red points for "Top Teams"
geom_point(
data = combined_whole_data %>% filter(color_code == "red"),
aes(color = "Top Teams"), size = 1.5, position = position_dodge(width = 0.5)
+
)
# Add dashed line for the average
geom_hline(yintercept = average_TD, linetype = "dashed", color = "black", size = 0.5) +
# Add label for the average line
annotate(
"text", x = "LSU", y = 15 + 0.5,
label = "Average", color = "black", size = 2.5, hjust = 0
+
)
# Customize color scale to show only "Top Teams" in the legend
scale_color_manual(
values = c("Top Teams" = "red"), # Red for "Top Teams"
labels = c("Top Teams"), # Only "Top Teams" in legend
guide = guide_legend(override.aes = list(size = 3)) # Adjust legend point size
+
)
# Apply minimal theme
theme_minimal() +
# Add labels and adjust layout
labs(
title = "Total Touchdowns by Year",
x = NULL, # Remove x-axis label
y = "Total Touchdowns",
color = NULL # Remove legend title
+
)
# Adjust theme
theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
legend.position = "bottom" # Place legend at the bottom
+
)
# Add facets
facet_wrap(~ Year) # Use fixed y-axis for all facets
# Display the plot
everyone
library(ggplot2)
library(tidyr)
library(dplyr)
library(readr)
library(tidyverse)
library(rvest)
library(here)
library(cowplot)
<- c(14,15,16,17,18,19,20,21,22,23)
years
<- function(start_year, end_year) {
scrape_clean # Initialize an empty list to store yearly data
<- list()
all_data
# Loop over the range of years
for (year in start_year:end_year) {
# Generate the file path
<- here("data_raw", paste0("offense_yards_", year, ".html"))
file_path
# Read the HTML file and extract tables
<- read_html(file_path) %>%
offense_yards html_table(fill = TRUE)
# Combine the first two tables (assuming they are in the desired format)
<- cbind(offense_yards[[1]], offense_yards[[2]])
combined_table
# Add a column for the year
$Year <- year
combined_table
# Append to the list
as.character(year)]] <- combined_table
all_data[[
}
# Combine all the yearly tables into one data frame
<- bind_rows(all_data)
combined_data_def
return(combined_data_def)
}
# Example usage
<- 14
start_year <- 23
end_year
<- scrape_clean(start_year, end_year)
combined_offense_data
# View the combined data
<- combined_offense_data[1, ]
new_headers <- combined_offense_data[-1, ]
combined_offense_data names(combined_offense_data) <- new_headers
<- combined_offense_data[, c(1,8,11)] %>%
cleaned_offense rename(Team = 1, Yards_Game = 2, Year = 3) %>%
mutate(Yards_Game = as.numeric(gsub(",", "", Yards_Game)))
#or (1 in years) {
$Year <- cleaned_offense$Year + 2000
cleaned_offenseglimpse(cleaned_offense)
<- data.frame()
all_data
<- 2014:2023
years # Read in data for each year
for (year in years) {
<- paste0("https://www.espn.com/college-football/rankings/_/poll/1/week/1/year/", year, "/seasontype/3")
url <- tryCatch({
page read_html(url)
error = function(e) {
}, message(paste("Error reading page for year:", year))
return(NULL)
})if (!is.null(page)) {
<- page %>% html_node("table")
table if (!is.null(table)) {
<- table %>%
data html_table(fill = TRUE)
$Year <- year
data<- bind_rows(all_data, data)
all_data else {
} message(paste("No table found for year:", year))
}
}
}
# all data is a smaller data frame consisted of
# rankings of top 25 teams every year
# edit offense data frame to also include RK information form all_data.
# Create a mapping of old names to new names
<- c(
name_mapping "OSU Ohio State" = "OSU",
"OSU Ohio State " = "OSU",
"UCF UCF " = "UCF",
"ORE Oregon" = "ORE",
"TCU TCU" = "TCU",
"ALA Alabama" = "ALA",
"ALA Alabama " = "ALA",
"FSU Florida State" = "FSU",
"MSU Michigan State" = "MSU",
"BAY Baylor" = "BAY",
"GT Georgia Tech" = "GT",
"UGA Georgia" = "UGA",
"UGA Georgia " = "UGA",
"UCLA UCLA" = "UCLA",
"MSST Mississippi State" = "MSST",
"ASU Arizona State" = "ASU",
"WIS Wisconsin" = "WIS",
"MIZ Missouri" = "MIZ",
"CLEM Clemson" = "CLEM",
"CLEM Clemson " = "CLEM",
"BSU Boise State" = "BSU",
"MISS Ole Miss" = "MISS",
"KSU Kansas State" = "KSU",
"ARIZ Arizona" = "ARIZ",
"USC USC" = "USC",
"UTAH Utah" = "UTAH",
"AUB Auburn" = "AUB",
"MRSH Marshall" = "MRSH",
"LOU Louisville" = "LOU",
"MEM Memphis" = "MEM",
"STAN Stanford" = "STAN",
"OKLA Oklahoma" = "OKLA",
"HOU Houston" = "HOU",
"IOWA Iowa" = "IOWA",
"ND Notre Dame" = "ND",
"MICH Michigan" = "MICH",
"MICH Michigan " = "MICH",
"UNC North Carolina" = "UNC",
"LSU LSU" = "LSU",
"LSU LSU " = "LSU",
"NAVY Navy" = "NAVY",
"OKST Oklahoma State" = "OKST",
"TENN Tennessee" = "TENN",
"NW Northwestern" = "NW",
"WKU Western Kentucky" = "WKU",
"FLA Florida" = "FLA",
"WASH Washington" = "WASH",
"PSU Penn State" = "PSU",
"VT Virginia Tech" = "VT",
"COLO Colorado" = "COLO",
"WVU West Virginia" = "WVU",
"USF South Florida" = "USF",
"MIAMI Miami" = "MIAMI",
"SDSU San Diego State" = "SDSU",
"NCST NC State" = "NCST",
"TEX Texas" = "TEX",
"WSU Washington State" = "WSU",
"UK Kentucky" = "UK",
"SYR Syracuse" = "SYR",
"TA&M Texas A&M" = "TA&M",
"FRES Fresno State" = "FRES",
"ARMY Army" = "ARMY",
"USU Utah State" = "USU",
"CIN Cincinnati" = "CIN",
"MINN Minnesota" = "MINN",
"APP Appalachian State" = "APP",
"AFA Air Force" = "AFA",
"OU Oklahoma" = "OU",
"ISU Iowa State" = "ISU",
"BYU BYU" = "BYU",
"IU Indiana" = "IU",
"CCU Coastal Carolina" = "CCU",
"UL Louisiana" = "UL",
"LIB Liberty" = "LIB",
"BALL Ball State" = "BALL",
"SJSU San José State" = "SJSU",
"BUFF Buffalo" = "BUFF",
"PITT Pittsburgh" = "PITT",
"WAKE Wake Forest" = "WAKE",
"ARK Arkansas" = "ARK",
"TULN Tulane" = "TULN",
"ORST Oregon State" = "ORST",
"TROY Troy" = "TROY",
"SC South Carolina" = "SC",
"SMU SMU" = "SMU",
"KU Kansas" = "KU"
)
# Update the 'Team' column using the mapping
# Remove numbers and parentheses
<- all_data %>%
all_data mutate(
Team = gsub("[0-9()]", "", Team)
)
<- all_data %>%
all_data mutate(Team = if_else(Team %in% names(name_mapping),
name_mapping[Team], Team))
<- all_data %>%
combined_offense_data inner_join( cleaned_offense , by = c("Team", "Year"))
# final combined datat frame for the past 9 yeard
glimpse(combined_offense_data)
<- combined_offense_data[,c(1,2,8,9)]
combined_offense_data
<- combined_offense_data %>%
combined_offense_data_2 group_by(Year) %>% # Group by Year so that ranking is done per year
mutate(yards_rank = rank(Yards_Game, ties.method = "min")) %>% # Rank within each year
ungroup()
## plot 1
<- ggplot(combined_offense_data_2 %>%
graph1 filter(RK == 1), # Filter for top-ranked teams
aes(x = Year, y = yards_rank)) + # Mapping Year to x and yards_rank to y
geom_point(color = "blue", size = 2) + # Fixed color for points
geom_line(color = "black", size = 0.25) + # Fixed color for lines
scale_y_reverse() + # Reverse y-axis
labs(
title = "Do #1 Teams Excel in Offensive Yards per Game?",
x = "Year",
y = "Offensive Yards Rank"
+
) theme_minimal() +
theme(
legend.position = "none" # Remove the legend
)
# Display the plot
graph1
## plot two
<- cleaned_offense %>%
combined_offense_data_2 left_join(all_data, by = c("Team", "Year"))
<- combined_offense_data_2[,c(1,2,3,4)]
combined_offense_data_2 ## graph 2
<- ggplot(data = combined_offense_data_2,
offensive aes(x = Team, y = Yards_Game,
fill = case_when(
== 1 ~ "Top Rank", # Only color "Top Rank" for RK == 1
RK TRUE ~ NA_character_ # Don't assign fill for other ranks
+
))) # Plot all points (background points with light grey color)
geom_point(size = 1, shape = 21, color = "lightgrey", fill = "lightgrey") +
# Plot the blue points for RK == 1 (on the front layer, larger size)
geom_point(data = combined_offense_data_2 %>% filter(RK == 1),
aes(fill = "Top Rank"), size = 2, shape = 21) + # Blue dots, larger size
facet_wrap(~ Year, scales = "free_x") +
geom_hline(data = combined_offense_data_2 %>%
group_by(Year) %>%
summarise(avg_yards = mean(Yards_Game, na.rm = TRUE)),
aes(yintercept = avg_yards),
linetype = "dashed", color = "black") + # Add dotted line for the average Yards_Game
labs(
title = "Offensive Yards Per Game by Year",
x = NULL, # Remove x-axis label
y = "Yards Per Game",
fill = "Legend" # Legend title
+
) scale_fill_manual(
values = c("Top Rank" = "blue"), # Only show blue for "Top Rank"
na.value = "transparent" # Make the other fill values invisible in the legend
+
) theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_blank(),
legend.position = "bottom", # Place legend at the bottom
strip.text = element_text(size = 10),
aspect.ratio = .7
)
library(ggplot2)
library(dplyr)
# Compute yearly averages
<- combined_offense_data_2 %>%
yearly_avg group_by(Year) %>%
summarise(avg_yards = mean(Yards_Game, na.rm = TRUE))
# Create the plot
<- ggplot(data = combined_offense_data_2,
offensive aes(x = Team, y = Yards_Game,
fill = case_when(
== 1 ~ "Top Rank", # Only color "Top Rank" for RK == 1
RK TRUE ~ NA_character_ # Don't assign fill for other ranks
+
))) # Plot all points (background points with light grey color)
geom_point(size = 1, shape = 21, color = "lightgrey", fill = "lightgrey") +
# Plot the blue points for RK == 1 (on the front layer, larger size)
geom_point(data = combined_offense_data_2 %>% filter(RK == 1),
aes(fill = "Top Rank"), size = 2, shape = 21) + # Blue dots, larger size
facet_wrap(~ Year, scales = "free_x") +
# Add dotted line for the average Yards_Game
geom_hline(data = yearly_avg,
aes(yintercept = avg_yards),
linetype = "dashed", color = "black") +
# Add "Average" text label
geom_text(data = yearly_avg,
aes(x = Inf, y = 150, label = "Average"),
inherit.aes = FALSE,
hjust = 1.1, # Adjust horizontal alignment
color = "black",
size = 3) +
labs(
title = "Offensive Yards Per Game by Team, Faceted by Year",
x = NULL, # Remove x-axis label
y = "Yards Per Game",
fill = "Legend" # Legend title
+
) scale_fill_manual(
values = c("Top Rank" = "blue"), # Only show blue for "Top Rank"
na.value = "transparent" # Make the other fill values invisible in the legend
+
) theme_minimal() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_blank(),
legend.position = "bottom", # Place legend at the bottom
strip.text = element_text(size = 10),
aspect.ratio = .7
)
# Display the plot
offensive
<- ggplot(combined_offense_data_2, aes(x = Team, y = Yards_Game, color = case_when(
offensive == 1 ~ "Top Rank", # Highlight "Top Rank" teams
RK TRUE ~ "Other Teams" # All other teams
+
))) # Plot all points
geom_point(size = 1.5, position = position_dodge(width = 0.5)) +
# Add dashed line for average Yards/Game
geom_hline(data = yearly_avg, aes(yintercept = avg_yards),
linetype = "dashed", color = "black", size = 0.5) +
# Add text annotation for the average line
geom_text(data = yearly_avg, aes(x = -Inf, y = avg_yards, label = "Average"),
inherit.aes = FALSE, hjust = -0.1, vjust = -0.5,
color = "black", size = 2.5) +
# Manual color scale
scale_color_manual(
values = c("Top Rank" = "blue", "Other Teams" = "lightgrey")
+
)
# Facet by Year
facet_wrap(~ Year, scales = "free_x") +
# Labels and theme adjustments
labs(
title = "Offensive Yards Per Game by Team (Split by Year)",
x = "Team",
y = "Yards Per Game",
color = "Rank"
+
) theme_minimal() +
theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
panel.grid.major.x = element_blank(), # Remove major gridlines
panel.grid.minor.x = element_blank(), # Remove minor gridlines
legend.position = "none", # Remove legend
strip.text = element_text(size = 10) # Adjust facet label text size
)
# Display the plot
offensive
<- ggplot() +
offensive # Plot all points (background points with light grey color)
geom_point(data = combined_offense_data_2,
aes(x = Team, y = Yards_Game),
color = "lightgrey", size = 1.5, position = position_dodge(width = 0.5)) +
# Plot blue points for "Top Rank" on top
geom_point(data = combined_offense_data_2 %>% filter(RK == 1),
aes(x = Team, y = Yards_Game, color = "Top Rank"),
size = 1.5, position = position_dodge(width = 0.5)) +
# Add dashed line for average Yards/Game
geom_hline(data = yearly_avg, aes(yintercept = avg_yards),
linetype = "dashed", color = "black", size = 0.5) +
# Add text annotation for the average line
geom_text(data = yearly_avg, aes(x = 35, y = avg_yards, label = "Average"),
inherit.aes = FALSE, hjust = -0.1, vjust = -0.5,
color = "black", size = 2.5) +
# Manual color scale to include only "Top Rank" in the legend
scale_color_manual(
values = c("Top Rank" = "blue"), # Blue for "Top Rank"
labels = c("Top Rank"), # Only keep "Top Rank" in legend
guide = guide_legend(override.aes = list(size = 3)) # Adjust legend point size
+
)
# Facet by Year
facet_wrap(~ Year, scales = "free_x") +
# Labels and theme adjustments
labs(
title = "Offensive Yards Per Game by Team",
x = NULL, # Remove x-axis label
y = "Yards Per Game",
color = "Rank"
+
) theme_minimal() +
theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
panel.grid.major.x = element_blank(), # Remove major gridlines
panel.grid.minor.x = element_blank(), # Remove minor gridlines
legend.position = "bottom", # Place legend at the bottom
strip.text = element_text(size = 10) # Adjust facet label text size
)
# Display the plot
offensive
library(rvest)
library(tidyr)
library(dplyr)
library(stringdist)
library(janitor)
library(stringr)
library(ggplot2)
library(gganimate)
library(magick)
library(here)
###### GATHERING THE DATA #######
<- read_html(here::here('data_raw', '2023_passing_offense.html')) %>%
data_list23 html_table()
<- cbind(data_list23[[1]], data_list23[[2]])
data23
<- c(
top25 "Ohio State Buckeyes",
"Notre Dame Fighting Irish",
"Alabama Crimson Tide",
"Clemson Tigers",
"Iowa Hawkeyes",
"LSU Tigers",
"Michigan Wolverines",
"Oklahoma State Cowboys",
"Texas Longhorns",
"Georgia Bulldogs",
"Washington Huskies",
"Cincinnati Bearcats",
"Florida Gators",
"Oklahoma Sooners",
"Oregon Ducks",
"USC Trojans",
"Utah Utes",
"Auburn Tigers",
"Boise State Broncos",
"Florida State Seminoles",
"Miami Hurricanes",
"Northwestern Wildcats",
"Oklahoma Sooners",
"Tennessee Volunteers",
"Penn State Nittany Lions",
"Texas A&M Aggies")
<- data23 %>%
filtered_data23 mutate(InTop25 = Team %in% top25) %>%
filter(InTop25 == TRUE)
<- filtered_data23 %>%
df2 arrange(YDS)
<- 2014:2023
years <- data.frame()
all_data
# Read in data for each year
for (year in years) {
<- paste0("https://www.espn.com/college-football/rankings/_/poll/1/week/1/year/", year, "/seasontype/3")
url <- tryCatch({
page read_html(url)
error = function(e) {
}, message(paste("Error reading page for year:", year))
return(NULL)
})if (!is.null(page)) {
<- page %>% html_node("table")
table if (!is.null(table)) {
<- table %>%
data html_table(fill = TRUE)
$Year <- year
data<- bind_rows(all_data, data)
all_data else {
} message(paste("No table found for year:", year))
}
}
}
# Select only the columns we need
<- all_data %>%
all_data select("RK",'Team', "REC","PTS","Year")
<- all_data %>% filter(RK == 1)
ranked_1
###### DOT AND LINE GRAPH #######
<- function(year) {
extract_avg_ypp # Construct the file name for the HTML file
<- paste0(year, "_avg_yds_per_play.html")
file_name
# Load the data
<- read_html(here::here('data_raw', file_name)) %>%
avg_ypp html_table()
<- as.data.frame(avg_ypp) %>%
avg_ypp select("Team", "AVG") %>%
arrange(desc(AVG))
# Save the data frame globally
assign("avg_ypp", avg_ypp, envir = .GlobalEnv)
# Return the data frame as well
return(avg_ypp)
}
# Loop through years 2014 to 2023
for (year in 2014:2023) {
assign(paste0("avg_ypp_", year), extract_avg_ypp(year))
}
# Assuming avg_ypp_2014, avg_ypp_2015, ..., avg_ypp_2023 are defined
<- list(
avg_ypp_list "2014" = avg_ypp_2014,
"2015" = avg_ypp_2015,
"2016" = avg_ypp_2016,
"2017" = avg_ypp_2017,
"2018" = avg_ypp_2018,
"2019" = avg_ypp_2019,
"2020" = avg_ypp_2020,
"2021" = avg_ypp_2021,
"2022" = avg_ypp_2022,
"2023" = avg_ypp_2023
)
<- all_data %>%
combined_data filter(RK <= 1) %>% # Filter for top 10 teams
mutate(Year = as.numeric(Year)) # Ensure Year is numeric
<- combined_data %>%
combined_data rowwise() %>%
mutate(
# Extract the abbreviated team name (e.g., "OSU" from "OSU Ohio State (59)")
AbbrevTeam = str_extract(Team, "^[A-Z]+"),
# Dynamically look up the corresponding AVG value
AVG = avg_ypp_list[[as.character(Year)]] %>%
filter(Team == AbbrevTeam) %>%
pull(AVG) %>%
first() # Return the first match, if any
%>%
) ungroup()
# Plot
<- ggplot(combined_data, aes(x = Year, y = AVG)) +
avg_ypp_line_and_dot # Black line connecting all points
geom_line(color = "black", linewidth = 0.25, aes(group = 1)) +
# Uniform points for each year
geom_point(color = "orange", size = 2) +
labs(
title = "Do #1 Teams Excel in Average Offensive Yards/Pass?",
x = "Year",
y = "Average Yards/Pass Rank"
+
) theme_minimal() +
theme(
legend.position = "none" # Remove the legend
)
# Display the plot
avg_ypp_line_and_dot
###### FACETED DOT PLOT #########
# Function to prepare data for plotting
<- function(year, rank_index) {
prepare_avg_ypp_data # Construct the file name for the HTML file
<- paste0(year, "_avg_yds_per_play.html")
file_name
# Load the data
<- read_html(here::here('data_raw', file_name)) %>%
avg_ypp html_table()
<- as.data.frame(avg_ypp)
avg_ypp <- avg_ypp %>%
avg_ypp select('Team', "AVG") %>%
arrange(-AVG)
# Dynamically extract the abbreviation from the 'ranked_1' data based on rank_index
<- str_extract(ranked_1$Team[rank_index], "^[A-Za-z]+")
top_team_abbreviation
# Create a new column to indicate the highlight color
<- avg_ypp[1:50, ]
top_50_teams <- top_50_teams %>%
top_50_teams mutate(
is_highlight = Team == top_team_abbreviation,
fill_color = ifelse(is_highlight, "red", "lightgrey"), # Use regular red for highlight
display_name = ifelse(is_highlight, Team, ""),
Year = year # Add the year column for faceting
)
return(top_50_teams)
}
# Combine data for all years
<- bind_rows(lapply(2014:2023, function(year) {
all_years_data <- year - 2013 # Map year to rank index
rank_index prepare_avg_ypp_data(year, rank_index)
}))# Calculate the average AVG for each year
<- all_years_data %>%
yearly_averages group_by(Year) %>%
summarize(avg_yds = mean(AVG))
# Merge averages back into the data for labeling
<- all_years_data %>%
all_years_data left_join(yearly_averages, by = "Year")
# Create the faceted dot plot with average line
ggplot(all_years_data) +
# Plot all grey dots first
geom_point(data = filter(all_years_data, !is_highlight),
aes(x = reorder(Team, AVG), y = AVG),
color = "lightgrey", size = 1.5) +
# Plot the red highlighted dot on top with a legend
geom_point(data = filter(all_years_data, is_highlight),
aes(x = reorder(Team, AVG), y = AVG, color = "Top Rank"),
size = 1.5) +
# Add an average line for each year
geom_hline(data = yearly_averages, aes(yintercept = avg_yds), linetype = "dashed", color = "black") +
# Add a label for the average line
geom_text(data = yearly_averages,
aes(x = 25, y = avg_yds, label = "Average Yds/Pass"), # Move text slightly left
color = "black", hjust = 0, vjust = -0.5, size = 2.5) + # Black text, smaller size
labs(
x = "Team",
y = "Average Yards per Pass",
title = "Top 50 Teams - Average Yards Per Pass by Year",
color = "" # Legend title for red dot
+
) facet_wrap(~Year, ncol = 2) + # Facet by year
theme_minimal() +
theme(
axis.text.x = element_blank(), # Remove x-axis text labels
axis.ticks.x = element_blank(), # Remove x-axis ticks
legend.position = "bottom" # Add legend at the bottom
+
) scale_color_manual(values = c("Top Rank" = "red")) # Legend for red dot
# Create the faceted dot plot with average line styled like "everyone" graph
<- ggplot(all_years_data, aes(x = Team, y = AVG)) +
average_YPP # Plot all points as light grey
geom_point(color = "lightgrey", size = 1.5, position = position_dodge(width = 0.5)) +
# Overlay orange points for "Top Rank"
geom_point(
data = filter(all_years_data, is_highlight),
aes(color = "Top Rank"), size = 1.5, position = position_dodge(width = 0.5)
+
)
# Add dashed line for the average
geom_hline(data = yearly_averages, aes(yintercept = avg_yds),
linetype = "dashed", color = "black", size = 0.5) +
# Add label for the average line
annotate(
"text", x = 45, y = 7.25 + 0.5,
label = "Average", color = "black", size = 2.5, hjust = 0
+
)
# Customize color scale to show only "Top Rank" in the legend
scale_color_manual(
values = c("Top Rank" = "orange"), # Orange for "Top Rank"
labels = c("Top Rank"), # Only "Top Rank" in legend
guide = guide_legend(override.aes = list(size = 3)) # Adjust legend point size
+
)
# Apply minimal theme
theme_minimal() +
# Add labels and adjust layout
labs(
title = "Offensive Yards Per Pass by Year",
x = NULL, # Remove x-axis label
y = "Average Yards per Pass",
color = NULL # Remove legend title
+
)
# Adjust theme
theme(
axis.text.x = element_blank(), # Remove x-axis text
axis.ticks.x = element_blank(), # Remove x-axis ticks
legend.position = "bottom" # Place legend at the bottom
+
)
# Add facets
facet_wrap(~ Year) # Use fixed y-axis for all facets
# Display the plot
average_YPP
Data Dictionary
Offensive Data Variables
Variable | Data Type | Description |
---|---|---|
Team | character | The name of the college football team. |
YDS | numeric | Total offensive yards gained by a team in the season. |
YDS/G |
numeric | Offensive yards per game. |
TD | numeric | Total touchdowns scored by the team. |
AVG | numeric | Average yards per play (or pass). |
CMP% | numeric | Completion percentage for the team’s passes. |
Rank | numeric | Ranking of the team based on touchdowns or other performance metrics. |
Year | numeric | The year of the season. |
InTop25 | logical | Indicates whether the team is ranked in the Top 25 (TRUE/FALSE). |
Color | character | Color assigned to the team for plotting purposes. |
Defensive Data Variables
Variable | Data Type | Description |
---|---|---|
Team | character | The name of the college football team. |
Games Played | numeric | Total games played by the team in the season. |
Total YDS | numeric | Total defensive yards allowed by the team in the season. |
Total YDS/Game |
numeric | Defensive yards allowed per game. |
Passing YDS | numeric | Total passing yards allowed by the team. |
Passing YDS/Game |
numeric | Passing yards allowed per game. |
Rushing YDS | numeric | Total rushing yards allowed by the team. |
Rushing YDS/Game |
numeric | Rushing yards allowed per game. |
Points | numeric | Total points allowed by the team. |
Points/Game |
numeric | Points allowed per game. |
Metric | character | The type of metric (e.g., total yards, passing yards) for defensive stats. |
Value | numeric | The value of the selected defensive metric. |
Rankings and Combined Data
Variable | Data Type | Description |
---|---|---|
RK | numeric | Ranking of the team in overall standings or specific metrics. |
REC | character | Win-loss record of the team. |
PTS | numeric | Points accumulated by the team in the ranking system. |
row_key | character | Unique identifier combining key variables for each row in the dataset. |
yards_rank | numeric | Rank of the team based on yards per game within a specific year. |
Faceted and Yearly Analysis
Variable | Data Type | Description |
---|---|---|
avg_yds | numeric | Average yards per play across all teams in a specific year. |
is_highlight | logical | Indicates whether a team is highlighted in the plot (TRUE/FALSE). |
fill_color | character | Color assigned to teams for highlighting in faceted plots. |
This dictionary documents the variables in the datasets and visualizations used throughout your code. Each table categorizes variables based on their usage (offensive, defensive, rankings, and faceted/yearly analysis). Include it in your Quarto file as a reference for readers or collaborators.