We modified our research question in two ways. One way was to not focus in so much on a specific time period. This will allow us to go farther back in time to see how Lewis Hamilton has progressed through the different levels of racing, maybe not just F1. That way we will be able realize the relationship (or lack thereof) between the types of cars he drives. We also modified the question to not automatically pair Lewis Hamilton’s success with Mercedes. Yes, he is a Mercedes driver but that may or may not be the sole reason for his successes in F1 racing.
The sources on FIA F1, also known as Formula 1 Racing, comes from the website Kaggle. The URL to the source can be found here: F1 Data. The data is a collection of different databases and CSV files that capture information on race wins, qualifying times, constructors/teams as well as driver championships, and fastest lap times. The data ranges from the year of 1950 to 2020 and was last updated on December 14, 2020. Our project is focusing on the two specific data frames: fastest_laps_all_drivers_all_race_1950
and constructors_championship_1958
. These data frames concentrate on the F1 constructors that have won the championship and the fastest laps.
The data was collected using web scraping methodologies that searched publicly available data from a variety of different F1 websites. Unfortunately, the original sources of the data were not documented so the data can not be validated against FIA data.
The F1 data is nevertheless very usable because it has already been pre-processed and updated over 25 times. However, we would like to emphasize the data is not coming from the original data sets and is rather a collection of data from unknown F1 sources. The collector of the data and the Dataset owner is the Kaggle user, Aadil Tajani. Our team has reached out to the author to try and gain additional knowledge on the source of the original data. We will of course provide updates when we ourselves get them.
wins <- read_csv("data_raw/race_wins_1950-2020.csv")
#head(wins)
wins_cleaning <- wins %>%
clean_names() %>%
mutate(date = dmy(date))
wins_filtering <- wins_cleaning %>%
filter(year(date) > 2007) %>%
group_by(name) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
slice(1:10) %>%
mutate(
is_Lewis = if_else(
name == 'Lewis Hamilton', TRUE, FALSE)) %>%
ggplot() +
geom_col(aes(x = count,
y = reorder(name, count),
fill = is_Lewis),
width = 0.7, alpha = 0.8) +
scale_x_continuous(
expand = expansion(mult = c(0, 0.05))) +
scale_fill_manual(values = c('grey', 'lightgreen')) +
theme_minimal_vgrid() +
theme(legend.position = 'none') +
labs(title = "Grand Prix Wins by Driver between 2007-2020",
x = "Grand Prix Wins",
y = "Racer")
wins_filtering
As you can see Lewis Hamilton is quite dominant in the sport of F1 driving. He has led the field in Grand Prix wins by a large margin. Another interesting thing to note is that his teammates, who also drive Mercedes cars, are among the top 10 drivers in number of Grand Prix wins. One question we felt was important to dive into was, what really was the reasoning behind Lewis Hamilton’s dominance in the sport of F1. Was it that the car he was driving was far superior than his competitors due to money invested on the Mercedes racing team? Or was Lewis Hamilton that much more skilled than his racing competitors and no matter what he was driving, as long as it was comparative to his rivals, he would come out on top of the leader board. The about of Grand Prix wins are no doubt impressive, however, it does not help us answer why Lewis Hamilton has dominated.
wins_filtering_McLaren <- wins_cleaning %>%
filter(year(date) > 2007) %>%
filter(year(date) < 2013) %>%
group_by(name) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
slice(1:10) %>%
mutate(
is_Lewis = if_else(
name == 'Lewis Hamilton', TRUE, FALSE)) %>%
ggplot() +
geom_col(aes(x = count,
y = reorder(name, count),
fill = is_Lewis),
width = 0.7, alpha = 0.8) +
scale_x_continuous(
expand = expansion(mult = c(0, 0.05))) +
scale_fill_manual(values = c('grey', 'orange')) +
theme_minimal_vgrid() +
theme(legend.position = 'none') +
labs(title = "Grand Prix Wins by Driver between 2007-2013",
x = "Grand Prix Wins",
y = "Racer")
wins_filtering_McLaren
Although Hamilton is not the outright leader, he is right there in contention for the most Grand Prix wins throughout his early career with Mclaren, second to Sebastian Vettel. This time frame was earlier in Lewis Hamilton’s career so there are a couple factors that could explain the result above. His inexperience driving in F1, the constructor Mclaren, and the car all could have played a major factor.
To make sure we properly answer the research question, we must look further into his skill as a F1 driver and see how he has performed in qualifying, while also examining variables such as fastest lap speed.
qualifying_times_filtered <- qualifying_times %>%
filter(Name == c("Lewis Hamilton", "Valtteri Bottas")) %>%
clean_names()
qualifying_times_filtered$q1 <- as.numeric(seconds(ms(qualifying_times_filtered$q1)))
qualifying_times_filtered$q2 <- as.numeric(seconds(ms(qualifying_times_filtered$q2)))
qualifying_times_filtered$q3 <- as.numeric(seconds(ms(qualifying_times_filtered$q3)))
qualifying_times_filtered <- qualifying_times_filtered %>%
filter(!is.na(q1)) %>%
filter(!is.na(q2)) %>%
filter(!is.na(q3)) %>%
filter(x < 6040) %>%
filter(year > 2016) %>%
group_by(year, venue, name) %>%
summarise(fastest_qualifying_lap = min(q1,q2,q3)) %>%
spread(key = name, value = fastest_qualifying_lap) %>%
clean_names() %>%
filter(!is.na(lewis_hamilton)) %>%
filter(!is.na(valtteri_bottas)) %>%
group_by(year, venue, lewis_hamilton, valtteri_bottas) %>%
summarise(fastest_lap_difference = valtteri_bottas - lewis_hamilton) %>%
mutate(
Driver = if_else(
fastest_lap_difference > 0, "Lewis Hamilton", "Valtteri Bottas"))
fastest_qualifying_lap_graph <- qualifying_times_filtered %>%
ggplot(aes(x=factor(venue),y= fastest_lap_difference, fill = Driver)) +
geom_col()+
theme(
axis.text.x = element_text(size=15,angle=45),
strip.text.x = element_text(size = 15)) + facet_wrap(~year,ncol=2) +
coord_flip()+
guides(color = FALSE) +
theme(
text = element_text(family = "Georgia"),
axis.text = element_text(size = 20),
plot.title = element_text(size = 30, margin = margin(b = 10)),
plot.subtitle = element_text(size = 20, color = "darkslategrey", margin = margin(b = 25)))+
labs(title='Difference Between Fastest Laps of Lewis Hamilton and Valtteri Bottas',
subtitle='Time in Seconds', x = "Venue", y = "Fastest Lap Difference (Seconds)", fill = "Who Was Faster?")
fastest_qualifying_lap_graph
To evaluate the performance of Lewis Hamilton in Mercedes, we have to look at his lap times compared with his teammate Valtteri Bottas in Mercedes, because both of them have identical cars. So, any comparison with his teammate would be a fair evaluation of Lewis Hamilton skills. To have bais free results, we evaluated the fastest qualifying round for each driver in races that they both participated in, because in the qualifying rounds no one has the advantage of the pole position in the race. Hamilton and Bottas raced together in Mercedes between the years 2017-2020, so the chart shows the difference between the fastest qualifying lap in these years in all the Grand Prixs between Bottas and Hamilton. The red bars indicate that Hamilton achieved a faster qualifying lap, while a blue bar indicates that Bottas achieved a faster qualifying lap. We can clearly see that Hamilton was very dominant 2018, and slightly better than Bottas in 2019. The graphs from 2017 and 2020 do not have enough races that both participated in, so these results do not produce the strongest argument.
fastest_laps_all_drivers_all_race$Avg.Speed<-as.numeric(fastest_laps_all_drivers_all_race$Avg.Speed)
fastest_lap_times <- fastest_laps_all_drivers_all_race %>%
mutate(Correct_driver = ifelse(Name == 'Lewis Hamilton', "Yes", "No")) %>%
dplyr::filter(Year>2006) %>%
dplyr::group_by(Name,Year,Correct_driver) %>%
summarize(medianFastestLapSpeed = median(Avg.Speed,na.rm=T))
label_1 <- fastest_lap_times %>%
filter(Name == 'Lewis Hamilton') %>%
arrange(desc(medianFastestLapSpeed))
fastest_lap_times %>%
ggplot(aes(x=factor(Year), y= medianFastestLapSpeed )) +
geom_boxplot(alpha=.25) + theme_fivethirtyeight() +
#geom_jitter(shape=16,position=position_jitter(0.2),size=1.5) +
geom_smooth(method='loess',aes(group=1),color='black',lty=2,size=.5) +
geom_point(aes(x=factor(Year), y= medianFastestLapSpeed, color = Correct_driver )) +
geom_text(data = label_1, aes(label = Name, color = Correct_driver), size = 3, hjust = -.2) +
labs(title='Fastest Lap per Year',
subtitle='in km/h, grouped by Grand Prix') +
guides(color = FALSE) +
theme(axis.title = element_blank(),
text = element_text(family = "Georgia"),
axis.text = element_text(size = 20),
plot.title = element_text(size = 40, margin = margin(b = 10)),
plot.subtitle = element_text(size = 40, color = "darkslategrey", margin = margin(b = 25)))
From the boxplot above we can see how Hamilton has performed in terms of fastest lap per year grouped by Grand Prix. Overall, he is almost always above the average curve line besides in 2011 and 2014. Furthermore, he is usually an outlier for positive fastest lap per year. The result shows that Hamilton has been so dominant because he consistently has faster lap speeds than the competition. Especially in the last 5 years, Hamilton’s performance has been well ahead of the competition. From this plot we can also determine that Hamilton’s fastest lap speeds have been going up over the course of his career. When he was driving for Mclaren his performance started to plateau. However, when he changed teams to Mercedes his performance increased and his speeds got faster. This is most likely due to his increased driving experience and a better car.
fastest_lap <- read_csv(
here::here("data_raw", "fastest_laps_all_drivers_all_race_1950-2020.csv"),
col_types = cols(Time = col_character()))
fastest_lap <- fastest_lap %>%
clean_names() %>%
select(year, venue, name, team, time) %>%
filter(!is.na(time)) %>%
mutate(
time = ms(time), # Converts from character to time period
seconds = as.numeric(time)) %>% # Converts time period to seconds
group_by(year, venue) %>%
mutate(minLapTime = min(seconds)) %>%
ungroup() %>%
mutate(secondsBehindLeader = seconds - minLapTime) %>%
filter(secondsBehindLeader < 5) %>%
group_by(name) %>%
mutate(medianSecondsBehindLeader = median(secondsBehindLeader)) %>%
ungroup() %>%
arrange(year, venue, minLapTime)
#Filtering Lewis Hamilton's McLaren teams race distribution time into their own data set
fastest_lap_2007McLaren <- fastest_lap %>%
group_by(team) %>%
filter(team == 'McLaren Mercedes') %>%
filter(year > 2006) %>%
filter(year < 2014) %>%
arrange(medianSecondsBehindLeader)
#Filtering Lewis Hamilton's Mercedes teams race distribution time into their own data set
fastest_lap_2013Mercedes <- fastest_lap %>%
group_by(team) %>%
filter(team == 'Mercedes') %>%
filter(year >= 2013) %>%
arrange(medianSecondsBehindLeader)
We have compared Lewis Hamilton’s driving across many metrics and he is clearly at the top of the competition. The main metric that stands out is the reasoning for how fast he is. His team has been very successful overall because they have more money to invest in Hamilton and the other drivers’ performances. The upcoming visualizations we will use a data set which we have mutated to include a variable titled secondsBehindLeader
. This variable was calculated by finding the min lap time for each race and subtracting everyones time, which was converted to seconds from this min lap time. If it were a zero that would mean that specific driver had the fastest lap time for that race. This metric shows which racers are consistently REALLY FAST. The faster the driver, the more heavily skewed their distribution of the seconds behind leader variable should be towards zero.
These two faceted graphs contain only teammates of Lewis Hamilton when we was on McLaren and Mercedes respectively. As you can see, Lewis Hamilton’s distribution shows he is quite fast and its very common for him to be the pace setter, hence why he has so many races in the column that contains zero. This means he is zero seconds behind the leader aka he was the leader!
To get a sense for how the data set will produce graphs showing the secondsBehindLeader
variable we will produce a preliminary histogram showing Lewis Hamilton’s career time distribution.
fastest_lap %>%
filter(name == "Lewis Hamilton") %>%
ggplot() +
geom_histogram(aes(x = secondsBehindLeader), bins = 30) +
labs(title = 'Lewis Hamiltons Career Race Time Distribution',
x = "Seconds Behind The Leader of a Race",
y = "Number of Races") +
theme_classic()
Following this hisogram we will compare Lewis with his teammates on both the McLaren and Mercedes racing teams that he has been apart of during his career.
fastest_lap_2007McLaren %>%
ggplot() +
geom_histogram(aes(x = secondsBehindLeader), bins = 20) +
facet_wrap(~factor(name, levels = c('Lewis Hamilton', 'Jenson Button', 'Heikki Kovalainen', 'Fernando Alonso', 'Sergio Perez'))) +
labs(x = "Seconds Behind the Leader",
y = "Number of Races",
title = "Comparing McLaren Teammates By Finishing Times",
subtitle = "Year: 2006-2013") +
theme_minimal()
fastest_lap_2013Mercedes %>%
ggplot() +
geom_histogram(aes(x = secondsBehindLeader), bins = 25) +
facet_wrap(vars(name), ncol = 1) +
labs(x = "Seconds Behind the Leader",
y = "Number of Races",
title = "Comparing Mercedes Teammates By Finishing Times",
subtitle = "Year: 2013-2020") +
theme_minimal()
All these racers are driving cars with the same features, horsepower and technology as well as have the same pit crews and team structures. This breakdown clearly shows that Lewis Hamilton is a much faster and more skilled driver when being compared to everyone, specifically racers who are driving in the same cars as his. He also appears to have a clutch factor, which can be seen through the large bin above the “0” time frame, showing he wins so many races and if he doesn’t win he is not that far off of the leader.
Here is the data dictionary for the variables in the fastest_laps_all_drivers_all_race_1950-2020
data frame:
variable | class | description |
---|---|---|
X1 | double | index for ref |
Year | double | Year of Participation |
Position | double | Position as per Lap times |
Driver No. | double | Driver no. |
Venue | character | Race Venue |
Name | character | Driver Name |
NameTag | character | Driver Name Tag |
Team | character | Team Name |
Lap No. | character | Lap No. of Fastest Lap |
Time | time | Lap Time |
Avg Speed | character | Average speed of the car during lap |
Here is the data dictionary for the variables in the race_wins_1950-2020
data frame:
variable | class | description |
---|---|---|
X1 | double | index for ref |
Venue | character | Name of Grand Prix |
Date | time | Date of Rac |
Name | character | Driver Name |
NameTag | character | Driver Name Tag |
Position | double | Position as per Lap times |
Team | character | Team Name |
Laps | double | Number of laps |
Time | double | Race time |
Here is the data dictionary for the variables in the constructors_championship_1958-2020.csv
data frame:
variable | class | description |
---|---|---|
X1 | double | index for ref |
Year | double | Year of Participation |
Position | double | Position as per Lap times |
Team | character | Team Name |
Points | double | Points Scored |
#install.packages('ggthemes', dependencies = TRUE)
# Load libraries and settings here
library(tidyverse)
library(here)
library(ggplot2)
library(ggrepel)
library(lubridate)
library(janitor)
library(plotly)
library(ggthemes)
library(cowplot)
knitr::opts_chunk$set(
warning = FALSE,
message = FALSE,
comment = "#>",
fig.path = "figs/", # Folder where rendered plots are saved
fig.width = 7.252, # Default plot width
fig.height = 4, # Default plot height
fig.retina = 3 # For better plot resolution
)
# Load data below here
fastest_laps_all_drivers_all_race <-read.csv("data_raw/fastest_laps_all_drivers_all_race_1950-2020.csv",sep=',',stringsAsFactors=F)
qualifying_times <- read.csv("data_raw/qualifying_times_2006-2020.csv")
wins <- read.csv("data_raw/race_wins_1950-2020.csv")
# Put any other "global" settings here, e.g. a ggplot theme:
theme_set(theme_bw(base_size = 20))
wins <- read_csv("data_raw/race_wins_1950-2020.csv")
#head(wins)
wins_cleaning <- wins %>%
clean_names() %>%
mutate(date = dmy(date))
wins_filtering <- wins_cleaning %>%
filter(year(date) > 2007) %>%
group_by(name) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
slice(1:10) %>%
mutate(
is_Lewis = if_else(
name == 'Lewis Hamilton', TRUE, FALSE)) %>%
ggplot() +
geom_col(aes(x = count,
y = reorder(name, count),
fill = is_Lewis),
width = 0.7, alpha = 0.8) +
scale_x_continuous(
expand = expansion(mult = c(0, 0.05))) +
scale_fill_manual(values = c('grey', 'lightgreen')) +
theme_minimal_vgrid() +
theme(legend.position = 'none') +
labs(title = "Grand Prix Wins by Driver between 2007-2020",
x = "Grand Prix Wins",
y = "Racer")
wins_filtering
wins_filtering_McLaren <- wins_cleaning %>%
filter(year(date) > 2007) %>%
filter(year(date) < 2013) %>%
group_by(name) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
slice(1:10) %>%
mutate(
is_Lewis = if_else(
name == 'Lewis Hamilton', TRUE, FALSE)) %>%
ggplot() +
geom_col(aes(x = count,
y = reorder(name, count),
fill = is_Lewis),
width = 0.7, alpha = 0.8) +
scale_x_continuous(
expand = expansion(mult = c(0, 0.05))) +
scale_fill_manual(values = c('grey', 'orange')) +
theme_minimal_vgrid() +
theme(legend.position = 'none') +
labs(title = "Grand Prix Wins by Driver between 2007-2013",
x = "Grand Prix Wins",
y = "Racer")
wins_filtering_McLaren
qualifying_times_filtered <- qualifying_times %>%
filter(Name == c("Lewis Hamilton", "Valtteri Bottas")) %>%
clean_names()
qualifying_times_filtered$q1 <- as.numeric(seconds(ms(qualifying_times_filtered$q1)))
qualifying_times_filtered$q2 <- as.numeric(seconds(ms(qualifying_times_filtered$q2)))
qualifying_times_filtered$q3 <- as.numeric(seconds(ms(qualifying_times_filtered$q3)))
qualifying_times_filtered <- qualifying_times_filtered %>%
filter(!is.na(q1)) %>%
filter(!is.na(q2)) %>%
filter(!is.na(q3)) %>%
filter(x < 6040) %>%
filter(year > 2016) %>%
group_by(year, venue, name) %>%
summarise(fastest_qualifying_lap = min(q1,q2,q3)) %>%
spread(key = name, value = fastest_qualifying_lap) %>%
clean_names() %>%
filter(!is.na(lewis_hamilton)) %>%
filter(!is.na(valtteri_bottas)) %>%
group_by(year, venue, lewis_hamilton, valtteri_bottas) %>%
summarise(fastest_lap_difference = valtteri_bottas - lewis_hamilton) %>%
mutate(
Driver = if_else(
fastest_lap_difference > 0, "Lewis Hamilton", "Valtteri Bottas"))
fastest_qualifying_lap_graph <- qualifying_times_filtered %>%
ggplot(aes(x=factor(venue),y= fastest_lap_difference, fill = Driver)) +
geom_col()+
theme(
axis.text.x = element_text(size=15,angle=45),
strip.text.x = element_text(size = 15)) + facet_wrap(~year,ncol=2) +
coord_flip()+
guides(color = FALSE) +
theme(
text = element_text(family = "Georgia"),
axis.text = element_text(size = 20),
plot.title = element_text(size = 30, margin = margin(b = 10)),
plot.subtitle = element_text(size = 20, color = "darkslategrey", margin = margin(b = 25)))+
labs(title='Difference Between Fastest Laps of Lewis Hamilton and Valtteri Bottas',
subtitle='Time in Seconds', x = "Venue", y = "Fastest Lap Difference (Seconds)", fill = "Who Was Faster?")
fastest_qualifying_lap_graph
fastest_laps_all_drivers_all_race$Avg.Speed<-as.numeric(fastest_laps_all_drivers_all_race$Avg.Speed)
fastest_lap_times <- fastest_laps_all_drivers_all_race %>%
mutate(Correct_driver = ifelse(Name == 'Lewis Hamilton', "Yes", "No")) %>%
dplyr::filter(Year>2006) %>%
dplyr::group_by(Name,Year,Correct_driver) %>%
summarize(medianFastestLapSpeed = median(Avg.Speed,na.rm=T))
label_1 <- fastest_lap_times %>%
filter(Name == 'Lewis Hamilton') %>%
arrange(desc(medianFastestLapSpeed))
fastest_lap_times %>%
ggplot(aes(x=factor(Year), y= medianFastestLapSpeed )) +
geom_boxplot(alpha=.25) + theme_fivethirtyeight() +
#geom_jitter(shape=16,position=position_jitter(0.2),size=1.5) +
geom_smooth(method='loess',aes(group=1),color='black',lty=2,size=.5) +
geom_point(aes(x=factor(Year), y= medianFastestLapSpeed, color = Correct_driver )) +
geom_text(data = label_1, aes(label = Name, color = Correct_driver), size = 3, hjust = -.2) +
labs(title='Fastest Lap per Year',
subtitle='in km/h, grouped by Grand Prix') +
guides(color = FALSE) +
theme(axis.title = element_blank(),
text = element_text(family = "Georgia"),
axis.text = element_text(size = 20),
plot.title = element_text(size = 40, margin = margin(b = 10)),
plot.subtitle = element_text(size = 40, color = "darkslategrey", margin = margin(b = 25)))
fastest_lap <- read_csv(
here::here("data_raw", "fastest_laps_all_drivers_all_race_1950-2020.csv"),
col_types = cols(Time = col_character()))
fastest_lap <- fastest_lap %>%
clean_names() %>%
select(year, venue, name, team, time) %>%
filter(!is.na(time)) %>%
mutate(
time = ms(time), # Converts from character to time period
seconds = as.numeric(time)) %>% # Converts time period to seconds
group_by(year, venue) %>%
mutate(minLapTime = min(seconds)) %>%
ungroup() %>%
mutate(secondsBehindLeader = seconds - minLapTime) %>%
filter(secondsBehindLeader < 5) %>%
group_by(name) %>%
mutate(medianSecondsBehindLeader = median(secondsBehindLeader)) %>%
ungroup() %>%
arrange(year, venue, minLapTime)
#Filtering Lewis Hamilton's McLaren teams race distribution time into their own data set
fastest_lap_2007McLaren <- fastest_lap %>%
group_by(team) %>%
filter(team == 'McLaren Mercedes') %>%
filter(year > 2006) %>%
filter(year < 2014) %>%
arrange(medianSecondsBehindLeader)
#Filtering Lewis Hamilton's Mercedes teams race distribution time into their own data set
fastest_lap_2013Mercedes <- fastest_lap %>%
group_by(team) %>%
filter(team == 'Mercedes') %>%
filter(year >= 2013) %>%
arrange(medianSecondsBehindLeader)
fastest_lap %>%
filter(name == "Lewis Hamilton") %>%
ggplot() +
geom_histogram(aes(x = secondsBehindLeader), bins = 30) +
labs(title = 'Lewis Hamiltons Career Race Time Distribution',
x = "Seconds Behind The Leader of a Race",
y = "Number of Races") +
theme_classic()
fastest_lap_2007McLaren %>%
ggplot() +
geom_histogram(aes(x = secondsBehindLeader), bins = 20) +
facet_wrap(~factor(name, levels = c('Lewis Hamilton', 'Jenson Button', 'Heikki Kovalainen', 'Fernando Alonso', 'Sergio Perez'))) +
labs(x = "Seconds Behind the Leader",
y = "Number of Races",
title = "Comparing McLaren Teammates By Finishing Times",
subtitle = "Year: 2006-2013") +
theme_minimal()
fastest_lap_2013Mercedes %>%
ggplot() +
geom_histogram(aes(x = secondsBehindLeader), bins = 25) +
facet_wrap(vars(name), ncol = 1) +
labs(x = "Seconds Behind the Leader",
y = "Number of Races",
title = "Comparing Mercedes Teammates By Finishing Times",
subtitle = "Year: 2013-2020") +
theme_minimal()