Leveling the Playing Field
How the MLB Competitive Balance Tax affects Team Performance
Research Question
image source
Data
Sources
Team Payrolls: 1950-2022
We collected team payroll data from
baseball-reference.com
, an online baseball encyclopedia.
Not all seasons have complete data, especially those closer to 1950. We
started using data from 1969 rather than 1950 for better data. These
payrolls amounts were also adjusted for inflation to try and get a
better understanding of the differences in payrolls from year to year
(by comparing real rather than nominal payrolls).
Team Performance Data
Sean Lahman began an effort to make MLB statistics publicly available
starting in 1994. Since then, he and his team have updated a database
containing different data frames in different formats for public use.
(The most useful format for this project is comma-delimited values,
csv’s.) The data frame we used is the Teams.csv
data frame
for regular season data. Thankfully, the data from the Lahman Baseball
Database (see documentation
has already been compiled into an R package: the Lahman
package.
Data Processing
The data from the Lahman database has been compiled into the
Lahman
package. The other data is loaded from the
data_raw
folder. The data frames from the
Lahman
package used are Teams
.
The below code is the initial data wrangling of the data frames as well as a data dictionary for each data frame.
Expectations
One variable’s distribution we estimated was team payroll–hypothesizing it to be bimodal (with teams either having very high or very low payrolls). This is disproved by the graph below, which shows a histogram of every team’s payroll. The visual is skewed to the right, indicating that our preliminary analysis was slightly flawed.
Here are the expected relationships we discussed in our proposal:
One relationship we might expect to find is a positive correlation between each team’s yearly payroll and their season record measured in wins and losses. It is the assumption that team’s that have more funding (a larger payroll) should be performing better.
The visual for this relationship is outlined in the following Data Visualization Section. Titled “Winning Percent based on Payroll”, we see that there is no obvious positive correlation between both payroll and winning percentage. (As an extra, the visual also displays Pre and Post CBT through the use of two colors).
Unfortunately, we could not find data that could provide an accurate representation of our second relationship assumption, stated below:
Another relationship we might expect to find is that the implementation of the competitive balance tax (CBT) has little to no effect on the team’s performance. It is predicted that the CBT is so inconsequential that it is almost negligible in terms of affecting team performance.
Data Visualizations
Chart 1: Density Plot
The density plot shows a symmetric, unimodal distribution centered at 0.5. This makes sense particularly because a team’s winning percentage at the end of a season is dependent on their performance against the other observations (i.e., teams) in the data set. Some teams perform consistently well, some do the opposite, and others have seasons that keep them around 50%. With no blatant cheating and advantages, this should be expected. Every team, more or less, has a reasonable chance of winning against any other team.
Chart 2: Scatterplot
This visualization is a faceted scatterplot comparing team payroll and regular season winning percentage from the pre- and post-CBT eras. The pre-CBT data have a correlation of r = 0.074 and the post-CBT data have a correlation of r = 0.357. Given these values, there does not appear to be a linear relationship in the pre-CBT era. However, there is a moderately weak linear relationship for the post-CBT era.
Chart 3: Dumbbell Chart
This plot displays the winning percentages of teams, specifically pointing out the differences between post and pre-CBT implementation. The data is organized from greatest to least post-CBT win percentage. Some teams’ average winning percentage increased in the post-CBT era, whereas some teams’ performance decreased. There are two teams not included in this visualization: the Arizona Diamondbacks and the Tampa Bay Rays. These franchises were not founded until 1998, so there is no pre-CBT era data to compare. The two teams with very low pre-CBT winning percentages–Colorado and Miami (formerly Florida)–have these because these franchises’ first season in MLB was 1993. Historically, expansion teams rarely achieve success in their opening season and it takes time to curate a competitive roster.
Conclusion
There are several factors that affect a professional sports team’s performance. One of the largest factors is the talent level of a team’s roster. Usually kept in check by a salary cap, some sports still don’t use the “conventional” approach to limiting wealthy franchises’ power. There has been a cry from fans for decades to implement a salary cap in Major League Baseball. The luxury tax was MLB’s attempt at keeping payrolls in control. Clearly, though, it has not. However, as the visualizations show, not every team that has great success in the post-CBT era are large-market teams (St. Louis at #5, Cleveland at #6). Only one team out of the 2 of the 10 teams whose performance worsened after the implementation of the CBT come from relatively large sports markets–New York (Mets) and Toronto. There is certainly more factors to a team’s performance other than its overall annual payroll. Scouting, effective free agent signings, smart trading, and knowledgeable leadership are just some of the other contributing factors. This analysis is only exploratory, but it may be time for fans to consider whether a salary cap would truly be useful in helping their team perform better.
Appendix
Data Dictionaries
There are three raw data sets used in this report
payroll_data
, inflation_factors
, and
Teams
. The respective data dictionaries are below, in the
same order as previously listed.
Variable | Type | Description |
---|---|---|
year | dbl | year |
franch_id | chr | three-letter franchise abbreviation |
nominal_est_payroll | dbl | nominal estimated payroll |
Variable | Type | Description |
---|---|---|
Year | dbl | year |
infl_factor | dbl | value of $ on Jan 1, Year is worth in September 2022 |
Variable | Type | Description |
---|---|---|
year_id | int | year |
team_id | chr | three-letter team abbreviation |
franch_id | chr | three-letter franchise abbreviation |
g | int | games played |
w | int | wins |
l | int | losses |
win_pct | dbl | winning percentage |
The clean data frame used is stats_and_payroll
. Below is
its data dictionary:
Variable | Type | Description |
---|---|---|
franch_id | chr | three-letter franchise abbreviation |
year | dbl | year |
nominal_est_payroll | dbl | nominal estimated payroll |
team_id | chr | three-letter team abbreviation |
g | int | games played |
w | int | wins |
l | int | losses |
win_pct | dbl | winning percentage |
post_CBT | chr | whether the season is pre- or post-CBT |
infl_factor | dbl | value of $ on Jan 1, Year is worth in September 2022 |
payroll_adj_for_infl | dbl | real estimated payroll |
Below is a list of the franchise IDs and the corresponding full team name to eliminate any confusion when abbreviations are used in the report:
franch_id | Team Name |
---|---|
ANA | Los Angeles Angels of Anaheim |
ARI | Arizona Diamondbacks |
ATL | Atlanta Braves |
BAL | Baltimore Orioles |
BOS | Boston Red Sox |
CHC | Chicago Cubs |
CHW | Chicago White Sox |
CIN | Cincinnati Reds |
CLE | Cleveland Guardians |
COL | Colorado Rockies |
DET | Detroit Tigers |
FLA | Miami Marlins (formerly Florida Marlins) |
HOU | Houston Astros |
KCR | Kansas City Royals |
LAD | Los Angeles Dodgers |
MIL | Milwaukee Brewers |
MIN | Minnesota Twins |
NYM | New York Mets |
NYY | New York Yankees |
OAK | Oakland Athletics |
PHI | Philadelphia Phillies |
PIT | Pittsburgh Pirates |
SDP | San Diego Padres |
SEA | Seattle Mariners |
SFG | San Francisco Giants |
STL | St. Louis Cardinals |
TBD | Tampa Bay Rays (formerly Tampa Bay Devil Rays) |
TEX | Texas Rangers |
TOR | Toronto Blue Jays |
WSN | Washington Nationals |
Code Used
Below is all the code used to generate this report.
# Load libraries and settings here
library(tidyverse)
library(here)
library(Lahman)
library(janitor)
library(plotly)
library(ggforce)
::opts_chunk$set(
knitrwarning = FALSE,
message = FALSE,
echo = F,
comment = "#>",
fig.path = "figs/", # Folder where rendered plots are saved
fig.width = 8, # Default plot width
fig.height = 5, # Default plot height
fig.retina = 3 # For better plot resolution
)
# Put any other "global" settings here, e.g. a ggplot theme:
theme_set(theme_minimal(base_size = 20))
# Write code below here to load any data used in project
<- read_csv(here('data_raw', 'payroll_data.csv'))
payroll_data <- read_csv(here('data_raw', 'inflation_factors.csv'))
inflation_factors
# For appendix
<- read_csv(here('data_raw', 'payroll_data_dict.csv'))
payroll_dict <- read_csv(here('data_raw', 'inflation_factors_dict.csv'))
infl_dict <- read_csv(here('data_raw', 'team_dict.csv'))
teams_dict <- read_csv(here('data_raw', 'stats_payroll_dict.csv'))
s_p_dict <- read_csv(here('data_raw', 'abbrevs.csv'))
abbrevs <- payroll_data %>%
payroll_data clean_names() %>%
mutate(nominal_est_payroll = est_payroll) %>%
select(c(year, franch_id, nominal_est_payroll)) %>%
filter(is.na(nominal_est_payroll) == F)
<- inflation_factors %>%
inflation_factors clean_names()
<- Teams %>%
teams clean_names() %>%
select(c(year_id, team_id, franch_id, g, w, l)) %>%
filter(year_id >= 1969) %>%
mutate(
team_id = as.character(team_id),
franch_id = as.character(team_id),
win_pct = w / g
)# CREATE Stats and Payroll Master Data Frame
$franch_id<-replace(teams$franch_id, teams$franch_id=="SF","SFG")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="CAL","ANA")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="LAA","ANA")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="CHA","CHW")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="CHN","CHC")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="KCA","KCR")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="LAN","LAD")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="MON","WSN")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="WS2","WSN")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="WAS","WSN")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="NYA","NYY")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="NYN","NYM")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="SDN","SDP")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="SE1","MIL")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="ML4","MIL")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="SFN","SFG")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="SLN","STL")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="FLO","FLA")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="MIA","FLA")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="TBA","TBD")
teams
# Merging df
<- merge(payroll_data, teams, by.x=c("franch_id", "year"), by.y=c("franch_id", "year_id"), all = TRUE)
stats_and_payroll
# Adjust df for new variables and removing NA
<- stats_and_payroll %>%
stats_and_payroll filter(!is.na(nominal_est_payroll) & !is.na(win_pct)) %>%
mutate(post_CBT = ifelse(year > 1993, "Post-CBT", "Pre-CBT"))
# Get rid of inflation variables that don't apply
<- inflation_factors %>%
inflation_factors_adj filter(year > 1968)
# Create new df of applicable inflation factors and add to stats_and_payroll
<- enframe(table(stats_and_payroll$year)) %>%
count mutate_if(is.table, as.numeric) %>%
mutate(name = as.numeric(name))
<- merge(count, inflation_factors_adj, by.x="name", by.y = "year")
infl_merge <- rep(infl_merge$infl_factor, times=infl_merge$value)
count_array <- stats_and_payroll[with(stats_and_payroll, order(year)),]
stats_and_payroll <- stats_and_payroll %>%
stats_and_payroll mutate(
infl_factor = count_array,
payroll_adj_for_infl = infl_factor*nominal_est_payroll,
)
<- stats_and_payroll %>%
plot ggplot(aes(x = payroll_adj_for_infl)) +
geom_histogram(size = 0.5, color = "#002982", fill = "#002982", alpha = 0.7) +
labs(
title = "Payroll Histogram",
subtitle = "MLB Franchises Pre and Post-CBT (1969-2021)"
+
) xlab("Team Payroll (in Millions of $)") +
ylab("Count") +
theme_light() +
theme(legend.position = "none") +
scale_x_continuous(
breaks = c(0, 10^8, 2*10^8, 3*10^8),
labels = c("0", "100", "200", "300"),
limits = c(0, 3.005*10^8)) +
theme(
legend.title = element_blank(),
plot.title=element_text(size=20),
axis.title.y = element_text(margin = margin(r=9)),
axis.title.x = element_text(margin = margin(t=9))
)ggplotly(plot)
<- stats_and_payroll %>%
plot ggplot(aes(x = win_pct)) +
geom_density(size = 0.5, color = "#002982", alpha = 0.7, fill = "#002982") +
theme_light() +
labs(
title = "Winning Percent Distribution",
subtitle = "MLB Franchises Pre and Post-CBT (1969-2021)"
+
) xlab("Regular Season Winning Percent") +
ylab("Density") +
theme(
plot.title=element_text(size=20),
axis.title.y = element_text(margin = margin(r=9)),
axis.title.x = element_text(margin = margin(t=9))
)
plot<- stats_and_payroll %>%
pre filter(year <= 1993)
<- stats_and_payroll %>%
post filter(year >= 1994)
<- stats_and_payroll %>%
plot ggplot(aes(
x = payroll_adj_for_infl,
y = win_pct),
color = post_CBT
+
) geom_point(size = 1, color = '#002982') +
facet_wrap(vars(post_CBT), nrow = 1) +
labs(
title = "Winning Percent based on Payroll",
subtitle = "MLB Franchises Pre and Post-CBT (1969-2021)"
+
) scale_x_continuous(
breaks = c(0, 10^8, 2*10^8, 3*10^8),
labels = c("0", "100", "200", "300"),
limits = c(0, 3.005*10^8)) +
xlab("Team Payroll (in Millions of $)") +
ylab("Regular Season Winning Percent") +
theme_light() +
theme(
legend.title = element_blank(),
plot.title=element_text(size=20),
axis.title.y = element_text(margin = margin(r=9)),
axis.title.x = element_text(margin = margin(t=9))
)
plot<- stats_and_payroll %>%
preCBT filter(post_CBT == "Pre-CBT") %>%
group_by(franch_id) %>%
summarise(avg_pre_win_pct = mean(win_pct))
<- stats_and_payroll %>%
postCBT filter(post_CBT == "Post-CBT", franch_id != c('ARI', 'TBD')) %>%
group_by(franch_id) %>%
summarise(avg_post_win_pct = mean(win_pct))
<- merge(preCBT, postCBT)
lineGraphData
<- lineGraphData %>%
scatterData rename(Team=franch_id,
"Pre-CBT"=avg_pre_win_pct,
"Post-CBT"=avg_post_win_pct) %>%
gather("Pre-CBT", "Post-CBT", key = "CBT",value = number)
<- inner_join(scatterData, lineGraphData, by = c("Team" = "franch_id"))
combinedAvg
<- combinedAvg %>%
plot ggplot() +
geom_segment(size = 0.8, color ="grey",
aes(
x = avg_pre_win_pct,
xend = avg_post_win_pct,
y = reorder(Team, avg_post_win_pct),
yend = Team
)+
) geom_point(aes(y = reorder(Team, avg_post_win_pct), x=number, color=CBT, shape=CBT),) +
scale_color_manual(values = c("Pre-CBT" = "#002982","Post-CBT"= "#D50032")) +
scale_shape_manual(values = c("Pre-CBT" = 16, "Post-CBT" = 17)) +
labs(
title = "Average Winning Percent Pre- and Post-CBT",
subtitle = "By MLB Franchise"
+
) ylab("Team Franchise ID") +
xlab("Regular Season Winning Percent") +
theme_light() +
theme(
plot.title=element_text(size=18),
axis.title.y = element_text(margin = margin(r=9)),
axis.title.x = element_text(margin = margin(t=9))
)
plot::kable(payroll_dict)
knitr::kable(infl_dict)
knitr::kable(teams_dict)
knitr::kable(s_p_dict)
knitr::kable(abbrevs) knitr