Leveling the Playing Field

How the MLB Competitive Balance Tax affects Team Performance

Research Question

How have Major League Baseball (MLB) teams’ records and post-season performance changed before and after the competitive balance tax (CBT) was implemented in 1994?

image source

Data

Sources

Team Payrolls: 1950-2022

We collected team payroll data from baseball-reference.com, an online baseball encyclopedia. Not all seasons have complete data, especially those closer to 1950. We started using data from 1969 rather than 1950 for better data. These payrolls amounts were also adjusted for inflation to try and get a better understanding of the differences in payrolls from year to year (by comparing real rather than nominal payrolls).

Team Performance Data

Sean Lahman began an effort to make MLB statistics publicly available starting in 1994. Since then, he and his team have updated a database containing different data frames in different formats for public use. (The most useful format for this project is comma-delimited values, csv’s.) The data frame we used is the Teams.csv data frame for regular season data. Thankfully, the data from the Lahman Baseball Database (see documentation has already been compiled into an R package: the Lahman package.

Data Processing

The data from the Lahman database has been compiled into the Lahman package. The other data is loaded from the data_raw folder. The data frames from the Lahman package used are Teams.

The below code is the initial data wrangling of the data frames as well as a data dictionary for each data frame.

Expectations

One variable’s distribution we estimated was team payroll–hypothesizing it to be bimodal (with teams either having very high or very low payrolls). This is disproved by the graph below, which shows a histogram of every team’s payroll. The visual is skewed to the right, indicating that our preliminary analysis was slightly flawed.

Here are the expected relationships we discussed in our proposal:

One relationship we might expect to find is a positive correlation between each team’s yearly payroll and their season record measured in wins and losses. It is the assumption that team’s that have more funding (a larger payroll) should be performing better.

The visual for this relationship is outlined in the following Data Visualization Section. Titled “Winning Percent based on Payroll”, we see that there is no obvious positive correlation between both payroll and winning percentage. (As an extra, the visual also displays Pre and Post CBT through the use of two colors).

Unfortunately, we could not find data that could provide an accurate representation of our second relationship assumption, stated below:

Another relationship we might expect to find is that the implementation of the competitive balance tax (CBT) has little to no effect on the team’s performance. It is predicted that the CBT is so inconsequential that it is almost negligible in terms of affecting team performance.

Data Visualizations

Chart 1: Density Plot

The density plot shows a symmetric, unimodal distribution centered at 0.5. This makes sense particularly because a team’s winning percentage at the end of a season is dependent on their performance against the other observations (i.e., teams) in the data set. Some teams perform consistently well, some do the opposite, and others have seasons that keep them around 50%. With no blatant cheating and advantages, this should be expected. Every team, more or less, has a reasonable chance of winning against any other team.

Chart 2: Scatterplot

This visualization is a faceted scatterplot comparing team payroll and regular season winning percentage from the pre- and post-CBT eras. The pre-CBT data have a correlation of r = 0.074 and the post-CBT data have a correlation of r = 0.357. Given these values, there does not appear to be a linear relationship in the pre-CBT era. However, there is a moderately weak linear relationship for the post-CBT era.

Chart 3: Dumbbell Chart

This plot displays the winning percentages of teams, specifically pointing out the differences between post and pre-CBT implementation. The data is organized from greatest to least post-CBT win percentage. Some teams’ average winning percentage increased in the post-CBT era, whereas some teams’ performance decreased. There are two teams not included in this visualization: the Arizona Diamondbacks and the Tampa Bay Rays. These franchises were not founded until 1998, so there is no pre-CBT era data to compare. The two teams with very low pre-CBT winning percentages–Colorado and Miami (formerly Florida)–have these because these franchises’ first season in MLB was 1993. Historically, expansion teams rarely achieve success in their opening season and it takes time to curate a competitive roster.

Conclusion

There are several factors that affect a professional sports team’s performance. One of the largest factors is the talent level of a team’s roster. Usually kept in check by a salary cap, some sports still don’t use the “conventional” approach to limiting wealthy franchises’ power. There has been a cry from fans for decades to implement a salary cap in Major League Baseball. The luxury tax was MLB’s attempt at keeping payrolls in control. Clearly, though, it has not. However, as the visualizations show, not every team that has great success in the post-CBT era are large-market teams (St. Louis at #5, Cleveland at #6). Only one team out of the 2 of the 10 teams whose performance worsened after the implementation of the CBT come from relatively large sports markets–New York (Mets) and Toronto. There is certainly more factors to a team’s performance other than its overall annual payroll. Scouting, effective free agent signings, smart trading, and knowledgeable leadership are just some of the other contributing factors. This analysis is only exploratory, but it may be time for fans to consider whether a salary cap would truly be useful in helping their team perform better.

Appendix

Data Dictionaries

There are three raw data sets used in this report payroll_data, inflation_factors, and Teams. The respective data dictionaries are below, in the same order as previously listed.

Variable	Type	Description
year	dbl	year
franch_id	chr	three-letter franchise abbreviation
nominal_est_payroll	dbl	nominal estimated payroll

Variable	Type	Description
Year	dbl	year
infl_factor	dbl	value of $ on Jan 1, Year is worth in September 2022

Variable	Type	Description
year_id	int	year
team_id	chr	three-letter team abbreviation
franch_id	chr	three-letter franchise abbreviation
g	int	games played
w	int	wins
l	int	losses
win_pct	dbl	winning percentage

The clean data frame used is stats_and_payroll. Below is its data dictionary:

Variable	Type	Description
franch_id	chr	three-letter franchise abbreviation
year	dbl	year
nominal_est_payroll	dbl	nominal estimated payroll
team_id	chr	three-letter team abbreviation
g	int	games played
w	int	wins
l	int	losses
win_pct	dbl	winning percentage
post_CBT	chr	whether the season is pre- or post-CBT
infl_factor	dbl	value of $ on Jan 1, Year is worth in September 2022
payroll_adj_for_infl	dbl	real estimated payroll

Below is a list of the franchise IDs and the corresponding full team name to eliminate any confusion when abbreviations are used in the report:

franch_id	Team Name
ANA	Los Angeles Angels of Anaheim
ARI	Arizona Diamondbacks
ATL	Atlanta Braves
BAL	Baltimore Orioles
BOS	Boston Red Sox
CHC	Chicago Cubs
CHW	Chicago White Sox
CIN	Cincinnati Reds
CLE	Cleveland Guardians
COL	Colorado Rockies
DET	Detroit Tigers
FLA	Miami Marlins (formerly Florida Marlins)
HOU	Houston Astros
KCR	Kansas City Royals
LAD	Los Angeles Dodgers
MIL	Milwaukee Brewers
MIN	Minnesota Twins
NYM	New York Mets
NYY	New York Yankees
OAK	Oakland Athletics
PHI	Philadelphia Phillies
PIT	Pittsburgh Pirates
SDP	San Diego Padres
SEA	Seattle Mariners
SFG	San Francisco Giants
STL	St. Louis Cardinals
TBD	Tampa Bay Rays (formerly Tampa Bay Devil Rays)
TEX	Texas Rangers
TOR	Toronto Blue Jays
WSN	Washington Nationals

Code Used

Below is all the code used to generate this report.

# Load libraries and settings here
library(tidyverse)
library(here)
library(Lahman)
library(janitor)
library(plotly)
library(ggforce)

knitr::opts_chunk$set(
    warning = FALSE,
    message = FALSE,
    echo = F,
    comment = "#>",
    fig.path = "figs/", # Folder where rendered plots are saved
    fig.width = 8, # Default plot width
    fig.height = 5, # Default plot height
    fig.retina = 3 # For better plot resolution
)

# Put any other "global" settings here, e.g. a ggplot theme:
theme_set(theme_minimal(base_size = 20))

# Write code below here to load any data used in project
payroll_data <- read_csv(here('data_raw', 'payroll_data.csv'))
inflation_factors <- read_csv(here('data_raw', 'inflation_factors.csv'))

# For appendix
payroll_dict <- read_csv(here('data_raw', 'payroll_data_dict.csv'))
infl_dict <- read_csv(here('data_raw', 'inflation_factors_dict.csv'))
teams_dict <- read_csv(here('data_raw', 'team_dict.csv'))
s_p_dict <- read_csv(here('data_raw', 'stats_payroll_dict.csv'))
abbrevs <- read_csv(here('data_raw', 'abbrevs.csv'))
payroll_data <- payroll_data %>% 
    clean_names() %>% 
    mutate(nominal_est_payroll = est_payroll) %>% 
    select(c(year, franch_id, nominal_est_payroll)) %>% 
    filter(is.na(nominal_est_payroll) == F)

inflation_factors <- inflation_factors %>% 
    clean_names() 

teams <- Teams %>% 
    clean_names() %>% 
    select(c(year_id, team_id, franch_id, g, w, l)) %>% 
    filter(year_id >= 1969) %>% 
    mutate(
        team_id = as.character(team_id),
        franch_id = as.character(team_id),
        win_pct = w / g
    )
# CREATE Stats and Payroll Master Data Frame
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="SF","SFG")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="CAL","ANA")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="LAA","ANA")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="CHA","CHW")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="CHN","CHC")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="KCA","KCR")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="LAN","LAD")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="MON","WSN")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="WS2","WSN")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="WAS","WSN")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="NYA","NYY")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="NYN","NYM")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="SDN","SDP")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="SE1","MIL")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="ML4","MIL")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="SFN","SFG")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="SLN","STL")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="FLO","FLA")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="MIA","FLA")
teams$franch_id<-replace(teams$franch_id, teams$franch_id=="TBA","TBD")


# Merging df
stats_and_payroll <- merge(payroll_data, teams, by.x=c("franch_id", "year"), by.y=c("franch_id", "year_id"), all = TRUE) 

# Adjust df for new variables and removing NA
stats_and_payroll <- stats_and_payroll %>% 
    filter(!is.na(nominal_est_payroll) & !is.na(win_pct)) %>% 
    mutate(post_CBT = ifelse(year > 1993, "Post-CBT", "Pre-CBT"))

# Get rid of inflation variables that don't apply 
inflation_factors_adj <- inflation_factors %>% 
    filter(year > 1968)

# Create new df of applicable inflation factors and add to stats_and_payroll
count <- enframe(table(stats_and_payroll$year)) %>% 
    mutate_if(is.table, as.numeric) %>% 
    mutate(name = as.numeric(name))
infl_merge <- merge(count, inflation_factors_adj, by.x="name", by.y = "year")
count_array <- rep(infl_merge$infl_factor, times=infl_merge$value)
stats_and_payroll <- stats_and_payroll[with(stats_and_payroll, order(year)),]
stats_and_payroll <- stats_and_payroll %>% 
    mutate(
        infl_factor = count_array, 
        payroll_adj_for_infl = infl_factor*nominal_est_payroll, 
        ) 

plot <- stats_and_payroll %>% 
    ggplot(aes(x = payroll_adj_for_infl)) +
    geom_histogram(size = 0.5, color = "#002982", fill = "#002982", alpha = 0.7) + 
    labs(
        title = "Payroll Histogram", 
        subtitle = "MLB Franchises Pre and Post-CBT (1969-2021)"
    ) + 
    xlab("Team Payroll (in Millions of $)") +
    ylab("Count") +
    theme_light() + 
    theme(legend.position = "none") +
     scale_x_continuous(
         breaks = c(0, 10^8, 2*10^8, 3*10^8), 
         labels = c("0", "100", "200", "300"),
         limits = c(0, 3.005*10^8)) +
    theme(
        legend.title = element_blank(),
        plot.title=element_text(size=20), 
        axis.title.y = element_text(margin = margin(r=9)),
        axis.title.x = element_text(margin = margin(t=9))
    )
ggplotly(plot)
plot <- stats_and_payroll %>%
    ggplot(aes(x = win_pct)) +
    geom_density(size = 0.5, color = "#002982", alpha = 0.7, fill = "#002982") +
    theme_light() + 
    labs(
        title = "Winning Percent Distribution", 
        subtitle = "MLB Franchises Pre and Post-CBT (1969-2021)"
    ) + 
    xlab("Regular Season Winning Percent") + 
    ylab("Density") +
    theme(
        plot.title=element_text(size=20), 
        axis.title.y = element_text(margin = margin(r=9)),
        axis.title.x = element_text(margin = margin(t=9))
    )

plot
pre <- stats_and_payroll %>% 
    filter(year <= 1993)

post <- stats_and_payroll %>% 
    filter(year >= 1994)

plot <- stats_and_payroll %>% 
    ggplot(aes(
        x = payroll_adj_for_infl, 
        y = win_pct),
        color = post_CBT
    ) +
    geom_point(size = 1, color = '#002982') +
    facet_wrap(vars(post_CBT), nrow = 1) +
    labs(
        title = "Winning Percent based on Payroll", 
        subtitle = "MLB Franchises Pre and Post-CBT (1969-2021)"
    ) + 
    scale_x_continuous(
         breaks = c(0, 10^8, 2*10^8, 3*10^8), 
         labels = c("0", "100", "200", "300"),
         limits = c(0, 3.005*10^8)) +
    xlab("Team Payroll (in Millions of $)") + 
    ylab("Regular Season Winning Percent") + 
    theme_light() + 
    theme(
        legend.title = element_blank(),
        plot.title=element_text(size=20), 
        axis.title.y = element_text(margin = margin(r=9)),
        axis.title.x = element_text(margin = margin(t=9))
    )

plot
preCBT <- stats_and_payroll %>% 
    filter(post_CBT == "Pre-CBT") %>% 
    group_by(franch_id) %>% 
    summarise(avg_pre_win_pct = mean(win_pct)) 

postCBT <- stats_and_payroll %>% 
    filter(post_CBT == "Post-CBT", franch_id != c('ARI', 'TBD')) %>% 
    group_by(franch_id) %>% 
    summarise(avg_post_win_pct = mean(win_pct))

lineGraphData <- merge(preCBT, postCBT)

scatterData <- lineGraphData %>% 
    rename(Team=franch_id,
           "Pre-CBT"=avg_pre_win_pct,
           "Post-CBT"=avg_post_win_pct) %>%
    gather("Pre-CBT", "Post-CBT", key = "CBT",value = number)

combinedAvg <- inner_join(scatterData, lineGraphData, by = c("Team" = "franch_id"))

plot <- combinedAvg %>% 
    ggplot() +
    geom_segment(size = 0.8, color ="grey",
        aes(
            x = avg_pre_win_pct,
            xend = avg_post_win_pct,
            y = reorder(Team, avg_post_win_pct),
            yend = Team
        )
    ) +
    geom_point(aes(y = reorder(Team, avg_post_win_pct), x=number, color=CBT, shape=CBT),) +
    scale_color_manual(values = c("Pre-CBT" = "#002982","Post-CBT"= "#D50032")) +
    scale_shape_manual(values = c("Pre-CBT" = 16, "Post-CBT" =  17)) +
    labs(
        title = "Average Winning Percent Pre- and Post-CBT", 
        subtitle = "By MLB Franchise"
    ) + 
    ylab("Team Franchise ID") + 
    xlab("Regular Season Winning Percent") + 
    theme_light() + 
    theme(
        plot.title=element_text(size=18), 
        axis.title.y = element_text(margin = margin(r=9)),
        axis.title.x = element_text(margin = margin(t=9))
    )

plot
knitr::kable(payroll_dict)
knitr::kable(infl_dict)
knitr::kable(teams_dict)
knitr::kable(s_p_dict)
knitr::kable(abbrevs)