Research Question:

How does the date a game is held on affect game attencdance in recent years in major league baseball?

Annual attendance in major league baseball has been declining for the past 13 years. Since the league attendance peak in 2007 there is has been a decline each season in overall attendanace. Baseball has been called America’s pastime and was even deemed an essantail profession during world war II to maintain the countries morale. However, many modern fans believe the sport is too slow and the games are too long and attendance continues to decline. It is important to understand what influences a modern fan’s decision to attend a game in order for teams to maximize attendance.

Data Sources:

For this project I have identified three main data sources. The primary data source that I originally identified is called games which comes from a a website called kaggle, at the link: The data comes originally from the major league baseball website and was then compiled into this csv file. The original data was collected by Major League Baseball and the data was then scraped into this csv. I cannot find the original data which is not ideal because I cannot identify what modifications have been made from the original data set. The website did provide a link to the original data, however, the link appeared to be out data and no longer led to the correct page. Below I have included a table of all the variables included in this data set, along with the type of variable and description of the variable.

Variable Name Type Description
attendance Double Number of people who attended the game
away_final_score Double Final score of the away team
away_team Character Three letter abbreviation of the away team
date Data Year, month, and day of date played
elapsed_time Double Number of minutes of game play
g_id Double Game ID
home_final_score Double Final score of the home team
home_team Character Three letter abbreviation of the home team
start_time Time The sceduled start time
umpire_1B Character The name of the first base umpire
umpire_2B Character The name of the second base umpire
umpire_3B Character The name of the third base umpire
venue_name Character The name of the stadium the game was played at
weather Character Temperature and weather conditions
wind Character Speed and direction of wind
delay Double Lenght of delay in minutes

The variable that I was most interested in, in this particular data set was attendence. I wanted to investigate how it was affected by other variables such as date and weather.

The second data set that I wanted to find was a data set about stadium capacity. The reason I wanted to include this data is that I wanted to take into account the capacity of stadiums since it could cause misleading results if one stadium always has higher attendance than another because there are different number of seats in each stadium. I could not find a premade data set with this information but there are only 30 parks in major league baseball; I decided to find the data and create the tibble by hand. I copied the data from this wikipedia site: While it is not ideal to manually input data I felt that since it was a small data set that could not be found elsewhere it would be my best option. This new tibble included three variables deatailed below:

Variable Name Type Description
venue_name Character The current name of the venue
stadium_capacity Double The total number of seats in the venue
year_built Double The year the venue was built

I then combined the first two data sets together into one data set before I began my graphical analysis.

The third data set that I looked for was a data set that went further back and included attendance data about more than three years. I was able to find another dataset on Kaggle that covers this topic at this link: This data was scraped from a website called baseball reference which is a website that tracks a huge amount of data about baseball. However, no link is provided for the original data set so I cannot find the original source. This data set has data about each team for each season dating back to 1876. The challenge with this data set is the source I got this data set from did not include variable descriptions and I was unable to determine what some of the variables mean. I still think it is benefical for looking at overall league trends. Also attendence data in the is the data set is not available until 1890. There are many variables included in this data set that I have detailed in a table below, however, the only two I used in this project was year and attendance.

Variable Name Type Description
X1 Double The number observation in the data set
Rk Double Unknown
Year Double The year the observation is from
Tm Character The team the observation is about
Lg Character The league that the team was playing in
G Double Number of games played that season
W Double Number of games won
L Double Number of games lost
Ties Double Number of games tied
W.L. Double Ration of wins to total games played
pythW.L. Double Pythagorean winning percentage (estimated games a team should have won)
Finish Character Final ranking out of total number of teams
GB Character Number of games back from top ranked team
Playoffs Character How far into the playoffs the team made it that season
R Double Number of runs scored
RA Double Runs allowed (number of runs opposing teams scored)
Attendance Double Total number of attendees for the season
BatAge Double Unknown
PAge Double Unknown
Top.Player Character Top player for that season and team
Managers Character Managers of the team
current Character The current team name

Process for Cleaning Data:

The primary data was already well organzied into a CSV file with clean columns names. There was very little that I needed to do to clean this data. The problem I ran into was that some of the parks changed names within the data set or have changed names since, in order to be consistiant I changed all park names to be their most recent venue name. This also made it easier to combine my first two data sets. I was able to join the first two data sets using the variable venue_name that existed in both data sets to add the data about the venue to data about each game played at that venue.

Some games also had attendence of 0 or 1, these are refered to as crowdless games. The first crowdless game in the MLB was played in 2015 in baltimore. The game was played with no fans due to civil unrest in the city. I was unable to investigate all the crowdless in this data set though they appear to mainly occur due to civil unrest in the city where the game is supposed to be played or double header games where attendance was only recorded for one of the games. I chose to remove them from the data set due to how rare they are, and also given that they are generally crowdless due to factors not measured within this data set.

Creation of New Variables:

There was also a singular weather variable that included the temperature and the conditions. This was all stored as a string. I separated this variable into two variables. One included the temperature stored as a double and the other included on the weather conditions stored as a string which can be used as categorical data.

I also created a new variable entitled percent_capacity which was the attendance divided by the total capacity of the stadium. The goal of this was to create a more equal comparison between venues of different size.


The first data set that I worked with in my results section was the historical data set. The two variables I used were year and a summary variable total attendance per year that I created. The range from 1890 to 2015. The variance of the total attendance is shown in a box plot below. As shown in the boxplot the there is a large spread particularly on the upper half of the mean, however, there are no outliers in this data.

## Rows: 141
## Columns: 2
## $ Year             <dbl> 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884…
## $ total_attendance <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).

The next variables I worked with was days of the week which are of course a standard seven and mean game attendance percentage by team for each day of the week. The boxplot below show the spread of average attendace percentage. The boxplot shows an incredibly even spread with no outliers.

## Rows: 217
## Columns: 4
## $ grouping_variable <chr> "Angel Stadium of Anaheim1", "Angel Stadium of Anah…
## $ mean_day_venue    <dbl> 82.43382, 76.46182, 77.18735, 77.66853, 76.64184, 8…
## $ week_day          <chr> "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "…
## $ venue_name        <chr> "Angel Stadium of Anaheim", "Angel Stadium of Anahe…

After looking at the days of the week I wanted to continue examining how the date affect attendance so I experimented with grouping by month. The months ranged from 3 to 10 as baseball is only played march through october. The boxplot for the mean attendance by month and venue is below. Again it shows a very even spread with no outliers.

## Rows: 227
## Columns: 4
## $ grouping_variable <chr> "Angel Stadium of Anaheim10", "Angel Stadium of Ana…
## $ mean_month_venue  <dbl> 69.94456, 82.38399, 78.57694, 81.84634, 87.20427, 8…
## $ month             <chr> "0", "4", "5", "6", "7", "8", "9", "0", "4", "5", "…
## $ venue_name        <chr> "Angel Stadium of Anaheim1", "Angel Stadium of Anah…

Next I examined the temperature of the day on which the games are played. The interquantile range is realitively small but this has lead to larger number of outliers than seen in the previous variables.

The last variable I examined was the weather conditions that were occuring the day the game was played. As seen in the historgram below there is a large variation in the number of games played in different conditions. For example almost no games were played in snow and very few were played in rain or drizzle. The most common condition for games to played in is partly cloudy. This is helpful to keep in mind when looking at the final chart to remember that some of the more rare weather conditions can have their results more easily skewed by one or two games, whose attendance may have been affected by conditions other than just the weather.


## Rows: 141
## Columns: 2
## $ Year             <dbl> 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884…
## $ total_attendance <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## Rows: 125
## Columns: 3
## $ Year                <dbl> 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1…
## $ total_attendance    <dbl> 1329632, 1808069, 1817573, 2208271, 2143525, 2202…
## $ attendance_millions <dbl> 1.329632, 1.808069, 1.817573, 2.208271, 2.143525,…
## Rows: 141
## Columns: 2
## $ Year             <dbl> 2007, 2008, 2006, 2005, 2012, 2013, 2014, 2015, 2009…
## $ total_attendance <dbl> 79484718, 78624315, 76043902, 74915268, 74859268, 74…

As I mentioned at the begining of this report attendance at major league baseball games has been in decline since its peak in 2007. This chart demonstrates trends in major legue attendance for the last 130 years. There is a clear upward trend in attendance consitiently for the last 130 years. There have been periods of decline previously and now we are clearly in one of those periods. While this is not directly related to the research question it provides context to the current state of baseball attendance.

## Rows: 217
## Columns: 4
## $ grouping_variable <chr> "Angel Stadium of Anaheim1", "Angel Stadium of Anah…
## $ mean_day_venue    <dbl> 82.43382, 76.46182, 77.18735, 77.66853, 76.64184, 8…
## $ week_day          <chr> "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "…
## $ venue_name        <chr> "Angel Stadium of Anaheim", "Angel Stadium of Anahe…

I decided to categorize this chart by venue rather than by team because the venue is inherienty part of percent_capacity. This chart demonstrates that the majority of venues see a lower attendance on weekdays than they do on the weekends. There are a few noteable exceptions to the this such as: Fenway Park, Oracle Park, and Wrigley Field. But the majority of venues have higher attendance on weekend days and days closer to the weekend.

## Rows: 7
## Columns: 2
## $ day_of_week <dbl> 1, 2, 3, 4, 5, 6, 7
## $ mean        <dbl> 75.91350, 63.58868, 62.45255, 63.37794, 65.32217, 74.4481…