Analyzing Olympic Success

Analyzing the Correlation Between Medal Count and Socioeconomic Factors

Author

Hannah Hodge, Sandy Alksninis, Jedidiah Bechtel, Aedan Bayhan

Published

December 8, 2024

Intro

Since the Olympics was restarted in 1896, The United States of America has seemingly run away with the total medal count. Most attribute this to the sheer number of athletes that comprise the United States delegation every Olympic Games, others gaze at the massive economy, and some even ponder if patriotism plays a role in the country’s success. Using data up to 2016 and not including countries that no longer exist such as the Soviet Union, the ten countries with the highest medal count are The United States, the United Kingdom, Germany, France, Italy, China, Sweden, Japan, Norway, and Australia. Of the 206 countries and delegations that have won a combined total of over 20,000 medals almost half of them have been won by an athlete from one of these top 10 countries.

Research Question

What are the correlations between the Olympic medal counts of different countries and various factors such as GDP, public healthcare, and the number of athletes per country? Does the evidence from these factors help to explain why the top 10 countries are so successful?

Data Sources

Medal Counts

The source was posted to Kaggle by a user called SJ. SJ scraped the data from the Olympics website using BeautifulSoup and then wrangled it so it could be analyzed. Neither of the data sources seems like they would be biased. The only problem we could think of is that some of the data could have gotten jumbled when it was being wrangled by SJ as for the GDP information it was downloaded straight from the World Bank Groups website and posted to Kaggle so we do not think that it would have been altered.

Medalists

Original Data From Sports Reference

The data that was used was posted to Kaggle by a user who goes by RGRIFFIN. The data is originally from a website called Sports-references and was scraped and put into R by RGRIFFIN. According to some of the comments, some medalists are missing from the data set, but after a brief investigation, it was just a few people from distant past Olympics where records most likely were lost or not kept well. The potential bias that might appear in datasets like this one is a grooming of data to make a certain country better than they are but there appears to be no bias based on what was collected and how.

The World Bank

The data for GDP Per Capita, GDP, and Healthcare Expenditure were all downloaded from the World Bank. The World Bank is an organization affiliated with the United Nations and specializes in unbiased data collection and management. They update their figures annually and we pulled data from the most recently updated data frames.

Results and Analysis

There are graphs and visualizations below that explore correlations and relationships between Medalists, GDP and GDP per capita, healthcare investments, and number of athletes. In the graphs, the top en countries by medal count are colored in red. This is to easily observe where they fall on each graph.

Changes in Medalists

Two graphs examine the top ten countries in terms of total overall medals and the Olympic games. The animated graph depicts the number of medalists at every Olympic Games since 1896. The purpose behind this graph is to observe a general trend throughout time thus the choice of a line chart. At the beginning of the time lapse the graph is fairly standardized with a few major shake-ups however, as time progresses the graph becomes hectic and messy so to speak, which is where the first graph comes in. The first graph is a dumbbell chart that displays the change in medalists from the year 2000 to the year 2016 to provide more clarity on what exactly was happening at the end of the time-lapse. Both of the graphs make use of the same data set which contains information on individual medalists. The data frame that the charts use, however, was a summary of individual medalist information grouped by year and country, which is how the total medalist for each country was found in a given year.

Code
df_people <- read_csv(here::here('data_raw','athlete_events.csv'))

df_country <- read_csv(here::here('data_raw','noc_regions.csv'))

df <- left_join(df_people, df_country, by = "NOC" ) %>% 
    filter(!is.na(Medal)) %>% 
    select(-notes)

df_by_country <- df %>% 
    group_by(Year, region) %>%
    summarise(
        count = n(),
        .groups = 'drop'  # This drops the grouping after summarizing
    ) %>%
    arrange(Year, desc(count)) %>% 
    filter(region %in% c("USA", "Australia", "China", "Germany", "Italy", "UK", "Canada", "Japan", "Sweden","France"))

view(df_by_country)

write_csv(df_by_country, "data_processed/df_by_country.csv")


change_in_medalists <- df_by_country %>% 
   mutate(Year = factor(Year)) %>% 
   filter(Year %in% c(2000,2016)) %>% 
   group_by(region) %>%
   arrange(Year) %>%
   mutate(delta_label = paste0(" Δ ", count[2] - count[1])) %>%
   ggplot(aes(x = count, y = region)) +
   geom_line(aes(group = region), color = 'lightblue') +
   geom_point(aes(color = Year), size = 2.5) +
   geom_text(
       data = . %>% group_by(region) %>% slice(n()), # only label last point of each group
       aes(x = count, label = delta_label), 
       hjust = -0.2, 
       vjust = 0.5,
       color = "darkgrey",
       size = 3
   ) +
   scale_color_manual(values = c('lightblue','steelblue')) +
   theme_minimal() +
   theme(
       panel.grid.major.y = element_line(color = "grey90", linetype = "dashed"),
       panel.grid.major.x = element_blank(),
       panel.grid.minor = element_blank(),
       legend.position = "top",
       text = element_text(family = "Arial", color = "grey30")
   ) +
   labs(
       title = "Total Medalists by Countries Across Years",
       x = "Count",
       y = "Country",
       color = "Year"
   ) +
   scale_x_continuous(
       expand = expansion(mult = c(0.1, 0.2))
   )

ggsave("figs/change_in_medalists.png", width = 10, height = 6, dpi = 300)

change_in_medalists

The dumbbell chart takes a closer look into the number of medalists from 2000 to 2016. The key point that this chart displays is the change in the number of medalists. This way countries with a large change can be looked into. For example, Australia experienced a massive decrease in medalists over that time interval. One factor that could have possibly contributed to this change is that in 2000 Sydney Australia hosted the Olympics. The home-field advantage could have led to an increase in medalists. Most countries experienced growth in their medalist count with the UK increasing theirs by a whopping 91 people. This success could be due to several factors that will be explored later on, but one that stuck out was an increase in funding for athletics by the country. In all the categories that received a funding increase, more success was found.

Code
#animated_region_chart <- ggplot(df_by_country, aes(x = Year, y = count, color = region, #group = region)) +
#  geom_line(size = 1.2) +
#  geom_point(size = 2) +
#  
#
#  geom_text(
#    aes(label = region),
#
#    nudge_x = 0.2,
# 
#    size = 3,
#    
#    fontface = 'bold'
#  ) +
#  
#
#  theme_minimal() +
#  theme(
#    legend.position = "none",  
#    plot.title = element_text(size = 15, face = "bold"),
#    axis.title = element_text(size = 12),
#    panel.grid.minor = element_blank()
#  ) +
#  
#
#  labs(
#    title = 'Medalists Count Progression by Country',
#    x = 'Year',
#    y = 'Medal Count'
#  ) +
#  
#  scale_color_viridis_d() +
#  
# 
#  scale_x_continuous(expand = expansion(mult = c(0.1, 0.2))) +
#  
#  
#  transition_reveal(Year) +
#  ease_aes('linear')
#
#
#animate(
#  animated_region_chart, 
#  duration = 20,   
#  fps = 20,        
#  width = 800,     
#  height = 500    
#)
#
#anim_save("figs/region_count_progression.gif")
#

Throughout the animated graph, it is evident that the United States and Germany have led the charge when it comes to medalists over the past 150 years. While the amount of medalists fluctuated when it came to different years the front runners remained pretty much the same throughout time until recently. The key takeaway from this graph is that over the past few years, The Olympics games have gotten much more competitive with more countries participating and more countries achieving success. Over the rest of the report, the question of what has sparked this increase in competitiveness will be explored.

GDP Per Capita

The interactive graph below shows the relationship between GDP per capita, total medal wins by country, and population. The graph only displays the top 50 medal-winning countries in order to reduce the clutter from countries that aren’t competitive on the global stage. The graph clearly demonstrates that countries with a higher GDP per capita win more medals. The graph also demonstrates that despite having smaller populations, GDP per capita makes up a bigger factor in medal wins.

Code
df_GDP_percap_long <- df_GDP_Per_Capita %>% 
    select(-`Indicator Name`, -`Indicator Code`) %>%
    pivot_longer(
        cols = -c(`Country Name`, `Country Code`),
        names_to = "year",
        values_to = "GDP_Per_Capita"
    )

df_GDP_long <- df_GDP %>% 
    pivot_longer(
        cols = -c(`Country`, `Country Code`),
        names_to = "year",
        values_to = "GDP"
    )

df_pop_long <- df_Pop %>% 
    select(-`Indicator Name`, -`Indicator Code`) %>%
    pivot_longer(
        cols = -c(`Country Name`, `Country Code`),
        names_to = "year",
        values_to = "Population"
    )

df_pop_2016 <- df_pop_long %>% 
    filter(year == "2016") %>% 
    select(-`Country Name`)%>% 
    select(-`year`)

df_GDP_2016 <- df_GDP_long %>% 
    filter(year == "2016") %>% 
    rename(countries = `Country`)

df_GDP_percap_2016 <- df_GDP_percap_long %>% 
    filter(year == "2016") %>% 
    rename(countries = `Country Name`)



df_medals <- df_medals %>%
    mutate(`Country Code` = gsub("[()]", "", ioc_code)) %>% 
    select(-ioc_code)%>%  
    select(-countries) %>%
    mutate(`Country Code` = ifelse(`Country Code` == "GER", "DEU", `Country Code`))





Percap_final_df <- left_join(df_medals, df_GDP_percap_2016, by = "Country Code")
#view(Percap_final_df)



Percap_final_df <- left_join(Percap_final_df, df_pop_2016, by = "Country Code")



GDP_final_df <- left_join(df_medals, df_GDP_2016, by = "Country Code")

#view(GDP_final_df)


Percap_final_df <- Percap_final_df %>%
    filter(!is.na(GDP_Per_Capita), !is.na(total_total), !is.na(Population))

Percap_final_df <- Percap_final_df %>% 
    slice_max(total_total, n = 50)

GDP_final_df <- GDP_final_df %>%
    filter(!is.na(GDP), !is.na(total_total))%>% 
    slice_max(total_total, n = 50)

top_10_countries_Percap <- Percap_final_df %>%
    arrange(desc(total_total)) %>%
    slice_head(n = 10)



p <- Percap_final_df %>% 
    ggplot(aes(text = paste("Country:", countries, 
                            "<br>GDP Per Capita:", round(GDP_Per_Capita),
                            "<br>Total Medals:", total_total,
                            "<br>Population:", Population),
               x = GDP_Per_Capita, y = total_total, size = Population)) +
    geom_point(aes(color = ifelse(countries %in% top_10_countries_Percap$countries, "red", "gray40"))) +
    scale_size_continuous(range = c(0.1, 15), name = "Population") +
    scale_y_continuous(labels = scales::comma) +
    theme_bw() +
    labs(
        x = "GDP Per Capita",
        y = "Total Number of Medals Won",
        title = "Total Number of Medals Won by Countries against GDP Per Capita \n(Bubble size indicates population)"
    )+
  scale_color_identity()+
  theme(legend.position = "none") +
    scale_x_log10()

ggplotly(p, tooltip = "text")
Code
ggsave("figs/GDP_per_capita.png", plot = p, width = 8, height = 6)

write.csv(GDP_final_df, "data_processed/GDP_final_df.csv", row.names = FALSE)
write.csv(Percap_final_df, "data_processed/Percap_final_df.csv", row.names = FALSE)

GDP

The graph below shows the correlation between GDP and medals won. There is a very clear correlation between the productivity of a nation and how many medals they win.

Code
top_10_countries_GDP <- GDP_final_df %>%
    arrange(desc(total_total)) %>%
    slice_head(n = 10)


library(ggrepel)

GDP_final_df %>% 
    ggplot(aes(x = GDP, y = total_total)) +
    geom_point(aes(color = ifelse(countries %in% top_10_countries_GDP$countries, "red", "gray"))) +
    scale_x_log10(
        breaks = scales::log_breaks(base = 10), 
        labels = scales::label_number(scale_cut = scales::cut_short_scale())) + 
    geom_smooth(color = "gray40", se = FALSE) +
    labs(
        title = "GDP vs. Total medals",
        subtitle = "GDP is displayed on a logarithmic scale for clarity",
        x = "GDP",
        y = "Total medals") +
    scale_color_identity() +
    theme_minimal()

Average Healthcare Expenditure

When thinking about possible causes for Olympic medal counts, healthcare is a possibility because a country’s investment in healthcare can have a significant impact on the overall health of its population. Countries with better healthcare systems may have better-performing athletes because of access to medical support and injury prevention. Additionally, well-developed healthcare infrastructure can contribute to higher levels of physical fitness and long-term athletic development programs. These factors may give athletes from these countries a competitive advantage in the Olympics, leading to higher medal counts.

The scatter plot below illustrates the average healthcare expenditure per capita versus the Olympic medal count. Each dot represents a country that has competed in the Olympics.

Code
healthcare <-  read_excel(here('data_raw','healthcare_investments.xls'), skip = 2)
df_medals <- read_csv(here('data_raw',"olympics_medals_country_wise.csv"))
df_medals <-  df_medals %>% 
    mutate('Country Code' = gsub("[()]", "", ioc_code)) %>% 
    select(-ioc_code)
df_medals <- df_medals[, -c(1:15)]

healthcare[56, "Country Code"] <- "GER"
healthcare <- healthcare[, colSums(!is.na(healthcare)) > 0]
healthcare <- healthcare[, -c(3, 4)]
healthcare$Total <- rowSums(healthcare[, c(3:23)], na.rm = TRUE)

mergedDF <- left_join(healthcare, df_medals, by = "Country Code") 
mergedDF <- mergedDF %>%
    filter(!is.na(total_total))
mergedDF <- mergedDF[mergedDF$Total != 0, ]
mergedDF$averageHC <- mergedDF$Total/22


mergedDF <- mergedDF %>% 
    arrange(desc(total_total))
top10 <- c('USA', 'GBR', 'GER', 'FRA', 'CHN', 'SWE', 'AUS', 'JPN', 'ITA', 'NOR')
mergedDF <- mergedDF %>% 
    mutate(color_group = ifelse(`Country Code` %in% top10, "Top 10", "Other"))

healthcare_vs_medals<- mergedDF %>%
    ggplot(aes(x = total_total, y = averageHC)) + 
    geom_smooth(se = FALSE, color = "gray40") +
    geom_point(aes(color = color_group))+
    scale_x_log10() + 
    labs(
        x = "Medal Count",
        y = "Healthcare Expenditure (thousands USD)",
        title = "Average Healthcare Expenditure versus Medal Count",
        subtitle = "",
        caption = "Sources: World Bank Group, olympics.com
        \n The x-axis is shown on a logarithmic scale for clarity. Highlighted in red are the countries with the highest medal counts. \n 
        The y-axis is the average healthcare expenditure per capita since 2000 and is shown in thousands of USD. "
    ) +
    theme_minimal() +
    scale_color_manual(values = c("Top 10" = "red", "Other" = "gray")) +
    theme(legend.position = "none") +
    geom_text(
        data = subset(mergedDF, `Country Code` == "CHN"),
        aes(label = `Country Name`), 
        nudge_y = 0.5, 
        nudge_x = 0.2,
        size = 4,
        color = "gray40" 
    ) +
    geom_text(
        data = subset(mergedDF, `Country Code` == "NOR"), 
        aes(label = `Country Name`), 
        nudge_y = 0.1,  
        nudge_x = 0.25,  
        size = 4,       
        color = "gray40" 
    )+
    geom_text(
        data = subset(mergedDF, `Country Code` == "USA"), 
        aes(label = `Country Name`), 
        nudge_y = 0.1,  
        nudge_x = -0.4,  
        size = 4,       
        color = "gray40"
    )

healthcare_vs_medals

Code
final_df <- mergedDF
write.csv(final_df, "data_processed/final_df", row.names = FALSE)

At first glance, there is a strong relationship between healthcare expenditure per capita and Olympic medal counts. Healthcare expenditure seems to be a big factor that influences Olympic medal counts. Out of the top ten countries in medal count, the only outliers are China and Norway. A reason the healthcare expenditure per capita in China is so low may be due to the unusually large population of China. Norway spends a large amount of money on healthcare and is a leading country in healthcare spending. The other eight top countries remain close to the trend.

Clearly, there are many countries that have high healthcare expenditure per capita but are not in the top Olympic medal counts. These countries fall on the left side of the graph above the lines of best fit. There are most likely other factors influencing the medal count. In general, more countries fall within the trend of average healthcare expenditure per capita versus the Olympic medal count.

Athletes at the Olympics

This chart illustrates the correlation between the total number of medals won by different countries and the number of athletes competing in the 2024 Summer Olympics in Paris.

Code
df_athletes <- read_csv(here("data_raw", "athletes.csv"))
df_medals_total <- read_csv(here("data_raw", "medals_total.csv"))

athlete_counts <- df_athletes %>%
  group_by(country) %>%
  summarize(athlete_count = n())

merged_data <- athlete_counts %>%
  inner_join(df_medals_total, by = "country")

top10 <- c('USA', 'GBR', 'GER', 'FRA', 'CHN', 'SWE', 'AUS', 'JPN', 'ITA', 'NOR')
merged_data <- merged_data %>% 
    mutate(color_group = ifelse(`country_code` %in% top10, "Top 10", "Other"))



ggplot(merged_data, aes(x = athlete_count, y = Total)) + geom_smooth(se = FALSE, color = "gray40") +
  geom_point(aes(color = color_group)) +
  labs(
    title = "Relationship Between Number of Athletes and Medal Count",
    x = "Number of Athletes",
    y = "Total Medal Count",
    subtitle = "The 2024 Paris Olympics"
  ) +
  theme_minimal() +
    scale_color_manual(values = c("Top 10" = "red", "Other" = "gray")) +
        theme(legend.position = "none")

Code
ggsave("athletes_medals_plot.png", width = 25, height = 6, dpi = 300)

Since countries with more athletes win more medals, there is a clear positive correlation. With the most athletes and the most medals, the United States stands out for its wide range of sporting talent and emphasis on sports infrastructure investing in almost every sports branch. Although China has fewer athletes than the United States, it still maintains a considerable scale of participation and a high number of medals, reflecting its investment in competitive sports. European countries such as Germany, France, and the United Kingdom performed well because they struck a balance between having a large number of athletes and distributing medals efficiently. Smaller countries such as South Korea and the Netherlands, with fewer participating athletes, won relatively small medal totals, demonstrating the limitations of size and resources.

This chart illustrates the correlation between the total number of medals won by different countries and the number of athletes competing in the 2020 Summer Olympics in Tokyo

Code
athletes_path <- "data_raw/Athletes.xlsx"
medals_path <- "data_raw/Medals.xlsx"

athletes <- read_excel(athletes_path, sheet = "Details")
medals <- read_excel(medals_path, sheet = "Details")

athletes_summary <- athletes %>%
  group_by(NOC) %>%
  summarise(Number_of_Athletes = n())

medals_summary <- medals %>%
  select(NOC = `Team/NOC`, Medal_Count = Total)

merged_data <- inner_join(athletes_summary, medals_summary, by = "NOC")

write.csv(merged_data, "data_processed/cleaned_athletes_medals.csv")

cleaned_data <- read_csv("data_processed/cleaned_athletes_medals.csv")

cleaned_data <- cleaned_data %>%
  mutate(Medal_Count = as.numeric(Medal_Count))

top_10_countries <- cleaned_data %>%
  arrange(desc(Medal_Count)) %>%
  slice(1:10)


top10 <- c('United States of America', 'Great Britain', '   
Germany', 'France', 'People\'s Republic of China', 'Sweden', 'Australia', 'Japan', 'Italy', 'Norway')

cleaned_data <- cleaned_data %>% 
    mutate(color_group = ifelse(`NOC` %in% top10, "Top 10", "Other"))

bubble_chart <- ggplot(cleaned_data, aes(x = Number_of_Athletes, y = Medal_Count)) +
        geom_smooth(se = FALSE, color = "gray40") +
  geom_point(aes(color = color_group)) + 
  labs(
    title = "Relationship Between Number of Athletes and Medal Count ",
    x = "Number of Athletes",
    y = "Total Medal Count",
    subtitle = "The 2020 Tokyo Olympics"
  ) +
  theme_minimal() +
    scale_color_manual(values = c("Top 10" = "red", "Other" = "gray"))+
    theme(legend.position = "none")


ggsave("bubble_chart_wide.png", plot = bubble_chart, width = 25, height = 6, dpi = 300)


print(bubble_chart)

The data shows how medal performance at international events such as the Olympics is influenced by scale and strategic focus. The two red dots to the left are Norway and Sweden who while not sending a massive number of athletes they more often participate in Winter sports which has much less competition.

Conclusion

To conclude, by analyzing factors such as GDP, healthcare expenditure, and number of athletes there are some very strong correlations that suggest why the top 10 countries are so successful. Unsurprisingly, if a country wants to be successful in the Olympics it needs a large economy, or high income population. It should also invest into the healthcare of its citizens to keep them fit and healthy. Finally, it needs to foster athletics and send as many of its athletes to the Olympics as possible. These factors can help to explain why a highly developed country such as Sweden is able to win so many medals while a highly populated country such as India has not yet achieved the same success.

Attribution

This project was made by Hannah Hodge, Jedidiah Bechtel, Sandy Alksninis, and Aedan Bayhan.

  • The “Changes in Medalists” section was made by Jedidiah Bechtel.
  • The “GDP” and “GDP Per Capita” sections were made by Sandy Alksninis.
  • The “Average Healthcare Expenditure” section was made by Hannah Hodge.
  • The “Athletes at the Olympics” section was made by Aedan Bayhan.

Edited and submitted by Sandy Alksninis and Hannah Hodge.