Gender Representation in the Film Industry

Research Question: How has the increasing presence of women in the film industry impacted the success of films?

An analysis from 2000 to 2016.

Throughout the 21st Century, women have made incredible progress towards autonomy and equality in American society. While this is true, they are still underrepresented in the film industry. This study serves to examine how the representation of women in the film industry, from the directors to the stars of the movies, have impacted the success of films over the years.

Data Sources for Analysis:

Data Source One:

https://www.kaggle.com/danielgrijalvas/movies

The dataset that we are using as our base reference is titled “Movie Industry,” and it is a CSV file that contains thousands of data points in reference to movies from 1987 to 2016. It shows 6820 movies with the following attributes: budget, production company, country of origin, director, main genre, gross revenue, name of the movie, rating, release date, duration, International Movie Database (IMDb) user rating, number of user votes, main actor/actress, writer, and year. The data comes directly from Kaggle, which is considered one of the world’s largest public data science communities. This specific dataset was constructed using information directly from IMDb. The data set also has a Kaggle usability score of 10.0/10.0, which suggests that it is easy to interpret and the data is properly maintained.

Data Source Two:

https://github.com/taubergm/HollywoodGenderData/blob/master/all_actors_movies_gender_gold.csv

The second data source is titled “all_actors_movie_gender” from Hollywood Gender Data, and it is a CSV file that contains thousands of data points about the lead role in different movies from 2000 to 2018. The variables include: the name of the lead actor/actress, gender of the lead role, year, name of the movie, country the film was made in, budget, gross revenue, average runtime, language of the film, and the exact date the film was released. The data comes directly from Github, which is a software development platform with more than 100 million projects, allowing users to submit their own files for review and public use. The producer of this data has posted a number of repositories that have been used in many projects for analysis.

Data Source Three:

https://github.com/taubergm/HollywoodGenderData/blob/master/all_directors_gender.csv

The third data source is titled “all_directors_gener,” also from the Hollywood Gender Data, and it is a CSV file that contains thousands of data points about the directors of different movies from 2000 to 2018. It shows 5077 movies with the following attributes: the name of the director, gender of the director, year the film was released, name of the movie, country the film was made in, budget, gross revenue, average runtime, language of the film, and the exact date it was released. The data comes directly from Github from the same user that, as explained previously, has uploaded many data sets for public use.

Description of Data:

The relevant variables in our data are the gender of the actors and directors, IMDb score, release year of the movie, gross revenue. For the purposes of our analyses, we computed a number of variables to evaluate our research question. First, we noticed there were multiple countries present in all the data sets. We decided to soley analyze films that were made and released in the United States to make our research more relevant to our audience. Next, the movie industry data set contained data from 1987 through 2016, whereas the actor and director data sets contained data from 2000 to 2018. When we joined our base data of movie industry data to the actors and directors data sets, the overlapping years were 2000 to 2016, which is the time period we used for our analysis.

Hypothesis:

We expected that the number of films featuring female involvement would be less than those featuring males. In terms of the success of those films, we expected that, over time, films with female involvement would show an increasing trend. However, the success of those films would still be dominated by male-feautred films.

knitr::opts_chunk$set(
    warning = FALSE,
    message = FALSE)

#Cleaning each dataset - cleaning names, rearranging & deleting unnecessary columns, filtering for useful years (2000-2016)

movie_industry <- movie_industry %>%
    filter(year>1999,
           country =='USA') %>%     #Removing data 1999 and before & filtering for only USA movies
    arrange(year) %>%
    select(-budget, -company, country, director, -genre, gross, name, -rating, -released, -runtime, score, star, -votes, -writer, year)

actors_gender <- actors_gender %>%
    select(year, name, country, -budget, gross, -runtime, starring, -language, -released, gender) %>%
    mutate(
        gross = as.numeric(gross)
        )  %>%
    filter(year<2017) %>%       #Removing data 2017 and after
    arrange(desc(year))
distinct(actors_gender)

directors_gender <- directors_gender %>%
    select(year, name, country, -budget, gross, -runtime, director, -language, -released, gender) %>%
    mutate(
           gross = as.numeric(gross)
           ) 

movie_industry
actors_gender
directors_gender

#CREATE FULL ACTOR  DATASET
#Join actor_gender and movie_industry data together
#Keep only movie_industry columns (aka .x) for consistency with other datasets

movie_actor_gender <- movie_industry %>%
    left_join(actors_gender, by = c('name' = 'name')) %>%
    subset(star == starring) %>% 
    select(name, star, gender, country.x, director, gross.x, score, year.x) %>%
    rename(
        release_year = year.x,
        gross = gross.x,
        country = country.x
        ) %>% 
    mutate(
      gross = gross/1000000
    ) %>%
    filter(gender != 'unknown')
movie_actor_gender

#CREATE FULL DIRECTOR  DATASET
#Join director_gender and movie_industry data together
#Keep only movie_industry columns (aka .x) for consistency with other datasets

movie_director_gender <- movie_industry %>%
    left_join(directors_gender, by = c('name' = 'name')) %>%
    select(country.x, director.x, gross.x, name, score, star, year.x, -year.y, -country.y, -gross.y, -director.y, gender) %>%
    rename(
        release_year = year.x,
        gross = gross.x,
        country = country.x,
        director = director.x
        ) %>% 
    filter(gender != 'unknown')
movie_director_gender

Given our hypothesis, we first wanted to explore the share of films by gender released per year to provide an overview of the representation of lead female actresses in films. To do so, we counted the number of films each year with a female lead and a male lead. To visualize this, we created a 100% y-axis bar chart, which conveys the share of female versus male leads out of all the movies released each year. A 100% y-axis bar chart was utilized to show the weight of gender representation each year without simply counting the number of movies. It is important to first understand the general make up of gender representation within the film industry before we can narrow the scope to explore how gender may impact the success of films.

#ACTOR/ACTRESS SUMMARY VARIABLES AND COUNT GRAPH

#COUNT OF MOVIES:

#Summary of gender by year 
movie_actor_gender_count <- movie_actor_gender %>% 
    count(release_year, gender) %>% 
    rename(
        movie_count = 'n'
    )
movie_actor_gender_count

#Plot of Number of Movies with the lead role male/female by year

#gender.labs <- c("Female", "Male")

actor_gender_by_year_plot <- ggplot(movie_actor_gender_count, aes(release_year, movie_count))+
  geom_col(width = 0.8, position = 'fill', aes(fill = gender)) +
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  scale_y_continuous(
        expand = expand_scale(mult = c(0, 0.05)), 
        labels = scales:: percent_format()
        )+
  scale_x_continuous(
        breaks = seq(2000, 2016, 2)
        )+
  theme_minimal_hgrid() +
  theme(plot.title = element_text(hjust = 0.5, size = 14)
          )+
  scale_fill_manual(
    values = c('#7d5eb5', '#c4a262'), 
    labels = c('Female', 'Male'))+
  labs(x = "Release Year",
       y = "Share of Films",
       title = "Share of Films Released by Gender of Lead Role", 
      fill = 'Gender of Lead Role') 

actor_gender_by_year_plot

As we expected, it is immediately apparent that over the relevant 16 years, the share of male actors dominates that of female leads. However, over this time period there has been increasingly more female lead actors in U.S. films. It is also important to consider the share of films released based on the gender of the director. The following chart that shows the share of movies by gender of the director further affirms our expectation that the film industry is male-dominated.

movie_director_gender_count <- movie_director_gender %>% 
  count(release_year, gender) %>% 
  rename(
    movie_count = 'n'
  )
movie_director_gender_count

director_gender_by_year_plot <- ggplot(movie_director_gender_count, aes(release_year, movie_count))+
  geom_col(width = 0.8, position = 'fill', aes(fill = gender)) +
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  scale_y_continuous(
        expand = expand_scale(mult = c(0, 0.05)), 
        labels = scales:: percent_format()
        )+
  scale_x_continuous(
        breaks = seq(2000, 2016, 2)
        )+
  theme_minimal_hgrid() +
  theme(plot.title = element_text(hjust = 0.5, size = 14)
          )+
  scale_fill_manual(
    values = c('#7d5eb5', '#c4a262'), 
    labels = c('Female', 'Male'))+
  labs(x = "Release Year",
       y = "Share of Films",
       title = "Share of Films Released by Gender of Director", 
      fill = 'Gender of Director') 

director_gender_by_year_plot

Since 1929 when the Academy Awards were established to honor artistic and technical merit in the film industry only five women have ever been nomiated for Best Director. Of those five nomiations, only one woman has won the award, Kathryn Bigelow for “The Hurt Locker” in 2009. Since then, only one woman, Greta Gerwig, has been nominated for the category of Best Director. Although our analysis only focuses on the small time period of 2000-2016, the lack of female directors, as seen in this chart, is further emphasized by the lack of recognition women receive on a national scale.

An important area of analysis is the intersection between the gender of the lead role and the gender of the director. We wanted to determine if the gender of the director influenced the gender of the lead of a film. Two 100% y-axis charts were created to show the gender representation of the leads in both female-directed and male-directed films. It is important to consider these intersections because over time the amount of female actresses have been increasing, and it is important to figure out why.

#COUNT OF MOVIES BY GENDER OF LEAD ROLE AND DIRECTOR... INTERSECTION
movie_actor_gender2 <- movie_actor_gender %>% 
  select(name, star, gender, release_year)
  
movie_director_gender2 <- movie_director_gender %>% 
  select(name, director, gender)

actor_director_joint <- movie_actor_gender2 %>% 
  left_join(movie_director_gender2, by = c('name' = 'name')) %>% 
  rename( 
    actor_gender = gender.x,
    director_gender = gender.y) %>% 
  filter(!is.na(director_gender)) %>% 
  mutate(actor_num = ifelse(grepl('female',actor_gender), 1,
                                              ifelse(grepl('male', actor_gender), 2, 'Other'))) %>% 
  mutate(director_num = ifelse(grepl('female',director_gender), 1,
                                              ifelse(grepl('male', director_gender), 2, 'Other'))) %>% 
  mutate(combo = ifelse(actor_num == 1 & director_num == 1, 'Female Role & Female Director',
                        ifelse(actor_num == 1 & director_num == 2, 'Female Role & Male Director', 
                               ifelse(actor_num == 2 & director_num == 1, 'Male Role & Female Director', 
                                      ifelse(actor_num == 2 & director_num == 2, 'Male Role & Male Director', 'other'))))) 

actor_director_joint

lead_director_male_count <- actor_director_joint %>%
  filter(director_gender == "male") %>%
  count(release_year, actor_gender) %>%
  rename(male_director_count_by_lead_gender = n)
lead_director_male_count

lead_director_female_count <- actor_director_joint %>%
  filter(director_gender == "female") %>%
  count(release_year, actor_gender) %>%
  rename(female_director_count_by_lead_gender = n)
lead_director_female_count

lead_gender_by_male_director_plot <- ggplot(lead_director_male_count, aes(release_year, male_director_count_by_lead_gender))+
  geom_col(width = 0.8, position = 'fill', aes(fill = actor_gender)) +
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  scale_y_continuous(
        expand = expand_scale(mult = c(0, 0.05)), 
        labels = scales:: percent_format()
        )+
  scale_x_continuous(
        breaks = seq(2000, 2016, 2)
        )+
  theme_minimal_hgrid() +
  theme(plot.title = element_text(hjust = 0.5, size = 14)
          )+
  scale_fill_manual(
    values = c('#7d5eb5', '#c4a262'), 
    labels = c('Female', 'Male'))+
  labs(x = "Release Year",
       y = "Share of Films",
       title = "Share of Male Directed Films by Gender of Lead Role", 
      fill = 'Gender of Lead Role') 
lead_gender_by_male_director_plot

lead_gender_by_female_director_plot <- ggplot(lead_director_female_count, aes(release_year, female_director_count_by_lead_gender))+
  geom_col(width = 0.8, position = 'fill', aes(fill = actor_gender)) +
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  scale_y_continuous(
        expand = expand_scale(mult = c(0, 0.05)), 
        labels = scales:: percent_format()
        )+
  scale_x_continuous(
        breaks = seq(2000, 2016, 2)
        )+
  theme_minimal_hgrid() +
  theme(plot.title = element_text(hjust = 0.5, size = 14)
          )+
  scale_fill_manual(
    values = c('#7d5eb5', '#c4a262'), 
    labels = c('Female', 'Male'))+
  labs(x = "Release Year",
       y = "Share of Films",
       title = "Share of Female Directed Films by Gender of Lead Role", 
      fill = 'Gender of Lead Role') 
lead_gender_by_female_director_plot

When looking at the chart of male directed films by gender of lead roles, we are able to see that there have been far more male lead roles with a male director than female leads with a male director. The female lead/male director combination has slightly increased between 2000 and 2016, however, has not changed much. It is clear that more females are hired to be leads when there is a female director. Specifically, starting in 2007, female leads make up over 50% of female-directed films. This trend continues for the majority of the following years. Given this trend, with more female directors, it is expected that there will be more female leads. In both 2001 and 2004, the graph tells us that there was no female lead with female director combinations, which seems peculiar. This could mean that either there was not enough data to provide accurate results during these years, or when we cleaned and filtered the data, many of the data points during these years were taken out due to missing information. Despite this, we believe the chart accurately represents the overall trend in the industry; female directors tend to hire female leads more than male directors do.

Next, we wanted to analyze the success of films. To start, we looked at the IMDb scores, which are user-generated ratings of films. It is important to look at audience reviews because it is the public that determines the success of a film. IMDb users are able to write reviews and give films a score out of 10.0 on their overall liking of a movie. The following chart explores these user-generated scores to see if the gender of the lead actor impacts the avearge score.

#Summary of IMDb score by year by gender

movie_actor_female_score <- movie_actor_gender %>%
  filter(gender == "female") %>%
  group_by(release_year) %>%
  mutate(avg_female_score = mean(score)) %>% 
  select(name, star, gender, release_year, avg_female_score)
movie_actor_female_score

movie_actor_male_score <- movie_actor_gender %>%
  filter(gender == "male") %>%
  group_by(release_year) %>%
  mutate(avg_male_score = mean(score))  %>% 
  select(name, star, gender, release_year, avg_male_score)
movie_actor_male_score 

#Plot of score per year by gender 
actor_score_plot <- ggplot() + 
  geom_line(data = movie_actor_female_score, aes(x = release_year, y = avg_female_score, color= 'Female'),size = 1, color = '#7d5eb5', se = FALSE) + 
  geom_point(data = movie_actor_female_score, aes(x = release_year, y = avg_female_score), size = 1, alpha = 0.8, color = '#7d5eb5', se = FALSE)+
  geom_line(data = movie_actor_male_score , aes(x = release_year, y = avg_male_score, color = 'Male'), size = 1,color='#c4a262', se = FALSE) +
  geom_point(data = movie_actor_male_score , aes(x = release_year, y = avg_male_score), size = 1,color='#c4a262', se = FALSE)+
  scale_y_continuous(
    expand = expand_scale(mult = c(0,0.05)),
    breaks = seq(5.6,6.6, 0.2),
    limits = c(5.6,6.6)) +
  scale_x_continuous(
    breaks = seq(2000, 2016, 2)
      )+
  theme_minimal_hgrid()+
  theme(plot.title = element_text(hjust = 0.5, size = 15) 
        # legend.position = c(2018, 6.2), 
        # legend.text = element_text(fill = gender)
        )+
    labs(x = "Release Year",
         y = "Average IMDb Score",
         title = "Average IMDb Score per Year") +
  scale_shape_discrete(
    name = 'Gender of\nLead Actor', 
    breaks = c('male', 'female'), 
    labels = c('Male', 'Female')
  ) +
  annotate('text', 2016.63, 6.55, label = "Male", size = 3, color = '#c4a262')+
  annotate('text', 2016.8, 6.5, label = "Female", size = 3, color ='#7d5eb5')
actor_score_plot

This chart shows that movies with male lead actors tend to score higher than those with female lead actresses. However, there is an increasing trend in the average IMDb score of movies with lead female actresses over time, indicating that female-lead films are becoming increasingly more successful. This is similar to the trend of the previous graph that men outnumber women, but women have definitely improved in successful films. It is important to note that the range of these IMDb scores fall between 5.6 and 6.6, therefore the spikes and drops do not necessarliy indicate a failure of the film in the audiences eyes.

Another indicator of the success of a film is the revenue it generates. Similar to the previous chart, we analyzed the divide between the gender of the lead role, but this time focusing on the average revenue generated each year. Gross revenue is an integral area of consideration because it is an indicator of both popularity and success.

#GROSS REVENUE: Summary of gross revenue by year per gender

movie_actor_female_rev <- movie_actor_gender %>%
    filter(gender == "female") %>%
    group_by(release_year) %>%
    mutate(avg_female_gross = mean(gross))
movie_actor_female_rev

movie_actor_male_rev <- movie_actor_gender %>%
    filter(gender == "male") %>%
    group_by(release_year) %>%
    mutate(avg_male_gross = mean(gross))
movie_actor_male_rev

#Plot of gross revenue per year by gender 
ggplot() + 
geom_smooth(data = movie_actor_female_rev, aes(x = release_year, y = avg_female_gross), color = '#7d5eb5', se = FALSE) + 
geom_point(data = movie_actor_female_rev, aes(x = release_year, y = avg_female_gross), size = .8, color = '#7d5eb5')+
geom_smooth(data = movie_actor_male_rev, aes(x = release_year, y = avg_male_gross), color='#c4a262', se = FALSE) +
geom_point(data = movie_actor_male_rev, aes(x = release_year, y = avg_male_gross),size = .8, color='#c4a262')+
  scale_x_continuous(
    breaks = seq(2000, 2016, 2)
        )+
  scale_y_continuous(
    expand = expand_scale(mult = c(0, 0.05))
        )+
  theme_minimal_hgrid() +
  theme(plot.title = element_text(hjust = 0.5, size = 15)
        )+
  labs(x = "Release Year",
         y = "Gross Revenue in Millions ($)",
         title = "Average Gross Revenue per Year") +
  annotate('text', 2016.5, 66, label = "Male", size = 3, color = '#c4a262')+
  annotate('text', 2016.7, 96.5, label = "Female", size = 3, color ='#7d5eb5')

We are able to see that between the years of 2000 and 2013, the average gross revenue of films with male lead roles was higher than those with female leads. Eventually, starting in 2013, female-lead films begin to generate more revenue than those with male leads. It is important to note the female outlier in 2015. The avaerage revenue generated by female-lead films is so high in 2015 because of movies like “Star Wars: The Force Awakens,” which generated $936 million alone. We are able to conclude that the average generated revenue for male lead roles has been higher than female ones for most of the 2000s, however, the revenue for films with female lead roles have dramatically increased in recent years.

Conclusion:

As expected, films with female involvement have been less frequent and less successful than those featuring male involvement across the film industry. However, there has been an increasing presence of female actors and directors. Over time, with the increase of female representation in the film industry, there has also been an increase in the success of female-lead and female-directed films. Charts were created to show female representation in different categories. Throughout our analysis, we found that an increase in female representation and success was consistent with our hypothesis and provided insight to answer our research question. If the success of a film is measured by ratings and generated revenue, our findings show that as the number of females involved in a film increase, so do the ratings and generated revenue. More female representation has impacted the audience scores and generated revenue, which are each large determinants of a film’s success.

Appendix

Dictionary for Source 1: