Throughout the 21st Century, women have made incredible progress towards autonomy and equality in American society. While this is true, they are still underrepresented in the film industry. This study serves to examine how the representation of women in the film industry, from the directors to the stars of the movies, have impacted the success of films over the years.
https://www.kaggle.com/danielgrijalvas/movies
The dataset that we are using as our base reference is titled “Movie Industry,” and it is a CSV file that contains thousands of data points in reference to movies from 1987 to 2016. It shows 6820 movies with the following attributes: budget, production company, country of origin, director, main genre, gross revenue, name of the movie, rating, release date, duration, International Movie Database (IMDb) user rating, number of user votes, main actor/actress, writer, and year. The data comes directly from Kaggle, which is considered one of the world’s largest public data science communities. This specific dataset was constructed using information directly from IMDb. The data set also has a Kaggle usability score of 10.0/10.0, which suggests that it is easy to interpret and the data is properly maintained.
https://github.com/taubergm/HollywoodGenderData/blob/master/all_actors_movies_gender_gold.csv
The second data source is titled “all_actors_movie_gender” from Hollywood Gender Data, and it is a CSV file that contains thousands of data points about the lead role in different movies from 2000 to 2018. The variables include: the name of the lead actor/actress, gender of the lead role, year, name of the movie, country the film was made in, budget, gross revenue, average runtime, language of the film, and the exact date the film was released. The data comes directly from Github, which is a software development platform with more than 100 million projects, allowing users to submit their own files for review and public use. The producer of this data has posted a number of repositories that have been used in many projects for analysis.
https://github.com/taubergm/HollywoodGenderData/blob/master/all_directors_gender.csv
The third data source is titled “all_directors_gener,” also from the Hollywood Gender Data, and it is a CSV file that contains thousands of data points about the directors of different movies from 2000 to 2018. It shows 5077 movies with the following attributes: the name of the director, gender of the director, year the film was released, name of the movie, country the film was made in, budget, gross revenue, average runtime, language of the film, and the exact date it was released. The data comes directly from Github from the same user that, as explained previously, has uploaded many data sets for public use.
The relevant variables in our data are the gender of the actors and directors, IMDb score, release year of the movie, gross revenue. For the purposes of our analyses, we computed a number of variables to evaluate our research question. First, we noticed there were multiple countries present in all the data sets. We decided to soley analyze films that were made and released in the United States to make our research more relevant to our audience. Next, the movie industry data set contained data from 1987 through 2016, whereas the actor and director data sets contained data from 2000 to 2018. When we joined our base data of movie industry data to the actors and directors data sets, the overlapping years were 2000 to 2016, which is the time period we used for our analysis.
We expected that the number of films featuring female involvement would be less than those featuring males. In terms of the success of those films, we expected that, over time, films with female involvement would show an increasing trend. However, the success of those films would still be dominated by male-feautred films.
knitr::opts_chunk$set(
warning = FALSE,
message = FALSE)
#Cleaning each dataset - cleaning names, rearranging & deleting unnecessary columns, filtering for useful years (2000-2016)
movie_industry <- movie_industry %>%
filter(year>1999,
country =='USA') %>% #Removing data 1999 and before & filtering for only USA movies
arrange(year) %>%
select(-budget, -company, country, director, -genre, gross, name, -rating, -released, -runtime, score, star, -votes, -writer, year)
actors_gender <- actors_gender %>%
select(year, name, country, -budget, gross, -runtime, starring, -language, -released, gender) %>%
mutate(
gross = as.numeric(gross)
) %>%
filter(year<2017) %>% #Removing data 2017 and after
arrange(desc(year))
distinct(actors_gender)
directors_gender <- directors_gender %>%
select(year, name, country, -budget, gross, -runtime, director, -language, -released, gender) %>%
mutate(
gross = as.numeric(gross)
)
movie_industry
actors_gender
directors_gender
#CREATE FULL ACTOR DATASET
#Join actor_gender and movie_industry data together
#Keep only movie_industry columns (aka .x) for consistency with other datasets
movie_actor_gender <- movie_industry %>%
left_join(actors_gender, by = c('name' = 'name')) %>%
subset(star == starring) %>%
select(name, star, gender, country.x, director, gross.x, score, year.x) %>%
rename(
release_year = year.x,
gross = gross.x,
country = country.x
) %>%
mutate(
gross = gross/1000000
) %>%
filter(gender != 'unknown')
movie_actor_gender
#CREATE FULL DIRECTOR DATASET
#Join director_gender and movie_industry data together
#Keep only movie_industry columns (aka .x) for consistency with other datasets
movie_director_gender <- movie_industry %>%
left_join(directors_gender, by = c('name' = 'name')) %>%
select(country.x, director.x, gross.x, name, score, star, year.x, -year.y, -country.y, -gross.y, -director.y, gender) %>%
rename(
release_year = year.x,
gross = gross.x,
country = country.x,
director = director.x
) %>%
filter(gender != 'unknown')
movie_director_gender
Given our hypothesis, we first wanted to explore the share of films by gender released per year to provide an overview of the representation of lead female actresses in films. To do so, we counted the number of films each year with a female lead and a male lead. To visualize this, we created a 100% y-axis bar chart, which conveys the share of female versus male leads out of all the movies released each year. A 100% y-axis bar chart was utilized to show the weight of gender representation each year without simply counting the number of movies. It is important to first understand the general make up of gender representation within the film industry before we can narrow the scope to explore how gender may impact the success of films.
#ACTOR/ACTRESS SUMMARY VARIABLES AND COUNT GRAPH
#COUNT OF MOVIES:
#Summary of gender by year
movie_actor_gender_count <- movie_actor_gender %>%
count(release_year, gender) %>%
rename(
movie_count = 'n'
)
movie_actor_gender_count
#Plot of Number of Movies with the lead role male/female by year
#gender.labs <- c("Female", "Male")
actor_gender_by_year_plot <- ggplot(movie_actor_gender_count, aes(release_year, movie_count))+
geom_col(width = 0.8, position = 'fill', aes(fill = gender)) +
theme(plot.title = element_text(hjust = 0.5, size = 15)) +
scale_y_continuous(
expand = expand_scale(mult = c(0, 0.05)),
labels = scales:: percent_format()
)+
scale_x_continuous(
breaks = seq(2000, 2016, 2)
)+
theme_minimal_hgrid() +
theme(plot.title = element_text(hjust = 0.5, size = 14)
)+
scale_fill_manual(
values = c('#7d5eb5', '#c4a262'),
labels = c('Female', 'Male'))+
labs(x = "Release Year",
y = "Share of Films",
title = "Share of Films Released by Gender of Lead Role",
fill = 'Gender of Lead Role')
actor_gender_by_year_plot
As we expected, it is immediately apparent that over the relevant 16 years, the share of male actors dominates that of female leads. However, over this time period there has been increasingly more female lead actors in U.S. films. It is also important to consider the share of films released based on the gender of the director. The following chart that shows the share of movies by gender of the director further affirms our expectation that the film industry is male-dominated.
movie_director_gender_count <- movie_director_gender %>%
count(release_year, gender) %>%
rename(
movie_count = 'n'
)
movie_director_gender_count
director_gender_by_year_plot <- ggplot(movie_director_gender_count, aes(release_year, movie_count))+
geom_col(width = 0.8, position = 'fill', aes(fill = gender)) +
theme(plot.title = element_text(hjust = 0.5, size = 15)) +
scale_y_continuous(
expand = expand_scale(mult = c(0, 0.05)),
labels = scales:: percent_format()
)+
scale_x_continuous(
breaks = seq(2000, 2016, 2)
)+
theme_minimal_hgrid() +
theme(plot.title = element_text(hjust = 0.5, size = 14)
)+
scale_fill_manual(
values = c('#7d5eb5', '#c4a262'),
labels = c('Female', 'Male'))+
labs(x = "Release Year",
y = "Share of Films",
title = "Share of Films Released by Gender of Director",
fill = 'Gender of Director')
director_gender_by_year_plot
Since 1929 when the Academy Awards were established to honor artistic and technical merit in the film industry only five women have ever been nomiated for Best Director. Of those five nomiations, only one woman has won the award, Kathryn Bigelow for “The Hurt Locker” in 2009. Since then, only one woman, Greta Gerwig, has been nominated for the category of Best Director. Although our analysis only focuses on the small time period of 2000-2016, the lack of female directors, as seen in this chart, is further emphasized by the lack of recognition women receive on a national scale.
An important area of analysis is the intersection between the gender of the lead role and the gender of the director. We wanted to determine if the gender of the director influenced the gender of the lead of a film. Two 100% y-axis charts were created to show the gender representation of the leads in both female-directed and male-directed films. It is important to consider these intersections because over time the amount of female actresses have been increasing, and it is important to figure out why.
#COUNT OF MOVIES BY GENDER OF LEAD ROLE AND DIRECTOR... INTERSECTION
movie_actor_gender2 <- movie_actor_gender %>%
select(name, star, gender, release_year)
movie_director_gender2 <- movie_director_gender %>%
select(name, director, gender)
actor_director_joint <- movie_actor_gender2 %>%
left_join(movie_director_gender2, by = c('name' = 'name')) %>%
rename(
actor_gender = gender.x,
director_gender = gender.y) %>%
filter(!is.na(director_gender)) %>%
mutate(actor_num = ifelse(grepl('female',actor_gender), 1,
ifelse(grepl('male', actor_gender), 2, 'Other'))) %>%
mutate(director_num = ifelse(grepl('female',director_gender), 1,
ifelse(grepl('male', director_gender), 2, 'Other'))) %>%
mutate(combo = ifelse(actor_num == 1 & director_num == 1, 'Female Role & Female Director',
ifelse(actor_num == 1 & director_num == 2, 'Female Role & Male Director',
ifelse(actor_num == 2 & director_num == 1, 'Male Role & Female Director',
ifelse(actor_num == 2 & director_num == 2, 'Male Role & Male Director', 'other')))))
actor_director_joint
lead_director_male_count <- actor_director_joint %>%
filter(director_gender == "male") %>%
count(release_year, actor_gender) %>%
rename(male_director_count_by_lead_gender = n)
lead_director_male_count
lead_director_female_count <- actor_director_joint %>%
filter(director_gender == "female") %>%
count(release_year, actor_gender) %>%
rename(female_director_count_by_lead_gender = n)
lead_director_female_count
lead_gender_by_male_director_plot <- ggplot(lead_director_male_count, aes(release_year, male_director_count_by_lead_gender))+
geom_col(width = 0.8, position = 'fill', aes(fill = actor_gender)) +
theme(plot.title = element_text(hjust = 0.5, size = 15)) +
scale_y_continuous(
expand = expand_scale(mult = c(0, 0.05)),
labels = scales:: percent_format()
)+
scale_x_continuous(
breaks = seq(2000, 2016, 2)
)+
theme_minimal_hgrid() +
theme(plot.title = element_text(hjust = 0.5, size = 14)
)+
scale_fill_manual(
values = c('#7d5eb5', '#c4a262'),
labels = c('Female', 'Male'))+
labs(x = "Release Year",
y = "Share of Films",
title = "Share of Male Directed Films by Gender of Lead Role",
fill = 'Gender of Lead Role')
lead_gender_by_male_director_plot