Financial Trends, Ratings, and Profitability from TMDb and Rotten Tomatoes
Author
Ibrahim Ahmed; Paul Asamoah Boadu
Published
December 8, 2024
Introduction
The movie industry operates at the intersection of two primary objectives: profitability for production companies and qualitative entertainment for audiences. These dual goals raise intriguing questions about the factors that drive success in both financial and audience perception terms.
Are the same variables—such as actors, directors, genres, and production companies—responsible for both a movie’s profitability and its critical acclaim? Is there a strong correlation between financial success and audience ratings?
Using the TMDB movies dataset, this project seeks to explore these relationships by uncovering patterns among key contributors to highly profitable and highly rated movies. Additionally, it examines whether these patterns hold true for unprofitable and poorly rated films.
Primary Research Question
Are the factors that contribute to a movie’s profitability the same as those that influence its critical and audience ratings?
To address this overarching question, we explore the following sub-questions:
What is the trend of revenue, budget and tomatometer ratings over the years?
Is there a significant correlation between a movie’s profitability and its audience ratings/critical acclaim?
Do highly profitable movies share common patterns in actors, directors and content ratings with highly rated movies?
How do unprofitable and poorly rated movies compare to their highly profitable and highly rated counterparts in terms of key contributors (e.g., actors, directors, content ratings)?
Do specific genres or content ratings consistently yield movies that are both highly profitable and highly rated?
Can we develop a classification algorithm to predict a movie’s tomatometer status using financial data and other key variables such as runtime, primary genre, popularity, vote average and content ratings?
Data Sources and Preprocessing
Our analysis utilizes two key datasets: the TMDb Movie Metadata Dataset and the Rotten Tomatoes Movies and Critic Reviews Dataset. Together, these datasets provide a comprehensive view of the financial, audience, and critical performance of movies.
TMDb Movie Metadata Dataset
The TMDb dataset, hosted on Kaggle as the TMDB 5000 Movies Dataset, contains detailed information about over 5,000 movies. This dataset focuses on essential aspects of the movie industry, including movie titles, genres, production companies, release dates, budgets, and box office revenues. Originating from The Movie Database (TMDb), a widely recognized community-driven platform, the dataset spans movies released between 1916 and 2017. Kaggle contributors collected this dataset using the TMDb API.
The TMDb dataset is particularly valuable for its emphasis on the financial performance of movies, offering crucial metrics such as production budgets and box office revenues. These features serve as a foundation for exploring profitability and other financial dimensions of the film industry.
Rotten Tomatoes Movies and Critic Reviews Dataset
The Rotten Tomatoes dataset complements TMDb by providing insights into audience and critic perceptions of movies. It includes variables such as tomatometer ratings, vote counts, and tomatometer status, which reflect audience and critic evaluations. The dataset also shares common variables with TMDb, such as genres, cast, and directors, enabling integration and cross-referencing.
Scraped from the Rotten Tomatoes website as of October 31, 2020, this dataset is now hosted on Kaggle as Rotten Tomatoes Movies and Critic Reviews Dataset. By capturing both critical and audience responses, this dataset provides a well-rounded perspective on the reception of movies beyond their financial performance.
Data Integration and Preprocessing
To create a unified and enriched dataset, the TMDb and Rotten Tomatoes datasets were merged and preprocessed using several steps to ensure consistency and facilitate analysis. The datasets were joined using the “title” column as the common key, enabling financial data from TMDb to combine seamlessly with audience and critic evaluations from Rotten Tomatoes. Redundant columns were removed for simplicity.
Preprocessing Steps
The following key steps were performed during preprocessing:
Decomposition:
Nested variables like genres, production companies, production countries, spoken languages, and cast were transformed into comma-separated lists for easier analysis.
Feature Extraction:
From the decomposed lists, the first item was extracted to create new columns representing:
Primary Genre
Lead Actor
Lead Director
Primary Production Company
Reclassification:
Instances of genre misclassification were identified and corrected to ensure consistency and accuracy in categorization.
Derived Columns:
A profitability ratio was calculated by comparing revenue to budget, providing a measure of financial success. Based on this ratio, movies were further categorized into profitability classes, facilitating comparative analysis.
By integrating these datasets and applying thorough preprocessing, the final dataset harmonizes key aspects of financial performance and audience/critic evaluation. These enhancements establish a robust foundation for exploring the multi-faceted dimensions of the movie industry.
Code
# Load the cleaned_movies datasetmovies <-read.csv(here::here('data_processed', 'cleaned_movie_df.csv'))
Trend Analysis of budget, revenue and tatometer ratings
Code
# Calculate total revenue and budget per yearyearly_totals <- movies %>%# Extract year from release datemutate(year =as.numeric(substr(release_date, 1, 4)) ) %>%# Group by year and calculate total metricsgroup_by(year) %>%summarise(total_revenue =sum(revenue, na.rm =TRUE),total_budget =sum(budget, na.rm =TRUE),movie_count =n() ) %>%# Add a profitability ratio columnmutate(yearly_roi = (total_revenue - total_budget) / total_budget *100 )# Create the static ggplot without pointsfinancial_trend_plot <-ggplot(yearly_totals, aes(x = year)) +geom_line(aes(y = total_revenue, color ="Revenue"), size =0.8) +geom_line(aes(y = total_budget, color ="Budget"), size =0.8) +scale_y_continuous(labels =function(x) paste0("$", round(x /1e9, 1), "B"),name ="Total Amount (Billions USD)" ) +scale_color_manual(values =c("Revenue"="#2E8B57", "Budget"="#4169E1"),name ="Financial Metric" ) +labs(title ="Movie Industry Financials",subtitle ="Total Revenue and Budget Trends",x ="Year",caption ="Values represent yearly totals across all movies in the dataset" ) +theme_minimal_vgrid() +theme(plot.title =element_text(size =14, hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),axis.title =element_text(size =11),axis.text =element_text(size =10),legend.position ="bottom",legend.title =element_text(size =10),panel.grid.minor =element_blank(),panel.grid.major =element_line(color ="gray95"),plot.margin =unit(c(0.8, 0.8, 0.8, 0.8), "cm") )# Convert the ggplot to an interactive plotly chartinteractive_trend_plot <-ggplotly(financial_trend_plot)# Display the interactive chartinteractive_trend_plot
Code
# Calculate ratings datayearly_ratings <- movies %>%# Convert release dates to years for temporal analysismutate(year =as.numeric(substr(release_date, 1, 4)) ) %>%# Group by year to calculate annual averagesgroup_by(year) %>%summarise(avg_rating =mean(tomatometer_rating, na.rm =TRUE),movie_count =n() ) %>%filter(avg_rating >10)# Create the static ggplot without pointstrend_plot <-ggplot(yearly_ratings, aes(x = year, y = avg_rating)) +geom_line(color ="#E64A19", size =0.8, alpha =1 ) +scale_y_continuous(name ="Average Tomatometer Rating",limits =c(10, 100),breaks =seq(10, 100, by =10) ) +labs(title ="Evolution of Movie Ratings",x ="Year" ) +theme_minimal_vgrid() +theme(plot.title =element_text(size =14, hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),axis.title =element_text(size =11),axis.text =element_text(size =10),panel.grid.minor =element_blank(),panel.grid.major =element_line(color ="gray95"),plot.margin =unit(c(0.8, 0.8, 0.8, 0.8), "cm") )# Convert the ggplot to an interactive plotly chartinteractive_trend_plot <-ggplotly(trend_plot)# Display the interactive chartinteractive_trend_plot
The financial trends reveal a significant rise in both budgets and revenues in the movie industry since the 1980s, showcasing its growth in scale and investment, though profitability gaps have widened. In contrast, the average Tomatometer ratings have declined steadily since the 1950s, reflecting evolving audience expectations and critical standards.
Correlation between movies profitability and its audience ratings/critical acclaim
Code
corr <-cor( movies$profitability_ratio, movies$tomatometer_rating,method ="pearson",use ="complete.obs")# Label for annotationcorrLabel <-paste("r =", round(corr, 2))# Scatterplot with annotationscatterplot <- movies %>%ggplot() +geom_point(aes(x = profitability_ratio, y = tomatometer_rating),size =1, alpha =0.7 ) +scale_x_log10() +scale_y_log10() +theme_classic(base_size =20) +labs(x ="Profitability Ratio (log scale)",y ="Tomatometer Ratings (log scale)",title ="Correlation of Movie Profitability and Ratings" ) +annotate(geom ="text",x =0.01, y =200, label = corrLabel,hjust =0, size =7 )scatterplot
The scatterplot shows a weak correlation (r = 0.03) between profitability and Tomatometer ratings, indicating that financial success does not reliably translate to critical or audience acclaim.
Comparative Analysis of Tomatometer Ratings and Profitability
Highly Profitable and Certified-Fresh Movies
Comparison of Tomatometer Ratings and Profitability Classes
The waffle plots provide a distributional analysis of Certified-Fresh, Fresh, and Rotten ratings among highly profitable movies and the profitability classes within Certified-Fresh movies.
Code
# Tomatometer summary for highly profitable moviestomatometer_summary <- movies %>%filter(profitability_class =="Highly Profitable", !is.na(tomatometer_status)) %>%count(tomatometer_status) %>%mutate(fraction = n /sum(n) *100,parts =round(fraction) )# Define color palette for tomatometer statusestomato_colors <-c("Certified-Fresh"="#FF6B6B","Fresh"="#4BC0C0", "Rotten"="#5C7CFA")# Waffle plot for highly profitable moviesprofitable_waffle_plot <-ggplot(tomatometer_summary) +geom_waffle(aes(fill = tomatometer_status, values = parts),color ="white",size =0.5,n_rows =10,flip =TRUE ) +scale_fill_manual(values = tomato_colors,name ="Tomatometer Status" ) +coord_equal() +labs(title ="Tomatometer Ratings in Highly Profitable Movies",subtitle =sprintf("Each square ≈ %.1f%% of movies", 100/sum(tomatometer_summary$parts)) ) +theme_minimal() +theme(plot.title =element_text(size =14, face ="bold", margin = ggplot2::margin(b =10)),plot.subtitle =element_text(size =10, color ="gray50"),legend.position ="right",legend.title =element_text(size =10),legend.text =element_text(size =9),panel.grid =element_blank(),axis.text =element_blank(),axis.title =element_blank() )# Profitability summary for Certified-Fresh moviesprofitability_summary <- movies %>%filter(tomatometer_status =="Certified-Fresh", !is.na(profitability_class)) %>%count(profitability_class) %>%mutate(fraction = n /sum(n) *100,parts =round(fraction) )# Define color palette for profitability classesprofitability_colors <-c("Highly Profitable"="#FF6B6B","Moderately Profitable"="#4BC0C0","Unprofitable"="#5C7CFA")# Waffle plot for Certified-Fresh moviescertified_waffle_plot <-ggplot(profitability_summary) +geom_waffle(aes(fill = profitability_class, values = parts),color ="white",size =0.5,n_rows =10,flip =TRUE ) +scale_fill_manual(values = profitability_colors,name ="Profitability Class" ) +coord_equal() +labs(title ="Profitability Classes in Certified-Fresh Movies",subtitle =sprintf("Each square ≈ %.1f%% of movies", 100/sum(profitability_summary$parts)) ) +theme_minimal() +theme(plot.title =element_text(size =15, face ="bold", margin = ggplot2::margin(b =10)),plot.subtitle =element_text(size =11, color ="black"),legend.position ="right",legend.title =element_text(size =10),legend.text =element_text(size =9),panel.grid =element_blank(),axis.text =element_blank(),axis.title =element_blank() )final_plot <- profitable_waffle_plot + certified_waffle_plot +plot_layout(ncol =2, widths =c(1.1, 1)) &theme(plot.margin = ggplot2::margin(t =20, r =20, b =20, l =20), legend.position ="right", legend.title =element_text(size =15), legend.text =element_text(size =15), legend.spacing.x =unit(15, "pt") )# Adjust titles for better spacing and readabilityfinal_plot <- final_plot &theme(plot.title =element_text(size =16, face ="bold", hjust =0.5, margin = ggplot2::margin(b =15)),plot.subtitle =element_text(size =12, hjust =0.5, color ="gray30") )print(final_plot)
The distributional analyses provided further clarity. Highly profitable movies show a three-way distribution, with Certified-Fresh movies making up the largest segment (around 45%), followed by 19% and 36% Fresh and Rotten ratings respectively. Examining Certified-Fresh movies specifically, they tend to be financially successful, with about 57% being highly profitable, 23% moderately profitable, and only 20% unprofitable.
Unprofitable and Rotten Movies
Comparison of Tomatometer Ratings and Profitability Classes
This second waffle plots examine the distribution of Certified-Fresh, Fresh, and Rotten ratings within unprofitable movies and the profitability classes among Rotten movies.
Code
# Tomatometer summary for unprofitable moviestomatometer_unprofitable <- movies %>%filter(profitability_class =="Unprofitable", !is.na(tomatometer_status)) %>%count(tomatometer_status) %>%mutate(fraction = n /sum(n) *100,parts =round(fraction) )# Define color palette for tomatometer statusestomato_colors <-c("Certified-Fresh"="#FF6B6B","Fresh"="#4BC0C0", "Rotten"="#5C7CFA")# Waffle plot for unprofitable moviesunprofitable_waffle_plot <-ggplot(tomatometer_unprofitable) +geom_waffle(aes(fill = tomatometer_status, values = parts),color ="white",size =0.5,n_rows =10,flip =TRUE ) +scale_fill_manual(values = tomato_colors,name ="Tomatometer Status" ) +coord_equal() +labs(title ="Tomatometer Ratings in Unprofitable Movies",subtitle =sprintf("Each square ≈ %.1f%% of movies", 100/sum(tomatometer_unprofitable$parts)) ) +theme_minimal() +theme(plot.title =element_text(size =14, face ="bold", margin = ggplot2::margin(b =10)),plot.subtitle =element_text(size =10, color ="gray50"),legend.position ="right",legend.title =element_text(size =10),legend.text =element_text(size =9),panel.grid =element_blank(),axis.text =element_blank(),axis.title =element_blank() )# Profitability summary for Rotten moviesprofitability_rotten <- movies %>%filter(tomatometer_status =="Rotten", !is.na(profitability_class)) %>%count(profitability_class) %>%mutate(fraction = n /sum(n) *100,parts =round(fraction) )# Define color palette for profitability classesprofitability_colors <-c("Highly Profitable"="#FF6B6B","Moderately Profitable"="#4BC0C0","Unprofitable"="#5C7CFA")# Waffle plot for Rotten moviesrotten_waffle_plot <-ggplot(profitability_rotten) +geom_waffle(aes(fill = profitability_class, values = parts),color ="white",size =0.5,n_rows =10,flip =TRUE ) +scale_fill_manual(values = profitability_colors,name ="Profitability Class" ) +coord_equal() +labs(title ="Profitability Classes in Rotten Movies",subtitle =sprintf("Each square ≈ %.1f%% of movies", 100/sum(profitability_rotten$parts)) ) +theme_minimal() +theme(plot.title =element_text(size =15, face ="bold", margin = ggplot2::margin(b =10)),plot.subtitle =element_text(size =11, color ="black"),legend.position ="right",legend.title =element_text(size =10),legend.text =element_text(size =9),panel.grid =element_blank(),axis.text =element_blank(),axis.title =element_blank() )# Combine the two plots side-by-sidefinal_plot <- unprofitable_waffle_plot + rotten_waffle_plot +plot_layout(ncol =2, widths =c(1.1, 1)) &theme(plot.margin = ggplot2::margin(t =20, r =20, b =20, l =20),legend.position ="right",legend.title =element_text(size =15),legend.text =element_text(size =15),legend.spacing.x =unit(15, "pt") )# Adjust titles for consistency and readabilityfinal_plot <- final_plot &theme(plot.title =element_text(size =16, face ="bold", hjust =0.5, margin = ggplot2::margin(b =15)),plot.subtitle =element_text(size =12, hjust =0.5, color ="gray30") )print(final_plot)
These waffle plots demonstrate that unprofitable movies are predominantly rated as Rotten (approximately 70%), with smaller proportions of Fresh (20%) and Certified-Fresh (10%) ratings. However, interestingly, when examining Rotten-rated movies, they show surprising financial diversity - about 40% are highly profitable, 40% moderately profitable, and only 20% unprofitable. This highlights that while poor critical reception often correlates with poor financial performance, many poorly-rated movies still achieve significant financial success.
Analysis of Highly Profitable Movies and Critically Acclaimed Movies (Certified-Fresh) Across Actors, Directors, Genres, and Content Ratings
Lead Actors in Highly Profitable and Certified-Fresh Movies
This subsection analyzes actors with at least five appearances in movies categorized as “Highly Profitable” and “Certified-Fresh.”
Code
highly_profitable_actors <- movies %>%filter(profitability_class =="Highly Profitable", !is.na(lead_actor)) %>%group_by(lead_actor) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Then we created the highly rated actors datasethighly_rated_actors <- movies %>%filter(tomatometer_status =="Certified-Fresh", !is.na(lead_actor)) %>%group_by(lead_actor) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Now we combine themcombined_data <-full_join( highly_profitable_actors %>%rename(profitable_apps = appearances), highly_rated_actors %>%rename(rated_apps = appearances),by ="lead_actor") %>%replace_na(list(profitable_apps =0, rated_apps =0)) %>%filter(profitable_apps >5| rated_apps >5)# Then we made the plotggplot(combined_data) +geom_col(aes(x =reorder(lead_actor, profitable_apps), y =-profitable_apps), fill ="#69b3e7", width =0.8) +geom_col(aes(x =reorder(lead_actor, profitable_apps), y = rated_apps),fill ="#1e4b7a", width =0.8) +geom_hline(yintercept =0, color ="gray40") +coord_flip() +theme_minimal() +labs(title ="Lead Actors",subtitle ="Highly Profitable (light blue) vs Certified-Fresh (dark blue)",x =NULL,y ="Number of Appearances") +theme(# Changed actor names to black and kept sizeaxis.text.y =element_text(size =10, color ="black"),axis.text.x =element_text(size =10),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),# Removed horizontal gridlines, kept vertical onespanel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm"), ) +scale_y_continuous(labels = abs,breaks =seq(-21, 21, 2) )
Lead Directors in Highly Profitable and Certified-Fresh Movies
This subsection examines directors who have made significant contributions to movies that are both highly profitable and critically acclaimed, with at least five appearances.
Code
highly_profitable_directors <- movies %>%filter(profitability_class =="Highly Profitable", !is.na(lead_director)) %>%group_by(lead_director) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Then create the highly rated directors datasethighly_rated_directors <- movies %>%filter(tomatometer_status =="Certified-Fresh", !is.na(lead_director)) %>%group_by(lead_director) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Now combine themcombined_directors <-full_join( highly_profitable_directors %>%rename(profitable_apps = appearances), highly_rated_directors %>%rename(rated_apps = appearances),by ="lead_director") %>%replace_na(list(profitable_apps =0, rated_apps =0)) %>%filter(profitable_apps >5| rated_apps >5)# Create the diverging bar plot for directorsggplot(combined_directors) +geom_col(aes(x =reorder(lead_director, profitable_apps), y =-profitable_apps), fill ="#69b3e7", width =0.8) +geom_col(aes(x =reorder(lead_director, profitable_apps), y = rated_apps),fill ="#1e4b7a", width =0.8) +geom_hline(yintercept =0, color ="gray40") +coord_flip() +theme_minimal() +labs(title ="Lead Directors",subtitle ="Highly Profitable (light blue) vs Certified-Fresh (dark blue)",x =NULL,y ="Number of Appearances") +theme(axis.text.y =element_text(size =11, color ="black"),axis.text.x =element_text(size =10, color ="black"),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),# After coord_flip, panel.grid.major.x creates vertical linespanel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm") ) +scale_y_continuous(labels = abs,breaks =seq(-21, 21, 2) )
Content Analysis of Highly Profitable and Certified-Fresh Movies
This subsection delves into the genres and content ratings prevalent in highly profitable and Certified-Fresh movies, identifying patterns and their implications.
Code
# First create the content ratings for highly profitable moviescontent_ratings_profitable <- movies %>%filter(profitability_class =="Highly Profitable") %>%group_by(content_rating) %>%summarise(count =n()) %>%mutate(content_rating =fct_reorder(content_rating, count))# Then create the content ratings for highly rated moviescontent_ratings_rated <- movies %>%filter(tomatometer_status =="Certified-Fresh") %>%group_by(content_rating) %>%summarise(count =n()) %>%mutate(content_rating =fct_reorder(content_rating, count))# Now combine themcombined_ratings <-full_join( content_ratings_profitable %>%rename(profitable_count = count), content_ratings_rated %>%rename(rated_count = count),by ="content_rating") %>%replace_na(list(profitable_count =0, rated_count =0))ggplot(combined_ratings) +# Add vertical gridlinesgeom_vline(xintercept =seq(-300, 300, 50), color ="gray90") +# Add thicker zero linegeom_vline(xintercept =0, color ="gray40", linewidth =1) +# Left side (profitable) segments and pointsgeom_segment(aes(x =-profitable_count, xend =0, y =reorder(content_rating, profitable_count), yend =reorder(content_rating, profitable_count)), color ='grey70') +geom_point(aes(x =-profitable_count, y =reorder(content_rating, profitable_count)), size =3, color ='#69b3e7') +# Right side (rated) segments and pointsgeom_segment(aes(x =0, xend = rated_count, y =reorder(content_rating, profitable_count), yend =reorder(content_rating, profitable_count)), color ='grey70') +geom_point(aes(x = rated_count, y =reorder(content_rating, profitable_count)), size =3, color ='#1e4b7a') +# Customize the themetheme_minimal() +labs(title ="Content Ratings in Movies",subtitle ="Highly Profitable (light blue) vs Certified-Fresh (dark blue)",x ="Count of Movies",y =NULL) +theme(axis.text.y =element_text(size =11, colour ="black"),axis.text.x =element_text(size =10, colour ="black"),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),panel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),plot.margin =unit(c(1, 1, 1, 1), "cm") ) +# Extend x-axis rangescale_x_continuous(labels = abs,breaks =seq(-500, 500, 50),limits =c(-500, 500) )
Analysis of Unprofitable Movies and Poorly Rated Movies (Rotten) Across Actors, Directors, Genres, and Content Ratings
Lead Actors in Unprofitable and Rotten Movies
This subsection analyzes actors with at least five notable appearances in movies categorized as “Unprofitable” and “Rotten.”
Code
# Analysis of lead actors in unprofitable moviesunprofitable_actors <- movies %>%filter(profitability_class =="Unprofitable") %>%group_by(lead_actor) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))rotten_actors <- movies %>%filter(tomatometer_status =="Rotten") %>%group_by(lead_actor) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Combine the datasetscombined_negative <-full_join( unprofitable_actors %>%rename(unprofitable_apps = appearances), rotten_actors %>%rename(rotten_apps = appearances),by ="lead_actor") %>%replace_na(list(unprofitable_apps =0, rotten_apps =0)) %>%filter(unprofitable_apps >5| rotten_apps >5)combined_negative <- combined_negative %>%arrange(desc(unprofitable_apps + rotten_apps)) %>%head(30)# Create the diverging bar plotggplot(combined_negative) +geom_col(aes(x =reorder(lead_actor, unprofitable_apps), y =-unprofitable_apps), fill ="#FF9999", width =0.8) +geom_col(aes(x =reorder(lead_actor, unprofitable_apps), y = rotten_apps),fill ="#CC3333", width =0.8) +geom_hline(yintercept =0, color ="gray40") +coord_flip() +theme_minimal() +labs(title ="Lead Actors in Poorly Performing Movies",subtitle ="Unprofitable (light red) vs Rotten (dark red)",x =NULL,y ="Number of Appearances") +theme(axis.text.y =element_text(size =11, color ="black"),axis.text.x =element_text(size =10),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),panel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm"),aspect.ratio =1.5 ) +scale_y_continuous(labels = abs,breaks =seq(-20, 20, 2) )
Lead Directors in Unprofitable and Rotten Movies
This subsection examines directors with at least five notable contributions to movies categorized as both “Unprofitable” and “Poorly Rated.”
Code
unprofitable_directors <- movies %>%filter(profitability_class =="Unprofitable", !is.na(lead_director)) %>%group_by(lead_director) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Create rotten directors datasetrotten_directors <- movies %>%filter(tomatometer_status =="Rotten", !is.na(lead_director)) %>%group_by(lead_director) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Combine datasetscombined_negative_directors <-full_join( unprofitable_directors %>%rename(unprofitable_apps = appearances), rotten_directors %>%rename(rotten_apps = appearances),by ="lead_director") %>%replace_na(list(unprofitable_apps =0, rotten_apps =0)) %>%filter(unprofitable_apps >5| rotten_apps >5)# Create the diverging bar plot for directorsggplot(combined_negative_directors) +geom_col(aes(x =reorder(lead_director, unprofitable_apps), y =-unprofitable_apps), fill ="#FF9999", width =0.8) +geom_col(aes(x =reorder(lead_director, unprofitable_apps), y = rotten_apps),fill ="#CC3333", width =0.8) +geom_hline(yintercept =0, color ="gray40") +coord_flip() +theme_minimal() +labs(title ="Lead Directors in Poorly Performing Movies",subtitle ="Unprofitable (light red) vs Rotten (dark red)",x =NULL,y ="Number of Appearances") +theme(axis.text.y =element_text(size =11, color ="black"),axis.text.x =element_text(size =10),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),panel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm"), ) +scale_y_continuous(labels = abs,breaks =seq(-21, 21, 2) )
Content Analysis of Unprofitable and Rotten Movies
This subsection delves into the genres and content ratings prevalent in unprofitable and Rotten movies, identifying patterns and their implications.
Code
# Analysis of content ratings in unprofitable moviescontent_ratings_unprofitable <- movies %>%filter(profitability_class =="Unprofitable") %>%group_by(content_rating) %>%summarise(count =n()) %>%mutate(content_rating =fct_reorder(content_rating, count))# Then create the rotten content ratings datasetcontent_ratings_rotten <- movies %>%filter(tomatometer_status =="Rotten") %>%group_by(content_rating) %>%summarise(count =n()) %>%mutate(content_rating =fct_reorder(content_rating, count))# Now combine themcombined_ratings_negative <-full_join( content_ratings_unprofitable %>%rename(unprofitable_count = count), content_ratings_rotten %>%rename(rotten_count = count),by ="content_rating") %>%replace_na(list(unprofitable_count =0, rotten_count =0))# Create the diverging lollipop plotggplot(combined_ratings_negative) +# Segments for unprofitable movies (left side)geom_segment(aes(x =-unprofitable_count, xend =0, y =reorder(content_rating, unprofitable_count), yend =reorder(content_rating, unprofitable_count)), color ='grey70') +geom_point(aes(x =-unprofitable_count, y =reorder(content_rating, unprofitable_count)), size =3, color ='#FF9999') +# Segments for rotten movies (right side)geom_segment(aes(x =0, xend = rotten_count, y =reorder(content_rating, unprofitable_count), yend =reorder(content_rating, unprofitable_count)), color ='grey70') +geom_point(aes(x = rotten_count, y =reorder(content_rating, unprofitable_count)), size =3, color ='#CC3333') +# Customize the themetheme_minimal() +labs(title ="Content Ratings in Poorly Performing Movies",subtitle ="Unprofitable (light red) vs Rotten (dark red)",x ="Count of Movies",y =NULL) +theme(axis.text.y =element_text(size =11, color ="black"),axis.text.x =element_text(size =10),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),panel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm") ) +scale_x_continuous(labels = abs,breaks =seq(-500, 500, 50) )
Analysis of lead actors, directors, and content ratings revealed that most individuals associated with highly profitable movies were absent from Certified-Fresh movies, with only a few exceptions. This underscores the often divergent nature of commercial success and critical acclaim. Interestingly, content ratings exhibited consistent patterns across all categories, indicating their crucial role in influencing both profitability and ratings.
Do specific genres consistently yield movies that are highly profitable, highly rated (Certified Fresh), or both?
For this question, we implemented the following steps:
Contingency table:
Summarized movie counts by genre across three categories:
Highly Profitable
Certified Fresh
Both (i.e., movies that are both highly profitable and certified fresh).
Chi-Square test:
Assessed the statistical significance of the relationship between genres and the three categories.
Residuals:
Identify which genres contribute most to the observed differences in genre distributions.
Create a faceted bar plot:
Compare the counts of movies in each genre across the three categories visually using a faceted bar plot.
Results:
Chi-Squared Statistic: 53.883
Degrees of Freedom: 26
P-Value: 0.00105
The low (p)-value indicates a statistically significant relationship between genres and the three categories. Thus certain genres, such as drama and comedy, disproportionately contribute to the trends, showing consistent success in profitability, ratings, or both.
Code
# Filter the dataset to include valid rows for analysis and create binary indicators# 'highly_profitable' for movies classified as "Highly Profitable"# 'certified_fresh' for movies classified as "Certified Fresh"genre_data <- movies %>%filter(!is.na(primary_genre), !is.na(profitability_class), !is.na(tomatometer_status)) %>%mutate(highly_profitable =ifelse(profitability_class =="Highly Profitable", 1, 0),certified_fresh =ifelse(tomatometer_status =="Certified-Fresh", 1, 0) )# Group the data by genre and calculate counts for:# - Highly profitable movies# - Certified Fresh movies# - Movies that are both# Additionally, calculate the total count of movies in each genregenre_contingency <- genre_data %>%group_by(primary_genre) %>%summarise(highly_profitable_count =sum(highly_profitable),certified_fresh_count =sum(certified_fresh),both_count =sum(highly_profitable & certified_fresh),total_count =n() ) %>%arrange(desc(total_count)) # To view the resulting contingency table#print(genre_contingency)# Aggregate genres with fewer than 20 movies into an "Other" category# Summarize the counts again to consolidate small categoriesgenre_contingency_clean <- genre_contingency %>%mutate(primary_genre =ifelse(total_count <20, "Other", primary_genre) ) %>%group_by(primary_genre) %>%summarise(highly_profitable_count =sum(highly_profitable_count),certified_fresh_count =sum(certified_fresh_count),both_count =sum(both_count),total_count =sum(total_count) )# To view the cleaned contingency table with aggregated genres#print(genre_contingency_clean)# Create a matrix suitable for statistical testing# Select relevant columns (counts for different categories) and pivot into a matrixgenre_matrix <- genre_contingency_clean %>%select(primary_genre, highly_profitable_count, certified_fresh_count, both_count) %>%pivot_longer(cols =-primary_genre,names_to ="category",values_to ="count" ) %>%pivot_wider(names_from = category,values_from = count,values_fill =0 ) %>%column_to_rownames(var ="primary_genre") # Make genres the row names# Convert to a matrix for testing and printgenre_matrix <-as.matrix(genre_matrix)#print(genre_matrix)# Here we convert the matrix to numeric format for statistical testinggenre_matrix_numeric <-as.matrix(genre_matrix)# Here we perform a chi-square test to evaluate whether the distribution# of movie counts differs significantly across categoriesgenre_chisq <-chisq.test(genre_matrix_numeric)# Print the test result# print(genre_chisq)# Extract standardized residuals from the chi-square test# Residuals indicate which genres contribute most to any significant differencesresiduals <- genre_chisq$stdres#print(residuals)# Step 7: Preparing our data for visualization# we converted the matrix to a long-format data frame for visualization# and added readable labels for the metric namesgenre_df <-as.data.frame(genre_matrix)genre_df$Genre <-rownames(genre_matrix)genre_long <- genre_df %>%pivot_longer(cols =-Genre, names_to ="Metric", values_to ="Count") %>%mutate(Metric =recode(Metric,"highly_profitable_count"="Highly Profitable","certified_fresh_count"="Certified Fresh","both_count"="Both"))# Step 8: We sorted the data by counts within facets for each metric in descending ordergenre_long_sorted <- genre_long %>%group_by(Metric) %>%mutate(Genre =reorder_within(Genre, -Count, Metric)) %>%ungroup()# Step 9: we generated a faceted bar plot to compare genre distributions# across "Highly Profitable," "Certified Fresh," and "Both" categoriesgenre_order <- genre_long %>%filter(Metric =="Both") %>%arrange(desc(Count)) %>%pull(Genre) %>%rev()# Update the plot using this fixed ordergenre_faceted_sorted_plot <- genre_long %>%mutate(Genre =factor(Genre, levels = genre_order)) %>%ggplot(aes(y = Genre, x = Count, fill = Metric)) +geom_col(width =0.7) +facet_grid(. ~ Metric, scales ="free_x") +labs(title ="Genre Distribution in Movies",subtitle ="Highly Profitable vs Certified Fresh vs Both Categories",x ="Count of Movies",y =NULL ) +scale_fill_manual(values =c("Highly Profitable"="#69b3e7","Certified Fresh"="#1e4b7a","Both"="#2E5B88" ),name ="Category" ) +theme_minimal() +theme(axis.text.y =element_text(size =11, color ="black"),axis.text.x =element_text(size =10),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),strip.text =element_text(size =12, face ="bold"),legend.position ="none",panel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm"),panel.spacing =unit(2, "lines") )print(genre_faceted_sorted_plot)
7. Can we develop a classification algorithm to predict a movie’s Tomatometer status?
Steps Taken:
Data Preparation:
Selected key features: budget, revenue, popularity, etc.
Created derived variables:
popularity_score: Product of popularity and vote_average.
revenue_efficiency: Ratio of revenue to budget.
Removed missing values and converted categorical variables (e.g., tomatometer_status, primary_genre) to factors.
Data Splitting:
Divided data into 80% training and 20% testing subsets using a fixed random seed for reproducibility.
Model Training:
Built a Random Forest classifier with:
ntree = 2000 (number of trees).
mtry = 6 (variables per split).
Balanced sampling to account for class imbalances.
Model Evaluation:
Used the test set to generate predictions.
Assessed model performance using a confusion matrix.
Visualized key predictors using a variable importance plot.
Classification Model Performance
Confusion Matrix
Prediction
Certified-Fresh
Fresh
Rotten
Certified-Fresh
135
28
27
Fresh
7
23
12
Rotten
35
51
248
Key Metrics
Metric
Value
Accuracy
0.7173
95% CI
(0.6783, 0.7541)
No Information Rate
0.5071
Kappa
0.5146
Mcnemar’s Test P-Value
3.154e-08
Statistics by Class
Metric
Certified-Fresh
Fresh
Rotten
Sensitivity
0.7627
0.22549
0.8641
Specificity
0.8586
0.95905
0.6918
Positive Predictive Value
0.7105
0.54762
0.7425
Negative Predictive Value
0.8883
0.84924
0.8319
Prevalence
0.3127
0.18021
0.5071
Detection Rate
0.2385
0.04064
0.4382
Detection Prevalence
0.3357
0.07420
0.5901
Balanced Accuracy
0.8107
0.59227
0.7779
Conclusions
Our analysis reveals that while the movie industry has seen significant growth in budgets and revenues, profitability is weakly correlated with audience ratings, showing that financial success does not ensure critical acclaim. Highly profitable movies show a balanced distribution across all rating categories, while Certified-Fresh movies demonstrate a strong tendency toward financial success. Key factors such as genre and content ratings significantly influence both profitability and ratings. Lastly, our random forest model achieved 72% accuracy in predicting Certified-Fresh and Rotten movies, highlighting its strength in identifying extreme cases of audience and critic evaluations.
Explore the data and trends further with our interactive Shiny app, featuring dynamic visualizations and insights. Access it here.
Attribution
All members contributed equally.
APPENDIX
A. Data Dictionary
Column Name
Data Type
Description
id
Integer
Unique identifier for each movie.
budget
Numeric
The production budget of the movie in US dollars.
genres
String
Comma-separated list of genres associated with the movie.
popularity
Numeric
Popularity score of the movie based on user activity, such as views or votes.
production_companies
String
Comma-separated list of companies involved in the movie’s production.
production_countries
String
Comma-separated list of countries where the movie was produced.
release_date
String
Release date of the movie in the format YYYY-MM-DD.
revenue
Numeric
Revenue earned by the movie in US dollars.
runtime
Integer
Duration of the movie in minutes.
title
String
Title of the movie.
vote_average
Numeric
Average rating of the movie on a scale of 1 to 10 based on user votes.
vote_count
Integer
Total number of user votes for the movie.
cast
String
Comma-separated list of prominent actors featured in the movie.
content_rating
String
Official content rating (e.g., G, PG, PG-13, R) assigned to the movie.
directors
String
Comma-separated list of directors involved in the movie’s production.
tomatometer_status
String
Status of the movie on Rotten Tomatoes (e.g., “Fresh”, “Rotten”, “Certified-Fresh”).
tomatometer_rating
Integer
Percentage score of the movie on Rotten Tomatoes based on critic reviews.
tomatometer_count
Integer
Number of critic reviews considered for the Rotten Tomatoes score.
primary_genre
String
The primary genre of the movie.
primary_company
String
The primary production company responsible for the movie.
lead_actor
String
Name of the lead actor in the movie.
lead_director
String
Name of the lead director in the movie.
profitability_ratio
Numeric
The ratio of revenue to budget, representing profitability.
profitability_class
String
Profitability classification (e.g., “Unprofitable”, “Highly Profitable”) based on the profitability ratio.
B. R Code Summary
Code
# Load libraries and settings herelibrary(here)library(cowplot)library(tidyverse)library(viridis)library(ggrepel)library(gganimate)library(magick)library(gifski)library(transformr)library(reticulate)library(lubridate)library(janitor)library(jsonlite)library(purrr)library(waffle)library(tibble)library(caret)library(randomForest)library(e1071)library(forcats)library(patchwork)library(tidytext)library(ggplot2)library(plotly)knitr::opts_chunk$set(warning =FALSE,message =FALSE,comment ="#>",fig.path ="figs/", fig.width =8, fig.height =6, out.width ="100%",fig.retina =4)# Put any other "global" settings here, e.g. a ggplot theme:theme_set(theme_bw(base_size =20))# Load the cleaned_movies datasetmovies <-read.csv(here::here('data_processed', 'cleaned_movie_df.csv'))# Calculate total revenue and budget per yearyearly_totals <- movies %>%# Extract year from release datemutate(year =as.numeric(substr(release_date, 1, 4)) ) %>%# Group by year and calculate total metricsgroup_by(year) %>%summarise(total_revenue =sum(revenue, na.rm =TRUE),total_budget =sum(budget, na.rm =TRUE),movie_count =n() ) %>%# Add a profitability ratio columnmutate(yearly_roi = (total_revenue - total_budget) / total_budget *100 )# Create the static ggplot without pointsfinancial_trend_plot <-ggplot(yearly_totals, aes(x = year)) +geom_line(aes(y = total_revenue, color ="Revenue"), size =0.8) +geom_line(aes(y = total_budget, color ="Budget"), size =0.8) +scale_y_continuous(labels =function(x) paste0("$", round(x /1e9, 1), "B"),name ="Total Amount (Billions USD)" ) +scale_color_manual(values =c("Revenue"="#2E8B57", "Budget"="#4169E1"),name ="Financial Metric" ) +labs(title ="Movie Industry Financials",subtitle ="Total Revenue and Budget Trends",x ="Year",caption ="Values represent yearly totals across all movies in the dataset" ) +theme_minimal_vgrid() +theme(plot.title =element_text(size =14, hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),axis.title =element_text(size =11),axis.text =element_text(size =10),legend.position ="bottom",legend.title =element_text(size =10),panel.grid.minor =element_blank(),panel.grid.major =element_line(color ="gray95"),plot.margin =unit(c(0.8, 0.8, 0.8, 0.8), "cm") )# Convert the ggplot to an interactive plotly chartinteractive_trend_plot <-ggplotly(financial_trend_plot)# Display the interactive chartinteractive_trend_plot# Calculate ratings datayearly_ratings <- movies %>%# Convert release dates to years for temporal analysismutate(year =as.numeric(substr(release_date, 1, 4)) ) %>%# Group by year to calculate annual averagesgroup_by(year) %>%summarise(avg_rating =mean(tomatometer_rating, na.rm =TRUE),movie_count =n() ) %>%filter(avg_rating >10)# Create the static ggplot without pointstrend_plot <-ggplot(yearly_ratings, aes(x = year, y = avg_rating)) +geom_line(color ="#E64A19", size =0.8, alpha =1 ) +scale_y_continuous(name ="Average Tomatometer Rating",limits =c(10, 100),breaks =seq(10, 100, by =10) ) +labs(title ="Evolution of Movie Ratings",x ="Year" ) +theme_minimal_vgrid() +theme(plot.title =element_text(size =14, hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),axis.title =element_text(size =11),axis.text =element_text(size =10),panel.grid.minor =element_blank(),panel.grid.major =element_line(color ="gray95"),plot.margin =unit(c(0.8, 0.8, 0.8, 0.8), "cm") )# Convert the ggplot to an interactive plotly chartinteractive_trend_plot <-ggplotly(trend_plot)# Display the interactive chartinteractive_trend_plotcorr <-cor( movies$profitability_ratio, movies$tomatometer_rating,method ="pearson",use ="complete.obs")# Label for annotationcorrLabel <-paste("r =", round(corr, 2))# Scatterplot with annotationscatterplot <- movies %>%ggplot() +geom_point(aes(x = profitability_ratio, y = tomatometer_rating),size =1, alpha =0.7 ) +scale_x_log10() +scale_y_log10() +theme_classic(base_size =20) +labs(x ="Profitability Ratio (log scale)",y ="Tomatometer Ratings (log scale)",title ="Correlation of Movie Profitability and Ratings" ) +annotate(geom ="text",x =0.01, y =200, label = corrLabel,hjust =0, size =7 )scatterplot# Tomatometer summary for highly profitable moviestomatometer_summary <- movies %>%filter(profitability_class =="Highly Profitable", !is.na(tomatometer_status)) %>%count(tomatometer_status) %>%mutate(fraction = n /sum(n) *100,parts =round(fraction) )# Define color palette for tomatometer statusestomato_colors <-c("Certified-Fresh"="#FF6B6B","Fresh"="#4BC0C0", "Rotten"="#5C7CFA")# Waffle plot for highly profitable moviesprofitable_waffle_plot <-ggplot(tomatometer_summary) +geom_waffle(aes(fill = tomatometer_status, values = parts),color ="white",size =0.5,n_rows =10,flip =TRUE ) +scale_fill_manual(values = tomato_colors,name ="Tomatometer Status" ) +coord_equal() +labs(title ="Tomatometer Ratings in Highly Profitable Movies",subtitle =sprintf("Each square ≈ %.1f%% of movies", 100/sum(tomatometer_summary$parts)) ) +theme_minimal() +theme(plot.title =element_text(size =14, face ="bold", margin = ggplot2::margin(b =10)),plot.subtitle =element_text(size =10, color ="gray50"),legend.position ="right",legend.title =element_text(size =10),legend.text =element_text(size =9),panel.grid =element_blank(),axis.text =element_blank(),axis.title =element_blank() )# Profitability summary for Certified-Fresh moviesprofitability_summary <- movies %>%filter(tomatometer_status =="Certified-Fresh", !is.na(profitability_class)) %>%count(profitability_class) %>%mutate(fraction = n /sum(n) *100,parts =round(fraction) )# Define color palette for profitability classesprofitability_colors <-c("Highly Profitable"="#FF6B6B","Moderately Profitable"="#4BC0C0","Unprofitable"="#5C7CFA")# Waffle plot for Certified-Fresh moviescertified_waffle_plot <-ggplot(profitability_summary) +geom_waffle(aes(fill = profitability_class, values = parts),color ="white",size =0.5,n_rows =10,flip =TRUE ) +scale_fill_manual(values = profitability_colors,name ="Profitability Class" ) +coord_equal() +labs(title ="Profitability Classes in Certified-Fresh Movies",subtitle =sprintf("Each square ≈ %.1f%% of movies", 100/sum(profitability_summary$parts)) ) +theme_minimal() +theme(plot.title =element_text(size =15, face ="bold", margin = ggplot2::margin(b =10)),plot.subtitle =element_text(size =11, color ="black"),legend.position ="right",legend.title =element_text(size =10),legend.text =element_text(size =9),panel.grid =element_blank(),axis.text =element_blank(),axis.title =element_blank() )final_plot <- profitable_waffle_plot + certified_waffle_plot +plot_layout(ncol =2, widths =c(1.1, 1)) &theme(plot.margin = ggplot2::margin(t =20, r =20, b =20, l =20), legend.position ="right", legend.title =element_text(size =15), legend.text =element_text(size =15), legend.spacing.x =unit(15, "pt") )# Adjust titles for better spacing and readabilityfinal_plot <- final_plot &theme(plot.title =element_text(size =16, face ="bold", hjust =0.5, margin = ggplot2::margin(b =15)),plot.subtitle =element_text(size =12, hjust =0.5, color ="gray30") )print(final_plot)# Tomatometer summary for unprofitable moviestomatometer_unprofitable <- movies %>%filter(profitability_class =="Unprofitable", !is.na(tomatometer_status)) %>%count(tomatometer_status) %>%mutate(fraction = n /sum(n) *100,parts =round(fraction) )# Define color palette for tomatometer statusestomato_colors <-c("Certified-Fresh"="#FF6B6B","Fresh"="#4BC0C0", "Rotten"="#5C7CFA")# Waffle plot for unprofitable moviesunprofitable_waffle_plot <-ggplot(tomatometer_unprofitable) +geom_waffle(aes(fill = tomatometer_status, values = parts),color ="white",size =0.5,n_rows =10,flip =TRUE ) +scale_fill_manual(values = tomato_colors,name ="Tomatometer Status" ) +coord_equal() +labs(title ="Tomatometer Ratings in Unprofitable Movies",subtitle =sprintf("Each square ≈ %.1f%% of movies", 100/sum(tomatometer_unprofitable$parts)) ) +theme_minimal() +theme(plot.title =element_text(size =14, face ="bold", margin = ggplot2::margin(b =10)),plot.subtitle =element_text(size =10, color ="gray50"),legend.position ="right",legend.title =element_text(size =10),legend.text =element_text(size =9),panel.grid =element_blank(),axis.text =element_blank(),axis.title =element_blank() )# Profitability summary for Rotten moviesprofitability_rotten <- movies %>%filter(tomatometer_status =="Rotten", !is.na(profitability_class)) %>%count(profitability_class) %>%mutate(fraction = n /sum(n) *100,parts =round(fraction) )# Define color palette for profitability classesprofitability_colors <-c("Highly Profitable"="#FF6B6B","Moderately Profitable"="#4BC0C0","Unprofitable"="#5C7CFA")# Waffle plot for Rotten moviesrotten_waffle_plot <-ggplot(profitability_rotten) +geom_waffle(aes(fill = profitability_class, values = parts),color ="white",size =0.5,n_rows =10,flip =TRUE ) +scale_fill_manual(values = profitability_colors,name ="Profitability Class" ) +coord_equal() +labs(title ="Profitability Classes in Rotten Movies",subtitle =sprintf("Each square ≈ %.1f%% of movies", 100/sum(profitability_rotten$parts)) ) +theme_minimal() +theme(plot.title =element_text(size =15, face ="bold", margin = ggplot2::margin(b =10)),plot.subtitle =element_text(size =11, color ="black"),legend.position ="right",legend.title =element_text(size =10),legend.text =element_text(size =9),panel.grid =element_blank(),axis.text =element_blank(),axis.title =element_blank() )# Combine the two plots side-by-sidefinal_plot <- unprofitable_waffle_plot + rotten_waffle_plot +plot_layout(ncol =2, widths =c(1.1, 1)) &theme(plot.margin = ggplot2::margin(t =20, r =20, b =20, l =20),legend.position ="right",legend.title =element_text(size =15),legend.text =element_text(size =15),legend.spacing.x =unit(15, "pt") )# Adjust titles for consistency and readabilityfinal_plot <- final_plot &theme(plot.title =element_text(size =16, face ="bold", hjust =0.5, margin = ggplot2::margin(b =15)),plot.subtitle =element_text(size =12, hjust =0.5, color ="gray30") )print(final_plot)highly_profitable_actors <- movies %>%filter(profitability_class =="Highly Profitable", !is.na(lead_actor)) %>%group_by(lead_actor) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Then we created the highly rated actors datasethighly_rated_actors <- movies %>%filter(tomatometer_status =="Certified-Fresh", !is.na(lead_actor)) %>%group_by(lead_actor) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Now we combine themcombined_data <-full_join( highly_profitable_actors %>%rename(profitable_apps = appearances), highly_rated_actors %>%rename(rated_apps = appearances),by ="lead_actor") %>%replace_na(list(profitable_apps =0, rated_apps =0)) %>%filter(profitable_apps >5| rated_apps >5)# Then we made the plotggplot(combined_data) +geom_col(aes(x =reorder(lead_actor, profitable_apps), y =-profitable_apps), fill ="#69b3e7", width =0.8) +geom_col(aes(x =reorder(lead_actor, profitable_apps), y = rated_apps),fill ="#1e4b7a", width =0.8) +geom_hline(yintercept =0, color ="gray40") +coord_flip() +theme_minimal() +labs(title ="Lead Actors",subtitle ="Highly Profitable (light blue) vs Certified-Fresh (dark blue)",x =NULL,y ="Number of Appearances") +theme(# Changed actor names to black and kept sizeaxis.text.y =element_text(size =10, color ="black"),axis.text.x =element_text(size =10),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),# Removed horizontal gridlines, kept vertical onespanel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm"), ) +scale_y_continuous(labels = abs,breaks =seq(-21, 21, 2) )highly_profitable_directors <- movies %>%filter(profitability_class =="Highly Profitable", !is.na(lead_director)) %>%group_by(lead_director) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Then create the highly rated directors datasethighly_rated_directors <- movies %>%filter(tomatometer_status =="Certified-Fresh", !is.na(lead_director)) %>%group_by(lead_director) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Now combine themcombined_directors <-full_join( highly_profitable_directors %>%rename(profitable_apps = appearances), highly_rated_directors %>%rename(rated_apps = appearances),by ="lead_director") %>%replace_na(list(profitable_apps =0, rated_apps =0)) %>%filter(profitable_apps >5| rated_apps >5)# Create the diverging bar plot for directorsggplot(combined_directors) +geom_col(aes(x =reorder(lead_director, profitable_apps), y =-profitable_apps), fill ="#69b3e7", width =0.8) +geom_col(aes(x =reorder(lead_director, profitable_apps), y = rated_apps),fill ="#1e4b7a", width =0.8) +geom_hline(yintercept =0, color ="gray40") +coord_flip() +theme_minimal() +labs(title ="Lead Directors",subtitle ="Highly Profitable (light blue) vs Certified-Fresh (dark blue)",x =NULL,y ="Number of Appearances") +theme(axis.text.y =element_text(size =11, color ="black"),axis.text.x =element_text(size =10, color ="black"),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),# After coord_flip, panel.grid.major.x creates vertical linespanel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm") ) +scale_y_continuous(labels = abs,breaks =seq(-21, 21, 2) )# First create the content ratings for highly profitable moviescontent_ratings_profitable <- movies %>%filter(profitability_class =="Highly Profitable") %>%group_by(content_rating) %>%summarise(count =n()) %>%mutate(content_rating =fct_reorder(content_rating, count))# Then create the content ratings for highly rated moviescontent_ratings_rated <- movies %>%filter(tomatometer_status =="Certified-Fresh") %>%group_by(content_rating) %>%summarise(count =n()) %>%mutate(content_rating =fct_reorder(content_rating, count))# Now combine themcombined_ratings <-full_join( content_ratings_profitable %>%rename(profitable_count = count), content_ratings_rated %>%rename(rated_count = count),by ="content_rating") %>%replace_na(list(profitable_count =0, rated_count =0))ggplot(combined_ratings) +# Add vertical gridlinesgeom_vline(xintercept =seq(-300, 300, 50), color ="gray90") +# Add thicker zero linegeom_vline(xintercept =0, color ="gray40", linewidth =1) +# Left side (profitable) segments and pointsgeom_segment(aes(x =-profitable_count, xend =0, y =reorder(content_rating, profitable_count), yend =reorder(content_rating, profitable_count)), color ='grey70') +geom_point(aes(x =-profitable_count, y =reorder(content_rating, profitable_count)), size =3, color ='#69b3e7') +# Right side (rated) segments and pointsgeom_segment(aes(x =0, xend = rated_count, y =reorder(content_rating, profitable_count), yend =reorder(content_rating, profitable_count)), color ='grey70') +geom_point(aes(x = rated_count, y =reorder(content_rating, profitable_count)), size =3, color ='#1e4b7a') +# Customize the themetheme_minimal() +labs(title ="Content Ratings in Movies",subtitle ="Highly Profitable (light blue) vs Certified-Fresh (dark blue)",x ="Count of Movies",y =NULL) +theme(axis.text.y =element_text(size =11, colour ="black"),axis.text.x =element_text(size =10, colour ="black"),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),panel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),plot.margin =unit(c(1, 1, 1, 1), "cm") ) +# Extend x-axis rangescale_x_continuous(labels = abs,breaks =seq(-500, 500, 50),limits =c(-500, 500) )# Analysis of lead actors in unprofitable moviesunprofitable_actors <- movies %>%filter(profitability_class =="Unprofitable") %>%group_by(lead_actor) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))rotten_actors <- movies %>%filter(tomatometer_status =="Rotten") %>%group_by(lead_actor) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Combine the datasetscombined_negative <-full_join( unprofitable_actors %>%rename(unprofitable_apps = appearances), rotten_actors %>%rename(rotten_apps = appearances),by ="lead_actor") %>%replace_na(list(unprofitable_apps =0, rotten_apps =0)) %>%filter(unprofitable_apps >5| rotten_apps >5)combined_negative <- combined_negative %>%arrange(desc(unprofitable_apps + rotten_apps)) %>%head(30)# Create the diverging bar plotggplot(combined_negative) +geom_col(aes(x =reorder(lead_actor, unprofitable_apps), y =-unprofitable_apps), fill ="#FF9999", width =0.8) +geom_col(aes(x =reorder(lead_actor, unprofitable_apps), y = rotten_apps),fill ="#CC3333", width =0.8) +geom_hline(yintercept =0, color ="gray40") +coord_flip() +theme_minimal() +labs(title ="Lead Actors in Poorly Performing Movies",subtitle ="Unprofitable (light red) vs Rotten (dark red)",x =NULL,y ="Number of Appearances") +theme(axis.text.y =element_text(size =11, color ="black"),axis.text.x =element_text(size =10),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),panel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm"),aspect.ratio =1.5 ) +scale_y_continuous(labels = abs,breaks =seq(-20, 20, 2) )unprofitable_directors <- movies %>%filter(profitability_class =="Unprofitable", !is.na(lead_director)) %>%group_by(lead_director) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Create rotten directors datasetrotten_directors <- movies %>%filter(tomatometer_status =="Rotten", !is.na(lead_director)) %>%group_by(lead_director) %>%summarise(appearances =n()) %>%filter(appearances >5) %>%arrange(desc(appearances))# Combine datasetscombined_negative_directors <-full_join( unprofitable_directors %>%rename(unprofitable_apps = appearances), rotten_directors %>%rename(rotten_apps = appearances),by ="lead_director") %>%replace_na(list(unprofitable_apps =0, rotten_apps =0)) %>%filter(unprofitable_apps >5| rotten_apps >5)# Create the diverging bar plot for directorsggplot(combined_negative_directors) +geom_col(aes(x =reorder(lead_director, unprofitable_apps), y =-unprofitable_apps), fill ="#FF9999", width =0.8) +geom_col(aes(x =reorder(lead_director, unprofitable_apps), y = rotten_apps),fill ="#CC3333", width =0.8) +geom_hline(yintercept =0, color ="gray40") +coord_flip() +theme_minimal() +labs(title ="Lead Directors in Poorly Performing Movies",subtitle ="Unprofitable (light red) vs Rotten (dark red)",x =NULL,y ="Number of Appearances") +theme(axis.text.y =element_text(size =11, color ="black"),axis.text.x =element_text(size =10),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),panel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm"), ) +scale_y_continuous(labels = abs,breaks =seq(-21, 21, 2) )# Analysis of content ratings in unprofitable moviescontent_ratings_unprofitable <- movies %>%filter(profitability_class =="Unprofitable") %>%group_by(content_rating) %>%summarise(count =n()) %>%mutate(content_rating =fct_reorder(content_rating, count))# Then create the rotten content ratings datasetcontent_ratings_rotten <- movies %>%filter(tomatometer_status =="Rotten") %>%group_by(content_rating) %>%summarise(count =n()) %>%mutate(content_rating =fct_reorder(content_rating, count))# Now combine themcombined_ratings_negative <-full_join( content_ratings_unprofitable %>%rename(unprofitable_count = count), content_ratings_rotten %>%rename(rotten_count = count),by ="content_rating") %>%replace_na(list(unprofitable_count =0, rotten_count =0))# Create the diverging lollipop plotggplot(combined_ratings_negative) +# Segments for unprofitable movies (left side)geom_segment(aes(x =-unprofitable_count, xend =0, y =reorder(content_rating, unprofitable_count), yend =reorder(content_rating, unprofitable_count)), color ='grey70') +geom_point(aes(x =-unprofitable_count, y =reorder(content_rating, unprofitable_count)), size =3, color ='#FF9999') +# Segments for rotten movies (right side)geom_segment(aes(x =0, xend = rotten_count, y =reorder(content_rating, unprofitable_count), yend =reorder(content_rating, unprofitable_count)), color ='grey70') +geom_point(aes(x = rotten_count, y =reorder(content_rating, unprofitable_count)), size =3, color ='#CC3333') +# Customize the themetheme_minimal() +labs(title ="Content Ratings in Poorly Performing Movies",subtitle ="Unprofitable (light red) vs Rotten (dark red)",x ="Count of Movies",y =NULL) +theme(axis.text.y =element_text(size =11, color ="black"),axis.text.x =element_text(size =10),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),panel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm") ) +scale_x_continuous(labels = abs,breaks =seq(-500, 500, 50) )# Filter the dataset to include valid rows for analysis and create binary indicators# 'highly_profitable' for movies classified as "Highly Profitable"# 'certified_fresh' for movies classified as "Certified Fresh"genre_data <- movies %>%filter(!is.na(primary_genre), !is.na(profitability_class), !is.na(tomatometer_status)) %>%mutate(highly_profitable =ifelse(profitability_class =="Highly Profitable", 1, 0),certified_fresh =ifelse(tomatometer_status =="Certified-Fresh", 1, 0) )# Group the data by genre and calculate counts for:# - Highly profitable movies# - Certified Fresh movies# - Movies that are both# Additionally, calculate the total count of movies in each genregenre_contingency <- genre_data %>%group_by(primary_genre) %>%summarise(highly_profitable_count =sum(highly_profitable),certified_fresh_count =sum(certified_fresh),both_count =sum(highly_profitable & certified_fresh),total_count =n() ) %>%arrange(desc(total_count)) # To view the resulting contingency table#print(genre_contingency)# Aggregate genres with fewer than 20 movies into an "Other" category# Summarize the counts again to consolidate small categoriesgenre_contingency_clean <- genre_contingency %>%mutate(primary_genre =ifelse(total_count <20, "Other", primary_genre) ) %>%group_by(primary_genre) %>%summarise(highly_profitable_count =sum(highly_profitable_count),certified_fresh_count =sum(certified_fresh_count),both_count =sum(both_count),total_count =sum(total_count) )# To view the cleaned contingency table with aggregated genres#print(genre_contingency_clean)# Create a matrix suitable for statistical testing# Select relevant columns (counts for different categories) and pivot into a matrixgenre_matrix <- genre_contingency_clean %>%select(primary_genre, highly_profitable_count, certified_fresh_count, both_count) %>%pivot_longer(cols =-primary_genre,names_to ="category",values_to ="count" ) %>%pivot_wider(names_from = category,values_from = count,values_fill =0 ) %>%column_to_rownames(var ="primary_genre") # Make genres the row names# Convert to a matrix for testing and printgenre_matrix <-as.matrix(genre_matrix)#print(genre_matrix)# Here we convert the matrix to numeric format for statistical testinggenre_matrix_numeric <-as.matrix(genre_matrix)# Here we perform a chi-square test to evaluate whether the distribution# of movie counts differs significantly across categoriesgenre_chisq <-chisq.test(genre_matrix_numeric)# Print the test result# print(genre_chisq)# Extract standardized residuals from the chi-square test# Residuals indicate which genres contribute most to any significant differencesresiduals <- genre_chisq$stdres#print(residuals)# Step 7: Preparing our data for visualization# we converted the matrix to a long-format data frame for visualization# and added readable labels for the metric namesgenre_df <-as.data.frame(genre_matrix)genre_df$Genre <-rownames(genre_matrix)genre_long <- genre_df %>%pivot_longer(cols =-Genre, names_to ="Metric", values_to ="Count") %>%mutate(Metric =recode(Metric,"highly_profitable_count"="Highly Profitable","certified_fresh_count"="Certified Fresh","both_count"="Both"))# Step 8: We sorted the data by counts within facets for each metric in descending ordergenre_long_sorted <- genre_long %>%group_by(Metric) %>%mutate(Genre =reorder_within(Genre, -Count, Metric)) %>%ungroup()# Step 9: we generated a faceted bar plot to compare genre distributions# across "Highly Profitable," "Certified Fresh," and "Both" categoriesgenre_order <- genre_long %>%filter(Metric =="Both") %>%arrange(desc(Count)) %>%pull(Genre) %>%rev()# Update the plot using this fixed ordergenre_faceted_sorted_plot <- genre_long %>%mutate(Genre =factor(Genre, levels = genre_order)) %>%ggplot(aes(y = Genre, x = Count, fill = Metric)) +geom_col(width =0.7) +facet_grid(. ~ Metric, scales ="free_x") +labs(title ="Genre Distribution in Movies",subtitle ="Highly Profitable vs Certified Fresh vs Both Categories",x ="Count of Movies",y =NULL ) +scale_fill_manual(values =c("Highly Profitable"="#69b3e7","Certified Fresh"="#1e4b7a","Both"="#2E5B88" ),name ="Category" ) +theme_minimal() +theme(axis.text.y =element_text(size =11, color ="black"),axis.text.x =element_text(size =10),plot.title =element_text(size =14, face ="bold", hjust =0.5),plot.subtitle =element_text(size =11, hjust =0.5),strip.text =element_text(size =12, face ="bold"),legend.position ="none",panel.grid.major.y =element_blank(),panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray85", linewidth =0.5),plot.margin =unit(c(1, 1, 1, 1), "cm"),panel.spacing =unit(2, "lines") )print(genre_faceted_sorted_plot)