Understanding Movie Performance Through Data

Financial Trends, Ratings, and Profitability from TMDb and Rotten Tomatoes

Author

Ibrahim Ahmed; Paul Asamoah Boadu

Published

December 8, 2024

Introduction

The movie industry operates at the intersection of two primary objectives: profitability for production companies and qualitative entertainment for audiences. These dual goals raise intriguing questions about the factors that drive success in both financial and audience perception terms.

Are the same variables—such as actors, directors, genres, and production companies—responsible for both a movie’s profitability and its critical acclaim? Is there a strong correlation between financial success and audience ratings?

Using the TMDB movies dataset, this project seeks to explore these relationships by uncovering patterns among key contributors to highly profitable and highly rated movies. Additionally, it examines whether these patterns hold true for unprofitable and poorly rated films.

Primary Research Question

Are the factors that contribute to a movie’s profitability the same as those that influence its critical and audience ratings?

To address this overarching question, we explore the following sub-questions:

What is the trend of revenue, budget and tomatometer ratings over the years?
Is there a significant correlation between a movie’s profitability and its audience ratings/critical acclaim?
Do highly profitable movies share common patterns in actors, directors and content ratings with highly rated movies?
How do unprofitable and poorly rated movies compare to their highly profitable and highly rated counterparts in terms of key contributors (e.g., actors, directors, content ratings)?
Do specific genres or content ratings consistently yield movies that are both highly profitable and highly rated?
Can we develop a classification algorithm to predict a movie’s tomatometer status using financial data and other key variables such as runtime, primary genre, popularity, vote average and content ratings?

Data Sources and Preprocessing

Our analysis utilizes two key datasets: the TMDb Movie Metadata Dataset and the Rotten Tomatoes Movies and Critic Reviews Dataset. Together, these datasets provide a comprehensive view of the financial, audience, and critical performance of movies.

TMDb Movie Metadata Dataset

The TMDb dataset, hosted on Kaggle as the TMDB 5000 Movies Dataset, contains detailed information about over 5,000 movies. This dataset focuses on essential aspects of the movie industry, including movie titles, genres, production companies, release dates, budgets, and box office revenues. Originating from The Movie Database (TMDb), a widely recognized community-driven platform, the dataset spans movies released between 1916 and 2017. Kaggle contributors collected this dataset using the TMDb API.

The TMDb dataset is particularly valuable for its emphasis on the financial performance of movies, offering crucial metrics such as production budgets and box office revenues. These features serve as a foundation for exploring profitability and other financial dimensions of the film industry.

Rotten Tomatoes Movies and Critic Reviews Dataset

The Rotten Tomatoes dataset complements TMDb by providing insights into audience and critic perceptions of movies. It includes variables such as tomatometer ratings, vote counts, and tomatometer status, which reflect audience and critic evaluations. The dataset also shares common variables with TMDb, such as genres, cast, and directors, enabling integration and cross-referencing.

Scraped from the Rotten Tomatoes website as of October 31, 2020, this dataset is now hosted on Kaggle as Rotten Tomatoes Movies and Critic Reviews Dataset. By capturing both critical and audience responses, this dataset provides a well-rounded perspective on the reception of movies beyond their financial performance.

Data Integration and Preprocessing

To create a unified and enriched dataset, the TMDb and Rotten Tomatoes datasets were merged and preprocessed using several steps to ensure consistency and facilitate analysis. The datasets were joined using the “title” column as the common key, enabling financial data from TMDb to combine seamlessly with audience and critic evaluations from Rotten Tomatoes. Redundant columns were removed for simplicity.

Preprocessing Steps

The following key steps were performed during preprocessing:

Decomposition:
Nested variables like genres, production companies, production countries, spoken languages, and cast were transformed into comma-separated lists for easier analysis.
Feature Extraction:
From the decomposed lists, the first item was extracted to create new columns representing:
- Primary Genre
- Lead Actor
- Lead Director
- Primary Production Company
Reclassification:
Instances of genre misclassification were identified and corrected to ensure consistency and accuracy in categorization.
Derived Columns:
A profitability ratio was calculated by comparing revenue to budget, providing a measure of financial success. Based on this ratio, movies were further categorized into profitability classes, facilitating comparative analysis.

By integrating these datasets and applying thorough preprocessing, the final dataset harmonizes key aspects of financial performance and audience/critic evaluation. These enhancements establish a robust foundation for exploring the multi-faceted dimensions of the movie industry.

Code

# Load the cleaned_movies dataset
movies <- read.csv(here::here('data_processed', 'cleaned_movie_df.csv'))

Trend Analysis of budget, revenue and tatometer ratings

Code

# Calculate total revenue and budget per year
yearly_totals <- movies %>%
  # Extract year from release date
  mutate(
    year = as.numeric(substr(release_date, 1, 4))
  ) %>%
  # Group by year and calculate total metrics
  group_by(year) %>%
  summarise(
    total_revenue = sum(revenue, na.rm = TRUE),
    total_budget = sum(budget, na.rm = TRUE),
    movie_count = n()
  ) %>%
  # Add a profitability ratio column
  mutate(
    yearly_roi = (total_revenue - total_budget) / total_budget * 100
  )

# Create the static ggplot without points
financial_trend_plot <- ggplot(yearly_totals, aes(x = year)) +
  geom_line(aes(y = total_revenue, color = "Revenue"), size = 0.8) +
  geom_line(aes(y = total_budget, color = "Budget"), size = 0.8) +
  scale_y_continuous(
    labels = function(x) paste0("$", round(x / 1e9, 1), "B"),
    name = "Total Amount (Billions USD)"
  ) +
  scale_color_manual(
    values = c("Revenue" = "#2E8B57", "Budget" = "#4169E1"),
    name = "Financial Metric"
  ) +
  labs(
    title = "Movie Industry Financials",
    subtitle = "Total Revenue and Budget Trends",
    x = "Year",
    caption = "Values represent yearly totals across all movies in the dataset"
  ) +
  theme_minimal_vgrid() +
  theme(
    plot.title = element_text(size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    axis.title = element_text(size = 11),
    axis.text = element_text(size = 10),
    legend.position = "bottom",
    legend.title = element_text(size = 10),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray95"),
    plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm")
  )

# Convert the ggplot to an interactive plotly chart
interactive_trend_plot <- ggplotly(financial_trend_plot)

# Display the interactive chart
interactive_trend_plot

Code

# Calculate ratings data
yearly_ratings <- movies %>%
  # Convert release dates to years for temporal analysis
  mutate(
    year = as.numeric(substr(release_date, 1, 4))
  ) %>%
  # Group by year to calculate annual averages
  group_by(year) %>%
  summarise(
    avg_rating = mean(tomatometer_rating, na.rm = TRUE),
    movie_count = n()
  ) %>%
  filter(avg_rating > 10)

# Create the static ggplot without points
trend_plot <- ggplot(yearly_ratings, aes(x = year, y = avg_rating)) +
  geom_line(
    color = "#E64A19",    
    size = 0.8,           
    alpha = 1             
  ) +
  scale_y_continuous(
    name = "Average Tomatometer Rating",
    limits = c(10, 100),
    breaks = seq(10, 100, by = 10)
  ) +
  labs(
    title = "Evolution of Movie Ratings",
    x = "Year"
  ) +
  theme_minimal_vgrid() +
  theme(
    plot.title = element_text(size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    axis.title = element_text(size = 11),
    axis.text = element_text(size = 10),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray95"),
    plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm")
  )

# Convert the ggplot to an interactive plotly chart
interactive_trend_plot <- ggplotly(trend_plot)

# Display the interactive chart
interactive_trend_plot

The financial trends reveal a significant rise in both budgets and revenues in the movie industry since the 1980s, showcasing its growth in scale and investment, though profitability gaps have widened. In contrast, the average Tomatometer ratings have declined steadily since the 1950s, reflecting evolving audience expectations and critical standards.

Correlation between movies profitability and its audience ratings/critical acclaim

Code

corr <- cor(
    movies$profitability_ratio, 
    movies$tomatometer_rating,
    method = "pearson",
    use = "complete.obs"
)

# Label for annotation
corrLabel <- paste("r =", round(corr, 2))

# Scatterplot with annotation
scatterplot <- movies %>%
    ggplot() +
    geom_point(
        aes(x = profitability_ratio, y = tomatometer_rating),
        size = 1, alpha = 0.7
    ) +
    scale_x_log10() +  
    scale_y_log10() + 
    theme_classic(base_size = 20) + 
    labs(
        x = "Profitability Ratio (log scale)",
        y = "Tomatometer Ratings (log scale)",
        title = "Correlation of Movie Profitability and Ratings"
    ) +
    annotate(
        geom = "text",
        x = 0.01, y = 200,  
        label = corrLabel,
        hjust = 0, size = 7
    )

scatterplot

The scatterplot shows a weak correlation (r = 0.03) between profitability and Tomatometer ratings, indicating that financial success does not reliably translate to critical or audience acclaim.

Comparative Analysis of Tomatometer Ratings and Profitability

Highly Profitable and Certified-Fresh Movies

Comparison of Tomatometer Ratings and Profitability Classes

The waffle plots provide a distributional analysis of Certified-Fresh, Fresh, and Rotten ratings among highly profitable movies and the profitability classes within Certified-Fresh movies.

Code

# Tomatometer summary for highly profitable movies
tomatometer_summary <- movies %>%
  filter(profitability_class == "Highly Profitable", !is.na(tomatometer_status)) %>%
  count(tomatometer_status) %>%
  mutate(
    fraction = n / sum(n) * 100,
    parts = round(fraction)
  )

# Define color palette for tomatometer statuses
tomato_colors <- c(
  "Certified-Fresh" = "#FF6B6B",
  "Fresh" = "#4BC0C0",          
  "Rotten" = "#5C7CFA"           
)

# Waffle plot for highly profitable movies
profitable_waffle_plot <- ggplot(tomatometer_summary) +
  geom_waffle(
    aes(fill = tomatometer_status, values = parts),
    color = "white",
    size = 0.5,
    n_rows = 10,
    flip = TRUE
  ) +
  scale_fill_manual(
    values = tomato_colors,
    name = "Tomatometer Status"
  ) +
  coord_equal() +
  labs(
    title = "Tomatometer Ratings in Highly Profitable Movies",
    subtitle = sprintf("Each square ≈ %.1f%% of movies", 100 / sum(tomatometer_summary$parts))
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", margin = ggplot2::margin(b = 10)),
    plot.subtitle = element_text(size = 10, color = "gray50"),
    legend.position = "right",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 9),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

# Profitability summary for Certified-Fresh movies
profitability_summary <- movies %>%
  filter(tomatometer_status == "Certified-Fresh", !is.na(profitability_class)) %>%
  count(profitability_class) %>%
  mutate(
    fraction = n / sum(n) * 100,
    parts = round(fraction)
  )

# Define color palette for profitability classes
profitability_colors <- c(
  "Highly Profitable" = "#FF6B6B",
  "Moderately Profitable" = "#4BC0C0",
  "Unprofitable" = "#5C7CFA"
)

# Waffle plot for Certified-Fresh movies
certified_waffle_plot <- ggplot(profitability_summary) +
  geom_waffle(
    aes(fill = profitability_class, values = parts),
    color = "white",
    size = 0.5,
    n_rows = 10,
    flip = TRUE
  ) +
  scale_fill_manual(
    values = profitability_colors,
    name = "Profitability Class"
  ) +
  coord_equal() +
  labs(
    title = "Profitability Classes in Certified-Fresh Movies",
    subtitle = sprintf("Each square ≈ %.1f%% of movies", 100 / sum(profitability_summary$parts))
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, face = "bold", margin = ggplot2::margin(b = 10)),
    plot.subtitle = element_text(size = 11, color = "black"),
    legend.position = "right",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 9),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

final_plot <- profitable_waffle_plot + certified_waffle_plot +
  plot_layout(ncol = 2, widths = c(1.1, 1)) & 
  theme(
    plot.margin = ggplot2::margin(t = 20, r = 20, b = 20, l = 20), 
    legend.position = "right",  
    legend.title = element_text(size = 15),  
    legend.text = element_text(size = 15),  
    legend.spacing.x = unit(15, "pt")  
  )

# Adjust titles for better spacing and readability
final_plot <- final_plot & 
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5, margin = ggplot2::margin(b = 15)),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray30")
  )

print(final_plot)

The distributional analyses provided further clarity. Highly profitable movies show a three-way distribution, with Certified-Fresh movies making up the largest segment (around 45%), followed by 19% and 36% Fresh and Rotten ratings respectively. Examining Certified-Fresh movies specifically, they tend to be financially successful, with about 57% being highly profitable, 23% moderately profitable, and only 20% unprofitable.

Unprofitable and Rotten Movies

Comparison of Tomatometer Ratings and Profitability Classes

This second waffle plots examine the distribution of Certified-Fresh, Fresh, and Rotten ratings within unprofitable movies and the profitability classes among Rotten movies.

Code

# Tomatometer summary for unprofitable movies
tomatometer_unprofitable <- movies %>%
  filter(profitability_class == "Unprofitable", !is.na(tomatometer_status)) %>%
  count(tomatometer_status) %>%
  mutate(
    fraction = n / sum(n) * 100,
    parts = round(fraction)
  )

# Define color palette for tomatometer statuses
tomato_colors <- c(
  "Certified-Fresh" = "#FF6B6B",
  "Fresh" = "#4BC0C0",          
  "Rotten" = "#5C7CFA"           
)

# Waffle plot for unprofitable movies
unprofitable_waffle_plot <- ggplot(tomatometer_unprofitable) +
  geom_waffle(
    aes(fill = tomatometer_status, values = parts),
    color = "white",
    size = 0.5,
    n_rows = 10,
    flip = TRUE
  ) +
  scale_fill_manual(
    values = tomato_colors,
    name = "Tomatometer Status"
  ) +
  coord_equal() +
  labs(
    title = "Tomatometer Ratings in Unprofitable Movies",
    subtitle = sprintf("Each square ≈ %.1f%% of movies", 100 / sum(tomatometer_unprofitable$parts))
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", margin = ggplot2::margin(b = 10)),
    plot.subtitle = element_text(size = 10, color = "gray50"),
    legend.position = "right",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 9),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

# Profitability summary for Rotten movies
profitability_rotten <- movies %>%
  filter(tomatometer_status == "Rotten", !is.na(profitability_class)) %>%
  count(profitability_class) %>%
  mutate(
    fraction = n / sum(n) * 100,
    parts = round(fraction)
  )

# Define color palette for profitability classes
profitability_colors <- c(
  "Highly Profitable" = "#FF6B6B",
  "Moderately Profitable" = "#4BC0C0",
  "Unprofitable" = "#5C7CFA"
)

# Waffle plot for Rotten movies
rotten_waffle_plot <- ggplot(profitability_rotten) +
  geom_waffle(
    aes(fill = profitability_class, values = parts),
    color = "white",
    size = 0.5,
    n_rows = 10,
    flip = TRUE
  ) +
  scale_fill_manual(
    values = profitability_colors,
    name = "Profitability Class"
  ) +
  coord_equal() +
  labs(
    title = "Profitability Classes in Rotten Movies",
    subtitle = sprintf("Each square ≈ %.1f%% of movies", 100 / sum(profitability_rotten$parts))
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, face = "bold", margin = ggplot2::margin(b = 10)),
    plot.subtitle = element_text(size = 11, color = "black"),
    legend.position = "right",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 9),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

# Combine the two plots side-by-side
final_plot <- unprofitable_waffle_plot + rotten_waffle_plot +
  plot_layout(ncol = 2, widths = c(1.1, 1)) &  
  theme(
    plot.margin = ggplot2::margin(t = 20, r = 20, b = 20, l = 20),
    legend.position = "right",
    legend.title = element_text(size = 15),
    legend.text = element_text(size = 15),
    legend.spacing.x = unit(15, "pt")
  )

# Adjust titles for consistency and readability
final_plot <- final_plot & 
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5, margin = ggplot2::margin(b = 15)),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray30")
  )

print(final_plot)

These waffle plots demonstrate that unprofitable movies are predominantly rated as Rotten (approximately 70%), with smaller proportions of Fresh (20%) and Certified-Fresh (10%) ratings. However, interestingly, when examining Rotten-rated movies, they show surprising financial diversity - about 40% are highly profitable, 40% moderately profitable, and only 20% unprofitable. This highlights that while poor critical reception often correlates with poor financial performance, many poorly-rated movies still achieve significant financial success.

Analysis of Highly Profitable Movies and Critically Acclaimed Movies (Certified-Fresh) Across Actors, Directors, Genres, and Content Ratings

Lead Actors in Highly Profitable and Certified-Fresh Movies

This subsection analyzes actors with at least five appearances in movies categorized as “Highly Profitable” and “Certified-Fresh.”

Code

highly_profitable_actors <- movies %>%
  filter(profitability_class == "Highly Profitable", !is.na(lead_actor)) %>%
  group_by(lead_actor) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Then we created the highly rated actors dataset
highly_rated_actors <- movies %>%
  filter(tomatometer_status == "Certified-Fresh", !is.na(lead_actor)) %>%
  group_by(lead_actor) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Now we combine them
combined_data <- full_join(
  highly_profitable_actors %>% rename(profitable_apps = appearances),
  highly_rated_actors %>% rename(rated_apps = appearances),
  by = "lead_actor"
) %>%
  replace_na(list(profitable_apps = 0, rated_apps = 0)) %>%
  filter(profitable_apps > 5 | rated_apps > 5)

# Then we made the plot
ggplot(combined_data) +
  geom_col(aes(x = reorder(lead_actor, profitable_apps), y = -profitable_apps), 
          fill = "#69b3e7", width = 0.8) +
  geom_col(aes(x = reorder(lead_actor, profitable_apps), y = rated_apps),
          fill = "#1e4b7a", width = 0.8) +
  geom_hline(yintercept = 0, color = "gray40") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Lead Actors",
       subtitle = "Highly Profitable (light blue) vs Certified-Fresh (dark blue)",
       x = NULL,
       y = "Number of Appearances") +
  theme(
    # Changed actor names to black and kept size
    axis.text.y = element_text(size = 10, color = "black"),
    axis.text.x = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    # Removed horizontal gridlines, kept vertical ones
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm"),
  ) +
  scale_y_continuous(
    labels = abs,
    breaks = seq(-21, 21, 2)  
  )

Lead Directors in Highly Profitable and Certified-Fresh Movies

This subsection examines directors who have made significant contributions to movies that are both highly profitable and critically acclaimed, with at least five appearances.

Code

highly_profitable_directors <- movies %>%
  filter(profitability_class == "Highly Profitable", !is.na(lead_director)) %>%
  group_by(lead_director) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Then create the highly rated directors dataset
highly_rated_directors <- movies %>%
  filter(tomatometer_status == "Certified-Fresh", !is.na(lead_director)) %>%
  group_by(lead_director) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Now combine them
combined_directors <- full_join(
  highly_profitable_directors %>% rename(profitable_apps = appearances),
  highly_rated_directors %>% rename(rated_apps = appearances),
  by = "lead_director"
) %>%
  replace_na(list(profitable_apps = 0, rated_apps = 0)) %>%
  filter(profitable_apps > 5 | rated_apps > 5)

# Create the diverging bar plot for directors
ggplot(combined_directors) +
  geom_col(aes(x = reorder(lead_director, profitable_apps), y = -profitable_apps), 
          fill = "#69b3e7", width = 0.8) +
  geom_col(aes(x = reorder(lead_director, profitable_apps), y = rated_apps),
          fill = "#1e4b7a", width = 0.8) +
  geom_hline(yintercept = 0, color = "gray40") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Lead Directors",
       subtitle = "Highly Profitable (light blue) vs Certified-Fresh (dark blue)",
       x = NULL,
       y = "Number of Appearances") +
  theme(
    axis.text.y = element_text(size = 11, color = "black"),
    axis.text.x = element_text(size = 10, color = "black"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    # After coord_flip, panel.grid.major.x creates vertical lines
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  ) +
  scale_y_continuous(
    labels = abs,
    breaks = seq(-21, 21, 2)
  )

Content Analysis of Highly Profitable and Certified-Fresh Movies

This subsection delves into the genres and content ratings prevalent in highly profitable and Certified-Fresh movies, identifying patterns and their implications.

Code

# First create the content ratings for highly profitable movies
content_ratings_profitable <- movies %>%
  filter(profitability_class == "Highly Profitable") %>%
  group_by(content_rating) %>%
  summarise(count = n()) %>%
  mutate(content_rating = fct_reorder(content_rating, count))

# Then create the content ratings for highly rated movies
content_ratings_rated <- movies %>%
  filter(tomatometer_status == "Certified-Fresh") %>%
  group_by(content_rating) %>%
  summarise(count = n()) %>%
  mutate(content_rating = fct_reorder(content_rating, count))

# Now combine them
combined_ratings <- full_join(
  content_ratings_profitable %>% rename(profitable_count = count),
  content_ratings_rated %>% rename(rated_count = count),
  by = "content_rating"
) %>%
  replace_na(list(profitable_count = 0, rated_count = 0))

ggplot(combined_ratings) +
  # Add vertical gridlines
  geom_vline(xintercept = seq(-300, 300, 50), color = "gray90") +
  # Add thicker zero line
  geom_vline(xintercept = 0, color = "gray40", linewidth = 1) +
  # Left side (profitable) segments and points
  geom_segment(aes(x = -profitable_count, xend = 0, 
                   y = reorder(content_rating, profitable_count), 
                   yend = reorder(content_rating, profitable_count)), 
               color = 'grey70') +
  geom_point(aes(x = -profitable_count, 
                 y = reorder(content_rating, profitable_count)), 
             size = 3, color = '#69b3e7') +
  # Right side (rated) segments and points
  geom_segment(aes(x = 0, xend = rated_count, 
                   y = reorder(content_rating, profitable_count), 
                   yend = reorder(content_rating, profitable_count)), 
               color = 'grey70') +
  geom_point(aes(x = rated_count, 
                 y = reorder(content_rating, profitable_count)), 
             size = 3, color = '#1e4b7a') +
  # Customize the theme
  theme_minimal() +
  labs(title = "Content Ratings in Movies",
       subtitle = "Highly Profitable (light blue) vs Certified-Fresh (dark blue)",
       x = "Count of Movies",
       y = NULL) +
  theme(
    axis.text.y = element_text(size = 11, colour = "black"),
    axis.text.x = element_text(size = 10, colour = "black"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  ) +
  # Extend x-axis range
  scale_x_continuous(
    labels = abs,
    breaks = seq(-500, 500, 50),
    limits = c(-500, 500)  
  )

Analysis of Unprofitable Movies and Poorly Rated Movies (Rotten) Across Actors, Directors, Genres, and Content Ratings

Lead Actors in Unprofitable and Rotten Movies

This subsection analyzes actors with at least five notable appearances in movies categorized as “Unprofitable” and “Rotten.”

Code

# Analysis of lead actors in unprofitable movies
unprofitable_actors <- movies %>%
  filter(profitability_class == "Unprofitable") %>%
  group_by(lead_actor) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

rotten_actors <- movies %>%
  filter(tomatometer_status == "Rotten") %>%
  group_by(lead_actor) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%  
  arrange(desc(appearances))

# Combine the datasets
combined_negative <- full_join(
  unprofitable_actors %>% rename(unprofitable_apps = appearances),
  rotten_actors %>% rename(rotten_apps = appearances),
  by = "lead_actor"
) %>%
  replace_na(list(unprofitable_apps = 0, rotten_apps = 0)) %>%
  filter(unprofitable_apps > 5 | rotten_apps > 5)

combined_negative <- combined_negative %>%
  arrange(desc(unprofitable_apps + rotten_apps)) %>%
  head(30)

# Create the diverging bar plot
ggplot(combined_negative) +
  geom_col(aes(x = reorder(lead_actor, unprofitable_apps), y = -unprofitable_apps), 
          fill = "#FF9999", width = 0.8) + 
  geom_col(aes(x = reorder(lead_actor, unprofitable_apps), y = rotten_apps),
          fill = "#CC3333", width = 0.8) +  
  geom_hline(yintercept = 0, color = "gray40") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Lead Actors in Poorly Performing Movies",
       subtitle = "Unprofitable (light red) vs Rotten (dark red)",
       x = NULL,
       y = "Number of Appearances") +
  theme(
    axis.text.y = element_text(size = 11, color = "black"),
    axis.text.x = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm"),
    aspect.ratio = 1.5
  ) +
  scale_y_continuous(
    labels = abs,
    breaks = seq(-20, 20, 2)
  )

Lead Directors in Unprofitable and Rotten Movies

This subsection examines directors with at least five notable contributions to movies categorized as both “Unprofitable” and “Poorly Rated.”

Code

unprofitable_directors <- movies %>%
  filter(profitability_class == "Unprofitable", !is.na(lead_director)) %>%
  group_by(lead_director) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Create rotten directors dataset
rotten_directors <- movies %>%
  filter(tomatometer_status == "Rotten", !is.na(lead_director)) %>%
  group_by(lead_director) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Combine datasets
combined_negative_directors <- full_join(
  unprofitable_directors %>% rename(unprofitable_apps = appearances),
  rotten_directors %>% rename(rotten_apps = appearances),
  by = "lead_director"
) %>%
  replace_na(list(unprofitable_apps = 0, rotten_apps = 0)) %>%
  filter(unprofitable_apps > 5 | rotten_apps > 5)

# Create the diverging bar plot for directors
ggplot(combined_negative_directors) +
  geom_col(aes(x = reorder(lead_director, unprofitable_apps), y = -unprofitable_apps), 
          fill = "#FF9999", width = 0.8) +  
  geom_col(aes(x = reorder(lead_director, unprofitable_apps), y = rotten_apps),
          fill = "#CC3333", width = 0.8) +  
  geom_hline(yintercept = 0, color = "gray40") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Lead Directors in Poorly Performing Movies",
       subtitle = "Unprofitable (light red) vs Rotten (dark red)",
       x = NULL,
       y = "Number of Appearances") +
  theme(
    axis.text.y = element_text(size = 11, color = "black"),
    axis.text.x = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm"),
  ) +
  scale_y_continuous(
    labels = abs,
    breaks = seq(-21, 21, 2)
  )

Content Analysis of Unprofitable and Rotten Movies

This subsection delves into the genres and content ratings prevalent in unprofitable and Rotten movies, identifying patterns and their implications.

Code

# Analysis of content ratings in unprofitable movies
content_ratings_unprofitable <- movies %>%
  filter(profitability_class == "Unprofitable") %>%
  group_by(content_rating) %>%
  summarise(count = n()) %>%
  mutate(content_rating = fct_reorder(content_rating, count))

# Then create the rotten content ratings dataset
content_ratings_rotten <- movies %>%
  filter(tomatometer_status == "Rotten") %>%
  group_by(content_rating) %>%
  summarise(count = n()) %>%
  mutate(content_rating = fct_reorder(content_rating, count))

# Now combine them
combined_ratings_negative <- full_join(
  content_ratings_unprofitable %>% rename(unprofitable_count = count),
  content_ratings_rotten %>% rename(rotten_count = count),
  by = "content_rating"
) %>%
  replace_na(list(unprofitable_count = 0, rotten_count = 0))

# Create the diverging lollipop plot
ggplot(combined_ratings_negative) +
  # Segments for unprofitable movies (left side)
  geom_segment(aes(x = -unprofitable_count, xend = 0, 
                   y = reorder(content_rating, unprofitable_count), 
                   yend = reorder(content_rating, unprofitable_count)), 
               color = 'grey70') +
  geom_point(aes(x = -unprofitable_count, 
                 y = reorder(content_rating, unprofitable_count)), 
             size = 3, color = '#FF9999') + 
  
  # Segments for rotten movies (right side)
  geom_segment(aes(x = 0, xend = rotten_count, 
                   y = reorder(content_rating, unprofitable_count), 
                   yend = reorder(content_rating, unprofitable_count)), 
               color = 'grey70') +
  geom_point(aes(x = rotten_count, 
                 y = reorder(content_rating, unprofitable_count)), 
             size = 3, color = '#CC3333') +  
  
  # Customize the theme
  theme_minimal() +
  labs(title = "Content Ratings in Poorly Performing Movies",
       subtitle = "Unprofitable (light red) vs Rotten (dark red)",
       x = "Count of Movies",
       y = NULL) +
  theme(
    axis.text.y = element_text(size = 11, color = "black"),
    axis.text.x = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  ) +
  scale_x_continuous(
    labels = abs,
    breaks = seq(-500, 500, 50)  
  )

Analysis of lead actors, directors, and content ratings revealed that most individuals associated with highly profitable movies were absent from Certified-Fresh movies, with only a few exceptions. This underscores the often divergent nature of commercial success and critical acclaim. Interestingly, content ratings exhibited consistent patterns across all categories, indicating their crucial role in influencing both profitability and ratings.

Do specific genres consistently yield movies that are highly profitable, highly rated (Certified Fresh), or both?

For this question, we implemented the following steps:

Contingency table:
Summarized movie counts by genre across three categories:
- Highly Profitable
- Certified Fresh
- Both (i.e., movies that are both highly profitable and certified fresh).
Chi-Square test:
Assessed the statistical significance of the relationship between genres and the three categories.
Residuals:
Identify which genres contribute most to the observed differences in genre distributions.
Create a faceted bar plot:
Compare the counts of movies in each genre across the three categories visually using a faceted bar plot.

Results:

Chi-Squared Statistic: 53.883
Degrees of Freedom: 26
P-Value: 0.00105

The low (p)-value indicates a statistically significant relationship between genres and the three categories. Thus certain genres, such as drama and comedy, disproportionately contribute to the trends, showing consistent success in profitability, ratings, or both.

Code

# Filter the dataset to include valid rows for analysis and create binary indicators
# 'highly_profitable' for movies classified as "Highly Profitable"
# 'certified_fresh' for movies classified as "Certified Fresh"
genre_data <- movies %>%
  filter(!is.na(primary_genre), !is.na(profitability_class), !is.na(tomatometer_status)) %>%
  mutate(
    highly_profitable = ifelse(profitability_class == "Highly Profitable", 1, 0),
    certified_fresh = ifelse(tomatometer_status == "Certified-Fresh", 1, 0)
  )

# Group the data by genre and calculate counts for:
# - Highly profitable movies
# - Certified Fresh movies
# - Movies that are both
# Additionally, calculate the total count of movies in each genre
genre_contingency <- genre_data %>%
  group_by(primary_genre) %>%
  summarise(
    highly_profitable_count = sum(highly_profitable),
    certified_fresh_count = sum(certified_fresh),
    both_count = sum(highly_profitable & certified_fresh),
    total_count = n()
  ) %>%
  arrange(desc(total_count)) 

# To view the resulting contingency table
#print(genre_contingency)

# Aggregate genres with fewer than 20 movies into an "Other" category
# Summarize the counts again to consolidate small categories
genre_contingency_clean <- genre_contingency %>%
  mutate(
    primary_genre = ifelse(total_count < 20, "Other", primary_genre)
  ) %>%
  group_by(primary_genre) %>%
  summarise(
    highly_profitable_count = sum(highly_profitable_count),
    certified_fresh_count = sum(certified_fresh_count),
    both_count = sum(both_count),
    total_count = sum(total_count)
  )

#  To view the cleaned contingency table with aggregated genres
#print(genre_contingency_clean)

# Create a matrix suitable for statistical testing
# Select relevant columns (counts for different categories) and pivot into a matrix
genre_matrix <- genre_contingency_clean %>%
  select(primary_genre, highly_profitable_count, certified_fresh_count, both_count) %>%
  pivot_longer(
    cols = -primary_genre,
    names_to = "category",
    values_to = "count"
  ) %>%
  pivot_wider(
    names_from = category,
    values_from = count,
    values_fill = 0  
  ) %>%
  column_to_rownames(var = "primary_genre")  # Make genres the row names

# Convert to a matrix for testing and print
genre_matrix <- as.matrix(genre_matrix)
#print(genre_matrix)

# Here we convert the matrix to numeric format for statistical testing
genre_matrix_numeric <- as.matrix(genre_matrix)

# Here we perform a chi-square test to evaluate whether the distribution
# of movie counts differs significantly across categories
genre_chisq <- chisq.test(genre_matrix_numeric)

# Print the test result
# print(genre_chisq)

# Extract standardized residuals from the chi-square test
# Residuals indicate which genres contribute most to any significant differences
residuals <- genre_chisq$stdres
#print(residuals)

# Step 7: Preparing our data for visualization
# we converted the matrix to a long-format data frame for visualization
# and added readable labels for the metric names
genre_df <- as.data.frame(genre_matrix)
genre_df$Genre <- rownames(genre_matrix)

genre_long <- genre_df %>%
  pivot_longer(cols = -Genre, names_to = "Metric", values_to = "Count") %>%
  mutate(Metric = recode(Metric,
                         "highly_profitable_count" = "Highly Profitable",
                         "certified_fresh_count" = "Certified Fresh",
                         "both_count" = "Both"))

# Step 8: We sorted the data by counts within facets for each metric in descending order
genre_long_sorted <- genre_long %>%
  group_by(Metric) %>%
  mutate(Genre = reorder_within(Genre, -Count, Metric)) %>%  
  ungroup()

# Step 9: we generated a faceted bar plot to compare genre distributions
# across "Highly Profitable," "Certified Fresh," and "Both" categories
genre_order <- genre_long %>%
  filter(Metric == "Both") %>%
  arrange(desc(Count)) %>%
  pull(Genre) %>% 
    rev()

# Update the plot using this fixed order
genre_faceted_sorted_plot <- genre_long %>%
  mutate(Genre = factor(Genre, levels = genre_order)) %>%
  ggplot(aes(y = Genre, x = Count, fill = Metric)) +
  geom_col(width = 0.7) +
  facet_grid(. ~ Metric, scales = "free_x") +
  labs(
    title = "Genre Distribution in Movies",
    subtitle = "Highly Profitable vs Certified Fresh vs Both Categories",
    x = "Count of Movies",
    y = NULL
  ) +
  scale_fill_manual(
    values = c(
      "Highly Profitable" = "#69b3e7",
      "Certified Fresh" = "#1e4b7a",
      "Both" = "#2E5B88"
    ),
    name = "Category"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 11, color = "black"),
    axis.text.x = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    strip.text = element_text(size = 12, face = "bold"),
    legend.position = "none",
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm"),
    panel.spacing = unit(2, "lines")
  )

print(genre_faceted_sorted_plot)

7. Can we develop a classification algorithm to predict a movie’s Tomatometer status?

Steps Taken:

Data Preparation:
- Selected key features: budget, revenue, popularity, etc.
- Created derived variables:
  - popularity_score: Product of popularity and vote_average.
  - revenue_efficiency: Ratio of revenue to budget.
- Removed missing values and converted categorical variables (e.g., tomatometer_status, primary_genre) to factors.
Data Splitting:
- Divided data into 80% training and 20% testing subsets using a fixed random seed for reproducibility.
Model Training:
- Built a Random Forest classifier with:
  - ntree = 2000 (number of trees).
  - mtry = 6 (variables per split).
  - Balanced sampling to account for class imbalances.
Model Evaluation:
- Used the test set to generate predictions.
- Assessed model performance using a confusion matrix.
- Visualized key predictors using a variable importance plot.

Classification Model Performance

Confusion Matrix

Prediction	Certified-Fresh	Fresh	Rotten
Certified-Fresh	135	28	27
Fresh	7	23	12
Rotten	35	51	248

Key Metrics

Metric	Value
Accuracy	0.7173
95% CI	(0.6783, 0.7541)
No Information Rate	0.5071
Kappa	0.5146
Mcnemar’s Test P-Value	3.154e-08

Statistics by Class

Metric	Certified-Fresh	Fresh	Rotten
Sensitivity	0.7627	0.22549	0.8641
Specificity	0.8586	0.95905	0.6918
Positive Predictive Value	0.7105	0.54762	0.7425
Negative Predictive Value	0.8883	0.84924	0.8319
Prevalence	0.3127	0.18021	0.5071
Detection Rate	0.2385	0.04064	0.4382
Detection Prevalence	0.3357	0.07420	0.5901
Balanced Accuracy	0.8107	0.59227	0.7779

Conclusions

Our analysis reveals that while the movie industry has seen significant growth in budgets and revenues, profitability is weakly correlated with audience ratings, showing that financial success does not ensure critical acclaim. Highly profitable movies show a balanced distribution across all rating categories, while Certified-Fresh movies demonstrate a strong tendency toward financial success. Key factors such as genre and content ratings significantly influence both profitability and ratings. Lastly, our random forest model achieved 72% accuracy in predicting Certified-Fresh and Rotten movies, highlighting its strength in identifying extreme cases of audience and critic evaluations.

Explore the data and trends further with our interactive Shiny app, featuring dynamic visualizations and insights. Access it here.

Attribution

All members contributed equally.

APPENDIX

A. Data Dictionary

Column Name	Data Type	Description
`id`	Integer	Unique identifier for each movie.
`budget`	Numeric	The production budget of the movie in US dollars.
`genres`	String	Comma-separated list of genres associated with the movie.
`popularity`	Numeric	Popularity score of the movie based on user activity, such as views or votes.
`production_companies`	String	Comma-separated list of companies involved in the movie’s production.
`production_countries`	String	Comma-separated list of countries where the movie was produced.
`release_date`	String	Release date of the movie in the format YYYY-MM-DD.
`revenue`	Numeric	Revenue earned by the movie in US dollars.
`runtime`	Integer	Duration of the movie in minutes.
`title`	String	Title of the movie.
`vote_average`	Numeric	Average rating of the movie on a scale of 1 to 10 based on user votes.
`vote_count`	Integer	Total number of user votes for the movie.
`cast`	String	Comma-separated list of prominent actors featured in the movie.
`content_rating`	String	Official content rating (e.g., G, PG, PG-13, R) assigned to the movie.
`directors`	String	Comma-separated list of directors involved in the movie’s production.
`tomatometer_status`	String	Status of the movie on Rotten Tomatoes (e.g., “Fresh”, “Rotten”, “Certified-Fresh”).
`tomatometer_rating`	Integer	Percentage score of the movie on Rotten Tomatoes based on critic reviews.
`tomatometer_count`	Integer	Number of critic reviews considered for the Rotten Tomatoes score.
`primary_genre`	String	The primary genre of the movie.
`primary_company`	String	The primary production company responsible for the movie.
`lead_actor`	String	Name of the lead actor in the movie.
`lead_director`	String	Name of the lead director in the movie.
`profitability_ratio`	Numeric	The ratio of revenue to budget, representing profitability.
`profitability_class`	String	Profitability classification (e.g., “Unprofitable”, “Highly Profitable”) based on the profitability ratio.

B. R Code Summary

Code

# Load libraries and settings here
library(here)
library(cowplot)
library(tidyverse)
library(viridis)
library(ggrepel)
library(gganimate)
library(magick)
library(gifski)
library(transformr)
library(reticulate)
library(lubridate)
library(janitor)
library(jsonlite)
library(purrr)
library(waffle)
library(tibble)
library(caret)
library(randomForest)
library(e1071)
library(forcats)
library(patchwork)
library(tidytext)
library(ggplot2)
library(plotly)


knitr::opts_chunk$set(
  warning = FALSE,
  message = FALSE,
  comment = "#>",
  fig.path = "figs/", 
  fig.width = 8, 
  fig.height = 6, 
  out.width = "100%",
  fig.retina = 4 
)

# Put any other "global" settings here, e.g. a ggplot theme:
theme_set(theme_bw(base_size = 20))

# Load the cleaned_movies dataset
movies <- read.csv(here::here('data_processed', 'cleaned_movie_df.csv'))

# Calculate total revenue and budget per year
yearly_totals <- movies %>%
  # Extract year from release date
  mutate(
    year = as.numeric(substr(release_date, 1, 4))
  ) %>%
  # Group by year and calculate total metrics
  group_by(year) %>%
  summarise(
    total_revenue = sum(revenue, na.rm = TRUE),
    total_budget = sum(budget, na.rm = TRUE),
    movie_count = n()
  ) %>%
  # Add a profitability ratio column
  mutate(
    yearly_roi = (total_revenue - total_budget) / total_budget * 100
  )

# Create the static ggplot without points
financial_trend_plot <- ggplot(yearly_totals, aes(x = year)) +
  geom_line(aes(y = total_revenue, color = "Revenue"), size = 0.8) +
  geom_line(aes(y = total_budget, color = "Budget"), size = 0.8) +
  scale_y_continuous(
    labels = function(x) paste0("$", round(x / 1e9, 1), "B"),
    name = "Total Amount (Billions USD)"
  ) +
  scale_color_manual(
    values = c("Revenue" = "#2E8B57", "Budget" = "#4169E1"),
    name = "Financial Metric"
  ) +
  labs(
    title = "Movie Industry Financials",
    subtitle = "Total Revenue and Budget Trends",
    x = "Year",
    caption = "Values represent yearly totals across all movies in the dataset"
  ) +
  theme_minimal_vgrid() +
  theme(
    plot.title = element_text(size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    axis.title = element_text(size = 11),
    axis.text = element_text(size = 10),
    legend.position = "bottom",
    legend.title = element_text(size = 10),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray95"),
    plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm")
  )

# Convert the ggplot to an interactive plotly chart
interactive_trend_plot <- ggplotly(financial_trend_plot)

# Display the interactive chart
interactive_trend_plot

# Calculate ratings data
yearly_ratings <- movies %>%
  # Convert release dates to years for temporal analysis
  mutate(
    year = as.numeric(substr(release_date, 1, 4))
  ) %>%
  # Group by year to calculate annual averages
  group_by(year) %>%
  summarise(
    avg_rating = mean(tomatometer_rating, na.rm = TRUE),
    movie_count = n()
  ) %>%
  filter(avg_rating > 10)

# Create the static ggplot without points
trend_plot <- ggplot(yearly_ratings, aes(x = year, y = avg_rating)) +
  geom_line(
    color = "#E64A19",    
    size = 0.8,           
    alpha = 1             
  ) +
  scale_y_continuous(
    name = "Average Tomatometer Rating",
    limits = c(10, 100),
    breaks = seq(10, 100, by = 10)
  ) +
  labs(
    title = "Evolution of Movie Ratings",
    x = "Year"
  ) +
  theme_minimal_vgrid() +
  theme(
    plot.title = element_text(size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    axis.title = element_text(size = 11),
    axis.text = element_text(size = 10),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "gray95"),
    plot.margin = unit(c(0.8, 0.8, 0.8, 0.8), "cm")
  )

# Convert the ggplot to an interactive plotly chart
interactive_trend_plot <- ggplotly(trend_plot)

# Display the interactive chart
interactive_trend_plot

corr <- cor(
    movies$profitability_ratio, 
    movies$tomatometer_rating,
    method = "pearson",
    use = "complete.obs"
)

# Label for annotation
corrLabel <- paste("r =", round(corr, 2))

# Scatterplot with annotation
scatterplot <- movies %>%
    ggplot() +
    geom_point(
        aes(x = profitability_ratio, y = tomatometer_rating),
        size = 1, alpha = 0.7
    ) +
    scale_x_log10() +  
    scale_y_log10() + 
    theme_classic(base_size = 20) + 
    labs(
        x = "Profitability Ratio (log scale)",
        y = "Tomatometer Ratings (log scale)",
        title = "Correlation of Movie Profitability and Ratings"
    ) +
    annotate(
        geom = "text",
        x = 0.01, y = 200,  
        label = corrLabel,
        hjust = 0, size = 7
    )

scatterplot
# Tomatometer summary for highly profitable movies
tomatometer_summary <- movies %>%
  filter(profitability_class == "Highly Profitable", !is.na(tomatometer_status)) %>%
  count(tomatometer_status) %>%
  mutate(
    fraction = n / sum(n) * 100,
    parts = round(fraction)
  )

# Define color palette for tomatometer statuses
tomato_colors <- c(
  "Certified-Fresh" = "#FF6B6B",
  "Fresh" = "#4BC0C0",          
  "Rotten" = "#5C7CFA"           
)

# Waffle plot for highly profitable movies
profitable_waffle_plot <- ggplot(tomatometer_summary) +
  geom_waffle(
    aes(fill = tomatometer_status, values = parts),
    color = "white",
    size = 0.5,
    n_rows = 10,
    flip = TRUE
  ) +
  scale_fill_manual(
    values = tomato_colors,
    name = "Tomatometer Status"
  ) +
  coord_equal() +
  labs(
    title = "Tomatometer Ratings in Highly Profitable Movies",
    subtitle = sprintf("Each square ≈ %.1f%% of movies", 100 / sum(tomatometer_summary$parts))
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", margin = ggplot2::margin(b = 10)),
    plot.subtitle = element_text(size = 10, color = "gray50"),
    legend.position = "right",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 9),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

# Profitability summary for Certified-Fresh movies
profitability_summary <- movies %>%
  filter(tomatometer_status == "Certified-Fresh", !is.na(profitability_class)) %>%
  count(profitability_class) %>%
  mutate(
    fraction = n / sum(n) * 100,
    parts = round(fraction)
  )

# Define color palette for profitability classes
profitability_colors <- c(
  "Highly Profitable" = "#FF6B6B",
  "Moderately Profitable" = "#4BC0C0",
  "Unprofitable" = "#5C7CFA"
)

# Waffle plot for Certified-Fresh movies
certified_waffle_plot <- ggplot(profitability_summary) +
  geom_waffle(
    aes(fill = profitability_class, values = parts),
    color = "white",
    size = 0.5,
    n_rows = 10,
    flip = TRUE
  ) +
  scale_fill_manual(
    values = profitability_colors,
    name = "Profitability Class"
  ) +
  coord_equal() +
  labs(
    title = "Profitability Classes in Certified-Fresh Movies",
    subtitle = sprintf("Each square ≈ %.1f%% of movies", 100 / sum(profitability_summary$parts))
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, face = "bold", margin = ggplot2::margin(b = 10)),
    plot.subtitle = element_text(size = 11, color = "black"),
    legend.position = "right",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 9),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

final_plot <- profitable_waffle_plot + certified_waffle_plot +
  plot_layout(ncol = 2, widths = c(1.1, 1)) & 
  theme(
    plot.margin = ggplot2::margin(t = 20, r = 20, b = 20, l = 20), 
    legend.position = "right",  
    legend.title = element_text(size = 15),  
    legend.text = element_text(size = 15),  
    legend.spacing.x = unit(15, "pt")  
  )

# Adjust titles for better spacing and readability
final_plot <- final_plot & 
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5, margin = ggplot2::margin(b = 15)),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray30")
  )

print(final_plot)

# Tomatometer summary for unprofitable movies
tomatometer_unprofitable <- movies %>%
  filter(profitability_class == "Unprofitable", !is.na(tomatometer_status)) %>%
  count(tomatometer_status) %>%
  mutate(
    fraction = n / sum(n) * 100,
    parts = round(fraction)
  )

# Define color palette for tomatometer statuses
tomato_colors <- c(
  "Certified-Fresh" = "#FF6B6B",
  "Fresh" = "#4BC0C0",          
  "Rotten" = "#5C7CFA"           
)

# Waffle plot for unprofitable movies
unprofitable_waffle_plot <- ggplot(tomatometer_unprofitable) +
  geom_waffle(
    aes(fill = tomatometer_status, values = parts),
    color = "white",
    size = 0.5,
    n_rows = 10,
    flip = TRUE
  ) +
  scale_fill_manual(
    values = tomato_colors,
    name = "Tomatometer Status"
  ) +
  coord_equal() +
  labs(
    title = "Tomatometer Ratings in Unprofitable Movies",
    subtitle = sprintf("Each square ≈ %.1f%% of movies", 100 / sum(tomatometer_unprofitable$parts))
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", margin = ggplot2::margin(b = 10)),
    plot.subtitle = element_text(size = 10, color = "gray50"),
    legend.position = "right",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 9),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

# Profitability summary for Rotten movies
profitability_rotten <- movies %>%
  filter(tomatometer_status == "Rotten", !is.na(profitability_class)) %>%
  count(profitability_class) %>%
  mutate(
    fraction = n / sum(n) * 100,
    parts = round(fraction)
  )

# Define color palette for profitability classes
profitability_colors <- c(
  "Highly Profitable" = "#FF6B6B",
  "Moderately Profitable" = "#4BC0C0",
  "Unprofitable" = "#5C7CFA"
)

# Waffle plot for Rotten movies
rotten_waffle_plot <- ggplot(profitability_rotten) +
  geom_waffle(
    aes(fill = profitability_class, values = parts),
    color = "white",
    size = 0.5,
    n_rows = 10,
    flip = TRUE
  ) +
  scale_fill_manual(
    values = profitability_colors,
    name = "Profitability Class"
  ) +
  coord_equal() +
  labs(
    title = "Profitability Classes in Rotten Movies",
    subtitle = sprintf("Each square ≈ %.1f%% of movies", 100 / sum(profitability_rotten$parts))
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, face = "bold", margin = ggplot2::margin(b = 10)),
    plot.subtitle = element_text(size = 11, color = "black"),
    legend.position = "right",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 9),
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

# Combine the two plots side-by-side
final_plot <- unprofitable_waffle_plot + rotten_waffle_plot +
  plot_layout(ncol = 2, widths = c(1.1, 1)) &  
  theme(
    plot.margin = ggplot2::margin(t = 20, r = 20, b = 20, l = 20),
    legend.position = "right",
    legend.title = element_text(size = 15),
    legend.text = element_text(size = 15),
    legend.spacing.x = unit(15, "pt")
  )

# Adjust titles for consistency and readability
final_plot <- final_plot & 
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5, margin = ggplot2::margin(b = 15)),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray30")
  )

print(final_plot)
highly_profitable_actors <- movies %>%
  filter(profitability_class == "Highly Profitable", !is.na(lead_actor)) %>%
  group_by(lead_actor) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Then we created the highly rated actors dataset
highly_rated_actors <- movies %>%
  filter(tomatometer_status == "Certified-Fresh", !is.na(lead_actor)) %>%
  group_by(lead_actor) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Now we combine them
combined_data <- full_join(
  highly_profitable_actors %>% rename(profitable_apps = appearances),
  highly_rated_actors %>% rename(rated_apps = appearances),
  by = "lead_actor"
) %>%
  replace_na(list(profitable_apps = 0, rated_apps = 0)) %>%
  filter(profitable_apps > 5 | rated_apps > 5)

# Then we made the plot
ggplot(combined_data) +
  geom_col(aes(x = reorder(lead_actor, profitable_apps), y = -profitable_apps), 
          fill = "#69b3e7", width = 0.8) +
  geom_col(aes(x = reorder(lead_actor, profitable_apps), y = rated_apps),
          fill = "#1e4b7a", width = 0.8) +
  geom_hline(yintercept = 0, color = "gray40") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Lead Actors",
       subtitle = "Highly Profitable (light blue) vs Certified-Fresh (dark blue)",
       x = NULL,
       y = "Number of Appearances") +
  theme(
    # Changed actor names to black and kept size
    axis.text.y = element_text(size = 10, color = "black"),
    axis.text.x = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    # Removed horizontal gridlines, kept vertical ones
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm"),
  ) +
  scale_y_continuous(
    labels = abs,
    breaks = seq(-21, 21, 2)  
  )
highly_profitable_directors <- movies %>%
  filter(profitability_class == "Highly Profitable", !is.na(lead_director)) %>%
  group_by(lead_director) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Then create the highly rated directors dataset
highly_rated_directors <- movies %>%
  filter(tomatometer_status == "Certified-Fresh", !is.na(lead_director)) %>%
  group_by(lead_director) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Now combine them
combined_directors <- full_join(
  highly_profitable_directors %>% rename(profitable_apps = appearances),
  highly_rated_directors %>% rename(rated_apps = appearances),
  by = "lead_director"
) %>%
  replace_na(list(profitable_apps = 0, rated_apps = 0)) %>%
  filter(profitable_apps > 5 | rated_apps > 5)

# Create the diverging bar plot for directors
ggplot(combined_directors) +
  geom_col(aes(x = reorder(lead_director, profitable_apps), y = -profitable_apps), 
          fill = "#69b3e7", width = 0.8) +
  geom_col(aes(x = reorder(lead_director, profitable_apps), y = rated_apps),
          fill = "#1e4b7a", width = 0.8) +
  geom_hline(yintercept = 0, color = "gray40") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Lead Directors",
       subtitle = "Highly Profitable (light blue) vs Certified-Fresh (dark blue)",
       x = NULL,
       y = "Number of Appearances") +
  theme(
    axis.text.y = element_text(size = 11, color = "black"),
    axis.text.x = element_text(size = 10, color = "black"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    # After coord_flip, panel.grid.major.x creates vertical lines
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  ) +
  scale_y_continuous(
    labels = abs,
    breaks = seq(-21, 21, 2)
  )
# First create the content ratings for highly profitable movies
content_ratings_profitable <- movies %>%
  filter(profitability_class == "Highly Profitable") %>%
  group_by(content_rating) %>%
  summarise(count = n()) %>%
  mutate(content_rating = fct_reorder(content_rating, count))

# Then create the content ratings for highly rated movies
content_ratings_rated <- movies %>%
  filter(tomatometer_status == "Certified-Fresh") %>%
  group_by(content_rating) %>%
  summarise(count = n()) %>%
  mutate(content_rating = fct_reorder(content_rating, count))

# Now combine them
combined_ratings <- full_join(
  content_ratings_profitable %>% rename(profitable_count = count),
  content_ratings_rated %>% rename(rated_count = count),
  by = "content_rating"
) %>%
  replace_na(list(profitable_count = 0, rated_count = 0))

ggplot(combined_ratings) +
  # Add vertical gridlines
  geom_vline(xintercept = seq(-300, 300, 50), color = "gray90") +
  # Add thicker zero line
  geom_vline(xintercept = 0, color = "gray40", linewidth = 1) +
  # Left side (profitable) segments and points
  geom_segment(aes(x = -profitable_count, xend = 0, 
                   y = reorder(content_rating, profitable_count), 
                   yend = reorder(content_rating, profitable_count)), 
               color = 'grey70') +
  geom_point(aes(x = -profitable_count, 
                 y = reorder(content_rating, profitable_count)), 
             size = 3, color = '#69b3e7') +
  # Right side (rated) segments and points
  geom_segment(aes(x = 0, xend = rated_count, 
                   y = reorder(content_rating, profitable_count), 
                   yend = reorder(content_rating, profitable_count)), 
               color = 'grey70') +
  geom_point(aes(x = rated_count, 
                 y = reorder(content_rating, profitable_count)), 
             size = 3, color = '#1e4b7a') +
  # Customize the theme
  theme_minimal() +
  labs(title = "Content Ratings in Movies",
       subtitle = "Highly Profitable (light blue) vs Certified-Fresh (dark blue)",
       x = "Count of Movies",
       y = NULL) +
  theme(
    axis.text.y = element_text(size = 11, colour = "black"),
    axis.text.x = element_text(size = 10, colour = "black"),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  ) +
  # Extend x-axis range
  scale_x_continuous(
    labels = abs,
    breaks = seq(-500, 500, 50),
    limits = c(-500, 500)  
  )
# Analysis of lead actors in unprofitable movies
unprofitable_actors <- movies %>%
  filter(profitability_class == "Unprofitable") %>%
  group_by(lead_actor) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

rotten_actors <- movies %>%
  filter(tomatometer_status == "Rotten") %>%
  group_by(lead_actor) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%  
  arrange(desc(appearances))

# Combine the datasets
combined_negative <- full_join(
  unprofitable_actors %>% rename(unprofitable_apps = appearances),
  rotten_actors %>% rename(rotten_apps = appearances),
  by = "lead_actor"
) %>%
  replace_na(list(unprofitable_apps = 0, rotten_apps = 0)) %>%
  filter(unprofitable_apps > 5 | rotten_apps > 5)

combined_negative <- combined_negative %>%
  arrange(desc(unprofitable_apps + rotten_apps)) %>%
  head(30)

# Create the diverging bar plot
ggplot(combined_negative) +
  geom_col(aes(x = reorder(lead_actor, unprofitable_apps), y = -unprofitable_apps), 
          fill = "#FF9999", width = 0.8) + 
  geom_col(aes(x = reorder(lead_actor, unprofitable_apps), y = rotten_apps),
          fill = "#CC3333", width = 0.8) +  
  geom_hline(yintercept = 0, color = "gray40") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Lead Actors in Poorly Performing Movies",
       subtitle = "Unprofitable (light red) vs Rotten (dark red)",
       x = NULL,
       y = "Number of Appearances") +
  theme(
    axis.text.y = element_text(size = 11, color = "black"),
    axis.text.x = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm"),
    aspect.ratio = 1.5
  ) +
  scale_y_continuous(
    labels = abs,
    breaks = seq(-20, 20, 2)
  )
unprofitable_directors <- movies %>%
  filter(profitability_class == "Unprofitable", !is.na(lead_director)) %>%
  group_by(lead_director) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Create rotten directors dataset
rotten_directors <- movies %>%
  filter(tomatometer_status == "Rotten", !is.na(lead_director)) %>%
  group_by(lead_director) %>%
  summarise(appearances = n()) %>%
  filter(appearances > 5) %>%
  arrange(desc(appearances))

# Combine datasets
combined_negative_directors <- full_join(
  unprofitable_directors %>% rename(unprofitable_apps = appearances),
  rotten_directors %>% rename(rotten_apps = appearances),
  by = "lead_director"
) %>%
  replace_na(list(unprofitable_apps = 0, rotten_apps = 0)) %>%
  filter(unprofitable_apps > 5 | rotten_apps > 5)

# Create the diverging bar plot for directors
ggplot(combined_negative_directors) +
  geom_col(aes(x = reorder(lead_director, unprofitable_apps), y = -unprofitable_apps), 
          fill = "#FF9999", width = 0.8) +  
  geom_col(aes(x = reorder(lead_director, unprofitable_apps), y = rotten_apps),
          fill = "#CC3333", width = 0.8) +  
  geom_hline(yintercept = 0, color = "gray40") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Lead Directors in Poorly Performing Movies",
       subtitle = "Unprofitable (light red) vs Rotten (dark red)",
       x = NULL,
       y = "Number of Appearances") +
  theme(
    axis.text.y = element_text(size = 11, color = "black"),
    axis.text.x = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm"),
  ) +
  scale_y_continuous(
    labels = abs,
    breaks = seq(-21, 21, 2)
  )
# Analysis of content ratings in unprofitable movies
content_ratings_unprofitable <- movies %>%
  filter(profitability_class == "Unprofitable") %>%
  group_by(content_rating) %>%
  summarise(count = n()) %>%
  mutate(content_rating = fct_reorder(content_rating, count))

# Then create the rotten content ratings dataset
content_ratings_rotten <- movies %>%
  filter(tomatometer_status == "Rotten") %>%
  group_by(content_rating) %>%
  summarise(count = n()) %>%
  mutate(content_rating = fct_reorder(content_rating, count))

# Now combine them
combined_ratings_negative <- full_join(
  content_ratings_unprofitable %>% rename(unprofitable_count = count),
  content_ratings_rotten %>% rename(rotten_count = count),
  by = "content_rating"
) %>%
  replace_na(list(unprofitable_count = 0, rotten_count = 0))

# Create the diverging lollipop plot
ggplot(combined_ratings_negative) +
  # Segments for unprofitable movies (left side)
  geom_segment(aes(x = -unprofitable_count, xend = 0, 
                   y = reorder(content_rating, unprofitable_count), 
                   yend = reorder(content_rating, unprofitable_count)), 
               color = 'grey70') +
  geom_point(aes(x = -unprofitable_count, 
                 y = reorder(content_rating, unprofitable_count)), 
             size = 3, color = '#FF9999') + 
  
  # Segments for rotten movies (right side)
  geom_segment(aes(x = 0, xend = rotten_count, 
                   y = reorder(content_rating, unprofitable_count), 
                   yend = reorder(content_rating, unprofitable_count)), 
               color = 'grey70') +
  geom_point(aes(x = rotten_count, 
                 y = reorder(content_rating, unprofitable_count)), 
             size = 3, color = '#CC3333') +  
  
  # Customize the theme
  theme_minimal() +
  labs(title = "Content Ratings in Poorly Performing Movies",
       subtitle = "Unprofitable (light red) vs Rotten (dark red)",
       x = "Count of Movies",
       y = NULL) +
  theme(
    axis.text.y = element_text(size = 11, color = "black"),
    axis.text.x = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm")
  ) +
  scale_x_continuous(
    labels = abs,
    breaks = seq(-500, 500, 50)  
  )

# Filter the dataset to include valid rows for analysis and create binary indicators
# 'highly_profitable' for movies classified as "Highly Profitable"
# 'certified_fresh' for movies classified as "Certified Fresh"
genre_data <- movies %>%
  filter(!is.na(primary_genre), !is.na(profitability_class), !is.na(tomatometer_status)) %>%
  mutate(
    highly_profitable = ifelse(profitability_class == "Highly Profitable", 1, 0),
    certified_fresh = ifelse(tomatometer_status == "Certified-Fresh", 1, 0)
  )

# Group the data by genre and calculate counts for:
# - Highly profitable movies
# - Certified Fresh movies
# - Movies that are both
# Additionally, calculate the total count of movies in each genre
genre_contingency <- genre_data %>%
  group_by(primary_genre) %>%
  summarise(
    highly_profitable_count = sum(highly_profitable),
    certified_fresh_count = sum(certified_fresh),
    both_count = sum(highly_profitable & certified_fresh),
    total_count = n()
  ) %>%
  arrange(desc(total_count)) 

# To view the resulting contingency table
#print(genre_contingency)

# Aggregate genres with fewer than 20 movies into an "Other" category
# Summarize the counts again to consolidate small categories
genre_contingency_clean <- genre_contingency %>%
  mutate(
    primary_genre = ifelse(total_count < 20, "Other", primary_genre)
  ) %>%
  group_by(primary_genre) %>%
  summarise(
    highly_profitable_count = sum(highly_profitable_count),
    certified_fresh_count = sum(certified_fresh_count),
    both_count = sum(both_count),
    total_count = sum(total_count)
  )

#  To view the cleaned contingency table with aggregated genres
#print(genre_contingency_clean)

# Create a matrix suitable for statistical testing
# Select relevant columns (counts for different categories) and pivot into a matrix
genre_matrix <- genre_contingency_clean %>%
  select(primary_genre, highly_profitable_count, certified_fresh_count, both_count) %>%
  pivot_longer(
    cols = -primary_genre,
    names_to = "category",
    values_to = "count"
  ) %>%
  pivot_wider(
    names_from = category,
    values_from = count,
    values_fill = 0  
  ) %>%
  column_to_rownames(var = "primary_genre")  # Make genres the row names

# Convert to a matrix for testing and print
genre_matrix <- as.matrix(genre_matrix)
#print(genre_matrix)

# Here we convert the matrix to numeric format for statistical testing
genre_matrix_numeric <- as.matrix(genre_matrix)

# Here we perform a chi-square test to evaluate whether the distribution
# of movie counts differs significantly across categories
genre_chisq <- chisq.test(genre_matrix_numeric)

# Print the test result
# print(genre_chisq)

# Extract standardized residuals from the chi-square test
# Residuals indicate which genres contribute most to any significant differences
residuals <- genre_chisq$stdres
#print(residuals)

# Step 7: Preparing our data for visualization
# we converted the matrix to a long-format data frame for visualization
# and added readable labels for the metric names
genre_df <- as.data.frame(genre_matrix)
genre_df$Genre <- rownames(genre_matrix)

genre_long <- genre_df %>%
  pivot_longer(cols = -Genre, names_to = "Metric", values_to = "Count") %>%
  mutate(Metric = recode(Metric,
                         "highly_profitable_count" = "Highly Profitable",
                         "certified_fresh_count" = "Certified Fresh",
                         "both_count" = "Both"))

# Step 8: We sorted the data by counts within facets for each metric in descending order
genre_long_sorted <- genre_long %>%
  group_by(Metric) %>%
  mutate(Genre = reorder_within(Genre, -Count, Metric)) %>%  
  ungroup()

# Step 9: we generated a faceted bar plot to compare genre distributions
# across "Highly Profitable," "Certified Fresh," and "Both" categories
genre_order <- genre_long %>%
  filter(Metric == "Both") %>%
  arrange(desc(Count)) %>%
  pull(Genre) %>% 
    rev()

# Update the plot using this fixed order
genre_faceted_sorted_plot <- genre_long %>%
  mutate(Genre = factor(Genre, levels = genre_order)) %>%
  ggplot(aes(y = Genre, x = Count, fill = Metric)) +
  geom_col(width = 0.7) +
  facet_grid(. ~ Metric, scales = "free_x") +
  labs(
    title = "Genre Distribution in Movies",
    subtitle = "Highly Profitable vs Certified Fresh vs Both Categories",
    x = "Count of Movies",
    y = NULL
  ) +
  scale_fill_manual(
    values = c(
      "Highly Profitable" = "#69b3e7",
      "Certified Fresh" = "#1e4b7a",
      "Both" = "#2E5B88"
    ),
    name = "Category"
  ) +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 11, color = "black"),
    axis.text.x = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5),
    strip.text = element_text(size = 12, face = "bold"),
    legend.position = "none",
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_line(color = "gray85", linewidth = 0.5),
    plot.margin = unit(c(1, 1, 1, 1), "cm"),
    panel.spacing = unit(2, "lines")
  )

print(genre_faceted_sorted_plot)