Research Question

How are demographic factors such as race, gender, and US state correlated with changes in US four-year program college completion rates between 1996 and 2014?

Introduction

This project aims to answer the following question: How are demographic factors such as race, gender, and US state correlated with changes in US four-year program college completion rates between 1996 and 2014? As current college students, we wanted to investigate factors that could contribute to an individual not graduating from college. There are a variety of reasons why someone may begin at a four-year institution and not graduate, and this report investigates race, gender, and the state as three of the possible factors. This project is also important for colleges and universities so in addition to observing completion rates, they can identify demographic factors that may affect the college completion rate and work to make higher education more equitable.

Data Description

https://nces.ed.gov/programs/digest/d21/tables/dt21_326.10.asp?current=yes#

The National Center for Education Statistics (NCES) is a branch of the US Department of Education. The institute collects, analyzes, and publicly provides published statistical data. This data is from the original source, which is the National Center for Education Statistics. Mostly the source collected the data through survey responses. This data set contains the percentages of completed 4-year postsecondary institutions students by sex, race, control of institution, and percentage of applications accepted Cohort entry years from 1996 through 2014.

https://data.ers.usda.gov/reports.aspx?ID=17829 This data source contains the average college dropout rates for undergraduates who did not complete their degree program for the year up until the year 2020. This data was pre-processed by calculating the average of the year 1996 to 2014, which were the years selected for this research problem. The original data was collected from surveys and entered into the U.S Census Bureau for 1970,1980,1990, 2000, and 2010. Some states are missing data for certain years, which means that they will not be observed. The data does not appear to be biased and is valid since it is from an official government data collection website.

https://educationdata.org/college-dropout-rates#:~:text=In%20the%20United%20States%2C%20the,college%20dropout%20rate%20at%2054%25

This data source contains the average college dropout rates for undergraduates who did not complete their degree program for the year up until the year 2022. This source was used to verify the dropout average that was calculated using source 1. This is the data from the original source made by Melanie Hanson. There was not any missing data from this set.

Data Processing

Figure 1 chart compares college completion rates by race. The original data also was divided by gender in addition to race, but gender was disregarded for this section. The cleaned data now has three variables: year, race, and percentage. The description of each of these variables can be found in the data dictionary in the appendix of the report.

In Figure 2, we are making a chart to compare the completion rates by Gender. To help us achieve that we cleaned the data to be more focused on the percentage, gender, and entry year. The code chunk can be seen below.

Figure 3 compares the average college dropout rates for undergraduates in each state. The data also lists the percentage of dropouts who are under 35. The variables were renamed to State, Total, and Under 35.

In Figure 4, the race data was split up by gender. Since this is an overall comparison graph, we chose to compare gender and race to see if one factor influenced the completion rates more than the other. In this case, two data frames were created, one with female and one with male; they were merged for the comparison graph.

clean <- races[1:15,]

races_clean <- clean %>% 
    clean_names() %>% 
    select(-(x7:x8)) %>% 
    select(-(x10:x11)) %>% 
    select(-(x2)) %>% 
    rename(year = all_4_year_institutions) %>%
    rename(white = x3) %>%
    rename(black = x4) %>%
    rename(hispanic = x5) %>%
    rename(asian_pacific_islander = x6) %>% 
    rename(american_indian_alaska_native = x9) %>% 
    mutate(year = str_remove_all(year, "entry cohort")) %>% 
    mutate(year = as.double(year)) %>% 
    pivot_longer(
        cols = white:american_indian_alaska_native,
        names_to = "race",
        values_to = "percentage"
    ) %>% 
    mutate(race = fct_recode(race,
                             "White" = "white",
                             "Black" = "black",
                             "Hispanic" = "hispanic",
                             "Asian/Pacific Islander" = "asian_pacific_islander",
                             "American Indian/Alaska Native" = "american_indian_alaska_native"))
savePath <- here('data_processed', "races_clean.csv")
write_csv(races_clean, savePath)

Race

Figure 1

animation_plot <- ggplot(races_clean) +
    geom_point(
        aes(
            x = year,
            y = percentage,
            color = race
        ), size = 2
    ) +
    geom_line(
        aes(
            x = year, 
            y = percentage,
            color = race
        )) +
    geom_text_repel(
        aes(x = year, y = percentage, color = race, label = race),
        hjust = 0.5, nudge_x = 2, direction = "y", size = 6, segment.color = "grey", na.rm = TRUE) +
    scale_color_manual(values = c("#619CFF", "#AB91C4", "#FF61C3", "#00BA38", "#FFAE90"
    )) +
    scale_x_continuous(
        limits = c(1996, 2012),
        breaks = c(1996, 2000, 2004, 2008, 2012)
        ) +
    scale_y_continuous(
        limits = c(10, 60),
        breaks = c(seq(10,60,5))) +
    theme_half_open() +
    labs(
        x = "Year",
        y = "Graduation Rate Percentage",
        title = "College Graduation Rate by Race/Ethnicity",
        caption = "Source: National Center for Education Statistics"
    ) +
    theme(legend.position="none") +
    theme(plot.title = element_text(hjust = 0.5))

animation <- animation_plot +
    transition_reveal(year)

animate(animation,
        end_pause = 15,
        duration = 10,
        width = 1100, height = 650, res = 150,
        renderer = magick_renderer())

The chart shows that there has been an overall increase in college graduation rates from 1996 to 2012 for all races. When looking specifically at each race, Asian/Pacific Islander and White students were both around 50% as of 2012, while American Indian/Alaska Native and Black students were both around 25%. Hispanic students were in the middle at around 37%.

# Data Export
pathToData <- here('data_raw', 'tabn326.10.xls')
races_genders <- read_excel(pathToData, sheet = "Digest 2021 Table 326.10", skip = 5)
# Males & Races
clean <- races[66:80,]

data_clean_m <- clean %>%
    clean_names() %>%
    select(-(x7:x8)) %>%
    select(-(x10:x11)) %>%
    select(-(x2)) %>%
    rename(year = all_4_year_institutions) %>%
    rename(white = x3) %>%
    rename(black = x4) %>%
    rename(hispanic = x5) %>%
    rename(asian_pacific_islander = x6) %>%
    rename(american_indian_alaska_native = x9) %>%
    mutate(year = str_remove_all(year, "entry cohort")) %>%
    mutate(year = as.double(year)) %>%
    pivot_longer(
        cols = white:american_indian_alaska_native,
        names_to = "race",
        values_to = "percentage"
    ) %>%
    mutate(race = fct_recode(race,
                             "White" = "white",
                             "Black" = "black",
                             "Hispanic" = "hispanic",
                             "Asian/Pacific Islander" = "asian_pacific_islander",
                             "American Indian/Alaska Native" = "american_indian_alaska_native")) %>%
    mutate(gender = "male") %>% 
    group_by(race) %>%
    summarise(average_percent = mean(percentage)) %>% 
    mutate(gender = "male")

savePath <- here('data_processed', "data_clean_m.csv")
write_csv(data_clean_m, savePath)
# Females & Races
clean <- races[131:145,]

data_clean_f <- clean %>%
    clean_names() %>%
    select(-(x7:x8)) %>%
    select(-(x10:x11)) %>%
    select(-(x2)) %>%
    rename(year = all_4_year_institutions) %>%
    rename(white = x3) %>%
    rename(black = x4) %>%
    rename(hispanic = x5) %>%
    rename(asian_pacific_islander = x6) %>%
    rename(american_indian_alaska_native = x9) %>%
    mutate(year = str_remove_all(year, "entry cohort")) %>%
    mutate(year = as.double(year)) %>%
    pivot_longer(
        cols = white:american_indian_alaska_native,
        names_to = "race",
        values_to = "percentage"
    ) %>%
    mutate(race = fct_recode(race,
                             "White" = "white",
                             "Black" = "black",
                             "Hispanic" = "hispanic",
                             "Asian/Pacific Islander" = "asian_pacific_islander",
                             "American Indian/Alaska Native" = "american_indian_alaska_native")) %>%
    mutate(gender = "female") %>% 
    group_by(race) %>%
    summarise(average_percent = mean(percentage)) %>% 
    mutate(gender = "female")

savePath <- here('data_processed', "data_clean_f.csv")
write_csv(data_clean_f, savePath)
# Gender
graduation_rate <- read_excel("data_raw/Graduation rate by race and sex 1996-2014.xls", 
    sheet = "Digest 2021 Table 326.10", skip = 1) %>%
    clean_names()

graduation_rate2 <- graduation_rate%>%
    select(time_to_completion_sex_control_of_institution_cohort_entry_year_and_percentage_of_applications_accepted:total) %>%
    filter(!total == "NA") 
    
### select certain rows 

graduation_rate3 <- 
    graduation_rate2 %>%
    slice(2:16) 

female <-c (37.98561, 40.14311, 40.83008, 40.98976, 42.17581, 42.22312, 43.06550,
43.54768, 43.98212, 44.22564, 45.05931, 45.91336, 48.42821, 49.95644, 51.29678)

graduation_rate3$female <- female
    
# Create New Gender Column 
graduation_rate4 <- graduation_rate3 %>%
    rename( "Year" = time_to_completion_sex_control_of_institution_cohort_entry_year_and_percentage_of_applications_accepted,
        "Male" = total,
        "Female" = female,
        )

view(graduation_rate4)

gender_final<- graduation_rate4 %>%
    mutate(Year = str_remove_all(Year, "entry cohort")) %>%
    pivot_longer(
        cols = Male:Female,
        names_to = "Gender",
        values_to = "Percentage"
    )

### Round the Percentages
gender_final2 <- gender_final %>%
    mutate(
    Percentage = round(Percentage, digits = 1)   
    )

Gender

Figure 2

ggplot(gender_final2, aes(y = Year, x = Percentage, label = Percentage, fill = 
    Gender, colour = Gender, label = Percentage)) +
    geom_segment(aes(x = 0, y = Year, xend = Percentage, yend = Year), color = "grey", size = 0.75) +
    geom_point(size = 7) +
    scale_color_manual(values = c('blue', 'red')) +
    geom_text(color = "white", size = 2) +
    theme_minimal_vgrid() +
    labs(
    x = 'Percentage',
    y = 'Year',
    title = 'Graduation Rate by Gender',
    caption = "Source: National Center for Education Statistics"
) 

From the above chart, we can conclude that females have the highest completion rates throughout the years 1996-2014. The highest percentage rate of completion was in 2014. The gap between the male and female completion rate every year is about 5%. This shows that women are outnumbering men when it comes to graduation from higher institutions. Every data exploratory research question will lead to a new research question. This will lead us to why there is a rise in women hi the question why there is higher education completion rate for women. This will be very difficult to answer without researching new data in-depth.

xlsxPath <- here('data_raw','college_dropout_state.xlsx')
dropout_one <- read_excel(xlsxPath, sheet = 'Sheet1')

US State

Figure 3

dropout_2 <- janitor::clean_names(dropout_one)

dropout_3 <- dropout_2 %>%
    mutate('Total Dropouts' =
               as.numeric('Total Dropout'),
           'Under 35' = as.numeric('Under 35'),
           total_dropouts= total_dropouts ,
           under_35 = percent(under_35, accuracy = 1))

library(plotly)

dropout_3$hover <- with(dropout_3, paste(state, '<br>', "Number of Dropouts:",
                                        total_dropouts, ", Under 35: ", under_35,"<br>"))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
    scope = 'usa',
    projection = list(type = 'albers usa'),
    showlakes = TRUE,
    lakecolor = toRGB('white')
)

fig <- plot_geo(dropout_3, locationmode = 'USA-states')
fig <- fig %>% add_trace(
    z = ~total_dropouts, text = ~hover, locations = dropout_3$state,
    color = ~total_dropouts, colors = 'Purples'
)
fig <- fig %>% colorbar(title = "Total Amount of Dropouts")
fig <- fig %>% layout(
    title = 'Average Number of College Dropouts for Each State from 1990 -2014<br>(Hover for breakdown)',
    geo = g
)

fig
write_csv(dropout_3,"State_Clean_data.csv")

Source: USDA Economic Research Service

First, we will examine a US map displaying the average college dropout rates from 1996 to 2012 to examine the correlations between the number of dropouts and state. For this graph, the average values from 1996 -2014 were displayed instead of each individual year because there was low variation between the selected years. By hovering over the different states in the plot above, one can see the number of dropouts from each state and the percentage of dropouts that are under the age of 35. States that are more purple have higher dropout rates while lighter states have lower rates. The desired scenario for a state is a low dropout rate, which would be the lighter values on this map.

By examining the graph, it can be concluded that California has the highest rate of college dropouts compared to any other state. College dropouts under the age of 35 are more likely to live in Delaware compared to other states since 53.7% of the dropouts were under the age of 35, which is the highest percentage of any other state. As a result of the findings, we can conclude that college dropout rates are higher in more populated states compared to less populated states. States such as California, Texas, New York, Illinois, and Florida had the highest dropout rates, with 39.4% of all dropouts living in these states. Additionally, each of those states is ranked among the top 10 most populated.

Factor Comparisons: Gender & Race

Figure 4

genders_data <- rbind(data_clean_m, data_clean_f)

m_f_data <- genders_data %>%
    ggplot() +
    geom_col(
        aes(x = average_percent,
            y = reorder(race, average_percent),
            fill = as.factor(gender)),
        position = "dodge") +
    theme_minimal_vgrid() +
    labs(
        title = "College Graduation Rate by Race/Ethnicity and Gender",
        x = "Graduation Percentage",
        y = "Race/Ethnicity",
        fill = "Gender",
        caption = "Source: National Center for Education Statistics"
    ) +
    theme(plot.title = element_text(hjust = 0.5))

m_f_data

Figure 4 serves as a comparison for two of the three factors our group analyzed in this report. In the end, we learned that female students had a higher graduation rate across all of the races (Black, American Indian/Alaska Native, Hispanic, White, and Asian/Pacific Islander). Additionally, there is a very big gap in the graduation rate between Asian/Pacific Islander and Black students.

Conclusion

After examining the data sets we found from different sources, we found that factors such as; gender, state, and race play a big role in completion percentages. We found Race to be the biggest factor. The results from “College graduation rate by race/ethnicity” shows that White students have a 50% graduation rate while Black, Hispanic, and American-Indian students have less than a 50% graduation rate. This suggests that minority students have a lower graduation rate. These results generate another question to look into different factors such as economic status. Gender was the other factor we explored. The results from the graph of the gender factor show that women have a higher college graduation rate than men. The last factor we looked into was states. The conclusion we made after looking at the state factor is that California has the biggest dropout rate which suggests that it has a lower graduation rate than all the states. California is also one of the states among the most populated states in the US.This led us to draw another conclusion which is that college dropout rates are higher in more populated states compared to less populated states. Overall, we found all the factors correlate strongly with the graduation rates in the US between the years 1996 and 2014. Even though we conclude race to be the biggest factor, states and gender also have a correlation with graduation rates. Overall, Asian/Pacific Islander female students have the highest overall college graduation rates and Black male students have the lowest overall graduation rates.

Appendix

Data Dictionary

data_dictionary_path <- here("data_processed", "data_d.xlsx")

data_dictionary <- read_excel(here::here("data_processed", "data_d.xlsx"))

kable(data_dictionary)
Variable  Description
Year Year of a 4-year institution completion 
Male The percentage of male completion rate
Female The percentage of female completion rate
Race White, Black, Hispanic, Asian/Pacific Islander, American Indian/Alaska Native
Percentage College Completion Rate
State  One of the United States of America
Total Dropouts The average number of dropouts from year state from the years 1996-2014
Under 35 The percentage of college dropouts from each state that are under the age of 35 
State  One of the United States of America
# Load libraries and settings here
library(tidyverse)
library(here)
library(readxl)
library(janitor)
library(ggplot2)
library(cowplot)
library(gganimate)
library(viridis)
library(ggrepel)
library(magick)
library(stringr)
library(lubridate)
library(dplyr)
#install.packages("jpeg")
library(jpeg)
library(readr)
#install.packages("leaflet")
#install.packages("maps")
library(maps)
library(leaflet)
library(scales)
#install.packages("plotly")
library(plotly)
library(knitr)

knitr::opts_chunk$set(
    warning = FALSE,
    message = FALSE,
    comment = "#>",
    fig.path = "figs/", # Folder where rendered plots are saved
    fig.width = 7.252, # Default plot width
    fig.height = 4, # Default plot height
    fig.retina = 3 # For better plot resolution
)

# Put any other "global" settings here, e.g. a ggplot theme:
theme_set(theme_bw(base_size = 20))

# Write code below here to load any data used in project
pathToData <- here('data_raw', 'tabn326.10.xls')
races <- read_excel(pathToData, sheet = "Digest 2021 Table 326.10", skip = 5)

clean <- races[1:15,]

races_clean <- clean %>% 
    clean_names() %>% 
    select(-(x7:x8)) %>% 
    select(-(x10:x11)) %>% 
    select(-(x2)) %>% 
    rename(year = all_4_year_institutions) %>%
    rename(white = x3) %>%
    rename(black = x4) %>%
    rename(hispanic = x5) %>%
    rename(asian_pacific_islander = x6) %>% 
    rename(american_indian_alaska_native = x9) %>% 
    mutate(year = str_remove_all(year, "entry cohort")) %>% 
    mutate(year = as.double(year)) %>% 
    pivot_longer(
        cols = white:american_indian_alaska_native,
        names_to = "race",
        values_to = "percentage"
    ) %>% 
    mutate(race = fct_recode(race,
                             "White" = "white",
                             "Black" = "black",
                             "Hispanic" = "hispanic",
                             "Asian/Pacific Islander" = "asian_pacific_islander",
                             "American Indian/Alaska Native" = "american_indian_alaska_native"))

savePath <- here('data_processed', "races_clean.csv")
write_csv(races_clean, savePath)

animation_plot <- ggplot(races_clean) +
    geom_point(
        aes(
            x = year,
            y = percentage,
            color = race
        ), size = 2
    ) +
    geom_line(
        aes(
            x = year, 
            y = percentage,
            color = race
        )) +
    geom_text_repel(
        aes(x = year, y = percentage, color = race, label = race),
        hjust = 0.5, nudge_x = 2, direction = "y", size = 6, segment.color = "grey", na.rm = TRUE) +
    scale_color_manual(values = c("#619CFF", "#AB91C4", "#FF61C3", "#00BA38", "#FFAE90"
    )) +
    scale_x_continuous(
        limits = c(1996, 2012),
        breaks = c(1996, 2000, 2004, 2008, 2012)
        ) +
    scale_y_continuous(
        limits = c(10, 60),
        breaks = c(seq(10,60,5))) +
    theme_half_open() +
    labs(
        x = "Year",
        y = "Graduation Rate Percentage",
        title = "College Graduation Rate by Race/Ethnicity",
        caption = "Source: National Center for Education Statistics"
    ) +
    theme(legend.position="none") +
    theme(plot.title = element_text(hjust = 0.5))

animation <- animation_plot +
    transition_reveal(year)

animate(animation,
        end_pause = 15,
        duration = 10,
        width = 1100, height = 650, res = 150,
        renderer = magick_renderer())


# Data Export
pathToData <- here('data_raw', 'tabn326.10.xls')
races_genders <- read_excel(pathToData, sheet = "Digest 2021 Table 326.10", skip = 5)
# Males & Races
clean <- races[66:80,]

data_clean_m <- clean %>%
    clean_names() %>%
    select(-(x7:x8)) %>%
    select(-(x10:x11)) %>%
    select(-(x2)) %>%
    rename(year = all_4_year_institutions) %>%
    rename(white = x3) %>%
    rename(black = x4) %>%
    rename(hispanic = x5) %>%
    rename(asian_pacific_islander = x6) %>%
    rename(american_indian_alaska_native = x9) %>%
    mutate(year = str_remove_all(year, "entry cohort")) %>%
    mutate(year = as.double(year)) %>%
    pivot_longer(
        cols = white:american_indian_alaska_native,
        names_to = "race",
        values_to = "percentage"
    ) %>%
    mutate(race = fct_recode(race,
                             "White" = "white",
                             "Black" = "black",
                             "Hispanic" = "hispanic",
                             "Asian/Pacific Islander" = "asian_pacific_islander",
                             "American Indian/Alaska Native" = "american_indian_alaska_native")) %>%
    mutate(gender = "male") %>% 
    group_by(race) %>%
    summarise(average_percent = mean(percentage)) %>% 
    mutate(gender = "male")

savePath <- here('data_processed', "data_clean_m.csv")
write_csv(data_clean_m, savePath)
# Females & Races
clean <- races[131:145,]

data_clean_f <- clean %>%
    clean_names() %>%
    select(-(x7:x8)) %>%
    select(-(x10:x11)) %>%
    select(-(x2)) %>%
    rename(year = all_4_year_institutions) %>%
    rename(white = x3) %>%
    rename(black = x4) %>%
    rename(hispanic = x5) %>%
    rename(asian_pacific_islander = x6) %>%
    rename(american_indian_alaska_native = x9) %>%
    mutate(year = str_remove_all(year, "entry cohort")) %>%
    mutate(year = as.double(year)) %>%
    pivot_longer(
        cols = white:american_indian_alaska_native,
        names_to = "race",
        values_to = "percentage"
    ) %>%
    mutate(race = fct_recode(race,
                             "White" = "white",
                             "Black" = "black",
                             "Hispanic" = "hispanic",
                             "Asian/Pacific Islander" = "asian_pacific_islander",
                             "American Indian/Alaska Native" = "american_indian_alaska_native")) %>%
    mutate(gender = "female") %>% 
    group_by(race) %>%
    summarise(average_percent = mean(percentage)) %>% 
    mutate(gender = "female")

savePath <- here('data_processed', "data_clean_f.csv")
write_csv(data_clean_f, savePath)
# Gender
graduation_rate <- read_excel("data_raw/Graduation rate by race and sex 1996-2014.xls", 
    sheet = "Digest 2021 Table 326.10", skip = 1) %>%
    clean_names()

graduation_rate2 <- graduation_rate%>%
    select(time_to_completion_sex_control_of_institution_cohort_entry_year_and_percentage_of_applications_accepted:total) %>%
    filter(!total == "NA") 
    
### select certain rows 

graduation_rate3 <- 
    graduation_rate2 %>%
    slice(2:16) 

female <-c (37.98561, 40.14311, 40.83008, 40.98976, 42.17581, 42.22312, 43.06550,
43.54768, 43.98212, 44.22564, 45.05931, 45.91336, 48.42821, 49.95644, 51.29678)

graduation_rate3$female <- female
    
# Create New Gender Column 
graduation_rate4 <- graduation_rate3 %>%
    rename( "Year" = time_to_completion_sex_control_of_institution_cohort_entry_year_and_percentage_of_applications_accepted,
        "Male" = total,
        "Female" = female,
        )

view(graduation_rate4)

gender_final<- graduation_rate4 %>%
    mutate(Year = str_remove_all(Year, "entry cohort")) %>%
    pivot_longer(
        cols = Male:Female,
        names_to = "Gender",
        values_to = "Percentage"
    )

### Round the Percentages
gender_final2 <- gender_final %>%
    mutate(
    Percentage = round(Percentage, digits = 1)   
    )

ggplot(gender_final2, aes(y = Year, x = Percentage, label = Percentage, fill = 
    Gender, colour = Gender, label = Percentage)) +
    geom_segment(aes(x = 0, y = Year, xend = Percentage, yend = Year), color = "grey", size = 0.75) +
    geom_point(size = 7) +
    scale_color_manual(values = c('blue', 'red')) +
    geom_text(color = "white", size = 2) +
    theme_minimal_vgrid() +
    labs(
    x = 'Percentage',
    y = 'Year',
    title = 'Graduation Rate by Gender',
    caption = "Source: National Center for Education Statistics"
) 

xlsxPath <- here('data_raw','college_dropout_state.xlsx')
dropout_one <- read_excel(xlsxPath, sheet = 'Sheet1')


dropout_2 <- janitor::clean_names(dropout_one)

dropout_3 <- dropout_2 %>%
    mutate('Total Dropouts' =
               as.numeric('Total Dropout'),
           'Under 35' = as.numeric('Under 35'),
           total_dropouts= total_dropouts ,
           under_35 = percent(under_35, accuracy = 1))

library(plotly)

dropout_3$hover <- with(dropout_3, paste(state, '<br>', "Number of Dropouts:",
                                        total_dropouts, ", Under 35: ", under_35,"<br>"))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
    scope = 'usa',
    projection = list(type = 'albers usa'),
    showlakes = TRUE,
    lakecolor = toRGB('white')
)

fig <- plot_geo(dropout_3, locationmode = 'USA-states')
fig <- fig %>% add_trace(
    z = ~total_dropouts, text = ~hover, locations = dropout_3$state,
    color = ~total_dropouts, colors = 'Purples'
)
fig <- fig %>% colorbar(title = "Total Amount of Dropouts")
fig <- fig %>% layout(
    title = 'Average Number of College Dropouts for Each State from 1990 -2014<br>(Hover for breakdown)',
    geo = g
)

fig

write_csv(dropout_3,"State_Clean_data.csv")
genders_data <- rbind(data_clean_m, data_clean_f)

m_f_data <- genders_data %>%
    ggplot() +
    geom_col(
        aes(x = average_percent,
            y = reorder(race, average_percent),
            fill = as.factor(gender)),
        position = "dodge") +
    theme_minimal_vgrid() +
    labs(
        title = "College Graduation Rate by Race/Ethnicity and Gender",
        x = "Graduation Percentage",
        y = "Race/Ethnicity",
        fill = "Gender",
        caption = "Source: National Center for Education Statistics"
    ) +
    theme(plot.title = element_text(hjust = 0.5))

m_f_data


data_dictionary_path <- here("data_processed", "data_d.xlsx")

data_dictionary <- read_excel(here::here("data_processed", "data_d.xlsx"))

kable(data_dictionary)