class: middle, inverse .leftcol30[ <center> <img src="https://raw.githubusercontent.com/emse-eda-gwu/2022-Fall/master/images/logo.png" width=250> </center> ] .rightcol70[ # Week 8: .fancy[Comparisons] ###
EMSE 4572: Exploratory Data Analysis ###
John Paul Helveston ###
October 19, 2022 ] --- class: center, middle, inverse # .fancy[.blue[Tip of the week]] # Shortcut keys --- class: middle, inverse ## 1) Quick shortcuts .leftcol[ Insert a `<-` operator: - **Windows**: `ALT` + `-` - **Mac**: `OPTION` + `-` ] -- .rightcol[ Insert a `%>%` operator: - **Windows**: `CTRL` + `SHIFT` + `M` - **Mac**: `COMMAND` + `SHIFT` + `M` ] --- class: middle, inverse ## 2) Edit multiple lines of code at once 1. Press and hold `ALT` (Windows) or `OPTION` (Mac) 2. Select multiple lines of code https://twitter.com/i/status/995394452821721088 --- class: middle, inverse .rightcol80[ ## "At the heart of quantitative reasoning is a single question: Compared to what?" ## -- Edward Tufte ] --- ## Today's data ```r college_all_ages <- read_csv(here('data', 'college_all_ages.csv')) gapminder <- read_csv(here('data', 'gapminder.csv')) marathon <- read_csv(here('data', 'marathon.csv')) milk_production <- read_csv(here('data', 'milk_production.csv')) *internet_regions <- read_csv(here('data', 'internet_users_region.csv')) ``` ## New packages ```r install.packages("ggrepel") install.packages("ggridges") ``` --- class: inverse, middle # Week 8: .fancy[Comparisons] ## 1. Comparing to a reference ## 2. Comparing variables ## BREAK ## 3. Comparing distributions --- class: inverse, middle # Week 8: .fancy[Comparisons] ## 1. .orange[Comparing to a reference] ## 2. Comparing variables ## BREAK ## 3. Comparing distributions <!-- Comparing things to a reference line: - Add a simple line - diverging bars / lollipops, - In any of these plots, adding a benchmark can be really useful - Another way is to compare things to a **computed** benchmark, like the mean - diverging bars / lollipops --> --- class: middle ## .center[For this section, we'll be using this data frame:] .code100[ ```r gapminder_americas <- gapminder %>% filter(continent == "Americas", year == 2007) %>% mutate(country = fct_reorder(country, lifeExp)) ``` ] --- class: center ## Use reference lines to add context to chart .leftcol[ <img src="figs/life-exp-dots-1.png" width="432" style="display: block; margin: auto;" /> ] -- .rightcol[ <img src="figs/life-exp-dots-mean-1.png" width="432" style="display: block; margin: auto;" /> ] --- class: center ## Or make zero the reference line .leftcol[ <img src="figs/life-exp-dots-diverging-1.png" width="432" style="display: block; margin: auto;" /> ] .rightcol[ <img src="figs/life-exp-bars-diverging-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## How to add a reference line .leftcol60[.code70[ Add horizontal line with `geom_hline()` Add vertical line with `geom_vline()` ```r ggplot(gapminder_americas) + geom_point( aes(x = lifeExp, y = country), color = 'steelblue', size = 2.5) + * geom_vline( * xintercept = mean(gapminder_americas$lifeExp), * color = 'red', linetype = 'dashed') + theme_minimal_vgrid() + labs(x = 'Life expectancy (years)', y = 'Country') ``` ]] .rightcol40[ <img src="figs/unnamed-chunk-7-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## How to add a reference line .leftcol60[.code70[ Add text with `annotate()` ```r ggplot(gapminder_americas) + geom_point( aes(x = lifeExp, y = country), color = 'steelblue', size = 2.5) + geom_vline( xintercept = mean(gapminder_americas$lifeExp), color = 'red', linetype = 'dashed') + annotate( * 'text', x = 73.2, y = 'Puerto Rico', * color = 'red', hjust = 1, * label = 'Mean\nLife\nExpectancy') + theme_minimal_vgrid() + labs(x = 'Life expectancy (years)', y = 'Country') ``` ]] .rightcol40[.center[ <img src="figs/unnamed-chunk-9-1.png" width="432" style="display: block; margin: auto;" /> ]] --- ## How to make zero the reference point .leftcol60[.code70[ ```r gapminder_diverging <- gapminder_americas %>% mutate( # Subtract the mean * lifeExp = lifeExp - mean(lifeExp), # Define the fill color * color = ifelse(lifeExp > 0, 'Above', 'Below')) ``` ```r ggplot(gapminder_diverging) + geom_col( * aes(x = lifeExp, y = country, fill = color), width = 0.7, alpha = 0.8) + scale_fill_manual( * values = c('steelblue', 'red')) + theme_minimal_vgrid() + theme(legend.position = 'none') + labs( x = 'Country', y = 'Difference from mean life expectancy (years)') ``` ]] .rightcol40[ <img src="figs/unnamed-chunk-12-1.png" width="432" style="display: block; margin: auto;" /> ] --- class: inverse
20
:
00
### Your turn - comparing to a reference Use the `milk_production.csv` data to create the following charts<br>showing differences from the mean state milk production in 2017. .leftcol[ <img src="figs/milk-lollipop-mean-1.png" width="60%" style="display: block; margin: auto;" /> ] .rightcol[ <img src="figs/milk-bars-diverging-1.png" width="60%" style="display: block; margin: auto;" /> ] --- class: inverse, middle # Week 8: .fancy[Comparisons] ## 1. Comparing to a reference ## 2. .orange[Comparing variables] ## BREAK ## 3. Comparing distributions <!-- Comparing categories with facets Comparing two things (dodged bars, slope chart, dumbbell chart) - dodged comparisons are fine, but really no more than 2 things. - Finally, overlapping bars are great when you want to show when something exceeds a threshold. E.g. going over your budget. - Using facets to break up 3-4 groups of 2 is okay. - A better approach for multiple categories: - slope charts - dumbbell charts --> --- class: center, middle ## Neither of these charts are great .leftcol[ <img src="figs/diamonds_bars_stacked-1.png" width="522.144" /> ] .rightcol[ <img src="figs/diamonds_bars_dodged-1.png" width="522.144" /> ] --- class: center ## "Parallel Coordinates" plots work well .leftcol[.left[ ```r diamonds %>% count(clarity, cut) %>% ggplot( aes(x = clarity, y = n, * color = cut, group = cut)) + geom_line() + geom_point() + scale_y_continuous(limits = c(0, 5100)) + theme_half_open(font_size = 18) + labs(y = "Count") ``` ]] .rightcol[ <img src="figs/unnamed-chunk-15-1.png" width="504" /> ] --- ## Consider facets for **comparing across categories** .leftcol60[ ```r diamonds %>% count(clarity, cut) %>% ggplot() + geom_col(aes(x = clarity, y = n), width = 0.7) + * facet_wrap(vars(cut), nrow = 1) + scale_y_continuous( expand = expansion(mult = c(0, 0.05))) + theme_minimal_hgrid(font_size = 16) ``` ] <img src="figs/diamonds-facet-2-1.png" width="1296" /> --- .leftcol[.code70[ ## Consider facets for **comparing across categories** ```r diamonds %>% count(clarity, cut) %>% mutate(n = n / 1000) %>% ggplot() + geom_col(aes(x = clarity, y = n), width = 0.7) + * facet_wrap(vars(cut), ncol = 2) + * coord_flip() + scale_y_continuous( expand = expansion(mult = c(0, 0.05))) + * theme_minimal_vgrid(font_size = 16) + labs(y = "Count (thousands)") ``` ]] .rightcol[ <img src="figs/unnamed-chunk-16-1.png" width="576" style="display: block; margin: auto;" /> ] --- background-image: url("images/ft-coronavirus.jpg") background-size: contain .right[<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> From [Financial Times](https://www.ft.com/coronavirus-latest)] --- class: inverse, center ## When comparing across multiple categories, consider: .leftcol[ ## Parallel coordinates charts <img src="figs/unnamed-chunk-17-1.png" width="504" /> ] .rightcol[ ## Faceting <img src="figs/unnamed-chunk-18-1.png" width="70%" style="display: block; margin: auto;" /> ] --- ## .center[When comparing **only 2** things,<br>dodged bars are a good starting point] .leftcol[.code70[ ```r milk_compare <- milk_production %>% filter(year %in% c(1970, 2017)) %>% mutate(state = fct_other(state, keep = c('California', 'Wisconsin'))) %>% group_by(year, state) %>% summarise( milk_produced = sum(milk_produced) / 10^9) ``` ``` #> # A tibble: 6 × 3 #> # Groups: year [2] #> year state milk_produced #> <dbl> <fct> <dbl> #> 1 1970 California 9.46 #> 2 1970 Wisconsin 18.4 #> 3 1970 Other 89.1 #> 4 2017 California 39.8 #> 5 2017 Wisconsin 30.3 #> 6 2017 Other 145. ``` ]] .rightcol[ <img src="figs/unnamed-chunk-20-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## .center[When comparing **only 2** things,<br>dodged bars are a good starting point] .leftcol[.code70[ ```r ggplot(milk_compare) + geom_col( aes(x = milk_produced, y = state, fill = as.factor(year)), width = 0.7, alpha = 0.8, * position = 'dodge') + scale_fill_manual( values = c('grey', 'steelblue'), guide = guide_legend(reverse = TRUE)) + scale_x_continuous( expand = expansion(mult = c(0, 0.05))) + theme_minimal_vgrid() + labs( x = 'Milk produced (billion lbs)', y = NULL, fill = 'Year') ``` ]] .rightcol[ <img src="figs/unnamed-chunk-22-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: center ## Avoid putting >2 categories in legend (if possible) .leftcol[ <center> <img src="images/check-bad.png" width=75> </center> <img src="figs/milk-compare-dodged-bad-1.png" width="504" style="display: block; margin: auto;" /> ] .rightcol[ <center> <img src="images/check-good.png" width=100> </center> <img src="figs/unnamed-chunk-23-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## .center[Or use facets to get rid of the legend!] .leftcol[ <center> <img src="images/check-bad.png" width=75> </center> <img src="figs/unnamed-chunk-24-1.png" width="504" style="display: block; margin: auto;" /> ] .rightcol[ <center> <img src="images/check-good.png" width=100> </center> <img src="figs/unnamed-chunk-25-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: center ## "Bullet" charts are also effective for comparing **2** things <img src="figs/milk-compare-bullet-1.png" width="576" style="display: block; margin: auto;" /> --- ## How to make a "bullet" chart .leftcol[.code70[ ```r milk_compare %>% * pivot_wider( * names_from = year, * values_from = milk_produced) ggplot() + geom_col( aes(x = `1970`, y = state, fill = '1970'), * width = 0.7) + geom_col( aes(x = `2017`, y = state, fill = '2017'), * width = 0.3) + scale_fill_manual( values = c('grey', 'black')) + scale_x_continuous( expand = expansion(mult = c(0, 0.05))) + theme_minimal_vgrid(font_size = 18) + labs( x = 'Milk produced (billion lbs)', y = NULL, fill = "Year") ``` ]] .rightcol[.center[ <img src="figs/unnamed-chunk-27-1.png" width="576" style="display: block; margin: auto;" /> ]] --- class: center ## With **more than 2** things, dodged bars can get confusing Still comparing 2 time periods, but across **10** categories <img src="figs/milk-compare-dodged-toomany-1.png" width="576" style="display: block; margin: auto;" /> --- class: center ### Strategies for comparing 2 things across **more than 2 categories** .leftcol[ **Dodged bars 😢** <img src="figs/unnamed-chunk-28-1.png" width="504" style="display: block; margin: auto;" /> ] -- .rightcol[ **Dumbbell bars 😄** <img src="figs/milk-dumbbell-chart-1.png" width="396" style="display: block; margin: auto;" /> ] --- class: center ### Strategies for comparing 2 things across **more than 2 categories** .leftcol[ **Dodged bars 😭** <img src="figs/unnamed-chunk-29-1.png" width="504" style="display: block; margin: auto;" /> ] -- .rightcol[ **Slope chart 😄** <img src="figs/milk-slope-chart-1.png" width="432" style="display: block; margin: auto;" /> ] --- .leftcol[ Dumbbell charts highlight: - Compare **magnitudes** across two periods / groups <img src="figs/unnamed-chunk-30-1.png" width="432" style="display: block; margin: auto;" /> ] -- .rightcol[ Slope charts highlight: - _Change_ in **rankings** - Highlight individual categories <img src="figs/unnamed-chunk-31-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## How to make a **Dumbbell chart** .leftcol[.code70[ Create data frame for plotting ```r top10states <- milk_production %>% filter(year == 2017) %>% arrange(desc(milk_produced)) %>% slice(1:10) milk_summary_dumbbell <- milk_production %>% filter( year %in% c(1970, 2017), state %in% top10states$state) %>% mutate( # Reorder state variables state = fct_reorder2(state, year, desc(milk_produced)), # Convert year to discrete variable year = as.factor(year), # Modify units milk_produced = milk_produced / 10^9) ``` ]] .rightcol[.code70[ <img src="figs/unnamed-chunk-33-1.png" width="432" style="display: block; margin: auto;" /> ]] --- ## How to make a **Dumbbell chart** .leftcol[.code70[ Make lines (note the `group` variable) ```r ggplot(milk_summary_dumbbell, * aes(x = milk_produced, y = state)) + * geom_line(aes(group = state), color = 'lightblue', size = 1) ``` ]] .rightcol[.code70[ <img src="figs/unnamed-chunk-34-1.png" width="432" style="display: block; margin: auto;" /> ]] --- ## How to make a **Dumbbell chart** .leftcol[.code70[ Add points (note the `color` variable) ```r ggplot(milk_summary_dumbbell, aes(x = milk_produced, y = state)) + geom_line(aes(group = state), color = 'lightblue', size = 1) + * geom_point(aes(color = year), size = 2.5) ``` ]] .rightcol[.code70[ <img src="figs/unnamed-chunk-35-1.png" width="432" style="display: block; margin: auto;" /> ]] --- ## How to make a **Dumbbell chart** .leftcol[.code70[ Change the colors: ```r ggplot(milk_summary_dumbbell, aes(x = milk_produced, y = state)) + geom_line(aes(group = state), color = 'lightblue', size = 1) + geom_point(aes(color = year), size = 2.5) + * scale_color_manual( * values = c('lightblue', 'steelblue')) ``` ]] .rightcol[.code70[ <img src="figs/unnamed-chunk-36-1.png" width="432" style="display: block; margin: auto;" /> ]] --- ## How to make a **Dumbbell chart** .leftcol[.code70[ Adjust the theme and annotate ```r ggplot(milk_summary_dumbbell, aes(x = milk_produced, y = state)) + geom_line(aes(group = state), color = 'lightblue', size = 1) + geom_point(aes(color = year), size = 2.5) + scale_color_manual( values = c('lightblue', 'steelblue')) + * theme_minimal_vgrid() + # Remove y axis line and tick marks theme( axis.line.y = element_blank(), * axis.ticks.y = element_blank()) + * labs(x = 'Milk produced (billion lbs)', * y = 'State', * color = 'Year', * title = 'Top 10 milk producing states', * subtitle = '(1970 - 2017)') ``` ]] .rightcol[.code70[ <img src="figs/unnamed-chunk-38-1.png" width="432" style="display: block; margin: auto;" /> ]] --- .leftcol[.code60[ Create data frame for plotting ```r top10states <- milk_production %>% filter(year == 2017) %>% arrange(desc(milk_produced)) %>% slice(1:10) milk_summary_slope <- milk_production %>% filter( year %in% c(1970, 2017), state %in% top10states$state) %>% mutate( # Reorder state variables state = fct_reorder2(state, year, desc(milk_produced)), # Convert year to discrete variable year = as.factor(year), # Modify units milk_produced = milk_produced / 10^9, * # Define line color * lineColor = if_else( * state == 'California', 'CA', 'other'), * # Make labels * label = paste(state, ' (', * round(milk_produced), ')'), * label_left = ifelse(year == 1970, label, NA), * label_right = ifelse(year == 2017, label, NA)) ``` ]] .rightcol[.code50[ ## .center[How to make a<br>**Slope chart**] <img src="figs/unnamed-chunk-40-1.png" width="432" style="display: block; margin: auto;" /> ]] --- .leftcol[.code60[ Create data frame for plotting ```r top10states <- milk_production %>% filter(year == 2017) %>% arrange(desc(milk_produced)) %>% slice(1:10) milk_summary_slope <- milk_production %>% filter( year %in% c(1970, 2017), state %in% top10states$state) %>% mutate( # Reorder state variables state = fct_reorder2(state, year, desc(milk_produced)), # Convert year to discrete variable year = as.factor(year), # Modify units milk_produced = milk_produced / 10^9, * # Define line color * lineColor = if_else( * state == 'California', 'CA', 'other'), * # Make labels * label = paste(state, ' (', * round(milk_produced), ')'), * label_left = ifelse(year == 1970, label, NA), * label_right = ifelse(year == 2017, label, NA)) ``` ]] .rightcol[.code50[ ## .center[How to make a<br>**Slope chart**] ```r milk_summary_slope %>% select(state, year, milk_produced, label, lineColor) ``` ``` #> # A tibble: 20 × 5 #> state year milk_produced label lineColor #> <fct> <fct> <dbl> <chr> <chr> #> 1 New York 1970 10.3 New York ( 10 ) other #> 2 Pennsylvania 1970 7.12 Pennsylvania ( 7 ) other #> 3 Michigan 1970 4.60 Michigan ( 5 ) other #> 4 Wisconsin 1970 18.4 Wisconsin ( 18 ) other #> 5 Minnesota 1970 9.64 Minnesota ( 10 ) other #> 6 Texas 1970 3.06 Texas ( 3 ) other #> 7 Idaho 1970 1.49 Idaho ( 1 ) other #> 8 New Mexico 1970 0.304 New Mexico ( 0 ) other #> 9 Washington 1970 2.09 Washington ( 2 ) other #> 10 California 1970 9.46 California ( 9 ) CA #> 11 New York 2017 14.9 New York ( 15 ) other #> 12 Pennsylvania 2017 10.9 Pennsylvania ( 11 ) other #> 13 Michigan 2017 11.2 Michigan ( 11 ) other #> 14 Wisconsin 2017 30.3 Wisconsin ( 30 ) other #> 15 Minnesota 2017 9.86 Minnesota ( 10 ) other #> 16 Texas 2017 12.1 Texas ( 12 ) other #> 17 Idaho 2017 14.6 Idaho ( 15 ) other #> 18 New Mexico 2017 8.21 New Mexico ( 8 ) other #> 19 Washington 2017 6.53 Washington ( 7 ) other #> 20 California 2017 39.8 California ( 40 ) CA ``` ]] --- ## How to make a **Slope chart** .leftcol[.code70[ Start with a line plot - note the `group` variable: ```r ggplot(milk_summary_slope, * aes(x = year, y = milk_produced, * group = state)) + * geom_line(aes(color = lineColor)) ``` ]] .rightcol[.code70[ <img src="figs/unnamed-chunk-43-1.png" width="432" style="display: block; margin: auto;" /> ]] --- ## How to make a **Slope chart** .leftcol[.code70[ Add labels: ```r ggplot(milk_summary_slope, aes(x = year, y = milk_produced, group = state)) + geom_line(aes(color = lineColor)) + # Add 1970 labels (left side) * geom_text(aes(label = label_left), * hjust = 1, nudge_x = -0.05) + # Add 2017 labels (right side) * geom_text(aes(label = label_right), * hjust = 0, nudge_x = 0.05) ``` Justification | `hjust` --------------|------- Left | 0 Center | 0.5 Right | 1 ]] .rightcol[.code70[ <img src="figs/unnamed-chunk-44-1.png" width="432" style="display: block; margin: auto;" /> ]] --- class: center, middle ## Overlapping labels? .leftcol30[.border[ <center> <img src="images/overlapping_labels.png"> </center> ]] --- class: center, middle ## Overlapping labels?<br>**ggrepel** library to the rescue! .leftcol30[.border[ <center> <img src="images/overlapping_labels.png"> </center> ]] .rightcol70[ <center> <img src="images/horst_monsters_ggrepel.jpg" width=600> </center> ] .left[Artwork by [@allison_horst](https://twitter.com/allison_horst)] --- ## How to make a **Slope chart** .leftcol[.code70[ Align labels so they don't overlap: ```r *library(ggrepel) ggplot(milk_summary_slope, aes(x = year, y = milk_produced, group = state)) + geom_line(aes(color = lineColor)) + # Add 1970 labels (left side) * geom_text_repel( aes(label = label_left), hjust = 1, nudge_x = -0.05, * direction = 'y', segment.color = 'grey') + # Add 2017 labels (right side) * geom_text_repel( aes(label = label_right), hjust = 0, nudge_x = 0.05, * direction = 'y', segment.color = 'grey') ``` ]] .rightcol[.code70[ <img src="figs/unnamed-chunk-45-1.png" width="504" style="display: block; margin: auto;" /> ]] --- ## How to make a **Slope chart** .leftcol[.code70[ Adjust colors: ```r ggplot(milk_summary_slope, aes(x = year, y = milk_produced, group = state)) + geom_line(aes(color = lineColor)) + geom_text_repel( aes(label = label_left), hjust = 1, nudge_x = -0.05, direction = 'y', segment.color = 'grey') + geom_text_repel( aes(label = label_right), hjust = 0, nudge_x = 0.05, direction = 'y', segment.color = 'grey') + # Move year labels to top, modify line colors * scale_x_discrete(position = 'top') + * scale_color_manual(values = c('red', 'black')) ``` ]] .rightcol[.code70[ <img src="figs/unnamed-chunk-46-1.png" width="504" style="display: block; margin: auto;" /> ]] --- .leftcol[.code60[ Adjust the theme and annotate ```r ggplot(milk_summary_slope, aes(x = year, y = milk_produced, group = state)) + geom_line(aes(color = lineColor)) + # Add 1970 labels (left side) geom_text_repel( aes(label = label_left), hjust = 1, nudge_x = -0.05, direction = 'y', segment.color = 'grey') + # Add 2017 labels (right side) geom_text_repel(aes(label = label_right), hjust = 0, nudge_x = 0.05, direction = 'y', segment.color = 'grey') + # Move year labels to top, modify line colors scale_x_discrete(position = 'top') + scale_color_manual(values = c('red', 'black')) + # Annotate & adjust theme labs(x = NULL, y = 'Milk produced (billion lbs)', title = 'Top 10 milk producing states (1970 - 2017)') + * theme_minimal_grid() + * theme(panel.grid = element_blank(), * axis.text.y = element_blank(), * axis.ticks = element_blank(), * legend.position = 'none') ``` ]] .rightcol[.code70[ ## .center[How to make a<br>**Slope chart**] <img src="figs/unnamed-chunk-48-1.png" width="432" style="display: block; margin: auto;" /> ]] --- class: inverse
20
:
00
## Your turn - comparing multiple categories Using the `internet_regions` data frame, pick a strategy and create an improved version of this chart. .leftcol30[ Strategies: - Dodged bars - Facets - Bullet chart - Dumbell chart - Slope chart ] .rightcol70[ <img src="figs/internet-usage-bars-1.png" width="80%" /> ] --- class: inverse, middle # Week 8: .fancy[Comparisons] ## 1. Comparing to a reference ## 2. Comparing variables ## BREAK ## 3. .orange[Comparing distributions] <!-- Comparing distributions - Box plots - Transparent histograms & densities (good for maybe 2 categories) - Ridgeline plots (good for lots of categories) Ridgeline plots: https://wilkelab.org/ggridges/ --> <!-- Helpful: https://datavizf17.classes.andrewheiss.com/files/example_2017-09-19.Rmd --> --- class: center ## Overlapping histograms have issues .leftcol[ ### Bad <img src="figs/marathon_histogram_overlap-1.png" width="504" style="display: block; margin: auto;" /> ] -- .rightcol[ ### Slightly better <img src="figs/marathon_density_overlap-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: center ## Good when number of categories is **small** .leftcol[ ### Density facets <img src="figs/marathon_density_facet-1.png" width="504" style="display: block; margin: auto;" /> ] -- .rightcol[ ### Diverging histograms <img src="figs/marathon_diverging_histograms-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: center ## Good when number of categories is **large** .leftcol[ ### Boxplot <img src="figs/college_boxplot-1.png" width="504" style="display: block; margin: auto;" /> ] -- .rightcol[ ### Ridgeplot <img src="figs/college_ridgeplot-1.png" width="504" style="display: block; margin: auto;" /> ] --- .leftcol[.code70[ ## How to make density facets You can use `facet_wrap()`, but<br>you won't get the full density overlay ```r ggplot(marathon, aes(x = Age, y = ..count.., fill = `M/F`)) + * geom_density(alpha = 0.7) + * facet_wrap(vars(`M/F`)) + scale_fill_manual( values = c('sienna', 'steelblue')) + scale_y_continuous( expand = expansion(mult = c(0, 0.05))) + theme_minimal_hgrid() ``` ]] .rightcol[ <img src="figs/unnamed-chunk-51-1.png" width="504" style="display: block; margin: auto;" /> ] -- .rightcol[ .center[If you want the full density overlay,<br>you have to hand-make the facets] <img src="figs/unnamed-chunk-52-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## How to make density facets .leftcol[.code70[ Make the full density plot first ```r base <- ggplot(marathon, * aes(x = Age, y = ..count..)) + * geom_density(fill = 'grey', alpha = 0.7) + scale_y_continuous( expand = expansion(mult = c(0, 0.05))) + theme_minimal_hgrid() ``` ]] .rightcol[ <img src="figs/density_facet_base-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## How to make density facets .leftcol[.code70[ Separately create each sub-plot ```r male <- base + geom_density( * data = marathon %>% * filter(`M/F` == 'M'), fill = 'steelblue', alpha = 0.7) + theme(legend.position = 'none') ``` ```r female <- base + geom_density( * data = marathon %>% * filter(`M/F` == 'F'), fill = 'sienna', alpha = 0.7) + theme(legend.position = 'none') ``` ]] .rightcol[ <img src="figs/density_facet_male-1.png" width="288" style="display: block; margin: auto;" /> <img src="figs/density_facet_female-1.png" width="288" style="display: block; margin: auto;" /> ] --- ## How to make density facets .code70[ Combine into single plot ```r plot_grid(male, female, labels = c('Male', 'Female')) ``` ] <img src="figs/unnamed-chunk-57-1.png" width="792" style="display: block; margin: auto;" /> --- ## How to make diverging histograms .leftcol[.code70[ Make the histograms by filtering the data ```r ggplot(marathon, aes(x = Age)) + # Add histogram for Female runners: geom_histogram( * data = marathon %>% * filter(`M/F` == 'F'), aes(fill = `M/F`, y=..count..), alpha = 0.7, color = 'white') + # Add negative histogram for Male runners: geom_histogram( * data = marathon %>% * filter(`M/F` == 'M'), aes(fill = `M/F`, y=..count..*(-1)), alpha = 0.7, color = 'white') ``` ]] .rightcol[ <img src="figs/unnamed-chunk-58-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## How to make diverging histograms .leftcol[.code70[ Make the histograms by filtering the data ```r ggplot(marathon, aes(x = Age)) + # Add histogram for Female runners: geom_histogram( data = marathon %>% filter(`M/F` == 'F'), * aes(fill = `M/F`, y=..count..), alpha = 0.7, color = 'white') + # Add negative histogram for Male runners: geom_histogram( data = marathon %>% filter(`M/F` == 'M'), * aes(fill = `M/F`, y=..count..*(-1)), alpha = 0.7, color = 'white') ``` ]] .rightcol[ <img src="figs/unnamed-chunk-60-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## How to make diverging histograms .leftcol[.code70[ Rotate, adjust colors, theme, annotate ```r ggplot(marathon, aes(x = Age)) + # Add histogram for Female runners: geom_histogram( data = marathon %>% filter(`M/F` == 'F'), aes(fill = `M/F`, y=..count..), alpha = 0.7, color = 'white') + # Add negative histogram for Male runners: geom_histogram( data = marathon %>% filter(`M/F` == 'M'), aes(fill = `M/F`, y=..count..*(-1)), alpha = 0.7, color = 'white') * scale_fill_manual( values = c('sienna', 'steelblue')) + * coord_flip() + * theme_minimal_hgrid() + * labs(fill = 'Gender', * y = 'Count') ``` ]] .rightcol[ <img src="figs/unnamed-chunk-62-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## How to make ridgeplots .leftcol[.code70[ Make a ridgeplot with **ggridges** library ```r *library(ggridges) college_all_ages %>% mutate( major_category = fct_reorder( major_category, median)) %>% ggplot() + * geom_density_ridges( * aes(x = median, y = major_category), * scale = 4, alpha = 0.7) + scale_y_discrete(expand = c(0, 0)) + scale_x_continuous(expand = c(0, 0)) + * coord_cartesian(clip = "off") + * theme_ridges() + labs(x = 'Median income ($)', y = 'Major category') ``` ]] .rightcol[ <img src="figs/unnamed-chunk-64-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse
15
:
00
## Your turn - comparing distributions Use the `gapminder.csv` data to create the following charts comparing the distribution of life expectancy across countries in continents in 2007. .leftcol[ <img src="figs/gapminder_densities-1.png" width="468" style="display: block; margin: auto;" /> ] .rightcol[ <img src="figs/gapminder_ridges-1.png" width="468" style="display: block; margin: auto;" /> ]