class: middle, inverse .leftcol30[ <center> <img src="https://raw.githubusercontent.com/emse-eda-gwu/2022-Fall/master/images/logo.png" width=250> </center> ] .rightcol70[ # Week 9: .fancy[Trends] ###
EMSE 4572: Exploratory Data Analysis ###
John Paul Helveston ###
October 26, 2022 ] --- class: center, middle, inverse # .fancy[.blue[Tip of the week]] ## Code outline in RStudio --- ## Today's data Old: ```r gapminder <- read_csv(here::here('data', 'gapminder.csv')) milk_production <- read_csv(here::here('data', 'milk_production.csv')) global_temps <- read_csv(here::here('data', 'nasa_global_temps.csv')) internet_region <- read_csv(here::here('data', 'internet_users_region.csv')) ``` New: ```r us_covid <- read_csv(here::here('data', 'us_covid.csv')) internet_country <- read_csv(here::here('data', 'internet_users_country.csv')) hotdogs <- read_csv(here::here('data', 'hot_dog_winners.csv')) us_diseases <- read_csv(here::here('data', 'us_contagious_diseases.csv')) ``` --- ## New packages: ```r install.packages('viridis') install.packages('gganimate') install.packages('magick') ``` --- class: inverse, middle # Week 9: .fancy[Trends] ## 1. Single Variables ## 2. Animations ## BREAK ## 3. Multiple Variables --- class: inverse, middle # Week 9: .fancy[Trends] ## 1. .orange[Single Variables] ## 2. Animations ## BREAK ## 3. Multiple Variables --- .leftcol[ ## Points Plotting the data points is a good starting point for viewing trends <img src="figs/milk_ca_point-1.png" width="432" style="display: block; margin: auto;" /> ] -- .rightcol[ ## Points + line Adding lines between the points helps<br>see the overall trend <img src="figs/milk_ca_point_line-1.png" width="432" style="display: block; margin: auto;" /> ] --- .leftcol[ ## Line Omitting the points emphasizes the overall trend<br><br><br> <img src="figs/milk_ca_line-1.png" width="432" style="display: block; margin: auto;" /> ] -- .rightcol[ ## Line + area Filling area below line emphasizes cumulative over time (y-axis should start at 0) <img src="figs/milk_ca_line_area-1.png" width="432" style="display: block; margin: auto;" /> ] --- class: center, middle ### If points are too sparse, a line can be misleading .leftcol[ <img src="figs/milk_ca_point_line_sparse1-1.png" width="432" style="display: block; margin: auto;" /> ] .rightcol[ <img src="figs/milk_ca_point_line_sparse2-1.png" width="432" style="display: block; margin: auto;" /> ] --- .leftcol[ ## Smoothed line Adding a "smoothed" line shows a modeled representation of the overall trend <img src="figs/milk_ca_smooth-1.png" width="432" style="display: block; margin: auto;" /> ] -- .rightcol[ ## Smoothed line + points Putting the smoothed line over the data points helps show whether **outliers** are driving the trend line <img src="figs/milk_ca_smooth_points-1.png" width="432" style="display: block; margin: auto;" /> ] --- class: center, middle ### Bars are useful when emphasizing the **data points**<br>rather than the **slope between them** .leftcol[ <img src="figs/hotdog_bar_record-1.png" width="468" style="display: block; margin: auto;" /> ] -- .rightcol[ <img src="figs/hotdog_bar_winner-1.png" width="468" style="display: block; margin: auto;" /> ] --- ## How to: **Points + line** .leftcol55[.code70[ Be sure to draw the line first,<br>then overlay the points ```r ggplot(milk_ca, * aes(x = year, y = milk_produced)) + * geom_line(color = 'steelblue', size = 0.5) + * geom_point(color = 'steelblue', size = 2) + theme_half_open(font_size = 18) + labs(x = 'Year', y = 'Milk produced (billion lbs)', title = 'Milk production in California') ``` ]] .rightcol45[ <img src="figs/unnamed-chunk-7-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## How to: **Line + area** .leftcol55[.code70[ Likewise, draw the area first then overlay the line ```r ggplot(milk_ca, aes(x = year, y = milk_produced)) + * geom_area(fill = 'steelblue', alpha = 0.5) + * geom_line(color = 'steelblue', size = 1) + scale_y_continuous( expand = expansion(mult = c(0, 0.05))) + theme_half_open(font_size = 18) + labs(x = 'Year', y = 'Milk produced (billion lbs)', title = 'Milk production in California') ``` ]] .rightcol45[ <img src="figs/unnamed-chunk-9-1.png" width="432" style="display: block; margin: auto;" /> ] --- ## How to: **Smoothed line + points** .leftcol[.code70[ Use `alpha` to make points slightly transparent ```r ggplot(milk_ca, aes(x = year, y = milk_produced)) + * geom_point(color = 'grey', * size = 2, alpha = 0.9) + * geom_smooth(color = 'steelblue', * size = 1, se = FALSE) + theme_half_open(font_size = 18) + labs( x = 'Year', y = 'Milk produced (billion lbs)', title = 'Milk production in California') ``` ]] .rightcol[ <img src="figs/unnamed-chunk-11-1.png" width="432" style="display: block; margin: auto;" /> ] --- class: inverse
20
:
00
## Your turn .leftcol[ Use the `global_temps` data frame to explore ways to visualize the change in average global temperatures. Consider using: - points - lines - areas - smoothed lines ] .rightcol[ ```r global_temps <- read_csv(here::here( 'data', 'nasa_global_temps.csv')) head(global_temps) ``` ``` #> # A tibble: 6 × 3 #> year meanTemp smoothTemp #> <dbl> <dbl> <dbl> #> 1 1880 -0.15 -0.08 #> 2 1881 -0.07 -0.12 #> 3 1882 -0.1 -0.15 #> 4 1883 -0.16 -0.19 #> 5 1884 -0.27 -0.23 #> 6 1885 -0.32 -0.25 ``` ] --- class: inverse, middle # Week 9: .fancy[Trends] ## 1. Single Variables ## 2. .orange[Animations] ## BREAK ## 3. Multiple Variables --- class: center, middle ### Animation adds emphasis to the **change over time** -- ### ...plus it's fun! --- class: center # Static chart <img src="figs/unnamed-chunk-15-1.png" width="60%" style="display: block; margin: auto;" /> --- class: center # Animated chart <center> <img src="images/milk_region_animation.gif" width="800"> </center> --- class: center ### Animation is particularly helpful for the **time dimension** <center> <img src="images/gapminder_animation.gif" width=600> </center> .left["Gapminder" visualization by Hans Rosling] --- class: center ### Animation is particularly helpful for the **time dimension** <center> <img src="images/spiral_temps.gif" width=450> </center> .left[Source: https://www.climate-lab-book.ac.uk/spirals/] --- class: center ### Animation is particularly helpful for the **time dimension** <center> <img src="images/ft-flu.gif" width=800> </center> .left[Financial Times comparison of Flu seasons to COVID-19] --- class: center ### Animation is particularly helpful for the **time dimension** <center> <img src="images/milk_race_anim.gif" width=550> </center> .left["Bar chart race" of top 10 milk producing states] --- ## How to: **Animate a line plot** .leftcol[.code60[ Make a static plot w/labels for each year ```r milk_region_anim_plot <- milk_region %>% ggplot( aes(x = year, y = milk_produced, color = region)) + * geom_line(size = 1) + * geom_point(size = 2) + * geom_text_repel( * aes(label = region), * hjust = 0, nudge_x = 1, direction = "y", * size = 6, segment.color = NA) + scale_x_continuous( breaks = seq(1970, 2010, 10), expand = expansion(add = c(1, 13))) + scale_color_manual(values = c( 'sienna', 'forestgreen', 'dodgerblue', 'orange')) + theme_half_open(font_size = 18) + theme(legend.position = 'none') + labs(x = 'Year', y = 'Milk produced (billion lbs)', title = 'Milk production in four US regions') milk_region_anim_plot ``` ]] .rightcol[ <img src="figs/unnamed-chunk-16-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## How to: **Animate a line plot** .leftcol[.code60[ Now animate it **Note the pause at the end!** ```r *library(gganimate) milk_region_anim <- milk_region_anim_plot + * transition_reveal(year) # Render the animation *animate(milk_region_anim, * end_pause = 15, * duration = 10, * width = 1100, height = 650, res = 150, * renderer = magick_renderer()) # Save last animation anim_save(here::here( 'figs', 'milk_region_animation.gif')) ``` ]] .rightcol[ <center> <img src="images/milk_region_animation.gif"> </center> ] --- ## How to: **Change label based on year** .leftcol55[.code80[ First make a static plot ```r gapminder_anim_plot <- ggplot(gapminder, * aes(x = gdpPercap, y = lifeExp, * size = pop, color = continent)) + geom_point(alpha = 0.7) + * scale_size_area( * guide = FALSE, max_size = 15) + scale_color_brewer(palette = 'Set2') + scale_x_log10() + theme_bw(base_size = 18) + theme(legend.position = c(0.85, 0.3)) + labs(x = 'GDP per capita', y = 'Life expectancy', color = 'Continent') gapminder_anim_plot ``` ]] .rightcol45[ <img src="figs/unnamed-chunk-18-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## How to: **Change label based on year** .leftcol[.code80[ Now animate it **Note**: Year must be an integer! ```r gapminder_anim <- gapminder_anim_plot + * transition_time(year) + * labs(title = "Year: {frame_time}") # Render the animation animate(gapminder_anim, end_pause = 10, width = 800, height = 600, res = 150, renderer = magick_renderer()) ``` ]] .rightcol[ <center> <img src="images/gapminder_animation.gif" width=600> </center> ] --- .leftcol[.code40[ ```r milk_race_anim <- milk_production %>% group_by(year) %>% mutate( rank = rank(-milk_produced), Value_rel = milk_produced / milk_produced[rank==1], Value_lbl = paste0(' ', round(milk_produced))) %>% group_by(state) %>% filter(rank <= 10) %>% ungroup() %>% mutate(year = as.integer(year)) %>% ggplot(aes(x = rank, group = state, fill = region, color = region)) + * geom_tile(aes(y = milk_produced / 2, * height = milk_produced), * width = 0.9, alpha = 0.8, color = NA) + geom_text(aes(y = 0, label = paste(state, " ")), vjust = 0.2, hjust = 1) + geom_text(aes(y = milk_produced, label = Value_lbl), hjust = 0) + coord_flip(clip = 'off', expand = FALSE) + scale_y_continuous(labels = scales::comma) + scale_fill_viridis(discrete = TRUE) + scale_color_viridis(discrete = TRUE) + scale_x_reverse() + guides(color = FALSE) + theme_minimal_vgrid() + theme( axis.line = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank(), legend.position = c(0.7, 0.3), legend.background = element_rect(fill = 'white'), plot.title = element_text( size = 22, hjust = 0.5, face = 'bold', colour = 'grey', vjust = -1), plot.subtitle = element_text( size = 18, hjust = 0.5, face = 'italic', color = 'grey'), plot.caption = element_text( size = 8, hjust = 0.5, face = 'italic', color = 'grey'), plot.margin = margin(0.5, 2, 0.5, 3, 'cm')) + * transition_time(year) + * view_follow(fixed_x = TRUE) + labs(title = 'Year : {frame_time}', subtitle = 'Top 10 states by milk produced', fill = 'Region', caption = 'Milk produced (billions lbs)') ``` ]] .rightcol[ ### .center[Making a bar chart race <br>[(tutorial here)](https://www.emilykuehler.com/portfolio/barchart-race/)] .code60[ ```r animate(milk_race_anim, duration = 17, end_pause = 15, width = 800, height = 700, res = 150, renderer = magick_renderer()) ``` ] <center> <img src="images/milk_race_anim.gif" width=500> </center> ] --- # Resources ### More animation options: - [More on gapminder + line charts](https://www.datanovia.com/en/blog/gganimate-how-to-create-plots-with-beautiful-animation-in-r/) - [Customizing the animation](https://github.com/ropenscilabs/learngganimate/blob/2872425f08392f9f647005eb19a9d4afacd1ab44/animate.md) --- class: inverse
20
:
00
## Your turn .leftcol[ Use the `global_temps` data frame to explore ways to _animate_ the change in average global temperatures. Consider using: - points - lines - areas ] .rightcol[ <center> <img src="images/global_temps_points_anim.gif" width=500> </center> ] --- class: inverse, center
05
:
00
# Break! ## Stand up, Move around, Stretch! --- class: inverse, middle # Week 9: .fancy[Trends] ## 1. Single Variables ## 2. Animations ## BREAK ## 3. .orange[Multiple Variables] --- class: center, middle ## With multiple categories,<br>points & lines can get messy -- .leftcol[ <img src="figs/milk_region_dot-1.png" width="540" style="display: block; margin: auto;" /> ] -- .rightcol[ <img src="figs/milk_region_dot_line-1.png" width="540" style="display: block; margin: auto;" /> ] --- class: center, middle .leftcol[ ### **Better**: Lines alone makes distinguishing trends easier <img src="figs/milk_region_line-1.png" width="540" style="display: block; margin: auto;" /> ] -- .rightcol[ ### **Even better**: Directly label<br>lines to remove legend <img src="figs/milk_region_line_label-1.png" width="540" style="display: block; margin: auto;" /> ] --- class: center, middle ### If goal is to communicate the **overall / total** trend,<br>consider a stacked area chart -- .leftcol[ ### Highlights **regional** trends <img src="figs/internet_region_line_label-1.png" width="540" style="display: block; margin: auto;" /> ] -- .rightcol[ ### Highlights **overall / total** trend <img src="figs/internet_region_area-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### If you have **lots** of categories:<br>**1) Plot all the data with the average highlighted** -- .leftcol[ .center[Measles in **California**] <img src="figs/measles_line_ca-1.png" width="432" style="display: block; margin: auto;" /> ] -- .rightcol[ .center[Measles in **all 50 states**] <img src="figs/measles_line_us-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### If you have **lots** of categories:<br>1) Plot all the data with the average highlighted<br>**2) Plot all the data with a heat map** <img src="figs/measles_heat_map-1.png" width="55%" style="display: block; margin: auto;" /> --- ## Heatmaps are great for multiple divisions of time ### My activity on Github: <center> <img src="images/jhelvy_github.png" width=800> </center> <br> -- ###Check out this heat map on [Traffic fatalities](https://flowingdata.com/2017/04/27/traffic-fatalities-when-and-where/) --- .leftcol55[.code60[ Make the basic line chart first ```r # Format the data milk_region <- milk_production %>% filter(region %in% c( 'Pacific', 'Northeast', 'Lake States', 'Mountain')) %>% group_by(year, region) %>% summarise(milk_produced = sum(milk_produced)) %>% ungroup() # Make the line chart ggplot(milk_region, * aes(x = year, y = milk_produced, * color = region)) + * geom_line(size = 1) + scale_color_manual(values = c( 'sienna', 'forestgreen', 'dodgerblue', 'orange')) + theme_half_open(font_size = 18) + labs( x = 'Year', y = 'Milk produced (billion lbs)', color = 'Region', title = 'Milk production in four US regions') ``` ]] .rightcol45[ ## .center[How to:<br>**Directly label lines**] <br> <img src="figs/unnamed-chunk-26-1.png" width="504" style="display: block; margin: auto;" /> ] --- .leftcol55[.code60[ ```r # Format the data milk_region <- milk_production %>% filter(region %in% c( 'Pacific', 'Northeast', 'Lake States', 'Mountain')) %>% group_by(year, region) %>% summarise(milk_produced = sum(milk_produced)) %>% ungroup() # Make the line plot ggplot(milk_region, aes(x = year, y = milk_produced, color = region)) + geom_line(size = 1) + # Add labels * geom_text_repel( * data = milk_region %>% * filter(year == max(year)), * aes(label = region), * hjust = 0, nudge_x = 1, direction = "y", * size = 6, segment.color = NA) + # Create space for labels on right side * scale_x_continuous( * breaks = seq(1970, 2010, 10), * expand = expansion(add = c(1, 13))) + scale_color_manual(values = c( 'sienna', 'forestgreen', 'dodgerblue', 'orange')) + theme_half_open(font_size = 18) + # Remove legend * theme(legend.position = 'none') + labs(x = 'Year', y = 'Milk produced (billion lbs)', title = 'Milk production in four US regions') ``` ]] .rightcol45[ ## .center[How to:<br>**Directly label lines**] <br> <img src="figs/unnamed-chunk-28-1.png" width="540" style="display: block; margin: auto;" /> ] --- ## How to: **Stacked area** .leftcol55[.code70[ ```r internet_region %>% mutate(numUsers = numUsers / 10^9) %>% ggplot() + * geom_area(aes(x = year, y = numUsers, * fill = region)) + # Nice colors from "viridis" library: * scale_fill_viridis(discrete = TRUE) + # Sort the legend into 3 rows * guides(fill = guide_legend( * nrow = 3, byrow = FALSE)) + theme_minimal_grid(font_size = 15) + theme(legend.position = 'bottom') + labs( x = 'Year', y = NULL, fill = 'Region', title = 'Number of internet users (billions)') ``` ]] .rightcol45[ <img src="figs/unnamed-chunk-30-1.png" width="576" style="display: block; margin: auto;" /> ] --- .leftcol[.code60[ Format the data ```r # Format the data measles <- us_diseases %>% filter( disease == 'Measles', !state %in% c("Hawaii", "Alaska")) %>% mutate( rate = (count / population) * 10000, state = fct_reorder(state, rate)) %>% # Compute annual mean rate across all states * group_by(year) %>% * mutate( * mean_rate = sum(count) / sum(population) * 10000) ``` Make all the state lines in light grey color ```r ggplot(measles) + * geom_line(aes(x = year, y = rate, group = state), * color = 'grey', alpha = 0.3) + # Add reference line & label: geom_vline(xintercept = 1963, col = 'blue', linetype = 'dashed') + annotate('text', x = 1964, y = 150, hjust = 0, label = 'Vaccine introduced in 1964', color = 'blue') + theme_minimal_grid(font_size = 18) + labs(y = 'Cases per 10,000 people') ``` ]] .rightcol[ ## How to:<br>**Average line overlay** <img src="figs/unnamed-chunk-32-1.png" width="432" style="display: block; margin: auto;" /> ] --- .leftcol55[.code70[ Now overlay the annual mean line ```r ggplot(measles) + geom_line( aes(x = year, y = rate, group = state), color = 'grey', alpha = 0.3) + * geom_line( * aes(x = year, y = mean_rate), size = 0.8) + # Add US mean label * annotate( * 'text', x = 1945, y = 55, hjust = 0, * label = 'US Mean') + # Add reference line & label geom_vline(xintercept = 1963, col = 'blue', linetype = 'dashed') + annotate('text', x = 1964, y = 150, hjust = 0, label = 'Vaccine introduced in 1964', color = 'blue') + theme_minimal_grid(font_size = 18) + labs(y = 'Cases per 10,000 people') ``` ]] .rightcol45[ ## How to:<br>**Average line overlay** <img src="figs/unnamed-chunk-34-1.png" width="432" style="display: block; margin: auto;" /> ] --- .leftcol[.code70[ Create main grid with `geom_tile()` ```r ggplot(measles) + * geom_tile( * aes(x = year, y = state, fill = rate), * color = 'grey80') + # Add reference line & label geom_vline( xintercept = 1963, col = 'blue') + annotate( 'text', x = 1964, y = 50.5, hjust = 0, label = 'Vaccine introduced in 1964', color = 'blue') ``` ]] .rightcol[ ## How to: **Heat map** <img src="figs/unnamed-chunk-35-1.png" width="576" style="display: block; margin: auto;" /> ] --- .leftcol[.code60[ Adjust scales and adjust theme ```r ggplot(measles) + geom_tile(aes(x = year, y = state, fill = rate), color = 'grey80') + # Add reference line & label geom_vline(xintercept = 1963, col = 'blue') + annotate( 'text', x = 1964, y = 50.5, hjust = 0, label = 'Vaccine introduced in 1964', color = 'blue') + # Adjust scales * scale_x_continuous(expand = c(0, 0)) + * scale_fill_viridis( * option = 'inferno', direction = -1) + # Adjust theme * theme_minimal() + * theme( * panel.grid = element_blank(), * legend.position = 'bottom', * text = element_text(size = 10)) + * coord_cartesian(clip = 'off') + labs( x = NULL, y = NULL, fill = 'Cases per 10,000 people', title = 'Measles') ``` ]] .rightcol[ #### .center[Color scale is linear in this chart] <img src="figs/unnamed-chunk-36-1.png" width="576" style="display: block; margin: auto;" /> ] --- .leftcol[.code60[ Adjust scales and adjust theme ```r ggplot(measles) + geom_tile(aes(x = year, y = state, fill = rate), color = 'grey80') + # Add reference line & label geom_vline(xintercept = 1963, col = 'blue') + annotate( 'text', x = 1964, y = 50.5, hjust = 0, label = 'Vaccine introduced in 1964', color = 'blue') + # Adjust scales scale_x_continuous(expand = c(0, 0)) + scale_fill_viridis( option = 'inferno', direction = -1, * trans = 'sqrt') + # Modify legend color bar * guides(fill = guide_colorbar( * title.position = 'top', reverse = TRUE)) + # Adjust theme theme_minimal() + theme( panel.grid = element_blank(), legend.position = 'bottom', text = element_text(size = 10)) + coord_cartesian(clip = 'off') + labs( x = NULL, y = NULL, fill = 'Cases per 10,000 people', title = 'Measles') ``` ]] .rightcol[ #### .center[Non-linear color scale<br>helps with large variations] <img src="figs/unnamed-chunk-38-1.png" width="576" style="display: block; margin: auto;" /> ] --- class: inverse
20
:
00
## Your turn .leftcol[ Use the `us_covid` data frame to explore ways to visualize the number of daily cases using: 1. A labeled line chart 2. A stacked area chart 3. A heat map ] .rightcol[ ```r us_covid <- read_csv(here::here( 'data', 'us_covid.csv')) head(us_covid) ``` ``` #> # A tibble: 6 × 7 #> date day state cases_daily deaths_daily cases_total deaths_total #> <date> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 2020-01-23 1 Alabama 0 0 0 0 #> 2 2020-01-24 2 Alabama 0 0 0 0 #> 3 2020-01-25 3 Alabama 0 0 0 0 #> 4 2020-01-26 4 Alabama 0 0 0 0 #> 5 2020-01-27 5 Alabama 0 0 0 0 #> 6 2020-01-28 6 Alabama 0 0 0 0 ``` ] --- class: center, middle, inverse # Two other examples for showing<br>change across mutliple categories --- class: center # Seasonal chart <center> <img src="images/seasonal_chart.png" width=700> </center> .left[Source: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Seasonal%20Plot] --- class: center ## Sankey chart <center> <img src="images/energy_sankey.png" width=800> </center> Source: https://flowcharts.llnl.gov/ --- class: middle .center[Would you consider purchasing an electric car?] <center> <img src="images/bevSankey.png" width=700> </center> .leftcol70[.font70[ Roberson, Laura A. & Helveston, J.P. (2020) "Electric vehicle adoption: can short experiences lead to big change?," Environmental Research Letters. 15(0940c3).<br>Made using the [ggforce](https://www.data-imaginist.com/2019/the-ggforce-awakens-again/) package ]]