class: middle, inverse .leftcol30[ <center> <img src="https://raw.githubusercontent.com/emse-eda-gwu/2021-Spring/master/images/eda_hex_sticker.png" width=250> </center> ] .rightcol70[ # Week 3: .fancy[Centrality & Variability] ### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 512 512"><path d="M496 128v16a8 8 0 0 1-8 8h-24v12c0 6.627-5.373 12-12 12H60c-6.627 0-12-5.373-12-12v-12H24a8 8 0 0 1-8-8v-16a8 8 0 0 1 4.941-7.392l232-88a7.996 7.996 0 0 1 6.118 0l232 88A8 8 0 0 1 496 128zm-24 304H40c-13.255 0-24 10.745-24 24v16a8 8 0 0 0 8 8h464a8 8 0 0 0 8-8v-16c0-13.255-10.745-24-24-24zM96 192v192H60c-6.627 0-12 5.373-12 12v20h416v-20c0-6.627-5.373-12-12-12h-36V192h-64v192h-64V192h-64v192h-64V192H96z"/></svg> EMSE 4575: Exploratory Data Analysis ### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 448 512"><path d="M224 256c70.7 0 128-57.3 128-128S294.7 0 224 0 96 57.3 96 128s57.3 128 128 128zm89.6 32h-16.7c-22.2 10.2-46.9 16-72.9 16s-50.6-5.8-72.9-16h-16.7C60.2 288 0 348.2 0 422.4V464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48v-41.6c0-74.2-60.2-134.4-134.4-134.4z"/></svg> John Paul Helveston ### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 448 512"><path d="M0 464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V192H0v272zm320-196c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM192 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM64 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zM400 64h-48V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H160V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H48C21.5 64 0 85.5 0 112v48h448v-48c0-26.5-21.5-48-48-48z"/></svg> January 27, 2021 ] --- class: center # Thanks for the heros 😄 |<center><img src="images/heros/helena.png" height="120"></center> Helena | <center><img src="images/heros/katie.gif" height="120"></center> Katie | <center><img src="images/heros/carolyne.jpg" height="120"></center> Carolyne | <center><img src="images/heros/kaveena.png" height="120"></center> Kaveena | <center><img src="images/heros/alejandro.jpg" height="120"></center> Alejandro | |-|-|-|-|-| |<center><img src="images/heros/kyara.jpg" height="120"></center> Kyara | <center><img src="images/heros/ebun.gif" height="120"></center> Ebun | <center><img src="images/heros/kareemot.gif" height="120"></center> Kareemot | <center><img src="images/heros/omar1.png" height="120"></center> Omar 1 | <center><img src="images/heros/omar2.jpg" height="120"></center> Omar 2 | |<center><img src="images/heros/matthew.gif" height="120"></center> Matthew | <center><img src="images/heros/michael.gif" height="120"></center> Michael | <center><img src="images/heros/alexa.jpg" height="120"></center> Alexa | <center><img src="images/heros/kazi.gif" height="120"></center> Kazi| <center><img src="images/heros/eliese.gif" height="120"></center> Eliese | --- # Updates -- ## Office hours are set (posted in #links in slack & on BB) : - 5-7pm Mondays w/Jenny K. - 4:30-6pm Tuesdays w/Lydia G. - 7-9pm Tuesdays w/Saurav P. - 2-4pm Fridays w/Prof. Helveston -- ## Meet Lydia -- ## Jenny has an announcement --- class: center, middle, inverse # Tip of the week: # `theme_set()` --- # Add "global" settings to all plots ```r library(knitr) library(tidyverse) library(here) knitr::opts_chunk$set( warning = FALSE, message = FALSE, comment = "#>", * fig.path = "figs/", # Plot save path * fig.width = 7.252, # Plot dimensions * fig.height = 4, * fig.retina = 3 # Better plot resolution ) *theme_set(theme_bw(base_size = 20)) # Set theme for all ggplots ``` --- ```r ggplot(mtcars) + geom_point(aes(x = mpg, y = hp)) ``` .leftcol[ Default theme <img src="figs/unnamed-chunk-3-1.png" width="522.144" /> ] .rightcol[ `theme_bw(base_size = 20)` <img src="figs/unnamed-chunk-4-1.png" width="522.144" /> ] --- class: inverse, middle # Week 3: .fancy[Centrality & Variability] ## 1. Data Types ## 2. Measures of Centrality & Variability ## BREAK ## 3. Visualizing Centrality & Variability ## 4. Relationships Between 2 Variables ## 5. Exploratory Data Analysis --- class: inverse, middle # Week 3: .fancy[Centrality & Variability] ## 1. .orange[Data Types] ## 2. Measures of Centrality & Variability ## BREAK ## 3. Visualizing Centrality & Variability ## 4. Relationships Between 2 Variables ## 5. Exploratory Data Analysis --- class: inverse, center, middle # 24,901 ??? If I walked up to you, and said, "The answer is 24,901," you would probably be confused. By itself, a number means nothing. --- class: inverse, center, middle # Earth's circumference at the equator:<br>24,901 miles ??? But if I were to tell you that the circumference of the earth at the equator is 24,901 miles, that would mean something. To be complete and meaningful, quantitative information consists of both quantitative data (the numbers) and categorical data (the labels that tell us what the numbers measure). --- # Types of Data -- .leftcol[ ### **Categorical** Subdivide things into _groups_ - What type? - Which category? ] -- .rightcol[ ### **Numerical** Measure things with numbers - How many? - How much? ] --- ## Categorical (discrete) variables -- .leftcol[ ### **Nominal** - Order doesn't matter - Differ in "name" (nominal) only e.g. `country` in TB case data: .code80[ ``` #> # A tibble: 6 x 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 ``` ]] -- .rightcol[ ### **Ordinal** - Order matters - Distance between units not equal e.g.: `Placement` 2017 Boston marathon: .code80[ ``` #> # A tibble: 6 x 3 #> Placement `Official Time` Name #> <dbl> <time> <chr> #> 1 1 02:09:37 Kirui, Geoffrey #> 2 2 02:09:58 Rupp, Galen #> 3 3 02:10:28 Osako, Suguru #> 4 4 02:12:08 Biwott, Shadrack #> 5 5 02:12:35 Chebet, Wilson #> 6 6 02:12:45 Abdirahman, Abdi ``` ]] --- ## Numerical data -- .leftcol[ ### **Interval** - Numerical scale with<br>arbitrary starting point - No "0" point - Can't say "x" is double "y" e.g.: `temp` in Beaver data ``` #> day time temp activ #> 1 346 840 36.33 0 #> 2 346 850 36.34 0 #> 3 346 900 36.35 0 #> 4 346 910 36.42 0 #> 5 346 920 36.55 0 #> 6 346 930 36.69 0 ``` ] -- .rightcol[ ### **Ratio** - Has a "0" point - Can be described as percentages - Can say "x" is double "y" e.g.: `height` & `speed` in wildlife impacts ``` #> # A tibble: 6 x 3 #> incident_date height speed #> <dttm> <dbl> <dbl> #> 1 2018-12-31 00:00:00 700 200 #> 2 2018-12-27 00:00:00 600 145 #> 3 2018-12-23 00:00:00 0 130 #> 4 2018-12-22 00:00:00 500 160 #> 5 2018-12-21 00:00:00 100 150 #> 6 2018-12-18 00:00:00 4500 250 ``` ] --- class: inverse, center, middle # Key Questions -- .leftcol[ ## Categorical ## .orange[Does the order matter?] | Yes | No | |---|---| | Ordinal | Nominal | ] -- .rightcol[ ## Numerical ## .orange[Is there a "baseline"?] | Yes | No | |---|---| | Ratio | Interval | ] --- class: center, middle # Be careful of how variables are encoded! --- ## .red[When numbers are categories] - "Dummy coding": e.g., `passedTest` = `1` or `0`) - "North", "South", "East", "West" = `1`, `2`, `3`, `4` -- ## .red[When ratio data are discrete (i.e. counts)] - Number of eggs in a carton, heart beats per minute, etc. - Continuous variables measured discretely (e.g. age) -- ## .red[Time] - As _ordinal_ categories: "Jan.", "Feb.", "Mar.", etc. - As _interval_ scale: "Jan. 1", "Jan. 2", "Jan. 3", etc. - As _ratio_ scale: "30 sec", "60 sec", "70 sec", etc. --- # **Quick practice**: What's the data type? > Decide [here](https://docs.google.com/presentation/d/1C9-MPyUaHuyYHfz0SxpZb11GDT4xUXvBjhz7wQoEoLE/edit?usp=sharing) (link also in #classroom) .code70[ ```r wildlife_impacts %>% filter(!is.na(cost_repairs_infl_adj)) %>% select(incident_date, time_of_day, species, cost_repairs_infl_adj) ``` ``` #> # A tibble: 615 x 4 #> incident_date time_of_day species cost_repairs_infl_adj #> <dttm> <chr> <chr> <dbl> #> 1 2018-10-25 00:00:00 Day Unknown bird - large 1000 #> 2 2018-09-05 00:00:00 <NA> Unknown bird - medium 200 #> 3 2018-08-09 00:00:00 Day Semipalmated sandpiper 10000 #> 4 2018-06-24 00:00:00 Day Unknown bird - large 100000 #> 5 2018-02-18 00:00:00 Day Rough-legged hawk 20000 #> 6 2018-01-05 00:00:00 Night Brant 487000 #> 7 2017-10-31 00:00:00 Day Unknown bird - small 51 #> 8 2017-10-12 00:00:00 <NA> Swainson's thrush 5120 #> 9 2017-09-17 00:00:00 Day Cattle egret 531763 #> 10 2017-09-16 00:00:00 <NA> Unknown bird - medium 102 #> # … with 605 more rows ``` ] ??? - incident_date: Interval - time_of_day: Ordinal - species: Nominal - cost_repairs_infl_adj: Ratio --- class: inverse, middle # Week 3: .fancy[Centrality & Variability] ## 1. Data Types ## 2. .orange[Measures of Centrality & Variability] ## BREAK ## 3. Visualizing Centrality & Variability ## 4. Relationships Between 2 Variables ## 5. Exploratory Data Analysis --- class: inverse, middle # .center[.font140[Summary Measures:]] # This week: .red[Centrality] & .blue[Variability] # Next week: .green[Correlation] --- # .red[Centrality (a.k.a. The "Average" Value)] -- ### A single number representing the _middle_ of a set of numbers -- ### **Mean**: `\(\frac{\text{Sum of values}}{\text{# of values}}\)` -- ### **Median**: "Middle" value (50% of data above & below) -- ### **Mode**: Most frequent value (usually for categorical data) --- # .center[Mean isn't always the "best" choice] .leftcol40[ ```r wildlife_impacts %>% filter(! is.na(height)) %>% summarise( mean = mean(height), median = median(height)) ``` ``` #> # A tibble: 1 x 2 #> mean median #> <dbl> <dbl> #> 1 984. 50 ``` Percent of data below mean: ``` #> [1] "73.9%" ``` ] -- .rightcol60[ <img src="figs/wildlife-hist-1.png"> ] ??? On average, where do planes hit birds? Saying ~1000 ft is misleading It's much more likely to be under 100 ft --- class: inverse # .center[Beware the "flaw of averages"] -- .leftcol[ ### What happened to the statistician that crossed a river with an average depth of 3 feet? ] -- .rightcol[ ### ...he drowned <img src="images/foa.jpg" width=600> ] --- # .blue[Variability ("Spread")] -- ### **Standard deviation**: distribution of values relative to the mean ### `\(s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}\)` -- ### **Interquartile range (IQR)**: `\(Q_3 - Q_1\)` (middle 50% of data) -- ### **Range**: max - min --- # .center[.fancy[Example:] Days to ship] .leftcol40[ Complaints are coming in about orders shipped from warehouse B, so you collect some data: .code70[ ```r daysToShip ``` ``` #> order warehouseA warehouseB #> 1 1 3 1 #> 2 2 3 1 #> 3 3 3 1 #> 4 4 4 3 #> 5 5 4 3 #> 6 6 4 4 #> 7 7 5 5 #> 8 8 5 5 #> 9 9 5 5 #> 10 10 5 6 #> 11 11 5 7 #> 12 12 5 10 ``` ]] -- .rightcol60[ Here, **averages** are misleading: ```r daysToShip %>% gather(warehouse, days, warehouseA:warehouseB) %>% group_by(warehouse) %>% summarise( * mean = mean(days), * median = median(days)) ``` ``` #> # A tibble: 2 x 3 #> warehouse mean median #> <chr> <dbl> <dbl> #> 1 warehouseA 4.25 4.5 #> 2 warehouseB 4.25 4.5 ``` ] --- # .center[.fancy[Example:] Days to ship] .leftcol40[ Complaints are coming in about orders shipped from warehouse B, so you collect some data: .code70[ ```r daysToShip ``` ``` #> order warehouseA warehouseB #> 1 1 3 1 #> 2 2 3 1 #> 3 3 3 1 #> 4 4 4 3 #> 5 5 4 3 #> 6 6 4 4 #> 7 7 5 5 #> 8 8 5 5 #> 9 9 5 5 #> 10 10 5 6 #> 11 11 5 7 #> 12 12 5 10 ``` ]] .rightcol60[ **Variability** reveals difference in days to ship: ```r daysToShip %>% gather(warehouse, days, warehouseA:warehouseB) %>% group_by(warehouse) %>% summarise( mean = mean(days), median = median(days), * range = max(days) - min(days), * sd = sd(days)) ``` ``` #> # A tibble: 2 x 5 #> warehouse mean median range sd #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 warehouseA 4.25 4.5 2 0.866 #> 2 warehouseB 4.25 4.5 9 2.70 ``` ] --- # .center[.fancy[Example:] Days to ship] <center> <img src="figs/days-to-ship-1.png" width=960> </center> --- class: center # Interpreting the standard deviation .leftcol[ ### `\(s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}\)` <center> <img src="figs/days-to-ship-sd-1.png" width=380> </center> ] -- .rightcol[ <img src="images/sd.png"> ] --- class: inverse, center # Outliers <center> <img src = "images/outliers.jpeg" width = "730"> </center> --- ## **Mean** & **Standard Deviation** are sensitive to outliers **Outliers**: `\(Q_1 - 1.5 IQR\)` or `\(Q_3 + 1.5 IQR\)` **Extreme values**: `\(Q_1 - 3 IQR\)` or `\(Q_3 + 3 IQR\)` -- .leftcol[ ```r data1 <- c(3,3,4,5,5,6,6,7,8,9) ``` - Mean: 5.6 - Standard Deviation: 2.01 - Median: 5.5 - IQR: 2.5 ] -- .rightcol[ ```r data2 <- c(3,3,4,5,5,6,6,7,8,20) ``` - .red[Mean: 6.7] - .red[Standard Deviation: 4.95] - .blue[Median: 5.5] - .blue[IQR: 2.5] ] --- class: inverse, middle # .center[Robust statistics for continuous data] # .center[(less sensitive to outliers)] ## .red[Centrality]: Use _median_ rather than _mean_ ## .blue[Variability]: Use _IQR_ rather than _standard deviation_ --- class: inverse # Practice with summary measurements ### 1) Read in the following data sets: - `milk_production.csv` - `lotr_words.csv` ### 2) For each variable in each data set, if possible, summarize its ### 1. .red[Centrality] ### 2. .blue[Variability] --- class: inverse, center # Break! ## Stand up, Move around, Stretch!
05
:
00
--- class: inverse, middle # Week 3: .fancy[Centrality & Variability] ## 1. Data Types ## 2. Measures of Centrality & Variability ## BREAK ## 3. .orange[Visualizing Centrality & Variability] ## 4. Relationships Between 2 Variables ## 5. Exploratory Data Analysis --- class: center # "Visualizing data helps us think" <center> <img src = "images/anscombe_data.png" width = "740"> </center> .left[.footer-small[Stephen Few (2009, pg. 6)]] --- class: center # Anscombe's Quartet <center> <img src="figs/anscombe-quartet-1.png" width=600> </center> .left[.footer-small[Stephen Few (2009, pg. 6)]] --- class: inverse, center, middle # The data _type_ determines <br> how to summarize it --- .cols3[ ### **Nominal<br>(Categorical)** **Measures**: - Frequency counts /<br>Proportions <br> <br> <br> <br> **Charts**: - Bars ] -- .cols3[ ### **Ordinal<br>(Categorical)** **Measures**: - Frequency counts /<br>Proportions - .red[Centrality]:<br>Median, Mode - .blue[Variability]: IQR <br> **Charts**: - Bars ] -- .cols3[ ### **Numerical<br>(Continuous)** **Measures**: - .red[Centrality]:<br>Mean, median - .blue[Variability]: Range, standard deviation, IQR <br> <br> **Charts**: - Histogram - Boxplot ] --- ## Summarizing **Nominal** data .leftcol45[ Summarize with counts / percentages ```r wildlife_impacts %>% * count(operator, sort = TRUE) %>% * mutate(p = n / sum(n)) ``` ``` #> # A tibble: 4 x 3 #> operator n p #> <chr> <int> <dbl> #> 1 SOUTHWEST AIRLINES 17970 0.315 #> 2 UNITED AIRLINES 15116 0.265 #> 3 AMERICAN AIRLINES 14887 0.261 #> 4 DELTA AIR LINES 9005 0.158 ``` ] -- .rightcol55[ Visualize with bars .code70[ ```r wildlife_impacts %>% count(operator, sort = TRUE) %>% * ggplot() + * geom_col(aes(x = n, y = reorder(operator, n)), * width = 0.7) + labs(x = "Count", y = "Operator") ``` <img src="figs/wildlife-operator-bars-1.png" width="504" /> ]] --- ## Summarizing **Ordinal** data .leftcol[ **Summarize**: Counts / percentages .code70[ ```r wildlife_impacts %>% * count(incident_month, sort = TRUE) %>% * mutate(p = n / sum(n)) ``` ``` #> # A tibble: 12 x 3 #> incident_month n p #> <dbl> <int> <dbl> #> 1 9 7980 0.140 #> 2 10 7754 0.136 #> 3 8 7104 0.125 #> 4 5 6161 0.108 #> 5 7 6133 0.108 #> 6 6 4541 0.0797 #> 7 4 4490 0.0788 #> 8 11 4191 0.0736 #> 9 3 2678 0.0470 #> 10 12 2303 0.0404 #> 11 1 1951 0.0342 #> 12 2 1692 0.0297 ``` ]] -- .rightcol[ **Visualize**: Bars .code70[ ```r wildlife_impacts %>% count(incident_month, sort = TRUE) %>% * ggplot() + * geom_col(aes(x = as.factor(incident_month), * y = n), width = 0.7) + labs(x = "Incident month") ``` <img src="figs/wildlife-months-bar-1.png" width="504" /> ]] --- ## Summarizing **continuous** variables .leftcol30[ **Histograms**: - Skewness - Number of modes <br> **Boxplots**: - Outliers - Comparing variables ] .rightcol70[.border[ <img src = 'images/eda-boxplot.png'> ]] --- ## **Histogram**: Identify Skewness & # of Modes .leftcol40[ **Summarise**:<br>Mean, median, sd, range, & IQR: ```r summary(wildlife_impacts$height) ``` ``` #> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's #> 0.0 0.0 50.0 983.8 1000.0 25000.0 18038 ``` ] -- .rightcol60[ **Visualize**:<br>Histogram (identify skewness & modes) ```r ggplot(wildlife_impacts) + * geom_histogram(aes(x = height), bins = 50) + labs(x = 'Height (ft)', y = 'Count') ``` <img src="figs/wildlife-height-hist-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## **Histogram**: Identify Skewness & # of Modes .leftcol[ **Height** ```r ggplot(wildlife_impacts) + * geom_histogram(aes(x = height), bins = 50) + labs(x = 'Height (ft)', y = 'Count') ``` <img src="figs/unnamed-chunk-27-1.png" width="504" style="display: block; margin: auto;" /> ] .rightcol[ **Speed** ```r ggplot(wildlife_impacts) + * geom_histogram(aes(x = speed), bins = 50) + labs(x = 'speed (mph)', y = 'Count') ``` <img src="figs/wildlife-speed-hist-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## **Boxplot**: Identify outliers .leftcol[ **Height** ```r ggplot(wildlife_impacts) + * geom_boxplot(aes(x = height)) + labs(x = 'Height (ft)', y = NULL) ``` <img src="figs/wildlife-height-boxplot-1.png" width="504" style="display: block; margin: auto;" /> ] .rightcol[ **Speed** ```r ggplot(wildlife_impacts) + * geom_boxplot(aes(x = speed)) + labs(x = 'Speed (mph)', y = NULL) ``` <img src="figs/wildlife-speed-boxplot-1.png" width="504" style="display: block; margin: auto;" /> ] --- .leftcol[ ## Histogram - Skewness - Modes <img src="figs/unnamed-chunk-28-1.png" width="504" style="display: block; margin: auto;" /> ] .rightcol[ ## Boxplot - Outliers <br><br> <img src="figs/unnamed-chunk-29-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse # Practicing visual summaries .font90[ 1) Read in the following data sets: - `faithful.csv` - `marathon.csv` 2) Summarize the following variables using an appropriate chart (bar chart, histogram, and / or boxplot): - faithful: `eruptions` - faithful: `waiting` - marathon: `Age` - marathon: `State` - marathon: `Country` - marathon: `` `Official Time` `` ] --- class: inverse, middle # Week 3: .fancy[Centrality & Variability] ## 1. Data Types ## 2. Measures of Centrality & Variability ## BREAK ## 3. Visualizing Centrality & Variability ## 4. .orange[Relationships Between 2 Variables] ## 5. Exploratory Data Analysis --- ## Two **Categorical** Variables Summarize with a table of counts .leftcol60[ ```r wildlife_impacts %>% * count(operator, time_of_day) ``` ``` #> # A tibble: 20 x 3 #> operator time_of_day n #> <chr> <chr> <int> #> 1 AMERICAN AIRLINES Dawn 458 #> 2 AMERICAN AIRLINES Day 7809 #> 3 AMERICAN AIRLINES Dusk 584 #> 4 AMERICAN AIRLINES Night 3710 #> 5 AMERICAN AIRLINES <NA> 2326 #> 6 DELTA AIR LINES Dawn 267 #> 7 DELTA AIR LINES Day 4846 #> 8 DELTA AIR LINES Dusk 353 #> 9 DELTA AIR LINES Night 2090 #> 10 DELTA AIR LINES <NA> 1449 #> 11 SOUTHWEST AIRLINES Dawn 394 #> 12 SOUTHWEST AIRLINES Day 9109 #> 13 SOUTHWEST AIRLINES Dusk 599 #> 14 SOUTHWEST AIRLINES Night 5425 #> 15 SOUTHWEST AIRLINES <NA> 2443 #> 16 UNITED AIRLINES Dawn 151 #> 17 UNITED AIRLINES Day 3359 #> 18 UNITED AIRLINES Dusk 181 #> 19 UNITED AIRLINES Night 1510 #> 20 UNITED AIRLINES <NA> 9915 ``` ] --- ## Two **Categorical** Variables Convert to "wide" format with `spread()` to make it easier to compare values .leftcol70[ ```r wildlife_impacts %>% count(operator, time_of_day) %>% * spread(key = time_of_day, value = n) ``` ``` #> # A tibble: 4 x 6 #> operator Dawn Day Dusk Night `<NA>` #> <chr> <int> <int> <int> <int> <int> #> 1 AMERICAN AIRLINES 458 7809 584 3710 2326 #> 2 DELTA AIR LINES 267 4846 353 2090 1449 #> 3 SOUTHWEST AIRLINES 394 9109 599 5425 2443 #> 4 UNITED AIRLINES 151 3359 181 1510 9915 ``` ] --- ## Two **Categorical** Variables .leftcol45[ Visualize with bars:<br>map **fill** to denote 2nd categorical var ```r wildlife_impacts %>% count(operator, time_of_day) %>% ggplot() + geom_col(aes(x = n, y = reorder(operator, n), * fill = reorder(time_of_day, n)), width = 0.7, * position = 'dodge') + theme(legend.position = "bottom") + labs(fill = "Time of day", y = NULL) ``` ] .rightcol55[ <img src="figs/unnamed-chunk-33-1.png" width="648" style="display: block; margin: auto;" /> ] --- ## Two **Continuous** Variables Visualize with scatterplot - looking for _clustering_ and/or _correlational_ relationship .leftcol45[ ```r ggplot(wildlife_impacts) + * geom_point(aes(x = speed, y = height), * size = 0.5) + labs(x = 'Speed (mph)', y = 'Height (f)') ``` ] .rightcol55[ <img src="figs/unnamed-chunk-34-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## One **Continuous**, One **Categorical** Visualize with **boxplot** .leftcol45[ ```r ggplot(wildlife_impacts) + * geom_boxplot(aes(x = speed, * y = operator)) + labs(x = 'Speed (mph)', y = 'Airline') ``` ] .rightcol55[ <img src="figs/unnamed-chunk-35-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse # Practice with visualizing _relationships_ 1) Read in the following data sets: - `marathon.csv` - `wildlife_impacts.csv` 2) Visualize the _relationships_ between the following variables using an appropriate chart (bar plots, scatterplots, and / or box plots): - marathon: `Age` & `Official Time` - marathon: Country & `Official Time` - wildlife_impacts: `state` & `operator` --- class: inverse, middle # Week 3: .fancy[Centrality & Variability] ## 1. Data Types ## 2. Measures of Centrality & Variability ## BREAK ## 3. Visualizing Centrality & Variability ## 4. Relationships Between 2 Variables ## 5. .orange[Exploratory Data Analysis] --- .leftcol[ # Exploratory Analysis ### Goal: **Form** hypotheses. ### Improves quality of **questions**. ### (do this in THIS class) ] -- .rightcol[ # Confirmatory Analysis ### Goal: **Test** hypotheses. ### Improves quality of **answers**. ### (do this in your stats classes) ] --- class: center, inverse # Don't be Icarus <center> <img src="images/icarus.jpg" width=800> </center> --- class: inverse, middle ## "Far better an approximate answer to the _right_ question, which is often vague, than an exact answer to the _wrong_ question, which can always be made precise." ## — John Tukey --- class: center background-color: #FFFFFF **EDA is an iterative process to help you<br>_understand_ your data and ask better questions** <center> <img src="images/eda.png" width=700> </center> --- ## Visualizing variation .leftcol30[ Ask yourself: - What type of **variation** occurs within my variables? - What type of **covariation** occurs between my variables? Check out [these guides](https://eda.seas.gwu.edu/2021-Spring/ref-visualizing-data.html#choosing-the-right-chart) ] .rightcol70[ <center> <img src = "images/plots-table.png" width = "800"> </center> ] --- class: inverse # Practice doing EDA 1) Read in the candy_rankings.csv data sets 2) Preview the data, note the data types and what each variable is. 3) Visualize (at least) three _relationships_ between two variables (guided by a question) using an appropriate chart: - Bar chart - Scatterplot - Boxplot --- class: center, middle, inverse # Start thinking about research questions --- # Writing a research question Follow [these guidelines](https://writingcenter.gmu.edu/guides/how-to-write-a-research-question) - your question should be: -- - **Clear**: your audience can easily understand its purpose without additional explanation. -- - **Focused**: it is narrow enough that it can be addressed thoroughly with the data available and within the limits of the final project report. -- - **Concise**: it is expressed in the fewest possible words. -- - **Complex**: it is not answerable with a simple "yes" or "no," but rather requires synthesis and analysis of data. -- - **Arguable**: its potential answers are open to debate rather than accepted facts (do others care about it?) --- # Writing a research question -- **Bad question: Why are social networking sites harmful?** - Unclear: it does not specify _which_ social networking sites or state what harm is being caused; assumes that "harm" exists. -- **Improved question: How are online users experiencing or addressing privacy issues on such social networking sites as Facebook and Twitter?** - Specifies the sites (Facebook and Twitter), type of harm (privacy issues), and who is harmed (online users). -- **Other good examples**: See the [Example Projects Page](https://eda.seas.gwu.edu/2021-Spring/ref-example-analyses.html) page --- # Start self-organizing for projects > Find your topic / teammate(s) [here](https://docs.google.com/presentation/d/15ohFg5a9y6ZMk5tGGm4ho7Rj97eVTgcBwgVac5W-Yzc/edit?usp=sharing) (link also in #classroom)