class: middle, inverse .leftcol30[ <center> <img src="https://eda.seas.gwu.edu/images/logo.png" width=250> </center> ] .rightcol70[ # Week 4: .fancy[Exploring Data] ###
EMSE 4572/6572: Exploratory Data Analysis ###
John Paul Helveston ###
September 18, 2024 ] --- class: center, middle, inverse # Quiz solution --- class: center, middle, inverse # Tip of the week: # `theme_set()` --- ``` r ggplot(mtcars) + geom_point(aes(x = mpg, y = hp)) ``` .leftcol[ Default theme <img src="figs/unnamed-chunk-3-1.png" width="522.144" /> ] .rightcol[ `theme_bw(base_size = 20)` <img src="figs/unnamed-chunk-4-1.png" width="522.144" /> ] --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. .orange[Exploring Data] ### 2. Data Types ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- .leftcol[ # Exploratory Analysis <br> ### Goal: **Form** hypotheses. ### Improves quality of **questions**. ### _(what we do in THIS class)_ ] -- .rightcol[ # Confirmatory Analysis <br> ### Goal: **Test** hypotheses. ### Improves quality of **answers**. ### _(what you do in a stats class)_ ] --- .leftcol[ # Exploratory Analysis <br> RQ: Do people bike more when the weather is nice? <center> <img src="images/biking.png" width=ru0%> </center> ] -- .rightcol[ # Confirmatory Analysis <br> Let's build a model to predict bike usage based on weather. ] --- class: center, inverse # Don't be Icarus <center> <img src="images/icarus.jpg" width=800> </center> --- class: inverse, middle ## "An _approximate_ answer to the _right_ question is better<br>than an _exact_ answer to the _wrong_ question." ## — [John Tukey](https://en.wikipedia.org/wiki/John_Tukey) --- class: center background-color: #FFFFFF **EDA is an iterative process to help you<br>_understand_ your data and ask better questions** <center> <img src="images/eda.png" width=700> </center> --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. .orange[Data Types] ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- class: inverse, center, middle # 24,901 ??? If I walked up to you, and said, "The answer is 24,901," you would probably be confused. By itself, a number means nothing. --- class: inverse, center, middle # .orange[Earth's circumference at the equator:]<br>24,901 ??? But if I were to tell you that the circumference of the earth at the equator is 24,901 miles, that would mean something. --- class: inverse, center, middle # Earth's circumference at the equator:<br>24,901 .orange[miles] ??? To be complete and meaningful, quantitative information consists of both quantitative data (the numbers) and categorical data (the labels that tell us what the numbers measure). --- # Types of Data -- .leftcol[ ### **Categorical** Subdivide things into _groups_ - What type? - Which category? ] -- .rightcol[ ### **Numerical** Measure things with numbers - How many? - How much? ] --- ## Categorical (discrete) variables -- .leftcol[ ### **Nominal** - Order doesn't matter - Differ in "name" (nominal) only e.g. `country` in TB case data: .code80[ ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 ``` ]] -- .rightcol[ ### **Ordinal** - Order matters - Distance between units not equal e.g.: `Placement` 2017 Boston marathon: .code80[ ``` #> # A tibble: 6 × 3 #> Placement `Official Time` Name #> <dbl> <time> <chr> #> 1 1 02:09:37 Kirui, Geoffrey #> 2 2 02:09:58 Rupp, Galen #> 3 3 02:10:28 Osako, Suguru #> 4 4 02:12:08 Biwott, Shadrack #> 5 5 02:12:35 Chebet, Wilson #> 6 6 02:12:45 Abdirahman, Abdi ``` ]] --- ## Numerical data -- .leftcol[ ### **Interval** - Numerical scale with<br>arbitrary starting point - No "0" point - Can't say "x" is double "y" e.g.: `temp` in Beaver data ``` #> day time temp activ #> 1 346 840 36.33 0 #> 2 346 850 36.34 0 #> 3 346 900 36.35 0 #> 4 346 910 36.42 0 #> 5 346 920 36.55 0 #> 6 346 930 36.69 0 ``` ] -- .rightcol[ ### **Ratio** - Has a "0" point - Can be described as percentages - Can say "x" is double "y" e.g.: `height` & `speed` in wildlife impacts ``` #> # A tibble: 6 × 3 #> incident_date height speed #> <dttm> <dbl> <dbl> #> 1 2018-12-31 00:00:00 700 200 #> 2 2018-12-27 00:00:00 600 145 #> 3 2018-12-23 00:00:00 0 130 #> 4 2018-12-22 00:00:00 500 160 #> 5 2018-12-21 00:00:00 100 150 #> 6 2018-12-18 00:00:00 4500 250 ``` ] --- class: inverse, center, middle # Key Questions -- .leftcol[ ## Categorical ## .orange[Does the order matter?] Yes: **Ordinal** No: **Nominal** ] -- .rightcol[ ## Numerical ## .orange[Is there a "baseline"?] Yes: **Ratio** No: **Interval** ] --- class: center, middle # Be careful of how variables are encoded! --- ## .red[When numbers are categories] - "Dummy coding": e.g., `passedTest` = `1` or `0`) - "North", "South", "East", "West" = `1`, `2`, `3`, `4` -- ## .red[When ratio data are discrete (i.e. counts)] - Number of eggs in a carton, heart beats per minute, etc. - Continuous variables measured discretely (e.g. age) -- ## .red[Time] - As _ordinal_ categories: "Jan.", "Feb.", "Mar.", etc. - As _interval_ scale: "Jan. 1", "Jan. 2", "Jan. 3", etc. - As _ratio_ scale: "30 sec", "60 sec", "70 sec", etc. --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. .orange[Centrality & Variability] ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- class: inverse, middle # .center[.font140[Summary Measures:]] # Single variables: .red[Centrality] & .blue[Variability] # Two variables: .green[Correlation] --- # .center[.red[Centrality (a.k.a. The "Average" Value)]] -- ### .center[A single number representing the _middle_ of a set of numbers] <br> -- ### **Mean**: `\(\frac{\text{Sum of values}}{\text{# of values}}\)` -- ### **Median**: "Middle" value (50% of data above & below) --- # .center[Mean isn't always the "best" choice] .leftcol40[ ``` r wildlife_impacts %>% filter(! is.na(height)) %>% summarise( mean = mean(height), median = median(height) ) ``` ``` #> # A tibble: 1 × 2 #> mean median #> <dbl> <dbl> #> 1 984. 50 ``` Percent of data below mean: ``` #> [1] "73.9%" ``` ] -- .rightcol60[ .center[**On average, at what height do planes hit birds?**] <img src="figs/wildlife-hist.png"> ] ??? On average, where do planes hit birds? Saying ~1000 ft is misleading It's much more likely to be under 100 ft --- class: inverse # .center[Beware the "flaw of averages"] -- .leftcol[ ### What happened to the statistician that crossed a river with an average depth of 3 feet? ] -- .rightcol[ ### ...he drowned <img src="images/foa.jpg" width=600> ] --- # .center[.blue[Variability ("Spread")]] -- ### **Standard deviation**: distribution of values relative to the mean ### `\(s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}\)` -- ### **Interquartile range (IQR)**: `\(Q_3 - Q_1\)` (middle 50% of data) -- ### **Range**: max - min --- # .center[.fancy[Example:] Days to ship] .leftcol40[ Complaints are coming in about orders shipped from warehouse B, so you collect some data: .code70[ ``` r daysToShip ``` ``` #> order warehouseA warehouseB #> 1 1 3 1 #> 2 2 3 1 #> 3 3 3 1 #> 4 4 4 3 #> 5 5 4 3 #> 6 6 4 4 #> 7 7 5 5 #> 8 8 5 5 #> 9 9 5 5 #> 10 10 5 6 #> 11 11 5 7 #> 12 12 5 10 ``` ]] -- .rightcol60[ Here, **averages** are misleading: ``` r daysToShip %>% gather(warehouse, days, warehouseA:warehouseB) %>% group_by(warehouse) %>% summarise( * mean = mean(days), * median = median(days)) ``` ``` #> # A tibble: 2 × 3 #> warehouse mean median #> <chr> <dbl> <dbl> #> 1 warehouseA 4.25 4.5 #> 2 warehouseB 4.25 4.5 ``` ] --- # .center[.fancy[Example:] Days to ship] .leftcol40[ Complaints are coming in about orders shipped from warehouse B, so you collect some data: .code70[ ``` r daysToShip ``` ``` #> order warehouseA warehouseB #> 1 1 3 1 #> 2 2 3 1 #> 3 3 3 1 #> 4 4 4 3 #> 5 5 4 3 #> 6 6 4 4 #> 7 7 5 5 #> 8 8 5 5 #> 9 9 5 5 #> 10 10 5 6 #> 11 11 5 7 #> 12 12 5 10 ``` ]] .rightcol60[ **Variability** reveals difference in days to ship: ``` r daysToShip %>% gather(warehouse, days, warehouseA:warehouseB) %>% group_by(warehouse) %>% summarise( mean = mean(days), median = median(days), * range = max(days) - min(days), * sd = sd(days)) ``` ``` #> # A tibble: 2 × 5 #> warehouse mean median range sd #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 warehouseA 4.25 4.5 2 0.866 #> 2 warehouseB 4.25 4.5 9 2.70 ``` ] --- # .center[.fancy[Example:] Days to ship] <center> <img src="figs/days-to-ship.png" width=960> </center> --- class: center # Interpreting the standard deviation .leftcol[ ### `\(s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}\)` <center> <img src="figs/days-to-ship-sd.png" width=380> </center> ] -- .rightcol[ <img src="images/sd.png"> ] --- class: inverse, center # Outliers <center> <img src = "images/outliers.jpeg" width = "730"> </center> --- ## **Mean** & **Standard Deviation** are sensitive to outliers **Outliers**: `\(Q_1 - 1.5 IQR\)` or `\(Q_3 + 1.5 IQR\)` **Extreme values**: `\(Q_1 - 3 IQR\)` or `\(Q_3 + 3 IQR\)` -- .leftcol[ ``` r data1 <- c(3,3,4,5,5,6,6,7,8,9) ``` - Mean: 5.6 - Standard Deviation: 2.01 - Median: 5.5 - IQR: 2.5 ] -- .rightcol[ ``` r data2 <- data1 data2[10] <- 20 ``` - .red[Mean: 6.7] - .red[Standard Deviation: 4.95] - Median: 5.5 - IQR: 2.5 ] --- class: inverse, middle # .center[Robust statistics for continuous data] # .center[(less sensitive to outliers)] ## .red[Centrality]: Use _median_ rather than _mean_ ## .blue[Variability]: Use _IQR_ rather than _standard deviation_ --- class: inverse
−
+
10
:
00
# Practice with summary measurements ### 1) Read in the following data sets: - `milk_production.csv` - `lotr_words.csv` ### 2) For each variable in each data set, if possible, summarize its ### 1. .red[Centrality] ### 2. .blue[Variability] --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. Centrality & Variability ### 4. .orange[Visualizing Centrality & Variability] ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- class: center # "Visualizing data helps us think" <center> <img src = "images/anscombe_data.png" width = "740"> </center> .left[.footer-small[Stephen Few (2009, pg. 6)]] --- background-color: #fff class: center # Anscombe's Quartet <center> <img src="figs/anscombe-quartet.png" width=600> </center> .left[.footer-small[Stephen Few (2009, pg. 6)]] --- background-color: #fff class: center .leftcol60[ # Anscombe's Quartet <center> <img src="figs/anscombe-quartet.png" width=600> </center> ] .rightcol40[ <br> <center> <img src="https://eda.seas.gwu.edu/2023-Fall/images/logo.png" width=100%> </center> ] --- class: inverse, center, middle # The data _type_ determines <br> how to summarize it --- .cols3[ ### **Nominal<br>(Categorical)** **Measures**: - Frequency counts /<br>Proportions <br> <br> <br> <br> **Charts**: - Bars ] -- .cols3[ ### **Ordinal<br>(Categorical)** **Measures**: - Frequency counts /<br>Proportions - .red[Centrality]:<br>Median, Mode - .blue[Variability]: IQR <br> **Charts**: - Bars ] -- .cols3[ ### **Numerical<br>(Continuous)** **Measures**: - .red[Centrality]:<br>Mean, median - .blue[Variability]: Range, standard deviation, IQR <br> <br> **Charts**: - Histogram - Boxplot ] --- ## Summarizing **Nominal** data .leftcol45[ Summarize with counts / percentages ``` r wildlife_impacts %>% * count(operator, sort = TRUE) %>% * mutate(p = n / sum(n)) ``` ``` #> # A tibble: 4 × 3 #> operator n p #> <chr> <int> <dbl> #> 1 SOUTHWEST AIRLINES 17970 0.315 #> 2 UNITED AIRLINES 15116 0.265 #> 3 AMERICAN AIRLINES 14887 0.261 #> 4 DELTA AIR LINES 9005 0.158 ``` ] -- .rightcol55[ Visualize with (usually sorted) bars .code70[ ``` r wildlife_impacts %>% count(operator, sort = TRUE) %>% * ggplot() + * geom_col(aes(x = n, y = reorder(operator, n)), * width = 0.7) + labs(x = "Count", y = "Operator") ``` <img src="figs/wildlife-operator-bars-1.png" width="504" /> ]] --- ## Summarizing **Ordinal** data .leftcol[ **Summarize**: Counts / percentages .code70[ ``` r wildlife_impacts %>% * count(incident_month, sort = TRUE) %>% * mutate(p = n / sum(n)) ``` ``` #> # A tibble: 12 × 3 #> incident_month n p #> <dbl> <int> <dbl> #> 1 9 7980 0.140 #> 2 10 7754 0.136 #> 3 8 7104 0.125 #> 4 5 6161 0.108 #> 5 7 6133 0.108 #> 6 6 4541 0.0797 #> 7 4 4490 0.0788 #> 8 11 4191 0.0736 #> 9 3 2678 0.0470 #> 10 12 2303 0.0404 #> 11 1 1951 0.0342 #> 12 2 1692 0.0297 ``` ]] -- .rightcol[ **Visualize**: Bars .code70[ ``` r wildlife_impacts %>% count(incident_month, sort = TRUE) %>% * ggplot() + * geom_col(aes(x = as.factor(incident_month), * y = n), width = 0.7) + labs(x = "Incident month") ``` <img src="figs/wildlife-months-bar-1.png" width="504" /> ]] --- ## Summarizing **continuous** variables .leftcol30[ **Histograms**: - Skewness - Number of modes <br> **Boxplots**: - Outliers - Comparing variables ] .rightcol70[.border[ <img src = 'images/eda-boxplot.png'> ]] --- ## **Histogram**: Identify Skewness & # of Modes .leftcol40[ **Summarise**:<br>Mean, median, sd, range, & IQR: ``` r summary(wildlife_impacts$height) ``` ``` #> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's #> 0.0 0.0 50.0 983.8 1000.0 25000.0 18038 ``` ] -- .rightcol60[ **Visualize**:<br>Histogram (identify skewness & modes) ``` r ggplot(wildlife_impacts) + * geom_histogram(aes(x = height), bins = 50) + labs(x = 'Height (ft)', y = 'Count') ``` <img src="figs/wildlife-height-hist-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## **Histogram**: Identify Skewness & # of Modes .leftcol[ **Height** ``` r ggplot(wildlife_impacts) + * geom_histogram(aes(x = height), bins = 50) + labs(x = 'Height (ft)', y = 'Count') ``` <img src="figs/unnamed-chunk-26-1.png" width="504" style="display: block; margin: auto;" /> ] .rightcol[ **Speed** ``` r ggplot(wildlife_impacts) + * geom_histogram(aes(x = speed), bins = 50) + labs(x = 'speed (mph)', y = 'Count') ``` <img src="figs/wildlife-speed-hist-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## **Boxplot**: Identify outliers .leftcol[ **Height** ``` r ggplot(wildlife_impacts) + * geom_boxplot(aes(x = height)) + labs(x = 'Height (ft)', y = NULL) ``` <img src="figs/wildlife-height-boxplot-1.png" width="504" style="display: block; margin: auto;" /> ] .rightcol[ **Speed** ``` r ggplot(wildlife_impacts) + * geom_boxplot(aes(x = speed)) + labs(x = 'Speed (mph)', y = NULL) ``` <img src="figs/wildlife-speed-boxplot-1.png" width="504" style="display: block; margin: auto;" /> ] --- .leftcol[ ## Histogram - Skewness - Modes <img src="figs/unnamed-chunk-27-1.png" width="504" style="display: block; margin: auto;" /> ] .rightcol[ ## Boxplot - Outliers <br><br> <img src="figs/unnamed-chunk-28-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse
−
+
15
:
00
# Practicing visual summaries .font90[ 1) Read in the following data sets: - `faithful.csv` - `marathon.csv` 2) Summarize the following variables using an appropriate chart (bar chart, histogram, and / or boxplot): - faithful: `eruptions` - faithful: `waiting` - marathon: `Age` - marathon: `State` - marathon: `Country` - marathon: `` `Official Time` `` ] --- class: inverse, center # Break! ## Stand up, Move around, Stretch!
−
+
05
:
00
--- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. .orange[Correlation] ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- ## .center[Some pretty racist origins in [eugenics](https://en.wikipedia.org/wiki/Eugenics) ("well born")] -- .leftcol[ ### [Sir Francis Galton](https://en.wikipedia.org/wiki/Francis_Galton) (1822 - 1911) - Charles Darwin's cousin. - "Father" of [eugenics](https://en.wikipedia.org/wiki/Eugenics). - Interested in heredity. <center> <img src="images/Francis_Galton_1850s.jpg" width=200> </center> ] -- .rightcol[ ### [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson) (1857 - 1936) - Galton's ([hero-worshiping](https://en.wikipedia.org/wiki/Apotheosis)) protégé. - Defined correlation equation. - "Father" of mathematical statistics. <center> <img src="images/Karl_Pearson.jpg" width=220> <center> ] ??? The beautiful irony is that human genetics was also the field that conclusively demonstrated the biological falsity of race. --- .leftcol[ # Galton's family data Galton, F. (1886). ["Regression towards mediocrity in hereditary stature"](http://www.stat.ucla.edu/~nchristo/statistics100C/history_regression.pdf). _The Journal of the Anthropological Institute of Great Britain and Ireland_ 15: 246-263. **Galton's question**: Does marriage selection indicate a relationship between the heights of husbands and wives?<br>(He called this "assortative mating") "midparent height" is just a scaled mean: ``` r midparentHeight = (father + 1.08*mother)/2 ``` ] -- .rightcol[.code70[ ``` r library(HistData) galtonScatterplot <- ggplot(GaltonFamilies) + geom_point(aes(x = midparentHeight, y = childHeight), size = 0.5, alpha = 0.7) + theme_classic() + labs(x = 'Midparent height (inches)', y = 'Child height (inches)') ``` <center> <img src="figs/galtonScatterplot.png" width=450> </center> ]] --- class: center, middle # How do you measure correlation? <br> # Pearson came up with this: # `\(r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}\)` --- # How do you measure correlation? .leftcol60[ ## `\(r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}\)` .font130[ Assumptions: 1. Variables must be interval or ratio 2. Linear relationship ]] -- .rightcol40[ <center> <img src="figs/cor_vstrong_p.png" width=275> </center> <br> <center> <img src="figs/cor_quad.png" width=275> </center> ] --- # How do you _interpret_ `\(r\)`? .leftcol[ ## `\(r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}\)` Interpretation: - `\(-1 \le r \le 1\)` - Closer to 1 is stronger correlation - Closer to 0 is weaker correlation ] -- .rightcol[.code70[ ``` r cor(x = GaltonFamilies$midparentHeight, y = GaltonFamilies$childHeight, method = 'pearson') ``` ``` #> [1] 0.3209499 ``` ] <center> <img src="figs/galtonScatterplot.png" width=400> </center> ] --- ## What does `\(r\)` mean? .leftcol40[.font120[ - `\(\pm 0.1 - 0.3\)`: Weak - `\(\pm 0.3 - 0.5\)`: Moderate - `\(\pm 0.5 - 0.8\)`: Strong - `\(\pm 0.8 - 1.0\)`: Very strong ]] .rightcol60[ <center> <img src="figs/cor_p.png"> </center> ] --- class: center, middle # Visualizing correlation is...um...easy, right? <br> # [guessthecorrelation.com](http://guessthecorrelation.com/) # Click [here](https://docs.google.com/presentation/d/1-7VqNRJp53FawfNJwKLEkpoubGQ_x0wIkN2lAMP7Emw/edit?usp=sharing) to vote! --- class: middle .leftcol20[ ## The datasaurus ### (More [here](https://www.autodeskresearch.com/publications/samestats)) ] .rightcol80[ <img src="images/datasaurus.png"> ] --- # Coefficient of determination: `\(r^2\)` .leftcol[.font130[ Percent of variance in one variable that is explained by the other variable <center> <img src="images/rsquared_venn.png"> </center> ]] -- .rightcol[ `\(r\)` | `\(r^2\)` ----|------ 0.1 | 0.01 0.2 | 0.04 0.3 | 0.09 0.4 | 0.16 0.5 | 0.25 0.6 | 0.36 0.7 | 0.49 0.8 | 0.64 0.9 | 0.81 1.0 | 1.00 ] --- ## You should report both `\(r\)` and `\(r^2\)` <br> ### Correlation between parent and child height is 0.32, therefore 10% of the variance in the child height is explained by the parent height. --- # Correlation != Causation -- ### X causes Y - Training causes improved performance -- ### Y causes X - Good (bad) performance causes people to train harder (less hard). -- ### Z causes both X & Y - Commitment and motivation cause increased training and better performance. --- class: center ## Be weary of dual axes! ## ([They can cause spurious correlations](https://www.tylervigen.com/spurious-correlations)) -- .leftcol[ .font120[Dual axes] <center> <img src="images/hbr_two_axes1.png"> </center> ] -- .rightcol[ .font120[Single axis] <center> <img src="images/hbr_two_axes2.png"> </center> ] --- class: inverse, center # Outliers <center> <img src = "images/outliers.jpeg" width = "730"> </center> --- class: middle <center> <img src="figs/pearson_base.png" width=600> </center> --- class: middle <center> <img src="figs/pearson1.png" width=600> </center> --- class: middle <center> <img src="figs/pearson2.png" width=600> </center> --- class: center, middle ## **Pearson** correlation is highly sensitive to outliers <center> <img src="figs/pearson_grid.png" width=600> </center> --- # **Spearman**'s rank-order correlation # `\(r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}\)` -- .font120[ - Separately rank the values of X & Y. - Use Pearson's correlation on the _ranks_ instead of the `\(x\)` & `\(y\)` values. ] -- .font120[ Assumptions: - Variables can be ordinal, interval or ratio - Relationship must be monotonic (i.e. does not require linearity) ] --- class: center, middle ## Spearman correlation more robust to outliers <center> <img src="figs/spearman_grid.png" width=600> </center> --- class: center, middle ## Spearman correlation more robust to outliers .cols3[ <center> <img src="figs/pearson_grid.png"> </center> ] .cols3[ <table> <thead> <tr> <th style="text-align:right;"> Pearson </th> <th style="text-align:right;"> Spearman </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -0.56 </td> <td style="text-align:right;"> 0.53 </td> </tr> <tr> <td style="text-align:right;"> 0.39 </td> <td style="text-align:right;"> 0.69 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> <td style="text-align:right;"> 0.81 </td> </tr> <tr> <td style="text-align:right;"> 0.38 </td> <td style="text-align:right;"> 0.76 </td> </tr> <tr> <td style="text-align:right;"> 0.81 </td> <td style="text-align:right;"> 0.79 </td> </tr> <tr> <td style="text-align:right;"> 0.31 </td> <td style="text-align:right;"> 0.70 </td> </tr> <tr> <td style="text-align:right;"> 0.95 </td> <td style="text-align:right;"> 0.81 </td> </tr> <tr> <td style="text-align:right;"> 0.51 </td> <td style="text-align:right;"> 0.75 </td> </tr> <tr> <td style="text-align:right;"> -0.56 </td> <td style="text-align:right;"> 0.53 </td> </tr> </tbody> </table> ] .cols3[ <center> <img src="figs/outlier_compare.png"> </center> ] --- ## Summary of correlation .font120[ - **Pearson's correlation**: Described the strength of a **linear** relationship between two variables that are interval or ratio in nature. - **Spearman's rank-order correlation**: Describes the strength of a **monotonic** relationship between two variables that are ordinal, interval, or ratio. **It is more robust to outliers**. - The **coefficient of determination** ( `\(r^2\)` ) describes the amount of variance in one variable that is explained by the other variable. - **Correlation != Causation** ] -- R command (hint: add `use = "complete.obs"` to drop NA values) ``` r pearson <- cor(x, y, method = "pearson", use = "complete.obs") spearman <- cor(x, y, method = "spearman", use = "complete.obs") ``` --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. .orange[Visualizing Correlation] ### 7. Visualizing Relationships ] --- ## **Scatterplots**: The correlation workhorse .leftcol[ ``` r scatterplot <- mtcars %>% ggplot() + * geom_point( * aes(x = mpg, y = hp), * size = 2, alpha = 0.7 * ) + theme_classic(base_size = 20) + labs( x = 'Fuel economy (mpg)', y = 'Engine power (hp)' ) scatterplot ``` ] .rightcol[ <center> <img src="figs/mtcarsScatterplotBase.png"> </center> ] --- ## Adding a correlation label to a chart .leftcol[ Make the correlation label ``` r corr <- cor( mtcars$mpg, mtcars$hp, method = 'pearson') *corrLabel <- paste('r = ', round(corr, 2)) ``` Add label to the chart with `annotate()` ``` r scatterplot + * annotate( * geom = 'text', * x = 25, y = 310, * label = corrLabel, * hjust = 0, size = 7 * ) ``` ] .rightcol[ <center> <img src="figs/mtcarsScatterplot.png"> </center> ] --- class: middle, center background-color: #FFFFFF <center> <img src="images/all-the-correlations.jpeg" width=700> </center> --- ## Visualize all the correlations: `ggcorr()` .leftcol[ ``` r library('GGally') ``` ``` r mtcars %>% * ggcorr() ``` ] .rightcol[ <center> <img src="figs/ggcor_mtcars.png"> </center> ] --- ## Visualizing correlations: `ggcorr()` .leftcol[ ``` r library('GGally') ``` ``` r mtcars %>% * ggcorr(label = TRUE, * label_size = 3, * label_round = 2) ``` ] .rightcol[ <center> <img src="figs/ggcor_mtcars_labels.png"> </center> ] --- ## Visualizing correlations: `ggcorr()` .leftcol[ ``` r ggcor_mtcars_final <- mtcars %>% ggcorr(label = TRUE, label_size = 3, label_round = 2, * label_color = 'white', * nbreaks = 5, * palette = "RdBu") ``` ] .rightcol[ <center> <img src="figs/ggcor_mtcars_final.png"> </center> ] --- .leftcol[ ## .center[Pearson] ``` r mtcars %>% ggcorr(label = TRUE, label_size = 3, label_round = 2, * method = c("pairwise", "pearson")) ``` <center> <img src="figs/ggcor_mtcars_pearson.png" width=400> </center> ] .rightcol[ ## .center[Spearman] ``` r mtcars %>% ggcorr(label = TRUE, label_size = 3, label_round = 2, * method = c("pairwise", "spearman")) ``` <center> <img src="figs/ggcor_mtcars_spearman.png" width=400> </center> ] --- ## Correlograms: `ggpairs()` .leftcol40[ ``` r library('GGally') ``` ``` r mtcars %>% select(mpg, cyl, disp, hp, wt) %>% * ggpairs() ``` - Look for linear relationships - View distribution of each variable ] .rightcol60[ <center> <img src="figs/ggpairs_mtcars.png" width=600> </center> ] --- ## Correlograms: `ggpairs()` .leftcol40[ ``` r library('GGally') ``` ``` r mtcars %>% select(mpg, cyl, disp, hp, wt) %>% ggpairs() + * theme_classic() ``` - Look for linear relationships - View distribution of each variable ] .rightcol60[ <center> <img src="figs/ggpairs_mtcars_classic.png" width=600> </center> ] --- class: inverse ## Your turn
−
+
15
:
00
.leftcol[ Using the `penguins` data frame: 1. Find the two variables with the largest correlation in absolute value (i.e. closest to -1 or 1). 2. Create a scatter plot of those two variables. 3. Add an annotation for the Pearson correlation coefficient. ] .rightcol[ ### .center[[palmerpenguins library](https://allisonhorst.github.io/palmerpenguins/)] <center> <img src="images/lter_penguins.png" width=700> </center> .right[Artwork by [@allison_horst](https://twitter.com/allison_horst)] ] --- ## **Simpson's Paradox**: when correlation betrays you -- .leftcol[ .center[**Body mass vs. Bill depth**] <center> <img src="figs/simpson_penguins.png" width=450> </center> ] -- .rightcol[ .center[**Body mass vs. Bill depth**] <center> <img src="figs/simpson_penguins_good.png" width=600> </center> ] --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. .orange[Visualizing Relationships] ] --- ## Visualizing variation .leftcol30[ Ask yourself: - What type of **variation** occurs within my variables? - What type of **covariation** occurs between my variables? Check out [these guides](https://eda.seas.gwu.edu/2023-Fall/references.html#choosing-the-right-chart) ] .rightcol70[ <center> <img src = "images/plots-table.png" width = "800"> </center> ] --- ## Two **Categorical** Variables Summarize with a table of counts .leftcol60[ ``` r wildlife_impacts %>% * count(operator, time_of_day) ``` ``` #> # A tibble: 20 × 3 #> operator time_of_day n #> <chr> <chr> <int> #> 1 AMERICAN AIRLINES Dawn 458 #> 2 AMERICAN AIRLINES Day 7809 #> 3 AMERICAN AIRLINES Dusk 584 #> 4 AMERICAN AIRLINES Night 3710 #> 5 AMERICAN AIRLINES <NA> 2326 #> 6 DELTA AIR LINES Dawn 267 #> 7 DELTA AIR LINES Day 4846 #> 8 DELTA AIR LINES Dusk 353 #> 9 DELTA AIR LINES Night 2090 #> 10 DELTA AIR LINES <NA> 1449 #> 11 SOUTHWEST AIRLINES Dawn 394 #> 12 SOUTHWEST AIRLINES Day 9109 #> 13 SOUTHWEST AIRLINES Dusk 599 #> 14 SOUTHWEST AIRLINES Night 5425 #> 15 SOUTHWEST AIRLINES <NA> 2443 #> 16 UNITED AIRLINES Dawn 151 #> 17 UNITED AIRLINES Day 3359 #> 18 UNITED AIRLINES Dusk 181 #> 19 UNITED AIRLINES Night 1510 #> 20 UNITED AIRLINES <NA> 9915 ``` ] --- ## Two **Categorical** Variables Convert to "wide" format with `pivot_wider()` to make it easier to compare values .leftcol70[ ``` r wildlife_impacts %>% count(operator, time_of_day) %>% * pivot_wider(names_from = time_of_day, values_from = n) ``` ``` #> # A tibble: 4 × 6 #> operator Dawn Day Dusk Night `NA` #> <chr> <int> <int> <int> <int> <int> #> 1 AMERICAN AIRLINES 458 7809 584 3710 2326 #> 2 DELTA AIR LINES 267 4846 353 2090 1449 #> 3 SOUTHWEST AIRLINES 394 9109 599 5425 2443 #> 4 UNITED AIRLINES 151 3359 181 1510 9915 ``` ] --- ## Two **Categorical** Variables .leftcol45[ Visualize with bars:<br>map **fill** to denote 2nd categorical var ``` r wildlife_impacts %>% count(operator, time_of_day) %>% ggplot() + geom_col( aes( x = n, y = reorder(operator, n), * fill = reorder(time_of_day, n) ), width = 0.7, * position = 'dodge') + theme(legend.position = "bottom") + labs( fill = "Time of day", y = "Airline" ) ``` ] .rightcol55[ <img src="figs/unnamed-chunk-56-1.png" width="648" style="display: block; margin: auto;" /> ] --- ## Two **Continuous** Variables Visualize with scatterplot - looking for _clustering_ and/or _correlational_ relationship .leftcol45[ ``` r ggplot(wildlife_impacts) + geom_point( aes( x = speed, y = height ), size = 0.5) + labs( x = 'Speed (mph)', y = 'Height (f)' ) ``` ] .rightcol55[ <img src="figs/unnamed-chunk-57-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## One **Continuous**, One **Categorical** Visualize with **boxplot** .leftcol45[ ``` r ggplot(wildlife_impacts) + geom_boxplot( aes( x = speed, y = operator) ) + labs( x = 'Speed (mph)', y = 'Airline' ) ``` ] .rightcol55[ <img src="figs/unnamed-chunk-58-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse
−
+
15
:
00
# Practice doing EDA 1) Read in the `candy_rankings.csv` data sets 2) Preview the data, note the data types and what each variable is. 3) Visualize (at least) three _relationships_ between two variables (guided by a question) using an appropriate chart: - Bar chart - Scatterplot - Boxplot --- class: inverse, middle # Reminders: ## You have **4** days until your [Project Proposal](https://eda.seas.gwu.edu/2024-Fall/project/1-proposal.html) is due ## You have **6** days until your [Mini Project 1](https://eda.seas.gwu.edu/2024-Fall/mini/1-data-cleaning.html) is due. ## [Sign up](https://docs.google.com/spreadsheets/d/1qhV29wAumIuFv2Cwmy-sgEirfTi3_a8OCUsqFUWNkzw/edit?usp=sharing) for meeting slot next week