class: middle, inverse .leftcol30[ <center> <img src="https://eda.seas.gwu.edu/images/logo.png" width=250> </center> ] .rightcol70[ # Week 4: .fancy[Exploring Data] ###
EMSE 4572/6572: Exploratory Data Analysis ###
John Paul Helveston ###
September 20, 2023 ] --- class: center, middle, inverse # Quiz solution --- class: center, middle, inverse # Tip of the week: # `theme_set()` --- ```r ggplot(mtcars) + geom_point(aes(x = mpg, y = hp)) ``` .leftcol[ Default theme <img src="figs/unnamed-chunk-3-1.png" width="522.144" /> ] .rightcol[ `theme_bw(base_size = 20)` <img src="figs/unnamed-chunk-4-1.png" width="522.144" /> ] --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. .orange[Exploring Data] ### 2. Data Types ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- .leftcol[ # Exploratory Analysis <br> ### Goal: **Form** hypotheses. ### Improves quality of **questions**. ### _(what we do in THIS class)_ ] -- .rightcol[ # Confirmatory Analysis <br> ### Goal: **Test** hypotheses. ### Improves quality of **answers**. ### _(what you do in a stats class)_ ] --- .leftcol[ # Exploratory Analysis <br> RQ: Do people bike more when the weather is nice? <center> <img src="images/biking.png" width=ru0%> </center> ] -- .rightcol[ # Confirmatory Analysis <br> Let's build a model to predict bike usage based on weather. ] --- class: center, inverse # Don't be Icarus <center> <img src="images/icarus.jpg" width=800> </center> --- class: inverse, middle ## "Far better an approximate answer to the _right_ question, which is often vague, than an exact answer to the _wrong_ question, which can always be made precise." ## — [John Tukey](https://en.wikipedia.org/wiki/John_Tukey) --- class: center background-color: #FFFFFF **EDA is an iterative process to help you<br>_understand_ your data and ask better questions** <center> <img src="images/eda.png" width=700> </center> --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. .orange[Data Types] ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- class: inverse, center, middle # 24,901 ??? If I walked up to you, and said, "The answer is 24,901," you would probably be confused. By itself, a number means nothing. --- class: inverse, center, middle # Earth's circumference at the equator:<br>24,901 miles ??? But if I were to tell you that the circumference of the earth at the equator is 24,901 miles, that would mean something. To be complete and meaningful, quantitative information consists of both quantitative data (the numbers) and categorical data (the labels that tell us what the numbers measure). --- # Types of Data -- .leftcol[ ### **Categorical** Subdivide things into _groups_ - What type? - Which category? ] -- .rightcol[ ### **Numerical** Measure things with numbers - How many? - How much? ] --- ## Categorical (discrete) variables -- .leftcol[ ### **Nominal** - Order doesn't matter - Differ in "name" (nominal) only e.g. `country` in TB case data: .code80[ ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 ``` ]] -- .rightcol[ ### **Ordinal** - Order matters - Distance between units not equal e.g.: `Placement` 2017 Boston marathon: .code80[ ``` #> # A tibble: 6 × 3 #> Placement `Official Time` Name #> <dbl> <time> <chr> #> 1 1 02:09:37 Kirui, Geoffrey #> 2 2 02:09:58 Rupp, Galen #> 3 3 02:10:28 Osako, Suguru #> 4 4 02:12:08 Biwott, Shadrack #> 5 5 02:12:35 Chebet, Wilson #> 6 6 02:12:45 Abdirahman, Abdi ``` ]] --- ## Numerical data -- .leftcol[ ### **Interval** - Numerical scale with<br>arbitrary starting point - No "0" point - Can't say "x" is double "y" e.g.: `temp` in Beaver data ``` #> day time temp activ #> 1 346 840 36.33 0 #> 2 346 850 36.34 0 #> 3 346 900 36.35 0 #> 4 346 910 36.42 0 #> 5 346 920 36.55 0 #> 6 346 930 36.69 0 ``` ] -- .rightcol[ ### **Ratio** - Has a "0" point - Can be described as percentages - Can say "x" is double "y" e.g.: `height` & `speed` in wildlife impacts ``` #> # A tibble: 6 × 3 #> incident_date height speed #> <dttm> <dbl> <dbl> #> 1 2018-12-31 00:00:00 700 200 #> 2 2018-12-27 00:00:00 600 145 #> 3 2018-12-23 00:00:00 0 130 #> 4 2018-12-22 00:00:00 500 160 #> 5 2018-12-21 00:00:00 100 150 #> 6 2018-12-18 00:00:00 4500 250 ``` ] --- class: inverse, center, middle # Key Questions -- .leftcol[ ## Categorical ## .orange[Does the order matter?] Yes: **Ordinal** No: **Nominal** ] -- .rightcol[ ## Numerical ## .orange[Is there a "baseline"?] Yes: **Ratio** No: **Interval** ] --- class: center, middle # Be careful of how variables are encoded! --- ## .red[When numbers are categories] - "Dummy coding": e.g., `passedTest` = `1` or `0`) - "North", "South", "East", "West" = `1`, `2`, `3`, `4` -- ## .red[When ratio data are discrete (i.e. counts)] - Number of eggs in a carton, heart beats per minute, etc. - Continuous variables measured discretely (e.g. age) -- ## .red[Time] - As _ordinal_ categories: "Jan.", "Feb.", "Mar.", etc. - As _interval_ scale: "Jan. 1", "Jan. 2", "Jan. 3", etc. - As _ratio_ scale: "30 sec", "60 sec", "70 sec", etc. --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. .orange[Centrality & Variability] ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- class: inverse, middle # .center[.font140[Summary Measures:]] # Single variables: .red[Centrality] & .blue[Variability] # Two variables: .green[Correlation] --- # .center[.red[Centrality (a.k.a. The "Average" Value)]] -- ### .center[A single number representing the _middle_ of a set of numbers] <br> -- ### **Mean**: `\(\frac{\text{Sum of values}}{\text{# of values}}\)` -- ### **Median**: "Middle" value (50% of data above & below) --- # .center[Mean isn't always the "best" choice] .leftcol40[ ```r wildlife_impacts %>% filter(! is.na(height)) %>% summarise( mean = mean(height), median = median(height) ) ``` ``` #> # A tibble: 1 × 2 #> mean median #> <dbl> <dbl> #> 1 984. 50 ``` Percent of data below mean: ``` #> [1] "73.9%" ``` ] -- .rightcol60[ .center[**On average, at what height do planes hit birds?**] <img src="figs/wildlife-hist.png"> ] ??? On average, where do planes hit birds? Saying ~1000 ft is misleading It's much more likely to be under 100 ft --- class: inverse # .center[Beware the "flaw of averages"] -- .leftcol[ ### What happened to the statistician that crossed a river with an average depth of 3 feet? ] -- .rightcol[ ### ...he drowned <img src="images/foa.jpg" width=600> ] --- # .center[.blue[Variability ("Spread")]] -- ### **Standard deviation**: distribution of values relative to the mean ### `\(s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}\)` -- ### **Interquartile range (IQR)**: `\(Q_3 - Q_1\)` (middle 50% of data) -- ### **Range**: max - min --- # .center[.fancy[Example:] Days to ship] .leftcol40[ Complaints are coming in about orders shipped from warehouse B, so you collect some data: .code70[ ```r daysToShip ``` ``` #> order warehouseA warehouseB #> 1 1 3 1 #> 2 2 3 1 #> 3 3 3 1 #> 4 4 4 3 #> 5 5 4 3 #> 6 6 4 4 #> 7 7 5 5 #> 8 8 5 5 #> 9 9 5 5 #> 10 10 5 6 #> 11 11 5 7 #> 12 12 5 10 ``` ]] -- .rightcol60[ Here, **averages** are misleading: ```r daysToShip %>% gather(warehouse, days, warehouseA:warehouseB) %>% group_by(warehouse) %>% summarise( * mean = mean(days), * median = median(days)) ``` ``` #> # A tibble: 2 × 3 #> warehouse mean median #> <chr> <dbl> <dbl> #> 1 warehouseA 4.25 4.5 #> 2 warehouseB 4.25 4.5 ``` ] --- # .center[.fancy[Example:] Days to ship] .leftcol40[ Complaints are coming in about orders shipped from warehouse B, so you collect some data: .code70[ ```r daysToShip ``` ``` #> order warehouseA warehouseB #> 1 1 3 1 #> 2 2 3 1 #> 3 3 3 1 #> 4 4 4 3 #> 5 5 4 3 #> 6 6 4 4 #> 7 7 5 5 #> 8 8 5 5 #> 9 9 5 5 #> 10 10 5 6 #> 11 11 5 7 #> 12 12 5 10 ``` ]] .rightcol60[ **Variability** reveals difference in days to ship: ```r daysToShip %>% gather(warehouse, days, warehouseA:warehouseB) %>% group_by(warehouse) %>% summarise( mean = mean(days), median = median(days), * range = max(days) - min(days), * sd = sd(days)) ``` ``` #> # A tibble: 2 × 5 #> warehouse mean median range sd #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 warehouseA 4.25 4.5 2 0.866 #> 2 warehouseB 4.25 4.5 9 2.70 ``` ] --- # .center[.fancy[Example:] Days to ship] <center> <img src="figs/days-to-ship.png" width=960> </center> --- class: center # Interpreting the standard deviation .leftcol[ ### `\(s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}\)` <center> <img src="figs/days-to-ship-sd.png" width=380> </center> ] -- .rightcol[ <img src="images/sd.png"> ] --- class: inverse, center # Outliers <center> <img src = "images/outliers.jpeg" width = "730"> </center> --- ## **Mean** & **Standard Deviation** are sensitive to outliers **Outliers**: `\(Q_1 - 1.5 IQR\)` or `\(Q_3 + 1.5 IQR\)` **Extreme values**: `\(Q_1 - 3 IQR\)` or `\(Q_3 + 3 IQR\)` -- .leftcol[ ```r data1 <- c(3,3,4,5,5,6,6,7,8,9) ``` - Mean: 5.6 - Standard Deviation: 2.01 - Median: 5.5 - IQR: 2.5 ] -- .rightcol[ ```r data2 <- c(3,3,4,5,5,6,6,7,8,20) ``` - .red[Mean: 6.7] - .red[Standard Deviation: 4.95] - .blue[Median: 5.5] - .blue[IQR: 2.5] ] --- class: inverse, middle # .center[Robust statistics for continuous data] # .center[(less sensitive to outliers)] ## .red[Centrality]: Use _median_ rather than _mean_ ## .blue[Variability]: Use _IQR_ rather than _standard deviation_ --- class: inverse
−
+
10
:
00
# Practice with summary measurements ### 1) Read in the following data sets: - `milk_production.csv` - `lotr_words.csv` ### 2) For each variable in each data set, if possible, summarize its ### 1. .red[Centrality] ### 2. .blue[Variability] --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. Centrality & Variability ### 4. .orange[Visualizing Centrality & Variability] ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- class: center # "Visualizing data helps us think" <center> <img src = "images/anscombe_data.png" width = "740"> </center> .left[.footer-small[Stephen Few (2009, pg. 6)]] --- background-color: #fff class: center # Anscombe's Quartet <center> <img src="figs/anscombe-quartet.png" width=600> </center> .left[.footer-small[Stephen Few (2009, pg. 6)]] --- background-color: #fff class: center .leftcol60[ # Anscombe's Quartet <center> <img src="figs/anscombe-quartet.png" width=600> </center> ] .rightcol40[ <br> <center> <img src="https://eda.seas.gwu.edu/2023-Fall/images/logo.png" width=100%> </center> ] --- class: inverse, center, middle # The data _type_ determines <br> how to summarize it --- .cols3[ ### **Nominal<br>(Categorical)** **Measures**: - Frequency counts /<br>Proportions <br> <br> <br> <br> **Charts**: - Bars ] -- .cols3[ ### **Ordinal<br>(Categorical)** **Measures**: - Frequency counts /<br>Proportions - .red[Centrality]:<br>Median, Mode - .blue[Variability]: IQR <br> **Charts**: - Bars ] -- .cols3[ ### **Numerical<br>(Continuous)** **Measures**: - .red[Centrality]:<br>Mean, median - .blue[Variability]: Range, standard deviation, IQR <br> <br> **Charts**: - Histogram - Boxplot ] --- ## Summarizing **Nominal** data .leftcol45[ Summarize with counts / percentages ```r wildlife_impacts %>% * count(operator, sort = TRUE) %>% * mutate(p = n / sum(n)) ``` ``` #> # A tibble: 4 × 3 #> operator n p #> <chr> <int> <dbl> #> 1 SOUTHWEST AIRLINES 17970 0.315 #> 2 UNITED AIRLINES 15116 0.265 #> 3 AMERICAN AIRLINES 14887 0.261 #> 4 DELTA AIR LINES 9005 0.158 ``` ] -- .rightcol55[ Visualize with (usually sorted) bars .code70[ ```r wildlife_impacts %>% count(operator, sort = TRUE) %>% * ggplot() + * geom_col(aes(x = n, y = reorder(operator, n)), * width = 0.7) + labs(x = "Count", y = "Operator") ``` <img src="figs/wildlife-operator-bars-1.png" width="504" /> ]] --- ## Summarizing **Ordinal** data .leftcol[ **Summarize**: Counts / percentages .code70[ ```r wildlife_impacts %>% * count(incident_month, sort = TRUE) %>% * mutate(p = n / sum(n)) ``` ``` #> # A tibble: 12 × 3 #> incident_month n p #> <dbl> <int> <dbl> #> 1 9 7980 0.140 #> 2 10 7754 0.136 #> 3 8 7104 0.125 #> 4 5 6161 0.108 #> 5 7 6133 0.108 #> 6 6 4541 0.0797 #> 7 4 4490 0.0788 #> 8 11 4191 0.0736 #> 9 3 2678 0.0470 #> 10 12 2303 0.0404 #> 11 1 1951 0.0342 #> 12 2 1692 0.0297 ``` ]] -- .rightcol[ **Visualize**: Bars .code70[ ```r wildlife_impacts %>% count(incident_month, sort = TRUE) %>% * ggplot() + * geom_col(aes(x = as.factor(incident_month), * y = n), width = 0.7) + labs(x = "Incident month") ``` <img src="figs/wildlife-months-bar-1.png" width="504" /> ]] --- ## Summarizing **continuous** variables .leftcol30[ **Histograms**: - Skewness - Number of modes <br> **Boxplots**: - Outliers - Comparing variables ] .rightcol70[.border[ <img src = 'images/eda-boxplot.png'> ]] --- ## **Histogram**: Identify Skewness & # of Modes .leftcol40[ **Summarise**:<br>Mean, median, sd, range, & IQR: ```r summary(wildlife_impacts$height) ``` ``` #> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's #> 0.0 0.0 50.0 983.8 1000.0 25000.0 18038 ``` ] -- .rightcol60[ **Visualize**:<br>Histogram (identify skewness & modes) ```r ggplot(wildlife_impacts) + * geom_histogram(aes(x = height), bins = 50) + labs(x = 'Height (ft)', y = 'Count') ``` <img src="figs/wildlife-height-hist-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## **Histogram**: Identify Skewness & # of Modes .leftcol[ **Height** ```r ggplot(wildlife_impacts) + * geom_histogram(aes(x = height), bins = 50) + labs(x = 'Height (ft)', y = 'Count') ``` <img src="figs/unnamed-chunk-26-1.png" width="504" style="display: block; margin: auto;" /> ] .rightcol[ **Speed** ```r ggplot(wildlife_impacts) + * geom_histogram(aes(x = speed), bins = 50) + labs(x = 'speed (mph)', y = 'Count') ``` <img src="figs/wildlife-speed-hist-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## **Boxplot**: Identify outliers .leftcol[ **Height** ```r ggplot(wildlife_impacts) + * geom_boxplot(aes(x = height)) + labs(x = 'Height (ft)', y = NULL) ``` <img src="figs/wildlife-height-boxplot-1.png" width="504" style="display: block; margin: auto;" /> ] .rightcol[ **Speed** ```r ggplot(wildlife_impacts) + * geom_boxplot(aes(x = speed)) + labs(x = 'Speed (mph)', y = NULL) ``` <img src="figs/wildlife-speed-boxplot-1.png" width="504" style="display: block; margin: auto;" /> ] --- .leftcol[ ## Histogram - Skewness - Modes <img src="figs/unnamed-chunk-27-1.png" width="504" style="display: block; margin: auto;" /> ] .rightcol[ ## Boxplot - Outliers <br><br> <img src="figs/unnamed-chunk-28-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse
−
+
15
:
00
# Practicing visual summaries .font90[ 1) Read in the following data sets: - `faithful.csv` - `marathon.csv` 2) Summarize the following variables using an appropriate chart (bar chart, histogram, and / or boxplot): - faithful: `eruptions` - faithful: `waiting` - marathon: `Age` - marathon: `State` - marathon: `Country` - marathon: `` `Official Time` `` ] --- class: inverse, center # Break! ## Stand up, Move around, Stretch!
−
+
05
:
00
--- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. .orange[Correlation] ### 6. Visualizing Correlation ### 7. Visualizing Relationships ] --- ## .center[Some pretty racist origins in [eugenics](https://en.wikipedia.org/wiki/Eugenics) ("well born")] -- .leftcol[ ### [Sir Francis Galton](https://en.wikipedia.org/wiki/Francis_Galton) (1822 - 1911) - Charles Darwin's cousin. - "Father" of [eugenics](https://en.wikipedia.org/wiki/Eugenics). - Interested in heredity. <center> <img src="images/Francis_Galton_1850s.jpg" width=200> </center> ] -- .rightcol[ ### [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson) (1857 - 1936) - Galton's ([hero-worshiping](https://en.wikipedia.org/wiki/Apotheosis)) protégé. - Defined correlation equation. - "Father" of mathematical statistics. <center> <img src="images/Karl_Pearson.jpg" width=220> <center> ] ??? The beautiful irony is that human genetics was also the field that conclusively demonstrated the biological falsity of race. --- .leftcol[ # Galton's family data Galton, F. (1886). ["Regression towards mediocrity in hereditary stature"](http://www.stat.ucla.edu/~nchristo/statistics100C/history_regression.pdf). _The Journal of the Anthropological Institute of Great Britain and Ireland_ 15: 246-263. **Galton's question**: Does marriage selection indicate a relationship between the heights of husbands and wives?<br>(He called this "assortative mating") "midparent height" is just a scaled mean: ```r midparentHeight = (father + 1.08*mother)/2 ``` ] -- .rightcol[.code70[ ```r library(HistData) galtonScatterplot <- ggplot(GaltonFamilies) + geom_point(aes(x = midparentHeight, y = childHeight), size = 0.5, alpha = 0.7) + theme_classic() + labs(x = 'Midparent height (inches)', y = 'Child height (inches)') ``` <center> <img src="figs/galtonScatterplot.png" width=450> </center> ]] --- class: center, middle # How do you measure correlation? <br> # Pearson came up with this: # `\(r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}\)` --- # How do you measure correlation? .leftcol60[ ## `\(r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}\)` .font130[ Assumptions: 1. Variables must be interval or ratio 2. Linear relationship ]] -- .rightcol40[ <center> <img src="figs/cor_vstrong_p.png" width=275> </center> <br> <center> <img src="figs/cor_quad.png" width=275> </center> ] --- # How do you _interpret_ `\(r\)`? .leftcol[ ## `\(r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}\)` Interpretation: - `\(-1 \le r \le 1\)` - Closer to 1 is stronger correlation - Closer to 0 is weaker correlation ] -- .rightcol[.code70[ ```r cor(x = GaltonFamilies$midparentHeight, y = GaltonFamilies$childHeight, method = 'pearson') ``` ``` #> [1] 0.3209499 ``` ] <center> <img src="figs/galtonScatterplot.png" width=400> </center> ] --- ## What does `\(r\)` mean? .leftcol40[.font120[ - `\(\pm 0.1 - 0.3\)`: Weak - `\(\pm 0.3 - 0.5\)`: Moderate - `\(\pm 0.5 - 0.8\)`: Strong - `\(\pm 0.8 - 1.0\)`: Very strong ]] .rightcol60[ <center> <img src="figs/cor_p.png"> </center> ] --- class: center, middle # Visualizing correlation is...um...easy, right? <br> # [guessthecorrelation.com](http://guessthecorrelation.com/) # Click [here](https://docs.google.com/presentation/d/1-7VqNRJp53FawfNJwKLEkpoubGQ_x0wIkN2lAMP7Emw/edit?usp=sharing) to vote! --- class: middle .leftcol20[ ## The datasaurus ### (More [here](https://www.autodeskresearch.com/publications/samestats)) ] .rightcol80[ <img src="images/datasaurus.png"> ] --- # Coefficient of determination: `\(r^2\)` .leftcol[.font130[ Percent of variance in one variable that is explained by the other variable <center> <img src="images/rsquared_venn.png"> </center> ]] -- .rightcol[ `\(r\)` | `\(r^2\)` ----|------ 0.1 | 0.01 0.2 | 0.04 0.3 | 0.09 0.4 | 0.16 0.5 | 0.25 0.6 | 0.36 0.7 | 0.49 0.8 | 0.64 0.9 | 0.81 1.0 | 1.00 ] --- ## You should report both `\(r\)` and `\(r^2\)` <br> ### Correlation between parent and child height is 0.32, therefore 10% of the variance in the child height is explained by the parent height. --- # Correlation != Causation -- ### X causes Y - Training causes improved performance -- ### Y causes X - Good (bad) performance causes people to train harder (less hard). -- ### Z causes both X & Y - Commitment and motivation cause increased training and better performance. --- class: center ## Be weary of dual axes! ## ([They can cause spurious correlations](https://www.tylervigen.com/spurious-correlations)) -- .leftcol[ .font120[Dual axes] <center> <img src="images/hbr_two_axes1.png"> </center> ] -- .rightcol[ .font120[Single axis] <center> <img src="images/hbr_two_axes2.png"> </center> ] --- class: inverse, center # Outliers <center> <img src = "images/outliers.jpeg" width = "730"> </center> --- class: middle <center> <img src="figs/pearson_base.png" width=600> </center> --- class: middle <center> <img src="figs/pearson1.png" width=600> </center> --- class: middle <center> <img src="figs/pearson2.png" width=600> </center> --- class: center, middle ## **Pearson** correlation is highly sensitive to outliers <center> <img src="figs/pearson_grid.png" width=600> </center> --- # **Spearman**'s rank-order correlation # `\(r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}\)` -- .font120[ - Separately rank the values of X & Y. - Use Pearson's correlation on the _ranks_ instead of the `\(x\)` & `\(y\)` values. ] -- .font120[ Assumptions: - Variables can be ordinal, interval or ratio - Relationship must be monotonic (i.e. does not require linearity) ] --- class: center, middle ## Spearman correlation more robust to outliers <center> <img src="figs/spearman_grid.png" width=600> </center> --- class: center, middle ## Spearman correlation more robust to outliers .cols3[ <center> <img src="figs/pearson_grid.png"> </center> ] .cols3[ <table> <thead> <tr> <th style="text-align:right;"> Pearson </th> <th style="text-align:right;"> Spearman </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -0.56 </td> <td style="text-align:right;"> 0.53 </td> </tr> <tr> <td style="text-align:right;"> 0.39 </td> <td style="text-align:right;"> 0.69 </td> </tr> <tr> <td style="text-align:right;"> 0.94 </td> <td style="text-align:right;"> 0.81 </td> </tr> <tr> <td style="text-align:right;"> 0.38 </td> <td style="text-align:right;"> 0.76 </td> </tr> <tr> <td style="text-align:right;"> 0.81 </td> <td style="text-align:right;"> 0.79 </td> </tr> <tr> <td style="text-align:right;"> 0.31 </td> <td style="text-align:right;"> 0.70 </td> </tr> <tr> <td style="text-align:right;"> 0.95 </td> <td style="text-align:right;"> 0.81 </td> </tr> <tr> <td style="text-align:right;"> 0.51 </td> <td style="text-align:right;"> 0.75 </td> </tr> <tr> <td style="text-align:right;"> -0.56 </td> <td style="text-align:right;"> 0.53 </td> </tr> </tbody> </table> ] .cols3[ <center> <img src="figs/outlier_compare.png"> </center> ] --- ## Summary of correlation .font120[ - **Pearson's correlation**: Described the strength of a **linear** relationship between two variables that are interval or ratio in nature. - **Spearman's rank-order correlation**: Describes the strength of a **monotonic** relationship between two variables that are ordinal, interval, or ratio. **It is more robust to outliers**. - The **coefficient of determination** ( `\(r^2\)` ) describes the amount of variance in one variable that is explained by the other variable. - **Correlation != Causation** ] -- R command (hint: add `use = "complete.obs"` to drop NA values) ```r pearson <- cor(x, y, method = "pearson", use = "complete.obs") spearman <- cor(x, y, method = "spearman", use = "complete.obs") ``` --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. .orange[Visualizing Correlation] ### 7. Visualizing Relationships ] --- ## **Scatterplots**: The correlation workhorse .leftcol[ ```r scatterplot <- mtcars %>% ggplot() + * geom_point( * aes(x = mpg, y = hp), * size = 2, alpha = 0.7 * ) + theme_classic(base_size = 20) + labs( x = 'Fuel economy (mpg)', y = 'Engine power (hp)' ) scatterplot ``` ] .rightcol[ <center> <img src="figs/mtcarsScatterplotBase.png"> </center> ] --- ## Adding a correlation label to a chart .leftcol[ Make the correlation label ```r corr <- cor( mtcars$mpg, mtcars$hp, method = 'pearson') *corrLabel <- paste('r = ', round(corr, 2)) ``` Add label to the chart with `annotate()` ```r scatterplot + * annotate( * geom = 'text', * x = 25, y = 310, * label = corrLabel, * hjust = 0, size = 7 * ) ``` ] .rightcol[ <center> <img src="figs/mtcarsScatterplot.png"> </center> ] --- class: middle, center background-color: #FFFFFF <center> <img src="images/all-the-correlations.jpeg" width=700> </center> --- ## Visualize all the correlations: `ggcorr()` .leftcol[ ```r library('GGally') ``` ```r mtcars %>% * ggcorr() ``` ] .rightcol[ <center> <img src="figs/ggcor_mtcars.png"> </center> ] --- ## Visualizing correlations: `ggcorr()` .leftcol[ ```r library('GGally') ``` ```r mtcars %>% * ggcorr(label = TRUE, * label_size = 3, * label_round = 2) ``` ] .rightcol[ <center> <img src="figs/ggcor_mtcars_labels.png"> </center> ] --- ## Visualizing correlations: `ggcorr()` .leftcol[ ```r ggcor_mtcars_final <- mtcars %>% ggcorr(label = TRUE, label_size = 3, label_round = 2, * label_color = 'white', * nbreaks = 5, * palette = "RdBu") ``` ] .rightcol[ <center> <img src="figs/ggcor_mtcars_final.png"> </center> ] --- .leftcol[ ## .center[Pearson] ```r mtcars %>% ggcorr(label = TRUE, label_size = 3, label_round = 2, * method = c("pairwise", "pearson")) ``` <center> <img src="figs/ggcor_mtcars_pearson.png" width=400> </center> ] .rightcol[ ## .center[Spearman] ```r mtcars %>% ggcorr(label = TRUE, label_size = 3, label_round = 2, * method = c("pairwise", "spearman")) ``` <center> <img src="figs/ggcor_mtcars_spearman.png" width=400> </center> ] --- ## Correlograms: `ggpairs()` .leftcol40[ ```r library('GGally') ``` ```r mtcars %>% select(mpg, cyl, disp, hp, wt) %>% * ggpairs() ``` - Look for linear relationships - View distribution of each variable ] .rightcol60[ <center> <img src="figs/ggpairs_mtcars.png" width=600> </center> ] --- ## Correlograms: `ggpairs()` .leftcol40[ ```r library('GGally') ``` ```r mtcars %>% select(mpg, cyl, disp, hp, wt) %>% ggpairs() + * theme_classic() ``` - Look for linear relationships - View distribution of each variable ] .rightcol60[ <center> <img src="figs/ggpairs_mtcars_classic.png" width=600> </center> ] --- class: inverse ## Your turn
−
+
15
:
00
.leftcol[ Using the `penguins` data frame: 1. Find the two variables with the largest correlation in absolute value (i.e. closest to -1 or 1). 2. Create a scatter plot of those two variables. 3. Add an annotation for the Pearson correlation coefficient. ] .rightcol[ ### .center[[palmerpenguins library](https://allisonhorst.github.io/palmerpenguins/)] <center> <img src="images/lter_penguins.png" width=700> </center> .right[Artwork by [@allison_horst](https://twitter.com/allison_horst)] ] --- ## **Simpson's Paradox**: when correlation betrays you -- .leftcol[ .center[**Body mass vs. Bill depth**] <center> <img src="figs/simpson_penguins.png" width=450> </center> ] -- .rightcol[ .center[**Body mass vs. Bill depth**] <center> <img src="figs/simpson_penguins_good.png" width=600> </center> ] --- class: inverse, middle # Week 4: .fancy[Exploring Data] .leftcol[ ### 1. Exploring Data ### 2. Data Types ### 3. Centrality & Variability ### 4. Visualizing Centrality & Variability ] .rightcol[ ### BREAK ### 5. Correlation ### 6. Visualizing Correlation ### 7. .orange[Visualizing Relationships] ] --- ## Visualizing variation .leftcol30[ Ask yourself: - What type of **variation** occurs within my variables? - What type of **covariation** occurs between my variables? Check out [these guides](https://eda.seas.gwu.edu/2023-Fall/references.html#choosing-the-right-chart) ] .rightcol70[ <center> <img src = "images/plots-table.png" width = "800"> </center> ] --- ## Two **Categorical** Variables Summarize with a table of counts .leftcol60[ ```r wildlife_impacts %>% * count(operator, time_of_day) ``` ``` #> # A tibble: 20 × 3 #> operator time_of_day n #> <chr> <chr> <int> #> 1 AMERICAN AIRLINES Dawn 458 #> 2 AMERICAN AIRLINES Day 7809 #> 3 AMERICAN AIRLINES Dusk 584 #> 4 AMERICAN AIRLINES Night 3710 #> 5 AMERICAN AIRLINES <NA> 2326 #> 6 DELTA AIR LINES Dawn 267 #> 7 DELTA AIR LINES Day 4846 #> 8 DELTA AIR LINES Dusk 353 #> 9 DELTA AIR LINES Night 2090 #> 10 DELTA AIR LINES <NA> 1449 #> 11 SOUTHWEST AIRLINES Dawn 394 #> 12 SOUTHWEST AIRLINES Day 9109 #> 13 SOUTHWEST AIRLINES Dusk 599 #> 14 SOUTHWEST AIRLINES Night 5425 #> 15 SOUTHWEST AIRLINES <NA> 2443 #> 16 UNITED AIRLINES Dawn 151 #> 17 UNITED AIRLINES Day 3359 #> 18 UNITED AIRLINES Dusk 181 #> 19 UNITED AIRLINES Night 1510 #> 20 UNITED AIRLINES <NA> 9915 ``` ] --- ## Two **Categorical** Variables Convert to "wide" format with `pivot_wider()` to make it easier to compare values .leftcol70[ ```r wildlife_impacts %>% count(operator, time_of_day) %>% * pivot_wider(names_from = time_of_day, values_from = n) ``` ``` #> # A tibble: 4 × 6 #> operator Dawn Day Dusk Night `NA` #> <chr> <int> <int> <int> <int> <int> #> 1 AMERICAN AIRLINES 458 7809 584 3710 2326 #> 2 DELTA AIR LINES 267 4846 353 2090 1449 #> 3 SOUTHWEST AIRLINES 394 9109 599 5425 2443 #> 4 UNITED AIRLINES 151 3359 181 1510 9915 ``` ] --- ## Two **Categorical** Variables .leftcol45[ Visualize with bars:<br>map **fill** to denote 2nd categorical var ```r wildlife_impacts %>% count(operator, time_of_day) %>% ggplot() + geom_col( aes( x = n, y = reorder(operator, n), * fill = reorder(time_of_day, n) ), width = 0.7, * position = 'dodge') + theme(legend.position = "bottom") + labs( fill = "Time of day", y = "Airline" ) ``` ] .rightcol55[ <img src="figs/unnamed-chunk-56-1.png" width="648" style="display: block; margin: auto;" /> ] --- ## Two **Continuous** Variables Visualize with scatterplot - looking for _clustering_ and/or _correlational_ relationship .leftcol45[ ```r ggplot(wildlife_impacts) + geom_point( aes( x = speed, y = height ), size = 0.5) + labs( x = 'Speed (mph)', y = 'Height (f)' ) ``` ] .rightcol55[ <img src="figs/unnamed-chunk-57-1.png" width="504" style="display: block; margin: auto;" /> ] --- ## One **Continuous**, One **Categorical** Visualize with **boxplot** .leftcol45[ ```r ggplot(wildlife_impacts) + geom_boxplot( aes( x = speed, y = operator) ) + labs( x = 'Speed (mph)', y = 'Airline' ) ``` ] .rightcol55[ <img src="figs/unnamed-chunk-58-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: inverse
−
+
15
:
00
# Practice doing EDA 1) Read in the `candy_rankings.csv` data sets 2) Preview the data, note the data types and what each variable is. 3) Visualize (at least) three _relationships_ between two variables (guided by a question) using an appropriate chart: - Bar chart - Scatterplot - Boxplot