Exploring Data

]

# Week 4: .fancy[Exploring Data]

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M243.4 2.6l-224 96c-14 6-21.8 21-18.7 35.8S16.8 160 32 160v8c0 13.3 10.7 24 24 24H456c13.3 0 24-10.7 24-24v-8c15.2 0 28.3-10.7 31.3-25.6s-4.8-29.9-18.7-35.8l-224-96c-8-3.4-17.2-3.4-25.2 0zM128 224H64V420.3c-.6 .3-1.2 .7-1.8 1.1l-48 32c-11.7 7.8-17 22.4-12.9 35.9S17.9 512 32 512H480c14.1 0 26.5-9.2 30.6-22.7s-1.1-28.1-12.9-35.9l-48-32c-.6-.4-1.2-.7-1.8-1.1V224H384V416H344V224H280V416H232V224H168V416H128V224zM256 64a32 32 0 1 1 0 64 32 32 0 1 1 0-64z"/></svg> EMSE 4572/6572: Exploratory Data Analysis
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M304 128a80 80 0 1 0 -160 0 80 80 0 1 0 160 0zM96 128a128 128 0 1 1 256 0A128 128 0 1 1 96 128zM49.3 464H398.7c-8.9-63.3-63.3-112-129-112H178.3c-65.7 0-120.1 48.7-129 112zM0 482.3C0 383.8 79.8 304 178.3 304h91.4C368.2 304 448 383.8 448 482.3c0 16.4-13.3 29.7-29.7 29.7H29.7C13.3 512 0 498.7 0 482.3z"/></svg> John Paul Helveston
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M152 24c0-13.3-10.7-24-24-24s-24 10.7-24 24V64H64C28.7 64 0 92.7 0 128v16 48V448c0 35.3 28.7 64 64 64H384c35.3 0 64-28.7 64-64V192 144 128c0-35.3-28.7-64-64-64H344V24c0-13.3-10.7-24-24-24s-24 10.7-24 24V64H152V24zM48 192H400V448c0 8.8-7.2 16-16 16H64c-8.8 0-16-7.2-16-16V192z"/></svg> September 20, 2023

]

---

# Quiz solution

---

# Tip of the week:

# `theme_set()`

---

```r
ggplot(mtcars) +
  geom_point(aes(x = mpg, y = hp))
```

Default theme

]

`theme_bw(base_size = 20)`

]

---

# Week 4: .fancy[Exploring Data]

### 1. Exploring Data
### 2. Data Types
### 3. Centrality & Variability
### 4. Visualizing Centrality & Variability

]

### BREAK

### 5. Correlation
### 6. Visualizing Correlation
### 7. Visualizing Relationships

]

---

# Week 4: .fancy[Exploring Data]

### 1. .orange[Exploring Data]
### 2. Data Types
### 3. Centrality & Variability
### 4. Visualizing Centrality & Variability

]

### BREAK

### 5. Correlation
### 6. Visualizing Correlation
### 7. Visualizing Relationships

]

---

# Exploratory Analysis

<br>

### Goal: **Form** hypotheses.
### Improves quality of **questions**.
### _(what we do in THIS class)_

]

# Confirmatory Analysis

<br>

### Goal: **Test** hypotheses.
### Improves quality of **answers**.
### _(what you do in a stats class)_

]

---

# Exploratory Analysis

<br>

RQ: Do people bike more when the weather is nice?

]

# Confirmatory Analysis

<br>

Let's build a model to predict bike usage based on weather.

]

---

# Don't be Icarus

---

## "Far better an approximate answer to the _right_ question, which is often vague, than an exact answer to the _wrong_ question, which can always be made precise."
## — [John Tukey](https://en.wikipedia.org/wiki/John_Tukey)

---

**EDA is an iterative process to help you<br>_understand_ your data and ask better questions**

---

# Week 4: .fancy[Exploring Data]

### 1. Exploring Data
### 2. .orange[Data Types]
### 3. Centrality & Variability
### 4. Visualizing Centrality & Variability

]

### BREAK

### 5. Correlation
### 6. Visualizing Correlation
### 7. Visualizing Relationships

]

---

# 24,901

???

If I walked up to you, and said, "The answer is 24,901,"
you would probably be confused.
By itself, a number means nothing.

---

# Earth's circumference at the equator:<br>24,901 miles

???

But if I were to tell you that the circumference of the earth at the equator is 24,901 miles, that would mean something.

To be complete and meaningful, quantitative information consists of both quantitative data (the numbers) and categorical data (the labels that tell us what the numbers measure).

---

# Types of Data

### **Categorical**

Subdivide things into _groups_

- What type?
- Which category?

]

### **Numerical**

Measure things with numbers

- How many?
- How much?

]

---

## Categorical (discrete) variables

### **Nominal**

- Order doesn't matter
- Differ in "name" (nominal) only

e.g. `country` in TB case data:

```
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <dbl>  <dbl>      <dbl>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
```

]]

### **Ordinal**

- Order matters
- Distance between units not equal

e.g.: `Placement` 2017 Boston marathon:

```
#> # A tibble: 6 × 3
#>   Placement `Official Time` Name            
#>       <dbl> <time>          <chr>           
#> 1         1 02:09:37        Kirui, Geoffrey 
#> 2         2 02:09:58        Rupp, Galen     
#> 3         3 02:10:28        Osako, Suguru   
#> 4         4 02:12:08        Biwott, Shadrack
#> 5         5 02:12:35        Chebet, Wilson  
#> 6         6 02:12:45        Abdirahman, Abdi
```

]]

---

## Numerical data

### **Interval**

- Numerical scale with<br>arbitrary starting point
- No "0" point
- Can't say "x" is double "y"

e.g.: `temp` in Beaver data

```
#>   day time  temp activ
#> 1 346  840 36.33     0
#> 2 346  850 36.34     0
#> 3 346  900 36.35     0
#> 4 346  910 36.42     0
#> 5 346  920 36.55     0
#> 6 346  930 36.69     0
```

]

### **Ratio**

- Has a "0" point
- Can be described as percentages
- Can say "x" is double "y"

e.g.: `height` & `speed` in wildlife impacts

```
#> # A tibble: 6 × 3
#>   incident_date       height speed
#>   <dttm>               <dbl> <dbl>
#> 1 2018-12-31 00:00:00    700   200
#> 2 2018-12-27 00:00:00    600   145
#> 3 2018-12-23 00:00:00      0   130
#> 4 2018-12-22 00:00:00    500   160
#> 5 2018-12-21 00:00:00    100   150
#> 6 2018-12-18 00:00:00   4500   250
```

]

---

# Key Questions

## Categorical

## .orange[Does the order matter?]

Yes: **Ordinal**

No: **Nominal**

]

## Numerical

## .orange[Is there a "baseline"?]

Yes: **Ratio**

No: **Interval**

]

---

# Be careful of how variables are encoded!

---

## .red[When numbers are categories]

- "Dummy coding": e.g., `passedTest` = `1` or `0`)
- "North", "South", "East", "West" = `1`, `2`, `3`, `4`

## .red[When ratio data are discrete (i.e. counts)]

- Number of eggs in a carton, heart beats per minute, etc.
- Continuous variables measured discretely (e.g. age)

## .red[Time]

- As _ordinal_ categories: "Jan.", "Feb.", "Mar.", etc.
- As _interval_ scale: "Jan. 1", "Jan. 2", "Jan. 3", etc.
- As _ratio_ scale: "30 sec", "60 sec", "70 sec", etc.

---

# Week 4: .fancy[Exploring Data]

### 1. Exploring Data
### 2. Data Types
### 3. .orange[Centrality & Variability]
### 4. Visualizing Centrality & Variability

]

### BREAK

### 5. Correlation
### 6. Visualizing Correlation
### 7. Visualizing Relationships

]

---

# .center[.font140[Summary Measures:]]

# Single variables: .red[Centrality] &  .blue[Variability]

# Two variables: .green[Correlation]

---

# .center[.red[Centrality (a.k.a. The "Average" Value)]]

### .center[A single number representing the _middle_ of a set of numbers]

<br>

### **Mean**: `$\frac{\text{Sum of values}}{\text{# of values}}$`

### **Median**: "Middle" value (50% of data above & below)

---

# .center[Mean isn't always the "best" choice]

```r
wildlife_impacts %>%
    filter(! is.na(height)) %>%
    summarise(
      mean = mean(height),
      median = median(height)
    )
```

```
#> # A tibble: 1 × 2
#>    mean median
#>   <dbl>  <dbl>
#> 1  984.     50
```

Percent of data below mean:

```
#> [1] "73.9%"
```

]

]

???

On average, where do planes hit birds?
Saying ~1000 ft is misleading
It's much more likely to be under 100 ft

---

# .center[Beware the "flaw of averages"]

### What happened to the statistician that crossed a river with an average depth of 3 feet?

]

### ...he drowned

]

---

# .center[.blue[Variability ("Spread")]]

### **Standard deviation**: distribution of values relative to the mean
### `$s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}$`

### **Interquartile range (IQR)**: `$Q_3 - Q_1$` (middle 50% of data)

### **Range**: max - min

---

# .center[.fancy[Example:] Days to ship]

Complaints are coming in about orders shipped from warehouse B, so you collect some data:

```r
daysToShip
```

```
#>    order warehouseA warehouseB
#> 1      1          3          1
#> 2      2          3          1
#> 3      3          3          1
#> 4      4          4          3
#> 5      5          4          3
#> 6      6          4          4
#> 7      7          5          5
#> 8      8          5          5
#> 9      9          5          5
#> 10    10          5          6
#> 11    11          5          7
#> 12    12          5         10
```

]]

Here, **averages** are misleading:

```r
daysToShip %>%
    gather(warehouse, days, warehouseA:warehouseB) %>%
    group_by(warehouse) %>%
    summarise(
*       mean   = mean(days),
*       median = median(days))
```

```
#> # A tibble: 2 × 3
#>   warehouse   mean median
#>   <chr>      <dbl>  <dbl>
#> 1 warehouseA  4.25    4.5
#> 2 warehouseB  4.25    4.5
```

]

---

# .center[.fancy[Example:] Days to ship]

Complaints are coming in about orders shipped from warehouse B, so you collect some data:

```r
daysToShip
```

]]

**Variability** reveals difference in days to ship:

```r
daysToShip %>%
    gather(warehouse, days, warehouseA:warehouseB) %>%
    group_by(warehouse) %>%
    summarise(
        mean   = mean(days),
        median = median(days),
*       range = max(days) - min(days),
*       sd    = sd(days))
```

```
#> # A tibble: 2 × 5
#>   warehouse   mean median range    sd
#>   <chr>      <dbl>  <dbl> <dbl> <dbl>
#> 1 warehouseA  4.25    4.5     2 0.866
#> 2 warehouseB  4.25    4.5     9 2.70
```

]

---

# .center[.fancy[Example:] Days to ship]

---

# Interpreting the standard deviation

### `$s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}$`

]

]

---

# Outliers

---

## **Mean** & **Standard Deviation** are sensitive to outliers

**Outliers**: `$Q_1 - 1.5 IQR$` or `$Q_3 + 1.5 IQR$`

**Extreme values**: `$Q_1 - 3 IQR$` or `$Q_3 + 3 IQR$`

```r
data1 <- c(3,3,4,5,5,6,6,7,8,9)
```

- Mean: 5.6
- Standard Deviation: 2.01
- Median: 5.5
- IQR: 2.5

]

```r
data2 <- c(3,3,4,5,5,6,6,7,8,20)
```

- .red[Mean: 6.7]
- .red[Standard Deviation: 4.95]
- .blue[Median: 5.5]
- .blue[IQR: 2.5]

]

---

# .center[Robust statistics for continuous data]
# .center[(less sensitive to outliers)]

## .red[Centrality]: Use _median_ rather than _mean_

## .blue[Variability]: Use _IQR_ rather than _standard deviation_

---

# Practice with summary measurements

### 1) Read in the following data sets:

- `milk_production.csv`
- `lotr_words.csv`

### 2) For each variable in each data set, if possible, summarize its

### 1. .red[Centrality]
### 2. .blue[Variability]

---

# Week 4: .fancy[Exploring Data]

### 1. Exploring Data
### 2. Data Types
### 3. Centrality & Variability
### 4. .orange[Visualizing Centrality & Variability]

]

### BREAK

### 5. Correlation
### 6. Visualizing Correlation
### 7. Visualizing Relationships

]

---

# "Visualizing data helps us think"

---

background-color: #fff
class: center

# Anscombe's Quartet

---

background-color: #fff
class: center

# Anscombe's Quartet

]

]

---

# The data _type_ determines <br> how to summarize it

---

### **Nominal<br>(Categorical)**

**Measures**:
- Frequency counts /<br>Proportions
<br>
<br>
<br>
<br>

**Charts**:
- Bars

]

### **Ordinal<br>(Categorical)**

**Measures**:
- Frequency counts /<br>Proportions
- .red[Centrality]:<br>Median, Mode
- .blue[Variability]: IQR
<br>

**Charts**:
- Bars

]

### **Numerical<br>(Continuous)**

**Measures**:
- .red[Centrality]:<br>Mean, median
- .blue[Variability]: Range, standard deviation, IQR
<br>
<br>

**Charts**:
- Histogram
- Boxplot

]

---

## Summarizing **Nominal** data

Summarize with counts / percentages

```r
wildlife_impacts %>%
*   count(operator, sort = TRUE) %>%
*   mutate(p = n / sum(n))
```

```
#> # A tibble: 4 × 3
#>   operator               n     p
#>   <chr>              <int> <dbl>
#> 1 SOUTHWEST AIRLINES 17970 0.315
#> 2 UNITED AIRLINES    15116 0.265
#> 3 AMERICAN AIRLINES  14887 0.261
#> 4 DELTA AIR LINES     9005 0.158
```

]

Visualize with (usually sorted) bars

```r
wildlife_impacts %>%
    count(operator, sort = TRUE) %>%
*   ggplot() +
*   geom_col(aes(x = n, y = reorder(operator, n)),
*            width = 0.7) +
    labs(x = "Count", y = "Operator")
```

]]

---

## Summarizing **Ordinal** data

**Summarize**: Counts / percentages

```r
wildlife_impacts %>%
*   count(incident_month, sort = TRUE) %>%
*   mutate(p = n / sum(n))
```

```
#> # A tibble: 12 × 3
#>    incident_month     n      p
#>             <dbl> <int>  <dbl>
#>  1              9  7980 0.140 
#>  2             10  7754 0.136 
#>  3              8  7104 0.125 
#>  4              5  6161 0.108 
#>  5              7  6133 0.108 
#>  6              6  4541 0.0797
#>  7              4  4490 0.0788
#>  8             11  4191 0.0736
#>  9              3  2678 0.0470
#> 10             12  2303 0.0404
#> 11              1  1951 0.0342
#> 12              2  1692 0.0297
```

]]

**Visualize**: Bars

```r
wildlife_impacts %>%
    count(incident_month, sort = TRUE) %>%
*   ggplot() +
*   geom_col(aes(x = as.factor(incident_month),
*                y = n), width = 0.7) +
    labs(x = "Incident month")
```

]]

---

## Summarizing **continuous** variables

**Histograms**:

- Skewness
- Number of modes

<br>

**Boxplots**:

- Outliers
- Comparing variables

]

]]

---

## **Histogram**: Identify Skewness & # of Modes

**Summarise**:<br>Mean, median, sd, range, & IQR:

```r
summary(wildlife_impacts$height)
```

```
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>     0.0     0.0    50.0   983.8  1000.0 25000.0   18038
```

]

**Visualize**:<br>Histogram (identify skewness & modes)

```r
ggplot(wildlife_impacts) +
* geom_histogram(aes(x = height), bins = 50) +
  labs(x = 'Height (ft)', y = 'Count')
```

]

---

## **Histogram**: Identify Skewness & # of Modes

**Height**

```r
ggplot(wildlife_impacts) +
* geom_histogram(aes(x = height), bins = 50) +
  labs(x = 'Height (ft)', y = 'Count')
```

]

**Speed**

```r
ggplot(wildlife_impacts) +
* geom_histogram(aes(x = speed), bins = 50) +
  labs(x = 'speed (mph)', y = 'Count')
```

]

---

## **Boxplot**: Identify outliers

**Height**

```r
ggplot(wildlife_impacts) +
*   geom_boxplot(aes(x = height)) +
    labs(x = 'Height (ft)', y = NULL)
```

<img src="figs/wildlife-height-boxplot-1.png" width="504" style="display: block; margin: auto;" />
]

**Speed**

```r
ggplot(wildlife_impacts) +
*   geom_boxplot(aes(x = speed)) +
    labs(x = 'Speed (mph)', y = NULL)
```

]

---

## Histogram

- Skewness
- Modes

]

## Boxplot

- Outliers

]

---

# Practicing visual summaries

1) Read in the following data sets:

- `faithful.csv`
- `marathon.csv`

2) Summarize the following variables using an appropriate chart (bar chart, histogram, and / or boxplot):

- faithful: `eruptions`
- faithful: `waiting`
- marathon: `Age`
- marathon: `State`
- marathon: `Country`
- marathon: `` `Official Time` ``

]

---

# Break!

## Stand up, Move around, Stretch!

---

# Week 4: .fancy[Exploring Data]

### 1. Exploring Data
### 2. Data Types
### 3. Centrality & Variability
### 4. Visualizing Centrality & Variability

]

### BREAK

### 5. .orange[Correlation]
### 6. Visualizing Correlation
### 7. Visualizing Relationships

]

---

## .center[Some pretty racist origins in [eugenics](https://en.wikipedia.org/wiki/Eugenics) ("well born")]

### [Sir Francis Galton](https://en.wikipedia.org/wiki/Francis_Galton) (1822 - 1911)

- Charles Darwin's cousin.
- "Father" of [eugenics](https://en.wikipedia.org/wiki/Eugenics).
- Interested in heredity.

]

### [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson) (1857 - 1936)

- Galton's ([hero-worshiping](https://en.wikipedia.org/wiki/Apotheosis)) protégé.
- Defined correlation equation.
- "Father" of mathematical statistics.

]

???

The beautiful irony is that human genetics was also the field that conclusively demonstrated the biological falsity of race.

---

# Galton's family data

Galton, F. (1886). ["Regression towards mediocrity in hereditary stature"](http://www.stat.ucla.edu/~nchristo/statistics100C/history_regression.pdf). _The Journal of the Anthropological Institute of Great Britain and Ireland_ 15: 246-263.

**Galton's question**: Does marriage selection indicate a relationship between the heights of husbands and wives?<br>(He called this "assortative mating")

"midparent height" is just a scaled mean:

```r
midparentHeight =  (father + 1.08*mother)/2
```

]

```r
library(HistData)

galtonScatterplot <- ggplot(GaltonFamilies) +
    geom_point(aes(x = midparentHeight,
                   y = childHeight),
               size = 0.5, alpha = 0.7) +
    theme_classic() +
    labs(x = 'Midparent height (inches)',
         y = 'Child height (inches)')
```

]]

---

# How do you measure correlation?

<br>

# Pearson came up with this:

# `$r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}$`

---

# How do you measure correlation?

## `$r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}$`

Assumptions:
1. Variables must be interval or ratio
2. Linear relationship

]]

]

---

# How do you _interpret_ `$r$`?

## `$r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}$`

Interpretation:
- `$-1 \le r \le 1$`
- Closer to 1 is stronger correlation
- Closer to 0 is weaker correlation

]

```r
cor(x = GaltonFamilies$midparentHeight,
    y = GaltonFamilies$childHeight,
    method = 'pearson')
```

```
#> [1] 0.3209499
```

]

]

---

## What does `$r$` mean?

- `$\pm 0.1 - 0.3$`: Weak
- `$\pm 0.3 - 0.5$`: Moderate
- `$\pm 0.5 - 0.8$`: Strong
- `$\pm 0.8 - 1.0$`: Very strong
]]

]

---

# Visualizing correlation is...um...easy, right?

<br>

# [guessthecorrelation.com](http://guessthecorrelation.com/)

# Click [here](https://docs.google.com/presentation/d/1-7VqNRJp53FawfNJwKLEkpoubGQ_x0wIkN2lAMP7Emw/edit?usp=sharing) to vote!

---

## The datasaurus

### (More [here](https://www.autodeskresearch.com/publications/samestats))

]

]

---

# Coefficient of determination: `$r^2$`

Percent of variance in one variable that is explained by the other variable

]]

`$r$` | `$r^2$`
----|------
0.1 | 0.01
0.2 | 0.04
0.3 | 0.09
0.4 | 0.16
0.5 | 0.25
0.6 | 0.36
0.7 | 0.49
0.8 | 0.64
0.9 | 0.81
1.0 | 1.00

]

---

## You should report both `$r$` and `$r^2$`

<br>

### Correlation between parent and child height is 0.32, therefore 10% of the variance in the child height is explained by the parent height.

---

# Correlation != Causation

### X causes Y

- Training causes improved performance

### Y causes X

- Good (bad) performance causes people to train harder (less hard).

### Z causes both X & Y

- Commitment and motivation cause increased training and better performance.

---

## Be weary of dual axes!

## ([They can cause spurious correlations](https://www.tylervigen.com/spurious-correlations))

]

]

---

# Outliers

---

---

---

---

## **Pearson** correlation is highly sensitive to outliers

---

# **Spearman**'s rank-order correlation

# `$r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}$`

- Separately rank the values of X & Y.
- Use Pearson's correlation on the _ranks_ instead of the `$x$` & `$y$` values.

]

Assumptions:

- Variables can be ordinal, interval or ratio
- Relationship must be monotonic (i.e. does not require linearity)

]

---

## Spearman correlation more robust to outliers

---

## Spearman correlation more robust to outliers

]

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Pearson </th>
   <th style="text-align:right;"> Spearman </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> -0.56 </td>
   <td style="text-align:right;"> 0.53 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.39 </td>
   <td style="text-align:right;"> 0.69 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.94 </td>
   <td style="text-align:right;"> 0.81 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.38 </td>
   <td style="text-align:right;"> 0.76 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.81 </td>
   <td style="text-align:right;"> 0.79 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.31 </td>
   <td style="text-align:right;"> 0.70 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.95 </td>
   <td style="text-align:right;"> 0.81 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.51 </td>
   <td style="text-align:right;"> 0.75 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> -0.56 </td>
   <td style="text-align:right;"> 0.53 </td>
  </tr>
</tbody>
</table>

]

]

---

## Summary of correlation

- **Pearson's correlation**: Described the strength of a **linear** relationship between two variables that are interval or ratio in nature.
- **Spearman's rank-order correlation**: Describes the strength of a **monotonic** relationship between two variables that are ordinal, interval, or ratio. **It is more robust to outliers**.
- The **coefficient of determination** ( `$r^2$` ) describes the amount of variance in one variable that is explained by the other variable.
- **Correlation != Causation**

]

R command (hint: add `use = "complete.obs"` to drop NA values)

```r
pearson  <- cor(x, y, method = "pearson", use = "complete.obs")
spearman <- cor(x, y, method = "spearman", use = "complete.obs")
```

---

# Week 4: .fancy[Exploring Data]

### 1. Exploring Data
### 2. Data Types
### 3. Centrality & Variability
### 4. Visualizing Centrality & Variability

]

### BREAK

### 5. Correlation
### 6. .orange[Visualizing Correlation]
### 7. Visualizing Relationships

]

---

## **Scatterplots**: The correlation workhorse

```r
scatterplot <- mtcars %>% 
  ggplot() +
* geom_point(
*   aes(x = mpg, y = hp),
*   size = 2, alpha = 0.7
* ) +
  theme_classic(base_size = 20) +
  labs(
    x = 'Fuel economy (mpg)',
    y = 'Engine power (hp)'
  )

scatterplot
```

]

]

---

## Adding a correlation label to a chart

```r
corr <- cor(
    mtcars$mpg, mtcars$hp,
    method = 'pearson')

*corrLabel <- paste('r = ', round(corr, 2))
```

Add label to the chart with `annotate()`

```r
scatterplot +
* annotate(
*   geom = 'text',
*   x = 25, y = 310,
*   label = corrLabel,
*   hjust = 0, size = 7
* )
```
]

]

---

---

## Visualize all the correlations: `ggcorr()`

```r
library('GGally')
```

```r
mtcars %>%
*   ggcorr()
```

]

]

---

## Visualizing correlations: `ggcorr()`

```r
library('GGally')
```

```r
mtcars %>%
*   ggcorr(label = TRUE,
*          label_size = 3,
*          label_round = 2)
```

]

]

---

## Visualizing correlations: `ggcorr()`

```r
ggcor_mtcars_final <- mtcars %>%
    ggcorr(label = TRUE,
           label_size = 3,
           label_round = 2,
*          label_color = 'white',
*          nbreaks = 5,
*          palette = "RdBu")
```

]

]

---

## .center[Pearson]

```r
mtcars %>%
    ggcorr(label = TRUE,
           label_size = 3,
           label_round = 2,
*          method = c("pairwise", "pearson"))
```

]

## .center[Spearman]

```r
mtcars %>%
    ggcorr(label = TRUE,
           label_size = 3,
           label_round = 2,
*          method = c("pairwise", "spearman"))
```

]

---

## Correlograms: `ggpairs()`

```r
library('GGally')
```

```r
mtcars %>%
    select(mpg, cyl, disp, hp, wt) %>%
*   ggpairs()
```

- Look for linear relationships
- View distribution of each variable

]

]

---

## Correlograms: `ggpairs()`

```r
library('GGally')
```

```r
mtcars %>%
    select(mpg, cyl, disp, hp, wt) %>%
    ggpairs() +
*   theme_classic()
```

- Look for linear relationships
- View distribution of each variable

]

]

---

## Your turn

Using the `penguins` data frame:

1. Find the two variables with the largest correlation in absolute value (i.e. closest to -1 or 1).

2. Create a scatter plot of those two variables.

3. Add an annotation for the Pearson correlation coefficient.

]

### .center[[palmerpenguins library](https://allisonhorst.github.io/palmerpenguins/)]

]

---

## **Simpson's Paradox**: when correlation betrays you

]

]

---

# Week 4: .fancy[Exploring Data]

### 1. Exploring Data
### 2. Data Types
### 3. Centrality & Variability
### 4. Visualizing Centrality & Variability

]

### BREAK

### 5. Correlation
### 6. Visualizing Correlation
### 7. .orange[Visualizing Relationships]

]

---

## Visualizing variation

Ask yourself:

- What type of **variation** occurs within my variables?
- What type of **covariation** occurs between my variables?

Check out [these guides](https://eda.seas.gwu.edu/2023-Fall/references.html#choosing-the-right-chart)

]

]

---

## Two **Categorical** Variables

Summarize with a table of counts

```r
wildlife_impacts %>%
*   count(operator, time_of_day)
```

```
#> # A tibble: 20 × 3
#>    operator           time_of_day     n
#>    <chr>              <chr>       <int>
#>  1 AMERICAN AIRLINES  Dawn          458
#>  2 AMERICAN AIRLINES  Day          7809
#>  3 AMERICAN AIRLINES  Dusk          584
#>  4 AMERICAN AIRLINES  Night        3710
#>  5 AMERICAN AIRLINES  <NA>         2326
#>  6 DELTA AIR LINES    Dawn          267
#>  7 DELTA AIR LINES    Day          4846
#>  8 DELTA AIR LINES    Dusk          353
#>  9 DELTA AIR LINES    Night        2090
#> 10 DELTA AIR LINES    <NA>         1449
#> 11 SOUTHWEST AIRLINES Dawn          394
#> 12 SOUTHWEST AIRLINES Day          9109
#> 13 SOUTHWEST AIRLINES Dusk          599
#> 14 SOUTHWEST AIRLINES Night        5425
#> 15 SOUTHWEST AIRLINES <NA>         2443
#> 16 UNITED AIRLINES    Dawn          151
#> 17 UNITED AIRLINES    Day          3359
#> 18 UNITED AIRLINES    Dusk          181
#> 19 UNITED AIRLINES    Night        1510
#> 20 UNITED AIRLINES    <NA>         9915
```

]

---

## Two **Categorical** Variables

Convert to "wide" format with `pivot_wider()` to make it easier to compare values

```r
wildlife_impacts %>%
    count(operator, time_of_day) %>%
*   pivot_wider(names_from = time_of_day, values_from = n)
```

```
#> # A tibble: 4 × 6
#>   operator            Dawn   Day  Dusk Night  `NA`
#>   <chr>              <int> <int> <int> <int> <int>
#> 1 AMERICAN AIRLINES    458  7809   584  3710  2326
#> 2 DELTA AIR LINES      267  4846   353  2090  1449
#> 3 SOUTHWEST AIRLINES   394  9109   599  5425  2443
#> 4 UNITED AIRLINES      151  3359   181  1510  9915
```

]

---

## Two **Categorical** Variables

Visualize with bars:<br>map **fill** to denote 2nd categorical var

```r
wildlife_impacts %>%
  count(operator, time_of_day) %>%
  ggplot() +
  geom_col(
    aes(
      x = n,
      y = reorder(operator, n),
*     fill = reorder(time_of_day, n)
    ), 
    width = 0.7,
*   position = 'dodge') +
  theme(legend.position = "bottom") +
  labs(
    fill = "Time of day", 
    y = "Airline"
  )
```

]

]

---

## Two **Continuous** Variables

Visualize with scatterplot - looking for _clustering_ and/or _correlational_ relationship

```r
ggplot(wildlife_impacts) +
  geom_point(
    aes(
      x = speed, 
      y = height  
    ),
    size = 0.5) +
  labs(
    x = 'Speed (mph)',
    y = 'Height (f)'
  )
```

]

]

---

## One **Continuous**, One **Categorical**

Visualize with **boxplot**

```r
ggplot(wildlife_impacts) +
  geom_boxplot(
    aes(
      x = speed, 
      y = operator)
    ) + 
  labs(
    x = 'Speed (mph)',
    y = 'Airline'
  )
```

]

]

---

# Practice doing EDA

1) Read in the `candy_rankings.csv` data sets

2) Preview the data, note the data types and what each variable is.

3) Visualize (at least) three _relationships_ between two variables (guided by a question) using an appropriate chart:

- Bar chart
- Scatterplot
- Boxplot