Correlation

.leftcol30[
<center>
<img src="https://raw.githubusercontent.com/emse-eda-gwu/2021-Spring/master/images/eda_hex_sticker.png" width=250>
</center>
]
.rightcol70[
# Week 4: .fancy[Correlation]

### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 512 512"><path d="M496 128v16a8 8 0 0 1-8 8h-24v12c0 6.627-5.373 12-12 12H60c-6.627 0-12-5.373-12-12v-12H24a8 8 0 0 1-8-8v-16a8 8 0 0 1 4.941-7.392l232-88a7.996 7.996 0 0 1 6.118 0l232 88A8 8 0 0 1 496 128zm-24 304H40c-13.255 0-24 10.745-24 24v16a8 8 0 0 0 8 8h464a8 8 0 0 0 8-8v-16c0-13.255-10.745-24-24-24zM96 192v192H60c-6.627 0-12 5.373-12 12v20h416v-20c0-6.627-5.373-12-12-12h-36V192h-64v192h-64V192h-64v192h-64V192H96z"/></svg> EMSE 4575: Exploratory Data Analysis
### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 448 512"><path d="M224 256c70.7 0 128-57.3 128-128S294.7 0 224 0 96 57.3 96 128s57.3 128 128 128zm89.6 32h-16.7c-22.2 10.2-46.9 16-72.9 16s-50.6-5.8-72.9-16h-16.7C60.2 288 0 348.2 0 422.4V464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48v-41.6c0-74.2-60.2-134.4-134.4-134.4z"/></svg> John Paul Helveston
### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 448 512"><path d="M0 464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V192H0v272zm320-196c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM192 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM64 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zM400 64h-48V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H160V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H48C21.5 64 0 85.5 0 112v48h448v-48c0-26.5-21.5-48-48-48z"/></svg> February 03, 2021
]

---
class: center, middle, inverse

# .fancy[.blue[Tip of the week]]

# All data are biased

---
class: middle

.leftcol40[
<center>
<img src="images/Abraham_Wald.jpg" width=300>
</center>
## [.center[Abraham Wald]](https://en.wikipedia.org/wiki/Abraham_Wald)
]

---

## Today's data

```r
msleep <- read_csv(here::here('data', 'msleep.csv'))
```

## New packages:

```r
install.packages('HistData')
install.packages('palmerpenguins')
install.packages('GGally')
```

---

# Week 4: .fancy[Correlation]

## 1. What is correlation?
## 2. Visualizing correlation

## BREAK

## 3. Linear models
## 4. Visualizing linear models

---

# Week 4: .fancy[Correlation]

## 1. .orange[What is correlation?]
## 2. Visualizing correlation

## BREAK

## 3. Linear models
## 4. Visualizing linear models

---

## Some pretty racist origins in [eugenics](https://en.wikipedia.org/wiki/Eugenics) ("well born")

### [Sir Francis Galton](https://en.wikipedia.org/wiki/Francis_Galton) (1822 - 1911)
- Charles Darwin's cousin.
- "Father" of [eugenics](https://en.wikipedia.org/wiki/Eugenics).
- Interested in heredity.

<center>
<img src="images/Francis_Galton_1850s.jpg" width=200>
</center>
]

### [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson) (1857 - 1936)

- Galton's ([hero-worshiping](https://en.wikipedia.org/wiki/Apotheosis)) protégé.
- Defined correlation equation.
- "Father" of mathematical statistics.

<center>
<img src="images/Karl_Pearson.jpg" width=220>
<center>
]

???

The beautiful irony is that human genetics was also the field that conclusively demonstrated the biological falsity of race.

---

# Galton's family data

Galton, F. (1886). ["Regression towards mediocrity in hereditary stature"](http://www.stat.ucla.edu/~nchristo/statistics100C/history_regression.pdf). _The Journal of the Anthropological Institute of Great Britain and Ireland_ 15: 246-263.

**Galton's question**: Does marriage selection indicate a relationship between the heights of husbands and wives?<br>(He called this "assortative mating")

"midparent height" is just a scaled mean:

```r
midparentHeight =  (father + 1.08*mother)/2
```
]

```r
library(HistData)

galtonScatterplot <- ggplot(GaltonFamilies) +
    geom_point(aes(x = midparentHeight,
                   y = childHeight),
               size = 0.5, alpha = 0.7) +
    theme_classic() +
    labs(x = 'Midparent height (inches)',
         y = 'Child height (inches)')
```

<center>
<img src="figs/galtonScatterplot.png" width=450>
</center>
]]

---
class: center, middle

# How do you measure correlation?

<br>
# Pearson came up with this:

# `$r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}$`

---

# How do you measure correlation?

.rightcol40[
<center>
<img src="figs/cor_vstrong_p.png" width=275>
</center>
<br>
<center>
<img src="figs/cor_quad.png" width=275>
</center>
]

---

# How do you _interpret_ `$r$`?

Interpretation:
- `$-1 \le r \le 1$`
- Closer to 1 is stronger correlation
- Closer to 0 is weaker correlation
]
--
.rightcol[.code70[

```r
cor(x = GaltonFamilies$midparentHeight,
    y = GaltonFamilies$childHeight,
    method = 'pearson')
```

```
#> [1] 0.3209499
```
]
<center>
<img src="figs/galtonScatterplot.png" width=400>
</center>
]

---

## What does `$r$` mean?

- `$\pm 0.1 - 0.3$`: Weak
- `$\pm 0.3 - 0.5$`: Moderate
- `$\pm 0.5 - 0.8$`: Strong
- `$\pm 0.8 - 1.0$`: Very strong
]]
.rightcol60[
<center>
<img src="figs/cor_p.png">
</center>
]

---
class: center,  middle

# Visualizing correlation is...um...easy, right?

<br>

# [guessthecorrelation.com](http://guessthecorrelation.com/)

# Click [here](https://docs.google.com/presentation/d/1uN5FfQ4QiiaJ1SI1A5SRji58vWzubk2qwGi9Lykz5z0/edit?usp=sharing) to vote!

---
class: middle

### (More [here](https://www.autodeskresearch.com/publications/samestats))
]

---

# Coefficient of determination: `$r^2$`

<center>
<img src="images/rsquared_venn.png">
</center>
]]

`$r$` | `$r^2$`
----|------
0.1 | 0.01
0.2 | 0.04
0.3 | 0.09
0.4 | 0.16
0.5 | 0.25
0.6 | 0.36
0.7 | 0.49
0.8 | 0.64
0.9 | 0.81
1.0 | 1.00

]

---

## You should report both `$r$` and `$r^2$`

<br>

### Correlation between parent and child height is 0.32, therefore 10% of the variance in the child height is explained by the parent height.

---

# Correlation != Causation

### X causes Y

- Training causes improved performance

### Y causes X

- (Good / bad) performance causes people to train harder.

### Z causes both X & Y

- Commitment and motivation cause increased training and better performance.

---
class: center

## Be weary of dual axes!

## ([They can cause spurious correlations](https://www.tylervigen.com/spurious-correlations))

---
class: inverse, center

# Outliers

---
class: middle

---
class: middle

---
class: middle

---
class: center, middle

## Pearson correlation is highly sensitive to outliers

---

# Spearman's rank-order correlation

# `$r = \frac{\text{Cov}(x, y)}{\text{sd}(x) * \text{sd}(y)}$`

.font120[
- Separately rank the values of X & Y.
- Use Pearson's correlation on the _ranks_ instead of the `$x$` & `$y$` values.
]

- Variables can be ordinal, interval or ratio
- Relationship must be monotonic (i.e. does not require linearity)
]

---
class: center, middle

## Spearman correlation more robust to outliers

---
class: center, middle

## Spearman correlation more robust to outliers

.cols3[
<center>
<img src="figs/pearson_grid.png">
</center>
]
.cols3[
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Pearson </th>
   <th style="text-align:right;"> Spearman </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> -0.56 </td>
   <td style="text-align:right;"> 0.53 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.39 </td>
   <td style="text-align:right;"> 0.69 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.94 </td>
   <td style="text-align:right;"> 0.81 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.38 </td>
   <td style="text-align:right;"> 0.76 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.81 </td>
   <td style="text-align:right;"> 0.79 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.31 </td>
   <td style="text-align:right;"> 0.70 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.95 </td>
   <td style="text-align:right;"> 0.81 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.51 </td>
   <td style="text-align:right;"> 0.75 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> -0.56 </td>
   <td style="text-align:right;"> 0.53 </td>
  </tr>
</tbody>
</table>
]
.cols3[
<center>
<img src="figs/outlier_compare.png">
</center>
]

---

## Summary of correlation

.font120[
- **Pearson's correlation**: Described the strength of a **linear** relationship between two variables that are interval or ratio in nature.
- **Spearman's rank-order correlation**: Describes the strength of a **monotonic** relationship between two variables that are ordinal, interval, or ratio. **It is more robust to outliers**.
- The **coefficient of determination** ( `$r^2$` ) describes the amount of variance in one variable that is explained by the other variable.
- **Correlation != Causation**
]

R command (hint: add `use = "complete.obs"` to drop NA values)

```r
pearson  <- cor(x, y, method = "pearson", use = "complete.obs")
spearman <- cor(x, y, method = "spearman", use = "complete.obs")
```

---

# Week 4: .fancy[Correlation]

## 1. What is correlation?
## 2. .orange[Visualizing correlation]

## BREAK

## 3. Linear models
## 4. Visualizing linear models

---

## **Scatterplots**: The correlation workhorse

```r
scatterplot <- ggplot(mtcars) +
*   geom_point(aes(x = mpg, y = hp),
               size = 2, alpha = 0.7) +
    theme_classic(base_size = 20) +
    labs(x = 'Fuel economy (mpg)',
         y = 'Engine power (hp)')

scatterplot
```
]

---

## Adding a correlation label to a chart

```r
corr <- cor(
    mtcars$mpg, mtcars$hp,
    method = 'pearson')

*corrLabel <- paste('r = ', round(corr, 2))
```

Add label to the chart with `annotate()`

```r
scatterplot +
*   annotate(geom = 'text',
*            x = 25, y = 310,
*            label = corrLabel,
*            hjust = 0, size = 7)
```
]

---
class: middle, center
background-color: #FFFFFF

---

## Visualize all the correlations: `ggcorr()`

```r
library('GGally')
```

```r
mtcars %>%
*   ggcorr()
```
]

---

## Visualizing correlations: `ggcorr()`

```r
library('GGally')
```

```r
mtcars %>%
*   ggcorr(label = TRUE,
*          label_size = 3,
*          label_round = 2)
```
]

---

## Visualizing correlations: `ggcorr()`

```r
ggcor_mtcars_final <- mtcars %>%
    ggcorr(label = TRUE,
           label_size = 3,
           label_round = 2,
*          label_color = 'white',
*          nbreaks = 5,
*          palette = "RdBu")
```
]

---

## .center[Pearson]

```r
mtcars %>%
    ggcorr(label = TRUE,
           label_size = 3,
           label_round = 2,
*          method = c("pairwise", "pearson"))
```

<center>
<img src="figs/ggcor_mtcars_pearson.png" width=400>
</center>
]

## .center[Spearman]

```r
mtcars %>%
    ggcorr(label = TRUE,
           label_size = 3,
           label_round = 2,
*          method = c("pairwise", "spearman"))
```

<center>
<img src="figs/ggcor_mtcars_spearman.png" width=400>
</center>
]

---

## Correlograms: `ggpairs()`

```r
library('GGally')
```

```r
mtcars %>%
    select(mpg, cyl, disp, hp, wt) %>%
*   ggpairs()
```

- Look for linear relationships 
- View distribution of each variable
]

---

## Correlograms: `ggpairs()`

```r
library('GGally')
```

```r
mtcars %>%
    select(mpg, cyl, disp, hp, wt) %>%
    ggpairs() +
*   theme_classic()
```

- Look for linear relationships 
- View distribution of each variable
]

---
class: inverse

## Your turn

1. Find the two variables with the largest correlation in absolute value (i.e. closest to -1 or 1).

2. Create a scatter plot of those two variables.

3. Add an annotation for the Pearson correlation coefficient.
]

.rightcol[
### .center[[palmerpenguins library](https://allisonhorst.github.io/palmerpenguins/)]
<center>
<img src="images/lter_penguins.png" width=700>
</center>

---

## **Simpson's Paradox**: when correlation betrays you

.leftcol[
.center[**Body mass vs. Bill depth**]
<center>
<img src="figs/simpson_penguins.png" width=450>
</center>
]

.rightcol[
.center[**Body mass vs. Bill depth**]
<center>
<img src="figs/simpson_penguins_good.png" width=600>
</center>
]

---
class: inverse, center

# Break!

## Stand up, Move around, Stretch!

---

# Week 4: .fancy[Correlation]

## 1. What is correlation?
## 2. Visualizing correlation

## BREAK

## 3. .orange[Linear models]
## 4. Visualizing linear models

---

# Palmer Penguins

.leftcol[
The correlation of 0.87 means that the body mass (g) explains about 75% of the variation in the flipper length (mm).
]

---

# Palmer Penguins

.leftcol[
The correlation of 0.87 means that the body mass (g) explains about 75% of the variation in the flipper length (mm).

<br>

**Now let's fit a model to these points!**
]

---

## Modeling basics

1. **Model family**: e.g., `$y = ax + b$`
<br>

2. **Fitted model**: e.g., `$y = 3x + 7$`
]

<center>
<img src="images/sim_base.png">
</center>
]

---

## Modeling basics

.rightcol[
There are an infinite number of possible models
<center>
<img src="images/sim_models.png">
</center>
]

---

## Modeling basics

Two parts to a model:

1. **Model family**: linear model: `$y = ax + b$`
<br>

2. **Fitted model**: How to choose the "best" `$a$` and `$b$`?
]

<center>
<img src="images/sim_models.png">
</center>
]

---

## Modeling basics

2. **Fitted model**: How to choose the "best" `$a$` and `$b$`?

Compute the **.blue["residuals"]**:<br>The distance between the model line and the data<br>

<center>
<img src="images/sim_residuals.png">
</center>
]

---

## Residual: `$y_i - y_i'$`

---

## Sum of squared residuals: `$\text{SSR} = \sum_{i = 1}^{n} (y_i - y_i')^2$`

---

## Search algorithm

`$y = ax + b$`
<br>

<center>
<img src="images/sim_models.png">
</center>
]

`$\text{SSR} = \sum_{i = 1}^{n} (y_i - y_i')^2$`

<center>
<img src="images/sim_residuals.png">
</center>
]

<center>
<img src="images/sim_best_line.png">
</center>
]

---

## Fitting a linear model in R

```r
model <- lm(formula = y ~ x, data = data)
```

&zwj;Example: Penguin data

```r
model <- lm(
  formula = body_mass_g ~ flipper_length_mm,
  data    = penguins)
```

Get coefficients ( `$a$` & `$b$` in `$y = ax + b$` )

```r
coef(model)
```

```
#>       (Intercept) flipper_length_mm 
#>       -5780.83136          49.68557
```

---

## Fitting a linear model in R

```r
model <- lm(formula = y ~ x,
            data = data)
```

&zwj;Example: Penguin data

```r
model <- lm(
  formula = body_mass_g ~ flipper_length_mm,
  data    = penguins)
```

Get coefficients

```r
coef(model)
```

```
#>       (Intercept) flipper_length_mm 
#>       -5780.83136          49.68557
```
]

---
class: middle, center

## Interpreting results

---

## Example write up for Penguin data

The correlation between flipper length (mm) and body mass (g) is **0.87**. Therefore, **~75%** of the variance in body mass is explained by flipper length.

The slope of the best fitting regression line indicates that body mass increased by **49.7 g** as flipper length increased by one mm.
]]

---

## Making predictions

.leftcol[
**Interpolation is OK**: You may predict values of `$y$` for values of `$x$` that were not observed but are within the range of the observed values of `$x$`.

<center>
<img src="figs/penguinScatterplotEq.png" width=420>
</center>
]

.rightcol[
**Extrapolation is BAD**: You should NOT predict values of `$y$` using values of `$x$` that are outside the observed range of `$x$`.

<center>
<img src="images/extrapolating.png">
</center>
.center[[xkcd](https://m.xkcd.com/605/)]
]

---

## Repeat: Extrapolation is **BAD**

.leftcol[
"Extrapolation of these trends to the 2008 Olympiad indicates that the women’s 100-metre race could be won in a time of 10.57±0.232 seconds and the men’s event in 9.73±0.144 seconds. **Should these trends continue, the projections will intersect at the 2156 Olympics, when — for the first time ever — the winning women's 100-metre sprint time of 8.079 seconds will be lower than that of the men's winning time of 8.098 seconds (Fig. 1).**"
]

.rightcol[
<center>
<img src="images/nature_sprint.jpeg">
</center>
.font80[Tatem, A. J., Guerra, C. A., Atkinson, P. M., & Hay, S. I. (2004). Momentous sprint at the 2156 Olympics? _Nature_, 431(7008), 525-525. [View online](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3173856/)]
]

---

# Symantics

.leftcol[.font130[
These all mean the same thing:
- "Use X to predict Y"
- "Regress Y _on_ X"
- "Regression of Y _on_ X"

```r
model <- lm(formula = y ~ x,
            data = data)
```
]]

---

# Symantics

```r
model <- lm(formula = y ~ x,
            data = data)
```

.leftcol[.font130[
**Y: Dependent variable**
- Outcome variable
- Response variable
- Regressand
- Left-hand variable

]]

.rightcol[.font130[
**X: Independent variable**
- Predictor variable
- Explanatory variable
- Regressor
- Right-hand variable
]]

---

# Week 4: .fancy[Correlation]

## 1. What is correlation?
## 2. Visualizing correlation

## BREAK

## 3. Linear models
## 4. .orange[Visualizing linear models]

---

## Adding the correlation annotation

```r
# Make the correlation label
corr <- cor(
    penguins$body_mass_g, penguins$flipper_length_mm,
    method = 'pearson', use = "complete.obs")

*corrLabel <- paste("r = ", round(corr, 2))

# Make the chart!
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(size = 1, alpha = 0.7) +
*   annotate(geom = 'text', x = 175, y = 6000,
*            label = corrLabel,
*            hjust = 0, size = 5) +
    theme_classic(base_size = 20) +
    labs(x = "Flipper length (mm)",
         y = "Body mass (g)")
```
]

---

```r
# Make correlation label
corrLabel <- paste("r = ", round(cor(
    penguins$body_mass_g, penguins$flipper_length_mm,
    method = 'pearson', use = "complete.obs"), 2))

# Make model label
model <- lm(
  formula = body_mass_g ~ flipper_length_mm,
  data    = penguins)
coefs <- round(coef(model), 2)
*modelLabel <- paste('y = ', coefs[1], ' + ', coefs[2], 'x')

# Make the chart!
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(size = 1, alpha = 0.7) +
*   geom_smooth(method = 'lm', se = FALSE) +
    annotate(geom = 'text', x = 175, y = 6000,
             label = corrLabel,
             hjust = 0, size = 5) +
*   annotate(geom = 'text', x = 175, y = 5700,
*            label = modelLabel, color = "blue",
*            hjust = 0, size = 5)
    theme_classic(base_size = 20) +
    labs(x = "Flipper length (mm)",
         y = "Body mass (g)")
```
]

<center>
<img src="figs/penguinScatterplotEq.png">
</center>
]

---
class: inverse

## Your turn

Using the `msleep` data frame:

1. Create a scatter plot of `brainwt` versus `bodywt`.
2. Include an annotation for the Pearson correlation coefficient.
3. Include an annotation for the best fit line.

Bonus: Compare your results to a log-linear relationship by converting the x and y variables to the log of x and y, like this:

```r
model <- lm(log(x) ~ log(y), data = msleep)
```

You can also convert your plot to log axes by adding these layers:

```r
plot +    
    scale_x_log10() +
    scale_y_log10()  
```

---
class: middle, center, inverse

# Projects

---
class: middle, center

## Take your time and take breaks

---
class: center

<center>
<img src="images/horst_monsters_r_first_then.png" width=550>
</center>
.left[Artwork by [@allison_horst](https://twitter.com/allison_horst)]

---
class: center, middle, inverse

# Start thinking about research questions

---

# Writing a research question

Follow [these guidelines](https://writingcenter.gmu.edu/guides/how-to-write-a-research-question) - your question should be:

- **Clear**: your audience can easily understand its purpose without additional explanation.

- **Focused**: it is narrow enough that it can be addressed thoroughly with the data available and within the limits of the final project report.

- **Concise**: it is expressed in the fewest possible words.

- **Complex**: it is not answerable with a simple "yes" or "no," but rather requires synthesis and analysis of data.

- **Arguable**: its potential answers are open to debate rather than accepted facts (do others care about it?)

---

# Writing a research question

**Bad question: Why are social networking sites harmful?**

- Unclear: it does not specify _which_ social networking sites or state what harm is being caused; assumes that "harm" exists.

**Improved question: How are online users experiencing or addressing privacy issues on such social networking sites as Facebook and Twitter?**

- Specifies the sites (Facebook and Twitter), type of harm (privacy issues), and who is harmed (online users).

**Other good examples**: See the [Example Projects Page](https://eda.seas.gwu.edu/2021-Spring/ref-example-analyses.html) page