Centrality & Variability

]

# Week 4: .fancy[Centrality & Variability]

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M243.4 2.587C251.4-.8625 260.6-.8625 268.6 2.587L492.6 98.59C506.6 104.6 514.4 119.6 511.3 134.4C508.3 149.3 495.2 159.1 479.1 160V168C479.1 181.3 469.3 192 455.1 192H55.1C42.74 192 31.1 181.3 31.1 168V160C16.81 159.1 3.708 149.3 .6528 134.4C-2.402 119.6 5.429 104.6 19.39 98.59L243.4 2.587zM256 128C273.7 128 288 113.7 288 96C288 78.33 273.7 64 256 64C238.3 64 224 78.33 224 96C224 113.7 238.3 128 256 128zM127.1 416H167.1V224H231.1V416H280V224H344V416H384V224H448V420.3C448.6 420.6 449.2 420.1 449.8 421.4L497.8 453.4C509.5 461.2 514.7 475.8 510.6 489.3C506.5 502.8 494.1 512 480 512H31.1C17.9 512 5.458 502.8 1.372 489.3C-2.715 475.8 2.515 461.2 14.25 453.4L62.25 421.4C62.82 420.1 63.41 420.6 63.1 420.3V224H127.1V416z"/></svg> EMSE 4572: Exploratory Data Analysis
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M224 256c70.7 0 128-57.31 128-128s-57.3-128-128-128C153.3 0 96 57.31 96 128S153.3 256 224 256zM274.7 304H173.3C77.61 304 0 381.6 0 477.3c0 19.14 15.52 34.67 34.66 34.67h378.7C432.5 512 448 496.5 448 477.3C448 381.6 370.4 304 274.7 304z"/></svg> John Paul Helveston
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M96 32C96 14.33 110.3 0 128 0C145.7 0 160 14.33 160 32V64H288V32C288 14.33 302.3 0 320 0C337.7 0 352 14.33 352 32V64H400C426.5 64 448 85.49 448 112V160H0V112C0 85.49 21.49 64 48 64H96V32zM448 464C448 490.5 426.5 512 400 512H48C21.49 512 0 490.5 0 464V192H448V464z"/></svg> September 21, 2022

]

---

# Quiz solution

---

# Tip of the week:

# `theme_set()`

---

# Add "global" settings to all plots

```r
library(knitr)
library(tidyverse)
library(here)
knitr::opts_chunk$set(
    warning = FALSE,
    message = FALSE,
    comment = "#>",
*   fig.path = "figs/", # Plot save path
*   fig.width = 7.252, # Plot dimensions
*   fig.height = 4,
*   fig.retina = 3 # Better plot resolution
)

*theme_set(theme_bw(base_size = 20)) # Set theme for all ggplots
```

---

```r
ggplot(mtcars) +
  geom_point(aes(x = mpg, y = hp))
```

Default theme

]

`theme_bw(base_size = 20)`

]

---

# Week 4: .fancy[Centrality & Variability]

## 1. Data Types
## 2. Measures of Centrality & Variability

## BREAK

## 3. Visualizing Centrality & Variability
## 4. Relationships Between 2 Variables
## 5. Exploratory Data Analysis

---

# Week 4: .fancy[Centrality & Variability]

## 1. .orange[Data Types]
## 2. Measures of Centrality & Variability

## BREAK

## 3. Visualizing Centrality & Variability
## 4. Relationships Between 2 Variables
## 5. Exploratory Data Analysis

---

# 24,901

???

If I walked up to you, and said, "The answer is 24,901,"
you would probably be confused.
By itself, a number means nothing.

---

# Earth's circumference at the equator:<br>24,901 miles

???

But if I were to tell you that the circumference of the earth at the equator is 24,901 miles, that would mean something.

To be complete and meaningful, quantitative information consists of both quantitative data (the numbers) and categorical data (the labels that tell us what the numbers measure).

---

# Types of Data

### **Categorical**

Subdivide things into _groups_

- What type?
- Which category?

]

### **Numerical**

Measure things with numbers

- How many?
- How much?

]

---

## Categorical (discrete) variables

### **Nominal**

- Order doesn't matter
- Differ in "name" (nominal) only

e.g. `country` in TB case data:

```
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <dbl>  <dbl>      <dbl>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
```

]]

### **Ordinal**

- Order matters
- Distance between units not equal

e.g.: `Placement` 2017 Boston marathon:

```
#> # A tibble: 6 × 3
#>   Placement `Official Time` Name            
#>       <dbl> <time>          <chr>           
#> 1         1 02:09:37        Kirui, Geoffrey 
#> 2         2 02:09:58        Rupp, Galen     
#> 3         3 02:10:28        Osako, Suguru   
#> 4         4 02:12:08        Biwott, Shadrack
#> 5         5 02:12:35        Chebet, Wilson  
#> 6         6 02:12:45        Abdirahman, Abdi
```

]]

---

## Numerical data

### **Interval**

- Numerical scale with<br>arbitrary starting point
- No "0" point
- Can't say "x" is double "y"

e.g.: `temp` in Beaver data

```
#>   day time  temp activ
#> 1 346  840 36.33     0
#> 2 346  850 36.34     0
#> 3 346  900 36.35     0
#> 4 346  910 36.42     0
#> 5 346  920 36.55     0
#> 6 346  930 36.69     0
```

]

### **Ratio**

- Has a "0" point
- Can be described as percentages
- Can say "x" is double "y"

e.g.: `height` & `speed` in wildlife impacts

```
#> # A tibble: 6 × 3
#>   incident_date       height speed
#>   <dttm>               <dbl> <dbl>
#> 1 2018-12-31 00:00:00    700   200
#> 2 2018-12-27 00:00:00    600   145
#> 3 2018-12-23 00:00:00      0   130
#> 4 2018-12-22 00:00:00    500   160
#> 5 2018-12-21 00:00:00    100   150
#> 6 2018-12-18 00:00:00   4500   250
```

]

---

# Key Questions

## Categorical

## .orange[Does the order matter?]

| Yes | No |
|---|---|
| Ordinal | Nominal |

]

## Numerical

## .orange[Is there a "baseline"?]

| Yes | No |
|---|---|
| Ratio | Interval |

]

---

# Be careful of how variables are encoded!

---

## .red[When numbers are categories]

- "Dummy coding": e.g., `passedTest` = `1` or `0`)
- "North", "South", "East", "West" = `1`, `2`, `3`, `4`

## .red[When ratio data are discrete (i.e. counts)]

- Number of eggs in a carton, heart beats per minute, etc.
- Continuous variables measured discretely (e.g. age)

## .red[Time]

- As _ordinal_ categories: "Jan.", "Feb.", "Mar.", etc.
- As _interval_ scale: "Jan. 1", "Jan. 2", "Jan. 3", etc.
- As _ratio_ scale: "30 sec", "60 sec", "70 sec", etc.

---

# **Quick practice**: What's the data type?

> Decide [here](https://docs.google.com/presentation/d/1J8UtyEwkA5QEcQQ9LCAs4EU1gyhPY7QIoIW_T_gys6o/edit?usp=sharing) (link also in #classroom)

```r
wildlife_impacts %>%
  filter(!is.na(cost_repairs_infl_adj)) %>%
  select(incident_date, time_of_day, species, cost_repairs_infl_adj) %>% 
    head()
```

```
#> # A tibble: 6 × 4
#>   incident_date       time_of_day species                cost_repairs_infl_adj
#>   <dttm>              <chr>       <chr>                                  <dbl>
#> 1 2018-10-25 00:00:00 Day         Unknown bird - large                    1000
#> 2 2018-09-05 00:00:00 <NA>        Unknown bird - medium                    200
#> 3 2018-08-09 00:00:00 Day         Semipalmated sandpiper                 10000
#> 4 2018-06-24 00:00:00 Day         Unknown bird - large                  100000
#> 5 2018-02-18 00:00:00 Day         Rough-legged hawk                      20000
#> 6 2018-01-05 00:00:00 Night       Brant                                 487000
```

]

???

- incident_date:         Interval
- time_of_day:           Ordinal
- species:               Nominal
- cost_repairs_infl_adj: Ratio

---

# Week 4: .fancy[Centrality & Variability]

## 1. Data Types
## 2. .orange[Measures of Centrality & Variability]

## BREAK

## 3. Visualizing Centrality & Variability
## 4. Relationships Between 2 Variables
## 5. Exploratory Data Analysis

---

# .center[.font140[Summary Measures:]]

# This week: .red[Centrality] &  .blue[Variability]

# Next week: .green[Correlation]

---

# .red[Centrality (a.k.a. The "Average" Value)]

### A single number representing the _middle_ of a set of numbers

### **Mean**: `$\frac{\text{Sum of values}}{\text{# of values}}$`

### **Median**: "Middle" value (50% of data above & below)

### **Mode**: Most frequent value (usually for categorical data)

---

# .center[Mean isn't always the "best" choice]

```r
wildlife_impacts %>%
    filter(! is.na(height)) %>%
    summarise(
      mean = mean(height),
      median = median(height))
```

```
#> # A tibble: 1 × 2
#>    mean median
#>   <dbl>  <dbl>
#> 1  984.     50
```

Percent of data below mean:

```
#> [1] "73.9%"
```

]

]

???

On average, where do planes hit birds?
Saying ~1000 ft is misleading
It's much more likely to be under 100 ft

---

# .center[Beware the "flaw of averages"]

### What happened to the statistician that crossed a river with an average depth of 3 feet?

]

### ...he drowned

]

---

# .blue[Variability ("Spread")]

### **Standard deviation**: distribution of values relative to the mean
### `$s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}$`

### **Interquartile range (IQR)**: `$Q_3 - Q_1$` (middle 50% of data)

### **Range**: max - min

---

# .center[.fancy[Example:] Days to ship]

Complaints are coming in about orders shipped from warehouse B, so you collect some data:

```r
daysToShip
```

```
#>    order warehouseA warehouseB
#> 1      1          3          1
#> 2      2          3          1
#> 3      3          3          1
#> 4      4          4          3
#> 5      5          4          3
#> 6      6          4          4
#> 7      7          5          5
#> 8      8          5          5
#> 9      9          5          5
#> 10    10          5          6
#> 11    11          5          7
#> 12    12          5         10
```

]]

Here, **averages** are misleading:

```r
daysToShip %>%
    gather(warehouse, days, warehouseA:warehouseB) %>%
    group_by(warehouse) %>%
    summarise(
*       mean   = mean(days),
*       median = median(days))
```

```
#> # A tibble: 2 × 3
#>   warehouse   mean median
#>   <chr>      <dbl>  <dbl>
#> 1 warehouseA  4.25    4.5
#> 2 warehouseB  4.25    4.5
```

]

---

# .center[.fancy[Example:] Days to ship]

Complaints are coming in about orders shipped from warehouse B, so you collect some data:

```r
daysToShip
```

]]

**Variability** reveals difference in days to ship:

```r
daysToShip %>%
    gather(warehouse, days, warehouseA:warehouseB) %>%
    group_by(warehouse) %>%
    summarise(
        mean   = mean(days),
        median = median(days),
*       range = max(days) - min(days),
*       sd    = sd(days))
```

```
#> # A tibble: 2 × 5
#>   warehouse   mean median range    sd
#>   <chr>      <dbl>  <dbl> <dbl> <dbl>
#> 1 warehouseA  4.25    4.5     2 0.866
#> 2 warehouseB  4.25    4.5     9 2.70
```

]

---

# .center[.fancy[Example:] Days to ship]

---

# Interpreting the standard deviation

### `$s = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N - 1}}$`

]

]

---

# Outliers

---

## **Mean** & **Standard Deviation** are sensitive to outliers

**Outliers**: `$Q_1 - 1.5 IQR$` or `$Q_3 + 1.5 IQR$`

**Extreme values**: `$Q_1 - 3 IQR$` or `$Q_3 + 3 IQR$`

```r
data1 <- c(3,3,4,5,5,6,6,7,8,9)
```

- Mean: 5.6
- Standard Deviation: 2.01
- Median: 5.5
- IQR: 2.5

]

```r
data2 <- c(3,3,4,5,5,6,6,7,8,20)
```

- .red[Mean: 6.7]
- .red[Standard Deviation: 4.95]
- .blue[Median: 5.5]
- .blue[IQR: 2.5]

]

---

# .center[Robust statistics for continuous data]
# .center[(less sensitive to outliers)]

## .red[Centrality]: Use _median_ rather than _mean_

## .blue[Variability]: Use _IQR_ rather than _standard deviation_

---

# Practice with summary measurements

### 1) Read in the following data sets:

- `milk_production.csv`
- `lotr_words.csv`

### 2) For each variable in each data set, if possible, summarize its

### 1. .red[Centrality]
### 2. .blue[Variability]

---

# Break!

## Stand up, Move around, Stretch!

---

# Week 4: .fancy[Centrality & Variability]

## 1. Data Types
## 2. Measures of Centrality & Variability

## BREAK

## 3. .orange[Visualizing Centrality & Variability]
## 4. Relationships Between 2 Variables
## 5. Exploratory Data Analysis

---

# "Visualizing data helps us think"

---

# Anscombe's Quartet

---

# The data _type_ determines <br> how to summarize it

---

### **Nominal<br>(Categorical)**

**Measures**:
- Frequency counts /<br>Proportions
<br>
<br>
<br>
<br>

**Charts**:
- Bars

]

### **Ordinal<br>(Categorical)**

**Measures**:
- Frequency counts /<br>Proportions
- .red[Centrality]:<br>Median, Mode
- .blue[Variability]: IQR
<br>

**Charts**:
- Bars

]

### **Numerical<br>(Continuous)**

**Measures**:
- .red[Centrality]:<br>Mean, median
- .blue[Variability]: Range, standard deviation, IQR
<br>
<br>

**Charts**:
- Histogram
- Boxplot

]

---

## Summarizing **Nominal** data

Summarize with counts / percentages

```r
wildlife_impacts %>%
*   count(operator, sort = TRUE) %>%
*   mutate(p = n / sum(n))
```

```
#> # A tibble: 4 × 3
#>   operator               n     p
#>   <chr>              <int> <dbl>
#> 1 SOUTHWEST AIRLINES 17970 0.315
#> 2 UNITED AIRLINES    15116 0.265
#> 3 AMERICAN AIRLINES  14887 0.261
#> 4 DELTA AIR LINES     9005 0.158
```

]

Visualize with bars

```r
wildlife_impacts %>%
    count(operator, sort = TRUE) %>%
*   ggplot() +
*   geom_col(aes(x = n, y = reorder(operator, n)),
*            width = 0.7) +
    labs(x = "Count", y = "Operator")
```

]]

---

## Summarizing **Ordinal** data

**Summarize**: Counts / percentages

```r
wildlife_impacts %>%
*   count(incident_month, sort = TRUE) %>%
*   mutate(p = n / sum(n))
```

```
#> # A tibble: 12 × 3
#>    incident_month     n      p
#>             <dbl> <int>  <dbl>
#>  1              9  7980 0.140 
#>  2             10  7754 0.136 
#>  3              8  7104 0.125 
#>  4              5  6161 0.108 
#>  5              7  6133 0.108 
#>  6              6  4541 0.0797
#>  7              4  4490 0.0788
#>  8             11  4191 0.0736
#>  9              3  2678 0.0470
#> 10             12  2303 0.0404
#> 11              1  1951 0.0342
#> 12              2  1692 0.0297
```

]]

**Visualize**: Bars

```r
wildlife_impacts %>%
    count(incident_month, sort = TRUE) %>%
*   ggplot() +
*   geom_col(aes(x = as.factor(incident_month),
*                y = n), width = 0.7) +
    labs(x = "Incident month")
```

]]

---

## Summarizing **continuous** variables

**Histograms**:

- Skewness
- Number of modes

<br>

**Boxplots**:

- Outliers
- Comparing variables

]

]]

---

## **Histogram**: Identify Skewness & # of Modes

**Summarise**:<br>Mean, median, sd, range, & IQR:

```r
summary(wildlife_impacts$height)
```

```
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>     0.0     0.0    50.0   983.8  1000.0 25000.0   18038
```

]

**Visualize**:<br>Histogram (identify skewness & modes)

```r
ggplot(wildlife_impacts) +
* geom_histogram(aes(x = height), bins = 50) +
  labs(x = 'Height (ft)', y = 'Count')
```

]

---

## **Histogram**: Identify Skewness & # of Modes

**Height**

```r
ggplot(wildlife_impacts) +
* geom_histogram(aes(x = height), bins = 50) +
  labs(x = 'Height (ft)', y = 'Count')
```

]

**Speed**

```r
ggplot(wildlife_impacts) +
* geom_histogram(aes(x = speed), bins = 50) +
  labs(x = 'speed (mph)', y = 'Count')
```

]

---

## **Boxplot**: Identify outliers

**Height**

```r
ggplot(wildlife_impacts) +
*   geom_boxplot(aes(x = height)) +
    labs(x = 'Height (ft)', y = NULL)
```

<img src="figs/wildlife-height-boxplot-1.png" width="504" style="display: block; margin: auto;" />
]

**Speed**

```r
ggplot(wildlife_impacts) +
*   geom_boxplot(aes(x = speed)) +
    labs(x = 'Speed (mph)', y = NULL)
```

]

---

## Histogram

- Skewness
- Modes

]

## Boxplot

- Outliers

]

---

# Practicing visual summaries

1) Read in the following data sets:

- `faithful.csv`
- `marathon.csv`

2) Summarize the following variables using an appropriate chart (bar chart, histogram, and / or boxplot):

- faithful: `eruptions`
- faithful: `waiting`
- marathon: `Age`
- marathon: `State`
- marathon: `Country`
- marathon: `` `Official Time` ``

]

---

# Week 4: .fancy[Centrality & Variability]

## 1. Data Types
## 2. Measures of Centrality & Variability

## BREAK

## 3. Visualizing Centrality & Variability
## 4. .orange[Relationships Between 2 Variables]
## 5. Exploratory Data Analysis

---

## Two **Categorical** Variables

Summarize with a table of counts

```r
wildlife_impacts %>%
*   count(operator, time_of_day)
```

```
#> # A tibble: 20 × 3
#>    operator           time_of_day     n
#>    <chr>              <chr>       <int>
#>  1 AMERICAN AIRLINES  Dawn          458
#>  2 AMERICAN AIRLINES  Day          7809
#>  3 AMERICAN AIRLINES  Dusk          584
#>  4 AMERICAN AIRLINES  Night        3710
#>  5 AMERICAN AIRLINES  <NA>         2326
#>  6 DELTA AIR LINES    Dawn          267
#>  7 DELTA AIR LINES    Day          4846
#>  8 DELTA AIR LINES    Dusk          353
#>  9 DELTA AIR LINES    Night        2090
#> 10 DELTA AIR LINES    <NA>         1449
#> 11 SOUTHWEST AIRLINES Dawn          394
#> 12 SOUTHWEST AIRLINES Day          9109
#> 13 SOUTHWEST AIRLINES Dusk          599
#> 14 SOUTHWEST AIRLINES Night        5425
#> 15 SOUTHWEST AIRLINES <NA>         2443
#> 16 UNITED AIRLINES    Dawn          151
#> 17 UNITED AIRLINES    Day          3359
#> 18 UNITED AIRLINES    Dusk          181
#> 19 UNITED AIRLINES    Night        1510
#> 20 UNITED AIRLINES    <NA>         9915
```

]

---

## Two **Categorical** Variables

Convert to "wide" format with `pivot_wider()` to make it easier to compare values

```r
wildlife_impacts %>%
    count(operator, time_of_day) %>%
*   pivot_wider(names_from = time_of_day, values_from = n)
```

```
#> # A tibble: 4 × 6
#>   operator            Dawn   Day  Dusk Night  `NA`
#>   <chr>              <int> <int> <int> <int> <int>
#> 1 AMERICAN AIRLINES    458  7809   584  3710  2326
#> 2 DELTA AIR LINES      267  4846   353  2090  1449
#> 3 SOUTHWEST AIRLINES   394  9109   599  5425  2443
#> 4 UNITED AIRLINES      151  3359   181  1510  9915
```

]

---

## Two **Categorical** Variables

Visualize with bars:<br>map **fill** to denote 2nd categorical var

```r
wildlife_impacts %>%
  count(operator, time_of_day) %>%
  ggplot() +
  geom_col(
    aes(
      x = n,
      y = reorder(operator, n),
*     fill = reorder(time_of_day, n)
    ), 
    width = 0.7,
*   position = 'dodge') +
  theme(legend.position = "bottom") +
  labs(
    fill = "Time of day", 
    y = "Airline"
  )
```
]

]

---

## Two **Continuous** Variables

Visualize with scatterplot - looking for _clustering_ and/or _correlational_ relationship

```r
ggplot(wildlife_impacts) +
  geom_point(
    aes(
      x = speed, 
      y = height  
    ),
    size = 0.5) +
  labs(
    x = 'Speed (mph)',
    y = 'Height (f)'
  )
```

]

]

---

## One **Continuous**, One **Categorical**

Visualize with **boxplot**

```r
ggplot(wildlife_impacts) +
  geom_boxplot(
    aes(
      x = speed, 
      y = operator)
    ) + 
  labs(
    x = 'Speed (mph)',
    y = 'Airline'
  )
```

]

]

---

# Practice with visualizing _relationships_

1) Read in the following data sets:

- `marathon.csv`
- `wildlife_impacts.csv`

2) Visualize the _relationships_ between the following variables using an appropriate chart (bar plots, scatterplots, and / or box plots):

- marathon: `Age` & `Official Time`
- marathon: Country & `Official Time`
- wildlife_impacts: `state` & `operator`

---

# Week 4: .fancy[Centrality & Variability]

## 1. Data Types
## 2. Measures of Centrality & Variability

## BREAK

## 3. Visualizing Centrality & Variability
## 4. Relationships Between 2 Variables
## 5. .orange[Exploratory Data Analysis]

---

# Exploratory Analysis

### Goal: **Form** hypotheses.
### Improves quality of **questions**.
### _(do this in THIS class)_

]

# Confirmatory Analysis

### Goal: **Test** hypotheses.
### Improves quality of **answers**.
### _(do this in your stats classes)_

]

---

# Don't be Icarus

---

## "Far better an approximate answer to the _right_ question, which is often vague, than an exact answer to the _wrong_ question, which can always be made precise."
## — John Tukey

---

**EDA is an iterative process to help you<br>_understand_ your data and ask better questions**

---

## Visualizing variation

Ask yourself:

- What type of **variation** occurs within my variables?
- What type of **covariation** occurs between my variables?

Check out [these guides](https://eda.seas.gwu.edu/2022-Fall/help/visualizing-data.html#choosing-the-right-chart)

]

]

---

# Practice doing EDA

1) Read in the `candy_rankings.csv` data sets

2) Preview the data, note the data types and what each variable is.

3) Visualize (at least) three _relationships_ between two variables (guided by a question) using an appropriate chart:

- Bar chart
- Scatterplot
- Boxplot

---

# Start thinking about research questions

---

# Writing a research question

Follow [these guidelines](https://writingcenter.gmu.edu/guides/how-to-write-a-research-question) - your question should be:

- **Clear**: your audience can easily understand its purpose without additional explanation.
- **Focused**: it is narrow enough that it can be addressed thoroughly with the data available and within the limits of the final project report.
- **Concise**: it is expressed in the fewest possible words.
- **Complex**: it is not answerable with a simple "yes" or "no," but rather requires synthesis and analysis of data.
- **Arguable**: its potential answers are open to debate rather than accepted facts (do others care about it?)

---

# Writing a research question

**Look at examples**: See the [Example Projects Page](https://eda.seas.gwu.edu/2022-Fall/help/example-projects.html) page

---

# Start now!

## [Mini Project 1](https://eda.seas.gwu.edu/2022-Fall/project-mini/1-data-cleaning.html): Due next week (9/27)

## [Project Proposal](https://eda.seas.gwu.edu/2022-Fall/project-final/1-proposal.html): Due in two weeks (10/04)