Tidy Data

]

# Week 2: .fancy[Tidy Data]

### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M243.4 2.6l-224 96c-14 6-21.8 21-18.7 35.8S16.8 160 32 160v8c0 13.3 10.7 24 24 24H456c13.3 0 24-10.7 24-24v-8c15.2 0 28.3-10.7 31.3-25.6s-4.8-29.9-18.7-35.8l-224-96c-8.1-3.4-17.2-3.4-25.2 0zM128 224H64V420.3c-.6 .3-1.2 .7-1.8 1.1l-48 32c-11.7 7.8-17 22.4-12.9 35.9S17.9 512 32 512H480c14.1 0 26.5-9.2 30.6-22.7s-1.1-28.1-12.9-35.9l-48-32c-.6-.4-1.2-.7-1.8-1.1V224H384V416H344V224H280V416H232V224H168V416H128V224zm128-96c-17.7 0-32-14.3-32-32s14.3-32 32-32s32 14.3 32 32s-14.3 32-32 32z"/></svg> EMSE 4572/6572: Exploratory Data Analysis
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M272 304h-96C78.8 304 0 382.8 0 480c0 17.67 14.33 32 32 32h384c17.67 0 32-14.33 32-32C448 382.8 369.2 304 272 304zM48.99 464C56.89 400.9 110.8 352 176 352h96c65.16 0 119.1 48.95 127 112H48.99zM224 256c70.69 0 128-57.31 128-128c0-70.69-57.31-128-128-128S96 57.31 96 128C96 198.7 153.3 256 224 256zM224 48c44.11 0 80 35.89 80 80c0 44.11-35.89 80-80 80S144 172.1 144 128C144 83.89 179.9 48 224 48z"/></svg> John Paul Helveston
### <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M152 64H296V24C296 10.75 306.7 0 320 0C333.3 0 344 10.75 344 24V64H384C419.3 64 448 92.65 448 128V448C448 483.3 419.3 512 384 512H64C28.65 512 0 483.3 0 448V128C0 92.65 28.65 64 64 64H104V24C104 10.75 114.7 0 128 0C141.3 0 152 10.75 152 24V64zM48 448C48 456.8 55.16 464 64 464H384C392.8 464 400 456.8 400 448V192H48V448z"/></svg> September 06, 2023

]

---

# Week 2: .fancy[Tidy Data]

## 1. Tidy Data
## 2. Tidy Data Wrangling

## BREAK

## 3. Tidy Data Visualization
## 4. Data Provenance & Curation
## 5. Writing a Research Question

---

# Week 2: .fancy[Tidy Data]

## 1. .orange[Tidy Data]
## 2. Tidy Data Wrangling

## BREAK

## 3. Tidy Data Visualization
## 4. Data Provenance & Curation
## 5. Writing a Research Question

---

## .center[Federal R&D Spending by Department]

```
#> # A tibble: 6 × 15
#>    year   DHS   DOC   DOD   DOE   DOT   EPA   HHS Interior  NASA   NIH   NSF Other  USDA    VA
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  1976     0   819 35696 10882  1142   968  9226     1152 12513  8025  2372  1191  1837   404
#> 2  1977     0   837 37967 13741  1095   966  9507     1082 12553  8214  2395  1280  1796   374
#> 3  1978     0   871 37022 15663  1156  1175 10533     1125 12516  8802  2446  1237  1962   356
#> 4  1979     0   952 37174 15612  1004  1102 10127     1176 13079  9243  2404  2321  2054   353
#> 5  1980     0   945 37005 15226  1048   903 10045     1082 13837  9093  2407  2468  1887   359
#> 6  1981     0   829 41737 14798   978   901  9644      990 13276  8580  2300  1925  1964   382
```

---

## .center[Federal R&D Spending by Department]

# "Wide" format

]]

# "Long" format

```
#> # A tibble: 6 × 3
#>   department  year rd_budget_mil
#>   <chr>      <dbl>         <dbl>
#> 1 DOD         1976         35696
#> 2 NASA        1976         12513
#> 3 DOE         1976         10882
#> 4 HHS         1976          9226
#> 5 NIH         1976          8025
#> 6 NSF         1976          2372
```

]]

---

## .center[Federal R&D Spending by Department]

# "Wide" format

```
#> [1] 42 15
```

]]

# "Long" format

```
#> [1] 588   3
```

]]

---

# .center[Tidy data = "Long" format]

- Each **variable** has its own **column**
- Each **observation** has its own **row**

---

# Tidy data

- Each **variable** has its own **column**
- Each **observation** has its own **row**

]

]

---

# "Long" format

]]

# "Wide" format

]]

---

# .center[**Do the names describe the values?**]

## **Yes**: "Long" format

]]

## **No**: "Wide" format

```
#> # A tibble: 6 × 8
#>    year   DHS   DOC   DOD   DOE   DOT   EPA   HHS
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  1976     0   819 35696 10882  1142   968  9226
#> 2  1977     0   837 37967 13741  1095   966  9507
#> 3  1978     0   871 37022 15663  1156  1175 10533
#> 4  1979     0   952 37174 15612  1004  1102 10127
#> 5  1980     0   945 37005 15226  1048   903 10045
#> 6  1981     0   829 41737 14798   978   901  9644
```

]]

---

# **Quick practice 1**: "long" or "wide" format?

**Description**: Tuberculosis cases in various countries

```
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <dbl>  <dbl>      <dbl>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
```
]

---

# **Quick practice 2**: "long" or "wide" format?

**Description**: Word counts in LOTR trilogy

```
#> # A tibble: 9 × 4
#>   Film                       Race   Female  Male
#>   <chr>                      <chr>   <dbl> <dbl>
#> 1 The Fellowship Of The Ring Elf      1229   971
#> 2 The Fellowship Of The Ring Hobbit     14  3644
#> 3 The Fellowship Of The Ring Man         0  1995
#> 4 The Return Of The King     Elf       183   510
#> 5 The Return Of The King     Hobbit      2  2673
#> 6 The Return Of The King     Man       268  2459
#> 7 The Two Towers             Elf       331   513
#> 8 The Two Towers             Hobbit      0  2463
#> 9 The Two Towers             Man       401  3589
```
]

---

# **Quick practice 3**: "long" or "wide" format?

**Description**: Word counts in LOTR trilogy

```
#> # A tibble: 15 × 4
#>    Film                       Race   Gender Word_Count
#>    <chr>                      <chr>  <chr>       <dbl>
#>  1 The Fellowship Of The Ring Elf    Female       1229
#>  2 The Fellowship Of The Ring Elf    Male          971
#>  3 The Fellowship Of The Ring Hobbit Female         14
#>  4 The Fellowship Of The Ring Hobbit Male         3644
#>  5 The Fellowship Of The Ring Man    Female          0
#>  6 The Fellowship Of The Ring Man    Male         1995
#>  7 The Return Of The King     Elf    Female        183
#>  8 The Return Of The King     Elf    Male          510
#>  9 The Return Of The King     Hobbit Female          2
#> 10 The Return Of The King     Hobbit Male         2673
#> 11 The Return Of The King     Man    Female        268
#> 12 The Return Of The King     Man    Male         2459
#> 13 The Two Towers             Elf    Female        331
#> 14 The Two Towers             Elf    Male          513
#> 15 The Two Towers             Hobbit Female          0
```

---

# Reshaping data with

## `pivot_longer()` and `pivot_wider()`

---

background-color: #fff

# Reshaping data

## `pivot_longer()`<br>`pivot_wider()`

]

]

---

## .center[From "long" to "wide" with `pivot_wider()`]

---

## .center[From "long" to "wide" with `pivot_wider()`]

```r
head(fed_spend_long)
```

]

```r
fed_spend_wide <- fed_spend_long %>%
    pivot_wider(
*       names_from = department,
*       values_from = rd_budget_mil)

head(fed_spend_wide)
```

```
#> # A tibble: 6 × 15
#>    year   DOD  NASA   DOE   HHS   NIH   NSF  USDA Interior   DOT   EPA   DOC   DHS    VA Other
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  1976 35696 12513 10882  9226  8025  2372  1837     1152  1142   968   819     0   404  1191
#> 2  1977 37967 12553 13741  9507  8214  2395  1796     1082  1095   966   837     0   374  1280
#> 3  1978 37022 12516 15663 10533  8802  2446  1962     1125  1156  1175   871     0   356  1237
#> 4  1979 37174 13079 15612 10127  9243  2404  2054     1176  1004  1102   952     0   353  2321
#> 5  1980 37005 13837 15226 10045  9093  2407  1887     1082  1048   903   945     0   359  2468
#> 6  1981 41737 13276 14798  9644  8580  2300  1964      990   978   901   829     0   382  1925
```

]

---

## .center[From "wide" to "long" with `pivot_longer()`]

---

## .center[From "wide" to "long" with `pivot_longer()`]

```r
head(fed_spend_wide)
```

]

```r
fed_spend_long <- fed_spend_wide %>%
    pivot_longer( 
*       names_to = "department",
*       values_to = "rd_budget_mil",
*       cols = DOD:Other)

head(fed_spend_long)
```

```
#> # A tibble: 6 × 3
#>    year department rd_budget_mil
#>   <dbl> <chr>              <dbl>
#> 1  1976 DOD                35696
#> 2  1976 NASA               12513
#> 3  1976 DOE                10882
#> 4  1976 HHS                 9226
#> 5  1976 NIH                 8025
#> 6  1976 NSF                 2372
```

]

---

## Can also set `cols` by selecting which columns _not_ to use

```r
names(fed_spend_wide)
```

```
#>  [1] "year"     "DOD"      "NASA"     "DOE"      "HHS"      "NIH"      "NSF"      "USDA"     "Interior" "DOT"      "EPA"      "DOC"      "DHS"      "VA"       "Other"
```

]

```r
fed_spend_long <- fed_spend_wide %>%
    pivot_longer(
        names_to = "department", 
        values_to = "rd_budget_mil",
*       cols = -year)

head(fed_spend_long)
```

]

---

# Your turn: Reshaping Data

Open the `practice.qmd` file.

Run the code chunk to read in the following two data files:

- `pv_cell_production.xlsx`: Data on solar photovoltaic cell production by country
- `milk_production.csv`: Data on milk production by state

Now modify the format of each:

- If the data are in "wide" format, convert it to "long" with `pivot_longer()`
- If the data are in "long" format, convert it to "wide" with `pivot_wider()`

---

# Week 2: .fancy[Tidy Data]

## 1. Tidy Data
## 2. .orange[Tidy Data Wrangling]

## BREAK

## 3. Tidy Data Visualization
## 4. Data Provenance & Curation
## 5. Writing a Research Question

---

# Why do we need tidy data?

(a quick explanation with cute graphics, by [Allison Horst](https://github.com/allisonhorst/stats-illustrations))

---

class: center
background-image: url("images/horst_tidydata_1.jpg")
background-size: contain

---

class: center
background-image: url("images/horst_tidydata_2.jpg")
background-size: contain

---

class: center
background-image: url("images/horst_tidydata_3.jpg")
background-size: contain

---

# Tidy data wrangling

Compute the total R&D spending in each year

```r
head(fed_spend_wide)
```

---

# Tidy data wrangling

Compute the total R&D spending in each year

**Approach 1**: Create new `total` by adding each variable

```r
fed_spend_wide %>%
  mutate(total = DHS + DOC + DOD + DOE + DOT + EPA + HHS + Interior + NASA + NIH + NSF + Other + USDA + VA) %>%
  select(year, total)
```

```
#> # A tibble: 42 × 2
#>     year  total
#>    <dbl>  <dbl>
#>  1  1976  86227
#>  2  1977  91807
#>  3  1978  94864
#>  4  1979  96601
#>  5  1980  96305
#>  6  1981  98304
#>  7  1982  95448
#>  8  1983  95010
#>  9  1984 105371
#> 10  1985 114818
#> # … with 32 more rows
```

---

# Tidy data wrangling

Compute the total R&D spending by department in each year

**Approach 2**: Reshape first, then summarise

```r
fed_spend_long <- fed_spend_wide %>%
    pivot_longer(
        names_to = "department", 
        values_to = "rd_budget_mil",
        cols = -year)

head(fed_spend_long)
```

]

```r
fed_spend_long %>%
    group_by(year) %>%
    summarise(total = sum(rd_budget_mil))
```

]

---

# Tidy data wrangling

Compute the total R&D spending by department in each year

**Approach 2**: Reshape first, then summarise

```r
total <- fed_spend_wide %>%
    pivot_longer(
        names_to = "department", 
        values_to = "rd_budget_mil",
        cols = -year) %>% 
    group_by(year) %>%
    summarise(total = sum(rd_budget_mil))
```

]

```r
head(total)
```

```
#> # A tibble: 6 × 2
#>    year total
#>   <dbl> <dbl>
#> 1  1976 86227
#> 2  1977 91807
#> 3  1978 94864
#> 4  1979 96601
#> 5  1980 96305
#> 6  1981 98304
```

]

---

# Your turn: Tidy Data Wrangling

Open the `practice.qmd` file.

Run the code chunk to read in the following two data files:

- `gapminder.csv`: Life expectancy in different countries over time
- `gdp.csv`: GDP of different countries over time

Now convert the data into a tidy (long) structure, then create the following summary data frames:

- Mean life expectancy in each year.
- Mean GDP in each year.

---

# .fancy[Break]

---

# Week 2: .fancy[Tidy Data]

## 1. Tidy Data
## 2. Tidy Data Wrangling

## BREAK

## 3. .orange[Tidy Data Visualization]
## 4. Data Provenance & Curation
## 5. Writing a Research Question

---

# Tidy data vizualization

Make a bar chart of total R&D spending by agency

```r
head(fed_spend_wide)
```

]

]

---

# Tidy data vizualization

Make a bar chart of total R&D spending by agency

```r
ggplot(fed_spend_wide) +
* geom_col(aes(x = rd_budget_mil, y = department)) +
  theme_bw() +
  labs(
      x = "R&D Spending ($Millions)",
      y = "Federal Agency"
  )
```

```
#> Error in `geom_col()`:
#> ! Problem while computing aesthetics.
#> ℹ Error occurred in the 1st layer.
#> Caused by error in `FUN()`:
#> ! object 'rd_budget_mil' not found
```

]

]

---

# Tidy data vizualization

Make a bar chart of total R&D spending by agency

```r
fed_spend_wide %>%
* pivot_longer(
*   names_to = "department",
*   values_to = "rd_budget_mil",
*   cols = -year
  ) %>%
  ggplot() +
  geom_col(aes(x = rd_budget_mil, y = department)) +
  theme_bw() +
  labs(
    x = "R&D Spending ($Millions)",
    y = "Federal Agency"
  )
```

]

]

---

# Your turn: Tidy Data Visualization

Run the code chunk to read in the two data files, then convert the data into a tidy (long) structure to create the following charts:

]

]

---

# Week 2: .fancy[Tidy Data]

## 1. Tidy Data
## 2. Tidy Data Wrangling

## BREAK

## 3. Tidy Data Visualization
## 4. .orange[Data Provenance & Curation]
## 5. Writing a Research Question

---

### Data provenance - It matters where you get your data

**Validity**:

- Is this data trustworthy? Is it authentic?
- Where did the data come from?
- How has the data been changed / managed over time?
- Is the data complete?

**Comprehension**:

- Is this data accurate?
- Can you explain your results?
- Is this the right data to answer your question?

**Reproducibility**:

- I should be able to fully replicate your results from your raw data and code.

---

## <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M416 208c0 45.9-14.9 88.3-40 122.7L502.6 457.4c12.5 12.5 12.5 32.8 0 45.3s-32.8 12.5-45.3 0L330.7 376c-34.4 25.2-76.8 40-122.7 40C93.1 416 0 322.9 0 208S93.1 0 208 0S416 93.1 416 208zM208 352c79.5 0 144-64.5 144-144s-64.5-144-144-144S64 128.5 64 208s64.5 144 144 144z"/></svg> **Document your source like a museum curator**

**Example**: View `README.md` file in the `data` folder

Whenever you download data, you should **at a minimum** record the following:

- The name of the file you are describing.
  - The date you downloaded it.
  - The original name of the downloaded file (in case you renamed it).
  - The url to the site you downloaded it from.
  - The source of the _original_ data (sometimes different from the site you downloaded it from).
  - A short description of the data, maybe how they were collected (if available).
  - A dictionary for the data (e.g. a simple markdown table describing each variable).

---

# Your turn

Documentation in the "data/README.md" file is missing for the following data sets:

- wildlife_impacts.csv: [source](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-07-23)
- north_america_bear_killings.txt: [source](https://data.world/makeovermonday/2019w21)
- uspto_clean_energy_patents.xlsx: [source](https://www.nsf.gov/statistics/2018/nsb20181/report/sections/industry-technology-and-the-global-marketplace/global-trends-in-sustainable-energy-research-and-technologies)

Go to the above sites and add the following information to the "data/README.md" file:

- The name of the downloaded file.
- The web address to the site you downloaded the data from.
- The source of the _original_ data (if different from the website).
- A short description of the data and how they were collected.
- A dictionary for the data (hint: the site might already have this!).

---

# Week 2: .fancy[Tidy Data]

## 1. Tidy Data
## 2. Tidy Data Wrangling

## BREAK

## 3. Tidy Data Visualization
## 4. Data Provenance & Curation
## 5. .orange[Writing a Research Question]

---

# Writing a research question

Follow [these guidelines](https://writingcenter.gmu.edu/guides/how-to-write-a-research-question) - your question should be:

- **Clear**: your audience can easily understand its purpose without additional explanation.

- **Focused**: it is narrow enough that it can be addressed thoroughly with the data available and within the limits of the final project report.

- **Concise**: it is expressed in the fewest possible words.

- **Complex**: it is not answerable with a simple "yes" or "no," but rather requires synthesis and analysis of data.

- **Arguable**: its potential answers are open to debate rather than accepted facts (do others care about it?)

---

# Writing a research question

**Bad question: Why are social networking sites harmful?**

- Unclear: it does not specify _which_ social networking sites or state what harm is being caused; assumes that "harm" exists.

**Improved question: How are online users experiencing or addressing privacy issues on social networking sites such as Facebook and Twitter?**

- Specifies the sites (Facebook and Twitter), type of harm (privacy issues), and who is harmed (online users).

---

# Writing a research question

**Example from previous classes**:

- [Genders in the Workforce](https://eda.seas.gwu.edu/showcase/2021-Spring/gender_pay_gap.html): How has the US gender wage gap changed over time for different occupations and age groups?
- [NFL Suspensions](https://eda.seas.gwu.edu/showcase/2021-Spring/nfl_suspensions.html): What factors contribute to the severity of disciplinary actions towards NFL players from 2002-2014?

**Other good examples**: See the [Example Projects](https://eda.seas.gwu.edu/2023-Fall/project/examples.html) page

---

# Use [this link](https://docs.google.com/spreadsheets/d/1r1hSkC8oql3aurOLUEmYfGovqEL5FYc9puAdNw1guJk/edit?usp=sharing) to form teams