class: middle, inverse .leftcol30[ <center> <img src="https://eda.seas.gwu.edu/images/logo.png" width=250> </center> ] .rightcol70[ # Week 2: .fancy[Tidy Data] ###
EMSE 4572/6572: Exploratory Data Analysis ###
John Paul Helveston ###
September 04, 2024 ] --- class: inverse, middle # Week 2: .fancy[Tidy Data] ## 1. Tidy Data ## 2. Tidy Data Wrangling ## BREAK ## 3. Tidy Data Visualization ## 4. Data Provenance & Curation ## 5. Writing a Research Question --- class: inverse, middle # Week 2: .fancy[Tidy Data] ## 1. .orange[Tidy Data] ## 2. Tidy Data Wrangling ## BREAK ## 3. Tidy Data Visualization ## 4. Data Provenance & Curation ## 5. Writing a Research Question --- ## .center[Federal R&D Spending by Department] ``` #> # A tibble: 6 × 15 #> year DHS DOC DOD DOE DOT EPA HHS Interior NASA NIH NSF Other USDA VA #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 1152 12513 8025 2372 1191 1837 404 #> 2 1977 0 837 37967 13741 1095 966 9507 1082 12553 8214 2395 1280 1796 374 #> 3 1978 0 871 37022 15663 1156 1175 10533 1125 12516 8802 2446 1237 1962 356 #> 4 1979 0 952 37174 15612 1004 1102 10127 1176 13079 9243 2404 2321 2054 353 #> 5 1980 0 945 37005 15226 1048 903 10045 1082 13837 9093 2407 2468 1887 359 #> 6 1981 0 829 41737 14798 978 901 9644 990 13276 8580 2300 1925 1964 382 ``` --- ## .center[Federal R&D Spending by Department] .leftcol60[.code70[ # "Wide" format ``` #> # A tibble: 6 × 15 #> year DHS DOC DOD DOE DOT EPA HHS Interior NASA NIH NSF Other USDA VA #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 1152 12513 8025 2372 1191 1837 404 #> 2 1977 0 837 37967 13741 1095 966 9507 1082 12553 8214 2395 1280 1796 374 #> 3 1978 0 871 37022 15663 1156 1175 10533 1125 12516 8802 2446 1237 1962 356 #> 4 1979 0 952 37174 15612 1004 1102 10127 1176 13079 9243 2404 2321 2054 353 #> 5 1980 0 945 37005 15226 1048 903 10045 1082 13837 9093 2407 2468 1887 359 #> 6 1981 0 829 41737 14798 978 901 9644 990 13276 8580 2300 1925 1964 382 ``` ]] -- .rightcol40[.code70[ # "Long" format ``` #> # A tibble: 6 × 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ]] --- ## .center[Federal R&D Spending by Department] .leftcol60[.code70[ # "Wide" format ``` #> # A tibble: 6 × 15 #> year DHS DOC DOD DOE DOT EPA HHS Interior NASA NIH NSF Other USDA VA #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 1152 12513 8025 2372 1191 1837 404 #> 2 1977 0 837 37967 13741 1095 966 9507 1082 12553 8214 2395 1280 1796 374 #> 3 1978 0 871 37022 15663 1156 1175 10533 1125 12516 8802 2446 1237 1962 356 #> 4 1979 0 952 37174 15612 1004 1102 10127 1176 13079 9243 2404 2321 2054 353 #> 5 1980 0 945 37005 15226 1048 903 10045 1082 13837 9093 2407 2468 1887 359 #> 6 1981 0 829 41737 14798 978 901 9644 990 13276 8580 2300 1925 1964 382 ``` ``` #> [1] 42 15 ``` ]] .rightcol40[.code70[ # "Long" format ``` #> # A tibble: 6 × 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ``` #> [1] 588 3 ``` ]] --- # .center[Tidy data = "Long" format] - Each **variable** has its own **column** - Each **observation** has its own **row** <center> <img src="images/tidy-data.png" width = "1000"> </center> --- .leftcol[ # Tidy data - Each **variable** has its own **column** - Each **observation** has its own **row** ] .rightcol[ ``` #> # A tibble: 6 × 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ] <center> <img src="images/tidy-data.png" width = "1000"> </center> --- .leftcol40[.code70[ # "Long" format ``` #> # A tibble: 6 × 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ]] .rightcol60[.code70[ # "Wide" format ``` #> # A tibble: 6 × 15 #> year DHS DOC DOD DOE DOT EPA HHS Interior NASA NIH NSF Other USDA VA #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 1152 12513 8025 2372 1191 1837 404 #> 2 1977 0 837 37967 13741 1095 966 9507 1082 12553 8214 2395 1280 1796 374 #> 3 1978 0 871 37022 15663 1156 1175 10533 1125 12516 8802 2446 1237 1962 356 #> 4 1979 0 952 37174 15612 1004 1102 10127 1176 13079 9243 2404 2321 2054 353 #> 5 1980 0 945 37005 15226 1048 903 10045 1082 13837 9093 2407 2468 1887 359 #> 6 1981 0 829 41737 14798 978 901 9644 990 13276 8580 2300 1925 1964 382 ``` ]] --- # .center[**Do the names describe the values?**] .leftcol40[.code70[ ## **Yes**: "Long" format ``` #> # A tibble: 6 × 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ]] .rightcol60[.code70[ ## **No**: "Wide" format ``` #> # A tibble: 6 × 8 #> year DHS DOC DOD DOE DOT EPA HHS #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 #> 2 1977 0 837 37967 13741 1095 966 9507 #> 3 1978 0 871 37022 15663 1156 1175 10533 #> 4 1979 0 952 37174 15612 1004 1102 10127 #> 5 1980 0 945 37005 15226 1048 903 10045 #> 6 1981 0 829 41737 14798 978 901 9644 ``` ]] --- # **Quick practice 1**: "long" or "wide" format? **Description**: Tuberculosis cases in various countries .code100[ ``` #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 ``` ] --- # **Quick practice 2**: "long" or "wide" format? **Description**: Word counts in LOTR trilogy .code90[ ``` #> # A tibble: 9 × 4 #> Film Race Female Male #> <chr> <chr> <dbl> <dbl> #> 1 The Fellowship Of The Ring Elf 1229 971 #> 2 The Fellowship Of The Ring Hobbit 14 3644 #> 3 The Fellowship Of The Ring Man 0 1995 #> 4 The Return Of The King Elf 183 510 #> 5 The Return Of The King Hobbit 2 2673 #> 6 The Return Of The King Man 268 2459 #> 7 The Two Towers Elf 331 513 #> 8 The Two Towers Hobbit 0 2463 #> 9 The Two Towers Man 401 3589 ``` ] --- # **Quick practice 3**: "long" or "wide" format? **Description**: Word counts in LOTR trilogy ``` #> # A tibble: 15 × 4 #> Film Race Gender Word_Count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 #> 3 The Fellowship Of The Ring Hobbit Female 14 #> 4 The Fellowship Of The Ring Hobbit Male 3644 #> 5 The Fellowship Of The Ring Man Female 0 #> 6 The Fellowship Of The Ring Man Male 1995 #> 7 The Return Of The King Elf Female 183 #> 8 The Return Of The King Elf Male 510 #> 9 The Return Of The King Hobbit Female 2 #> 10 The Return Of The King Hobbit Male 2673 #> 11 The Return Of The King Man Female 268 #> 12 The Return Of The King Man Male 2459 #> 13 The Two Towers Elf Female 331 #> 14 The Two Towers Elf Male 513 #> 15 The Two Towers Hobbit Female 0 ``` --- class: inverse, center, middle # Reshaping data with ## `pivot_longer()` and `pivot_wider()` --- background-color: #fff .leftcol40[ # Reshaping data ## `pivot_longer()`<br>`pivot_wider()` ] .rightcol60[ <center> <img src="images/tidyr-pivoting.gif" width=530> </center> ] --- ## .center[From "long" to "wide" with `pivot_wider()`] <center> <img src="images/tidy-wider.png" width=600> </center> --- ## .center[From "long" to "wide" with `pivot_wider()`] .leftcol45[ ``` r head(fed_spend_long) ``` ``` #> # A tibble: 6 × 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ] .rightcol55[ ``` r fed_spend_wide <- fed_spend_long %>% pivot_wider( * names_from = department, * values_from = rd_budget_mil) head(fed_spend_wide) ``` ``` #> # A tibble: 6 × 15 #> year DOD NASA DOE HHS NIH NSF USDA Interior DOT EPA DOC DHS VA Other #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 35696 12513 10882 9226 8025 2372 1837 1152 1142 968 819 0 404 1191 #> 2 1977 37967 12553 13741 9507 8214 2395 1796 1082 1095 966 837 0 374 1280 #> 3 1978 37022 12516 15663 10533 8802 2446 1962 1125 1156 1175 871 0 356 1237 #> 4 1979 37174 13079 15612 10127 9243 2404 2054 1176 1004 1102 952 0 353 2321 #> 5 1980 37005 13837 15226 10045 9093 2407 1887 1082 1048 903 945 0 359 2468 #> 6 1981 41737 13276 14798 9644 8580 2300 1964 990 978 901 829 0 382 1925 ``` ] --- ## .center[From "wide" to "long" with `pivot_longer()`] <center> <img src="images/tidy-longer.png" width=600> </center> --- ## .center[From "wide" to "long" with `pivot_longer()`] .leftcol45[ ``` r head(fed_spend_wide) ``` ``` #> # A tibble: 6 × 15 #> year DOD NASA DOE HHS NIH NSF USDA Interior DOT EPA DOC DHS VA Other #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 35696 12513 10882 9226 8025 2372 1837 1152 1142 968 819 0 404 1191 #> 2 1977 37967 12553 13741 9507 8214 2395 1796 1082 1095 966 837 0 374 1280 #> 3 1978 37022 12516 15663 10533 8802 2446 1962 1125 1156 1175 871 0 356 1237 #> 4 1979 37174 13079 15612 10127 9243 2404 2054 1176 1004 1102 952 0 353 2321 #> 5 1980 37005 13837 15226 10045 9093 2407 1887 1082 1048 903 945 0 359 2468 #> 6 1981 41737 13276 14798 9644 8580 2300 1964 990 978 901 829 0 382 1925 ``` ] .rightcol55[ ``` r fed_spend_long <- fed_spend_wide %>% pivot_longer( * names_to = "department", * values_to = "rd_budget_mil", * cols = DOD:Other) head(fed_spend_long) ``` ``` #> # A tibble: 6 × 3 #> year department rd_budget_mil #> <dbl> <chr> <dbl> #> 1 1976 DOD 35696 #> 2 1976 NASA 12513 #> 3 1976 DOE 10882 #> 4 1976 HHS 9226 #> 5 1976 NIH 8025 #> 6 1976 NSF 2372 ``` ] --- ## Can also set `cols` by selecting which columns _not_ to use .leftcol45[ ``` r names(fed_spend_wide) ``` ``` #> [1] "year" "DOD" "NASA" "DOE" "HHS" "NIH" "NSF" "USDA" "Interior" "DOT" "EPA" "DOC" "DHS" "VA" "Other" ``` ] .rightcol55[ ``` r fed_spend_long <- fed_spend_wide %>% pivot_longer( names_to = "department", values_to = "rd_budget_mil", * cols = -year) head(fed_spend_long) ``` ``` #> # A tibble: 6 × 3 #> year department rd_budget_mil #> <dbl> <chr> <dbl> #> 1 1976 DOD 35696 #> 2 1976 NASA 12513 #> 3 1976 DOE 10882 #> 4 1976 HHS 9226 #> 5 1976 NIH 8025 #> 6 1976 NSF 2372 ``` ] --- class: inverse
−
+
15
:
00
# Your turn: Reshaping Data Open the `practice.qmd` file. Run the code chunk to read in the following two data files: - `pv_cell_production.xlsx`: Data on solar photovoltaic cell production by country - `milk_production.csv`: Data on milk production by state Now modify the format of each: - If the data are in "wide" format, convert it to "long" with `pivot_longer()` - If the data are in "long" format, convert it to "wide" with `pivot_wider()` --- class: inverse, middle # Week 2: .fancy[Tidy Data] ## 1. Tidy Data ## 2. .orange[Tidy Data Wrangling] ## BREAK ## 3. Tidy Data Visualization ## 4. Data Provenance & Curation ## 5. Writing a Research Question --- class: center, middle, inverse # Why do we need tidy data? (a quick explanation with cute graphics, by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)) --- class: center background-image: url("images/horst_tidydata_1.jpg") background-size: contain --- class: center background-image: url("images/horst_tidydata_2.jpg") background-size: contain --- class: center background-image: url("images/horst_tidydata_3.jpg") background-size: contain --- # Tidy data wrangling Compute the total R&D spending in each year ``` r head(fed_spend_wide) ``` ``` #> # A tibble: 6 × 15 #> year DOD NASA DOE HHS NIH NSF USDA Interior DOT EPA DOC DHS VA Other #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 35696 12513 10882 9226 8025 2372 1837 1152 1142 968 819 0 404 1191 #> 2 1977 37967 12553 13741 9507 8214 2395 1796 1082 1095 966 837 0 374 1280 #> 3 1978 37022 12516 15663 10533 8802 2446 1962 1125 1156 1175 871 0 356 1237 #> 4 1979 37174 13079 15612 10127 9243 2404 2054 1176 1004 1102 952 0 353 2321 #> 5 1980 37005 13837 15226 10045 9093 2407 1887 1082 1048 903 945 0 359 2468 #> 6 1981 41737 13276 14798 9644 8580 2300 1964 990 978 901 829 0 382 1925 ``` --- # Tidy data wrangling Compute the total R&D spending in each year **Approach 1**: Create new `total` by adding each variable ``` r fed_spend_wide %>% mutate(total = DHS + DOC + DOD + DOE + DOT + EPA + HHS + Interior + NASA + NIH + NSF + Other + USDA + VA) %>% select(year, total) ``` ``` #> # A tibble: 42 × 2 #> year total #> <dbl> <dbl> #> 1 1976 86227 #> 2 1977 91807 #> 3 1978 94864 #> 4 1979 96601 #> 5 1980 96305 #> 6 1981 98304 #> 7 1982 95448 #> 8 1983 95010 #> 9 1984 105371 #> 10 1985 114818 #> # ℹ 32 more rows ``` --- # Tidy data wrangling Compute the total R&D spending by department in each year **Approach 2**: Reshape first, then summarise .leftcol[ ``` r fed_spend_long <- fed_spend_wide %>% pivot_longer( names_to = "department", values_to = "rd_budget_mil", cols = -year) head(fed_spend_long) ``` ``` #> # A tibble: 6 × 3 #> year department rd_budget_mil #> <dbl> <chr> <dbl> #> 1 1976 DOD 35696 #> 2 1976 NASA 12513 #> 3 1976 DOE 10882 #> 4 1976 HHS 9226 #> 5 1976 NIH 8025 #> 6 1976 NSF 2372 ``` ] -- .rightcol[ ``` r fed_spend_long %>% group_by(year) %>% summarise(total = sum(rd_budget_mil)) ``` ``` #> # A tibble: 42 × 2 #> year total #> <dbl> <dbl> #> 1 1976 86227 #> 2 1977 91807 #> 3 1978 94864 #> 4 1979 96601 #> 5 1980 96305 #> 6 1981 98304 #> 7 1982 95448 #> 8 1983 95010 #> 9 1984 105371 #> 10 1985 114818 #> # ℹ 32 more rows ``` ] --- # Tidy data wrangling Compute the total R&D spending by department in each year **Approach 2**: Reshape first, then summarise .leftcol[ ``` r total <- fed_spend_wide %>% pivot_longer( names_to = "department", values_to = "rd_budget_mil", cols = -year) %>% group_by(year) %>% summarise(total = sum(rd_budget_mil)) ``` ] .rightcol[ ``` r head(total) ``` ``` #> # A tibble: 6 × 2 #> year total #> <dbl> <dbl> #> 1 1976 86227 #> 2 1977 91807 #> 3 1978 94864 #> 4 1979 96601 #> 5 1980 96305 #> 6 1981 98304 ``` ] --- class: inverse
−
+
15
:
00
# Your turn: Tidy Data Wrangling Open the `practice.qmd` file. Run the code chunk to read in the following two data files: - `gapminder.csv`: Life expectancy in different countries over time - `gdp.csv`: GDP of different countries over time Now convert the data into a tidy (long) structure, then create the following summary data frames: - Mean life expectancy in each year. - Mean GDP in each year. --- class: inverse, center # .fancy[Break]
−
+
05
:
00
--- class: inverse, middle # Week 2: .fancy[Tidy Data] ## 1. Tidy Data ## 2. Tidy Data Wrangling ## BREAK ## 3. .orange[Tidy Data Visualization] ## 4. Data Provenance & Curation ## 5. Writing a Research Question --- # Tidy data vizualization Make a bar chart of total R&D spending by agency .leftcol55[ ``` r head(fed_spend_wide) ``` ``` #> # A tibble: 6 × 15 #> year DOD NASA DOE HHS NIH NSF USDA Interior DOT EPA DOC DHS VA Other #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 35696 12513 10882 9226 8025 2372 1837 1152 1142 968 819 0 404 1191 #> 2 1977 37967 12553 13741 9507 8214 2395 1796 1082 1095 966 837 0 374 1280 #> 3 1978 37022 12516 15663 10533 8802 2446 1962 1125 1156 1175 871 0 356 1237 #> 4 1979 37174 13079 15612 10127 9243 2404 2054 1176 1004 1102 952 0 353 2321 #> 5 1980 37005 13837 15226 10045 9093 2407 1887 1082 1048 903 945 0 359 2468 #> 6 1981 41737 13276 14798 9644 8580 2300 1964 990 978 901 829 0 382 1925 ``` ] .rightcol45[ <img src="figs/fed-spend-bars-1.png" width="522.144" /> ] --- # Tidy data vizualization Make a bar chart of total R&D spending by agency .leftcol55[ ``` r ggplot(fed_spend_wide) + * geom_col(aes(x = rd_budget_mil, y = department)) + theme_bw() + labs( x = "R&D Spending ($Millions)", y = "Federal Agency" ) ``` ``` #> Error in `geom_col()`: #> ! Problem while computing aesthetics. #> ℹ Error occurred in the 1st layer. #> Caused by error: #> ! object 'rd_budget_mil' not found ``` ] .rightcol45[ <img src="figs/unnamed-chunk-37-1.png" width="522.144" style="display: block; margin: auto;" /> ] --- # Tidy data vizualization Make a bar chart of total R&D spending by agency .leftcol55[ ``` r fed_spend_wide %>% * pivot_longer( * names_to = "department", * values_to = "rd_budget_mil", * cols = -year ) %>% ggplot() + geom_col(aes(x = rd_budget_mil, y = department)) + theme_bw() + labs( x = "R&D Spending ($Millions)", y = "Federal Agency" ) ``` ] .rightcol45[ <img src="figs/unnamed-chunk-39-1.png" width="522.144" style="display: block; margin: auto;" /> ] --- class: inverse
−
+
15
:
00
# Your turn: Tidy Data Visualization Run the code chunk to read in the two data files, then convert the data into a tidy (long) structure to create the following charts: .leftcol[ <img src="figs/unnamed-chunk-41-1.png" width="522.144" /> ] .rightcol[ <img src="figs/unnamed-chunk-42-1.png" width="522.144" /> ] --- class: inverse, middle # Week 2: .fancy[Tidy Data] ## 1. Tidy Data ## 2. Tidy Data Wrangling ## BREAK ## 3. Tidy Data Visualization ## 4. .orange[Data Provenance & Curation] ## 5. Writing a Research Question --- ### Data provenance - It matters where you get your data -- **Validity**: - Is this data trustworthy? Is it authentic? - Where did the data come from? - How has the data been changed / managed over time? - Is the data complete? -- **Comprehension**: - Is this data accurate? - Can you explain your results? - Is this the right data to answer your question? -- **Reproducibility**: - I should be able to fully replicate your results from your raw data and code. --- ##
**Document your source like a museum curator** **Example**: View `README.md` file in the `data` folder -- Whenever you download data, you should **at a minimum** record the following: - The name of the file you are describing. - The date you downloaded it. - The original name of the downloaded file (in case you renamed it). - The url to the site you downloaded it from. - The source of the _original_ data (sometimes different from the site you downloaded it from). - A short description of the data, maybe how they were collected (if available). - A dictionary for the data (e.g. a simple markdown table describing each variable). --- class: inverse
−
+
10
:
00
# Your turn Documentation in the "data/README.md" file is missing for the following data sets: - wildlife_impacts.csv: [source](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-07-23) - north_america_bear_killings.txt: [source](https://data.world/makeovermonday/2019w21) - uspto_clean_energy_patents.xlsx: [source](https://www.nsf.gov/statistics/2018/nsb20181/report/sections/industry-technology-and-the-global-marketplace/global-trends-in-sustainable-energy-research-and-technologies) Go to the above sites and add the following information to the "data/README.md" file: - The name of the downloaded file. - The web address to the site you downloaded the data from. - The source of the _original_ data (if different from the website). - A short description of the data and how they were collected. - A dictionary for the data (hint: the site might already have this!). --- class: inverse, middle # Week 2: .fancy[Tidy Data] ## 1. Tidy Data ## 2. Tidy Data Wrangling ## BREAK ## 3. Tidy Data Visualization ## 4. Data Provenance & Curation ## 5. .orange[Writing a Research Question] --- # Writing a research question Follow [these guidelines](https://writingcenter.gmu.edu/guides/how-to-write-a-research-question) - your question should be: -- - **Clear**: your audience can easily understand its purpose without additional explanation. -- - **Focused**: it is narrow enough that it can be addressed thoroughly with the data available and within the limits of the final project report. -- - **Concise**: it is expressed in the fewest possible words. -- - **Complex**: it is not answerable with a simple "yes" or "no," but rather requires synthesis and analysis of data. -- - **Arguable**: its potential answers are open to debate rather than accepted facts (do others care about it?) --- # Writing a research question -- **Bad question: Why are social networking sites harmful?** - Unclear: it does not specify _which_ social networking sites or state what harm is being caused; assumes that "harm" exists. -- **Improved question: How are online users experiencing or addressing privacy issues on social networking sites such as Facebook and Twitter?** - Specifies the sites (Facebook and Twitter), type of harm (privacy issues), and who is harmed (online users). --- # Writing a research question **Example from previous classes**: - [Genders in the Workforce](https://eda.seas.gwu.edu/showcase/2021-Spring/gender_pay_gap.html): How has the US gender wage gap changed over time for different occupations and age groups? - [NFL Suspensions](https://eda.seas.gwu.edu/showcase/2021-Spring/nfl_suspensions.html): What factors contribute to the severity of disciplinary actions towards NFL players from 2002-2014? **Other good examples**: See the [Example Projects](https://eda.seas.gwu.edu/2024-Fall/project/examples.html) page --- class: inverse, middle, center # Use [this link](https://docs.google.com/spreadsheets/d/15pn9VNtYBG3XF-1OhvdKLMoNj4hOTXCGf4U1KNx0Tco/edit?usp=sharing) to form teams --- .leftcol[ <br> <center> <img src="images/car-size.png" width=100%> </center> ] .rightcol[ <br> ## Project idea: car bloat #### Data: Webscraped from [Car and Driver](https://www.caranddriver.com/) Summary: - [This tweet thread](https://x.com/curious_founder/status/1830623332892299510?t=keukWxtrhC-CMEeHAwbOtg&s=19) - [This youtube video](https://youtu.be/C5q_l8SXar0?si=fXvDE3NQVnemcm_y) ]