class: middle, inverse .leftcol30[ <center> <img src="https://eda.seas.gwu.edu/images/logo.png" width=250> </center> ] .rightcol70[ # Week 1: .fancy[Getting Started] ###
EMSE 4572/6572: Exploratory Data Analysis ###
John Paul Helveston ###
August 28, 2024 ] --- class: inverse, middle # Week 1: .fancy[Getting Started] ### 1. Course Goal ### 2. Course Introduction ### 3. Break: Install Stuff ### 4. Quarto ### 5. Workflow & Reading In Data ### 6. Wrangling Data ### 7. Visualizing Data --- class: inverse, middle # Week 1: .fancy[Getting Started] ### 1. .orange[Course Goal] ### 2. Course Introduction ### 3. Break: Install Stuff ### 4. Quarto ### 5. Workflow & Reading In Data ### 6. Wrangling Data ### 7. Visualizing Data --- ## Course 1: [Intro to Programming for Analytics](https://p4a.seas.gwu.edu/) **"Computational Literacy"** - Programming: Conditionals (if/else), loops, functions, testing, data types. - Analytics: Data structures, import / export, basic data manipulation & visualization. -- ## Course 2: [Exploratory Data Analysis](https://eda.seas.gwu.edu/) **"Data Literacy"** - Strategies for conducting an exploratory data analysis. - Design principles for visualizing and communicating _information_ extracted from data. - Reproducibility: Reports that contain code, equations, visualizations, and narrative text. --- class: center, inverse, middle # **Class goal**: translate _data_ into _information_ <center> <img src="images/truth.png" width=80%> </center> --- class: center # **Class goal**: translate _data_ into _information_ -- .leftcol[ **Data** Average student engagement scores Class | Type | City | County ------------|-------------|------|------- Special Ed. | Charter | 643 | 793 Special Ed. | Public | 735 | 928 General Ed. | Charter | 590 | 724 General Ed. | Public | 863 | 662 ] -- .rightcol[ **Information** <img src="figs/student-engagement-final-1.png" width="432" /> ] --- # Data exploration: an iterative process -- .leftcol[ Encode data: .code60[ ``` r engagement_data <- data.frame( City = c(643, 735, 590, 863), County = c(793, 928, 724, 662), School = c('Special Ed., Charter', 'Special Ed., Public', 'General Ed., Charter', 'General Ed., Public')) engagement_data ``` ``` #> City County School #> 1 643 793 Special Ed., Charter #> 2 735 928 Special Ed., Public #> 3 590 724 General Ed., Charter #> 4 863 662 General Ed., Public ``` ]] -- .rightcol[ Re-format data for plotting: .code60[ ``` r engagement_data <- engagement_data %>% gather(Location, Engagement, City:County) %>% mutate(Location = fct_relevel( Location, c('City', 'County'))) engagement_data ``` ``` #> School Location Engagement #> 1 Special Ed., Charter City 643 #> 2 Special Ed., Public City 735 #> 3 General Ed., Charter City 590 #> 4 General Ed., Public City 863 #> 5 Special Ed., Charter County 793 #> 6 Special Ed., Public County 928 #> 7 General Ed., Charter County 724 #> 8 General Ed., Public County 662 ``` ]] --- # Data exploration: an iterative process .leftcol[ Initial exploratory plotting: .code60[ ``` r engagement_data %>% ggplot() + geom_col(aes(x = Engagement, y = School, fill = Location), position = 'dodge') ``` <img src="figs/student-engagement-bars1-1.png" width="432" /> ]] -- .rightcol[ More exploratory plotting:<br>highlight difference <img src="figs/student-engagement-bars2-1.png" width="432" /> ] --- # Data exploration: an iterative process .leftcol[ Directly label figure: <img src="figs/student-engagement-bars3-1.png" width="432" /> ] -- .rightcol[ Remove unnecessary axes, change colors, fix labels: <img src="figs/unnamed-chunk-6-1.png" width="432" /> ] --- **A fully reproducible analysis** .panelset[ .panel[.panel-name[Code] .code40[.leftcol[ ``` r data <- data.frame( City = c(643, 735, 590, 863), County = c(793, 928, 724, 662), School = c('Special Ed., Charter', 'Special Ed., Public', 'General Ed., Charter', 'General Ed., Public'), Highlight = c(0, 0, 0, 1)) %>% gather(Location, Engagement, City:County) %>% mutate( Location = fct_relevel(Location, c('City', 'County')), Highlight = as.factor(Highlight), x = ifelse(Location == 'County', 1, 0)) ``` ] .rightcol[ ``` r plot <- ggplot(data, aes(x = x, y = Engagement, group = School, color = Highlight)) + geom_point() + geom_line() + scale_color_manual(values = c('#757575', '#ed573e')) + labs(x = 'Sex', y = 'Engagement', title = paste0('Students in public, general education classes\n', 'in county schools have surprisingly low engagement')) + scale_x_continuous(limits = c(-1.2, 1.2), labels = c('City', 'County'), breaks = c(0, 1)) + geom_text_repel(aes(label = Engagement, color = as.factor(Highlight)), data = subset(engagement, Location == 'County'), size = 5, nudge_x = 0.1, segment.color = NA) + geom_text_repel(aes(label = Engagement, color = as.factor(Highlight)), data = subset(engagement, Location == 'City'), size = 5, nudge_x = -0.1, segment.color = NA) + geom_text_repel(aes(label = School, color = as.factor(Highlight)), data = subset(engagement, Location == 'City'), size = 5, nudge_x = -0.25, hjust = 1, segment.color = NA) + theme_cowplot() + background_grid(major = 'x') + theme(axis.line = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), legend.position = 'none') ``` ]]] .panel[.panel-name[Plot] <img src="figs/unnamed-chunk-9-1.png" width="432" /> ]] --- background-color: #fff class: center # Data exploration: an iterative process <center> <img src="images/eda.png" width=600> </center> --- class: inverse, middle # Week 1: .fancy[Getting Started] ### 1. Course Goal ### 2. .orange[Course Introduction] ### 3. Break: Install Stuff ### 4. Quarto ### 5. Workflow & Reading In Data ### 6. Wrangling Data ### 7. Visualizing Data --- # Meet your instructor! .leftcol30[.circle[ <img src="https://www.jhelvy.com/images/lab/john_helveston_square.png" width="300"> ]] .rightcol70[ ### John Helveston, Ph.D. .font80[ - 2018 - Present Assistant Professor, Engineering Management & Systems Engineering - 2016-2018 Postdoc at [Institute for Sustainable Energy](https://www.bu.edu/ise/), Boston University - 2016 PhD in Engineering & Public Policy at Carnegie Mellon University - 2015 MS in Engineering & Public Policy at Carnegie Mellon University - 2010 BS in Engineering Science & Mechanics at Virginia Tech - Website: [www.jhelvy.com](https://www.jhelvy.com/) ]] --- # Meet your tutors! .leftcol30[.circle[ <img src="images/hu.jpg" width="300"> ]] .rightcol70[ ### **Pingfan Hu** - Graduate Teaching Assistant (GTA) - PhD student in EMSE - Website: [www.pingfanhu.com](https://www.pingfanhu.com/) ] --- # Meet your tutors! .leftcol30[.circle[ <img src="images/bunea.jpg" width="300"> ]] .rightcol70[ ### **Bogdan Bunea** - Learning Assistant (LA) - EMSE Junior & P4A / EDA alumni - Check out his team's [project](https://eda.seas.gwu.edu/showcase/2023-Fall/ukraine-war.html) from 2023 ] --- # Prerequisites ## [EMSE 4574 / 6574: Intro to Programming for Analytics](https://p4a.seas.gwu.edu/) You should be able to: - Use RStudio to write basic R commands. - Know the distinctions between different R operators and data types, including numeric, string, and logical data. - Use **tidyverse** functions to wrangle and manipulate data in R. - Use the **ggplot2** library to create plots in R. -- > [
Check out R for Analytics Primer](http://jhelvy.github.io/r4aPrimer/) --- # Course website ##
Everything you need will be on the course website:<br>https://eda.seas.gwu.edu/2024-Fall/ -- ##
The [schedule](https://emse-eda-gwu.github.io/2024-Fall/schedule.html) is the best starting point --- # **Quizzes** (10% of grade) -- ##
At the start of class every other week-ish. Make ups only for excused absences (i.e. don't be late). -- ##
5 total, lowest dropped -- ##
~5 - 10 minutes -- > **Why quiz at all?** The "retrieval effect" - basically, you have to _practice_ remembering things, otherwise your brain won't remember them (see the book ["Make It Stick: The Science of Successful Learning"](https://www.hup.harvard.edu/catalog.php?isbn=9780674729018)) --- ## Assignments -- ## 1)
Weekly Homework / Readings: [HW1](https://eda.seas.gwu.edu/2024-Fall/hw/1-tidy-data.html) -- ## 2)
3 Mini Projects (due 2 weeks from date assigned) -- ## 3)
[Final Project](https://eda.seas.gwu.edu/2023-Fall/project/0-overview.html) .leftcol[ **Undergrads**: Teams of 3 - 4 students **Grads**: Teams of 2 students ] .rightcol[ Item | Due Date ----------------|--------------- Proposal | Sep 22 Progress Report | Oct 27 Final Report | Dec 08 Presentation | Dec 11 ] --- # .center[Grades] Item | Weight | Notes -------------------------------|--------|------------------------------------- Participation / Attendance | 5% | (Yes, I take attendance) Reflections | 12 % | Weekly assignment, lowest dropped) Quizzes | 10 % | 5 quizzes, lowest dropped Mini Project 1 | 10 % | Individual assignments Mini Project 2 | 10 % | Mini Project 3 | 10 % | Final Project: Proposal | 6 % | Final Project: Progress Report | 6 % | Final Project: Report | 15 % | Final Project: Presentation | 6 % | Final Interview | 10 % | Individual interview --- background-color: #FFF # .center[Grades] <center> <img src="https://eda.seas.gwu.edu/2024-Fall/figs/grade-breakdown-1.png" width=90%> </center> --- # Course policies -- .leftcol35[ - ## BE NICE - ## BE HONEST - ## DON'T CHEAT ] -- .rightcol65[ ## Copying is good, stealing is bad > "Plagiarism is trying to pass someone else’s work off as your own. Copying is about reverse-engineering." > > .right[-- Austin Kleon, from [Steal Like An Artist](https://austinkleon.com/steal/) ] ] --- ## Use of chatGPT and other AI tools - Large language models (LLMs) are pretty good...but sometimes suck. -- - Use of AI tools is generally permitted, but **be transparent**. - All assignments must include a **Use of AI on this assignment** section where you: - Describe any AI tool and how it was used along with prompt(s) used. - Include a link to the chat transcript. ## **Use AI as an assistant, not a solutions manual** > Curious how LLMs actually work? Check out [this article](https://www.understandingai.org/p/large-language-models-explained-with), which provides a simplified description of how they work (which itself is still quite complicated). --- # Late submissions ## - **5** late days - use them anytime, no questions asked ## - No more than **2** late days on any one assignment ## - Contact me for special cases --- # How to succeed in this class -- ##
Participate during class! -- ##
Start assignments early and **read carefully**! -- ##
Actually read (before class)! -- ##
Get sleep and take breaks often! -- ##
Ask for help! --- # Getting Help -- ##
Use [Slack](https://emse-eda-f24.slack.com/) to ask questions. -- ##
Meet with your tutors -- ##
[Schedule a meeting](https://jhelvy.appointlet.com/b/professor-helveston) w/Prof. Helveston: - Mondays from 8:00-4:30pm - Tuesdays from 8:00-4:30pm - Fridays from 8:00-4:00pm -- ##
[GW Coders](http://gwcoders.github.io/) --- #
[Course Software](https://eda.seas.gwu.edu/2024-Fall/software.html) -- ##
[Slack](https://emse-eda-f24.slack.com/): See bb for link to join;<br>install on phone and **turn notifications on**! -- ##
[R](https://cloud.r-project.org/) & [RStudio](https://posit.co/download/rstudio-desktop/) (Install both) -- ##
[Posit Cloud](https://posit.cloud/) (Register for free!) --- class: inverse <br> # .center[.fancy[Break]] 1. If you haven't already, install everything on the [software page](https://eda.seas.gwu.edu/2024-Fall/software.html) 2. Stand up, meet each other, (maybe form teams?...use [this sheet](https://docs.google.com/spreadsheets/d/15pn9VNtYBG3XF-1OhvdKLMoNj4hOTXCGf4U1KNx0Tco/edit?usp=sharing))
−
+
05
:
00
--- class: inverse, middle # Week 1: .fancy[Getting Started] ### 1. Course Goal ### 2. Course Introduction ### 3. Break: Install Stuff ### 4. .orange[Quarto] ### 5. Workflow & Reading In Data ### 6. Wrangling Data ### 7. Visualizing Data --- class: middle, inverse # .center[Quick demo] <br> # 1. Open `quarto_demo.qmd` # 2. Click "Render" <center> <img src="images/how-qmd-works.png" width=100%> </center> --- # .center[Anatomy of a .qmd file] <br> # .red[Header] # Markdown text # R code --- # Define overall document options in header .leftcol[ Basic html page ``` --- title: Your title author: Author name format: html --- ``` ] .rightcol[ Add table of contents, change theme ``` --- title: Your title author: Author name toc: true format: html: theme: united --- ``` More on themes at https://quarto.org/docs/output-formats/html-themes.html ] --- # Render to multiple outputs .leftcol[ ### PDF uses LaTeX ``` --- title: Your title author: Author name format: pdf --- ``` If you don't have LaTeX on your computer, install tinytex in R: ``` r tinytex::install_tinytex() ``` ] .rightcol[ ### Microsoft Word ``` --- title: Your title author: Author name format: docx --- ``` ] --- # .center[Anatomy of a .qmd file] <br> # ~~Header~~ # .red[Markdown text] # R code --- class: center # Right now, bookmark this! 👇 # https://commonmark.org/help/ <br><hr><br> # (When you have 10 minutes, do this! 👇) # https://commonmark.org/help/tutorial/ --- # .center[Headers] -- .leftcol[ ```markdown # HEADER 1 ## HEADER 2 ### HEADER 3 #### HEADER 4 ##### HEADER 5 ###### HEADER 6 ``` ] -- .rightcol[ # HEADER 1 ## HEADER 2 ### HEADER 3 #### HEADER 4 ##### HEADER 5 ###### HEADER 6 ] --- # .center[Basic Text Formatting] .leftcol[ ## Type this... - `normal text` - `_italic text_` - `*italic text*` - `**bold text**` - `***bold italic text***` - `~~strikethrough~~` - `` `code text` `` ] .rightcol[ ## ..to get this - normal text - _italic text_ - *italic text* - **bold text** - ***bold italic text*** - ~~strikethrough~~ - `code text` ] --- class: top # .center[Lists] .leftcol[ Bullet list: ``` r - first item - second item - third item ``` - first item - second item - third item ] .rightcol[ Numbered list: ``` r 1. first item 2. second item 3. third item ``` 1. first item 2. second item 3. third item ] --- # .center[Links] Simple **url link** to another site: ``` r [Download R](http://www.r-project.org/) ``` [Download R](http://www.r-project.org/) --- class: middle, center # Don't want to use Markdown? # .red[Use Visual Mode!] <center> <img src="images/visual-mode.png" width=700> </center> --- # .center[Anatomy of a .qmd file] <br> # ~~Header (think of this as the "settings")~~ # ~~Markdown text~~ # .red[R code] --- class: center # R Code -- .leftcol[ ## Inline code .left[ ``` r `r insert code here` ``` ]] -- .rightcol[ ## Code chunks .left[ ````markdown ```{r} insert code here insert more code here ``` ```` ]] --- # Inline R code ``` r The sum of 3 and 4 is `r 3 + 4` ``` -- Produces this: The sum of 3 and 4 is 7 --- # R Code chunks .leftcol[ This code chunk... ````markdown ```{r} library(palmerpenguins) head(penguins) ``` ```` ] -- .rightcol[ ...will produce this when compiled: ``` r library(palmerpenguins) head(penguins) ``` ``` #> # A tibble: 6 × 8 #> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year #> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> #> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 #> 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 #> 3 Adelie Torgersen 40.3 18 195 3250 female 2007 #> 4 Adelie Torgersen NA NA NA NA <NA> 2007 #> 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 #> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 ``` ] --- # Chunk options Control what chunks output using options All options [here](https://quarto.org/docs/reference/cells/cells-knitr.html) <img src="images/chunks_options.png" width="60%" /> --- # .center[Chunk output options] .center[By default, code chunks print **code** + **output**] -- .cols3[ ````markdown ```{r} #| echo: false cat('hello world!') ``` ```` Prints only **output**<br>(doesn't show code) ``` #> hello world! ``` ] -- .cols3[ ````markdown ```{r} #| eval: false cat('hello world!') ``` ```` Prints only **code**<br>(doesn't run the code) ``` r cat('hello world!') ``` ] -- .cols3[ ````markdown ```{r} #| include: false cat('hello world!') ``` ```` Runs, but doesn't print anything ] --- # A global `setup` chunk 🌍 .leftcol[ ````markdown ```{r} #| label: setup #| include: false knitr::opts_chunk$set( warning = FALSE, message = FALSE, fig.path = "figs/", fig.width = 7.252, fig.height = 4, comment = "#>", fig.retina = 3 ) ``` ```` ] .rightcol[ - Typically the first chunk - All following chunks will use these options (i.e., sets global chunk options) - You can (and should) use individual chunk options too - Often where I load libraries, etc. ] --- class: inverse, middle # Week 1: .fancy[Getting Started] ### 1. Course Goal ### 2. Course Introduction ### 3. Break: Install Stuff ### 4. Quarto ### 5. .orange[Workflow & Reading In Data] ### 6. Wrangling Data ### 7. Visualizing Data --- ## Workflow for reading in data 1) Use R Projects (.Rproj files) to organize your analysis - **don't double-click .R files**! <img src = "images/rproj.png" width = "75"> -- 2) Use the `here` package to create file paths ``` r path <- here::here("folder", "file.csv") ``` -- 3) Import data with these functions: File type | Function | Library -----------|----------------|---------- `.csv` | `read_csv()` | **readr** `.txt` | `read.table()` | **utils** `.xlsx` | `read_excel()` | **readxl** --- # Importing Comma Separated Values (.csv) Read in `.csv` files with `read_csv()`: ``` r library(tidyverse) library(here) csvPath <- here('data', 'milk_production.csv') *milk_production <- read_csv(csvPath) head(milk_production) ``` ``` #> # A tibble: 6 × 4 #> region state year milk_produced #> <chr> <chr> <dbl> <dbl> #> 1 Northeast Maine 1970 619000000 #> 2 Northeast New Hampshire 1970 356000000 #> 3 Northeast Vermont 1970 1970000000 #> 4 Northeast Massachusetts 1970 658000000 #> 5 Northeast Rhode Island 1970 75000000 #> 6 Northeast Connecticut 1970 661000000 ``` --- # Importing Text Files (.txt) Read in `.txt` files with `read.table()`: ``` r txtPath <- here('data', 'nasa_global_temps.txt') *global_temps <- read.table(txtPath, skip = 5, header = FALSE) head(global_temps) ``` ``` #> V1 V2 V3 #> 1 1880 -0.15 -0.08 #> 2 1881 -0.07 -0.12 #> 3 1882 -0.10 -0.15 #> 4 1883 -0.16 -0.19 #> 5 1884 -0.27 -0.23 #> 6 1885 -0.32 -0.25 ``` --- # Importing Text Files (.txt) Read in `.txt` files with `read.table()`: ``` r txtPath <- here('data', 'nasa_global_temps.txt') global_temps <- read.table(txtPath, skip = 5, header = FALSE) *names(global_temps) <- c('year', 'no_smoothing', 'loess') # Add header head(global_temps) ``` ``` #> year no_smoothing loess #> 1 1880 -0.15 -0.08 #> 2 1881 -0.07 -0.12 #> 3 1882 -0.10 -0.15 #> 4 1883 -0.16 -0.19 #> 5 1884 -0.27 -0.23 #> 6 1885 -0.32 -0.25 ``` --- # Importing Excel Files (.xlsx) Read in `.xlsx` files with `read_excel()`: ``` r library(readxl) xlsxPath <- here('data', 'pv_cell_production.xlsx') *pv_cells <- read_excel(xlsxPath, sheet = 'Cell Prod by Country', skip = 2) ``` .code70[ ``` r glimpse(pv_cells) ``` ``` #> Rows: 25 #> Columns: 10 #> $ Year <chr> NA, NA, "1995", "1996", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", NA, "Note: NA = data not available.", NA, "Source: Compiled by E… #> $ China <chr> "Megawatts", NA, "NA", "NA", "NA", "NA", "NA", "2.5", "3", "10", "13", "40", "128.30000000000001", "341.8", "1192.8735755126208", "2535.9804999999997", "5193.2335000000003", "12882.114299891044", "24338.646000000004", "24139… #> $ Taiwan <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "NA", "3.5", "8", "17", "39.299999999999997", "88", "169.5", "413.19362206495737", "871.4", "1573.2", "3755.9046488657718", "4773.1499999999996", "5270.1999999999989", "6338.565000000000… #> $ Japan <dbl> NA, NA, 16.4, 21.2, 35.0, 49.0, 80.0, 128.6, 171.2, 251.1, 363.9, 601.5, 833.0, 926.4, 937.5, 1268.0, 1503.0, 2169.0, 2707.0, 2641.8, 3679.0, NA, NA, NA, NA #> $ Malaysia <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "0", "0", "100.1", "397.9", "1228.0566037735848", "1919.0129442119946", "2684.5953947368421", "2597.365436241611", "3072.59", NA, NA, NA, NA #> $ Germany <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "22.5", "23.5", "55", "121.5", "193", "339", "469.1", "815.35421116529074", "1476.6923205919056", "1606.0497978436656", "2181.2726133183096", "2152.8626315789475", "1406.7827181208054", … #> $ `South Korea` <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "5.3", "13", "31.883935905674612", "70.848164851527258", "234", "886.29518449560589", "1227.3", "1107.0999999999999", "1127.0999999999999", NA, NA, NA, NA #> $ `United States` <dbl> NA, NA, 34.7500, 38.8500, 51.0000, 53.7000, 60.8000, 75.0000, 100.3000, 120.6000, 103.0000, 138.7000, 153.1000, 177.6000, 261.9804, 403.1250, 594.7922, 1162.5177, 1044.1895, 886.4018, 868.4250, NA, NA, NA, NA #> $ Others <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "48.200000000000017", "69.800000000000011", "97.299999999999955", "131", "186.29999999999995", "235.70000000000027", "361.09999999999991", "410.97322650945807", "709.03112641453299", "66… #> $ World <dbl> NA, NA, 77.600, 88.600, 125.800, 154.900, 201.300, 276.800, 371.300, 542.000, 749.400, 1198.800, 1782.400, 2458.500, 4163.859, 7732.977, 12595.992, 26399.539, 40761.761, 39523.565, 44464.496, NA, NA, NA, NA ``` ] --- # Importing Excel Files (.xlsx) Read in `.xlsx` files with `read_excel()`: ``` r library(readxl) xlsxPath <- here('data', 'pv_cell_production.xlsx') pv_cells <- read_excel(xlsxPath, sheet = 'Cell Prod by Country', skip = 2) %>% * mutate(Year = as.numeric(Year)) %>% # Convert "non-years" to NA * filter(!is.na(Year)) # Drop NA rows in Year ``` .code60[ ``` r glimpse(pv_cells) ``` ``` #> Rows: 19 #> Columns: 10 #> $ Year <dbl> 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 #> $ China <chr> "NA", "NA", "NA", "NA", "NA", "2.5", "3", "10", "13", "40", "128.30000000000001", "341.8", "1192.8735755126208", "2535.9804999999997", "5193.2335000000003", "12882.114299891044", "24338.646000000004", "24139.014999999999", "… #> $ Taiwan <chr> "NA", "NA", "NA", "NA", "NA", "NA", "3.5", "8", "17", "39.299999999999997", "88", "169.5", "413.19362206495737", "871.4", "1573.2", "3755.9046488657718", "4773.1499999999996", "5270.1999999999989", "6338.5650000000005" #> $ Japan <dbl> 16.4, 21.2, 35.0, 49.0, 80.0, 128.6, 171.2, 251.1, 363.9, 601.5, 833.0, 926.4, 937.5, 1268.0, 1503.0, 2169.0, 2707.0, 2641.8, 3679.0 #> $ Malaysia <chr> "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "0", "0", "100.1", "397.9", "1228.0566037735848", "1919.0129442119946", "2684.5953947368421", "2597.365436241611", "3072.59" #> $ Germany <chr> "NA", "NA", "NA", "NA", "NA", "22.5", "23.5", "55", "121.5", "193", "339", "469.1", "815.35421116529074", "1476.6923205919056", "1606.0497978436656", "2181.2726133183096", "2152.8626315789475", "1406.7827181208054", "1054.88… #> $ `South Korea` <chr> "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "5.3", "13", "31.883935905674612", "70.848164851527258", "234", "886.29518449560589", "1227.3", "1107.0999999999999", "1127.0999999999999" #> $ `United States` <dbl> 34.7500, 38.8500, 51.0000, 53.7000, 60.8000, 75.0000, 100.3000, 120.6000, 103.0000, 138.7000, 153.1000, 177.6000, 261.9804, 403.1250, 594.7922, 1162.5177, 1044.1895, 886.4018, 868.4250 #> $ Others <chr> "NA", "NA", "NA", "NA", "NA", "48.200000000000017", "69.800000000000011", "97.299999999999955", "131", "186.29999999999995", "235.70000000000027", "361.09999999999991", "410.97322650945807", "709.03112641453299", "663.660000… #> $ World <dbl> 77.600, 88.600, 125.800, 154.900, 201.300, 276.800, 371.300, 542.000, 749.400, 1198.800, 1782.400, 2458.500, 4163.859, 7732.977, 12595.992, 26399.539, 40761.761, 39523.565, 44464.496 ``` ] --- class: inverse
−
+
10
:
00
# Your turn Open the `practice.qmd` file. Write code to import the following data files from the "data" folder: - For `lotr_words.csv`, call the data frame `lotr` - For `north_america_bear_killings.txt`, call the data frame `bears` - For `uspto_clean_energy_patents.xlsx`, call the data frame `patents` --- class: inverse, middle # Week 1: .fancy[Getting Started] ### 1. Course Goal ### 2. Course Introduction ### 3. Break: Install Stuff ### 4. Quarto ### 5. Workflow & Reading In Data ### 6. .orange[Wrangling Data] ### 7. Visualizing Data --- .leftcol[ # .center[The data frame...<br>in .darkgreen[Excel]] <center> <img src="images/spreadsheet.png" width=340> </center> ] .rightcol[ # .center[The data frame...<br>in
] ``` r lotr ``` ``` #> # A tibble: 18 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 #> 3 The Fellowship Of The Ring Hobbit Female 14 #> 4 The Fellowship Of The Ring Hobbit Male 3644 #> 5 The Fellowship Of The Ring Man Female 0 #> 6 The Fellowship Of The Ring Man Male 1995 #> 7 The Return Of The King Elf Female 183 #> 8 The Return Of The King Elf Male 510 #> 9 The Return Of The King Hobbit Female 2 #> 10 The Return Of The King Hobbit Male 2673 #> 11 The Return Of The King Man Female 268 #> 12 The Return Of The King Man Male 2459 #> 13 The Two Towers Elf Female 331 #> 14 The Two Towers Elf Male 513 #> 15 The Two Towers Hobbit Female 0 #> 16 The Two Towers Hobbit Male 2463 #> 17 The Two Towers Man Female 401 #> 18 The Two Towers Man Male 3589 ``` ] --- ## **Columns**: _Vectors_ of values (must be same data type) Extract a column using `$` ``` r lotr$race ``` ``` #> [1] "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" ``` --- ## **Columns**: _Vectors_ of values (must be same data type) Can also use brackets: ``` r lotr$race ``` ``` #> [1] "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" ``` ``` r lotr[,2] ``` ``` #> # A tibble: 18 × 1 #> race #> <chr> #> 1 Elf #> 2 Elf #> 3 Hobbit #> 4 Hobbit #> 5 Man #> 6 Man #> 7 Elf #> 8 Elf #> 9 Hobbit #> 10 Hobbit #> 11 Man #> 12 Man #> 13 Elf #> 14 Elf #> 15 Hobbit #> 16 Hobbit #> 17 Man #> 18 Man ``` --- ## **Rows**: Information about individual observations Information about the first row: ``` r lotr[1,] ``` ``` #> # A tibble: 1 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 ``` -- Information about rows 1 & 2: ``` r lotr[1:2,] ``` ``` #> # A tibble: 2 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 ``` --- class: inverse ## Quick Practice Read in the `data.csv` file in the "data" folder: ``` r data <- read_csv(here('data', 'data.csv')) ``` Now answer these questions: - How many rows and columns are in the data frame? - What type of data is each column? - Preview the different columns - what do you think this data is about? What might one row represent? - How many unique airlines are in the data frame? - What is the shortest and longest air time for any one flight in the data frame? --- class: center ### The tidyverse: `stringr` + `dplyr` + `readr` + `ggplot2` + ... <center> <img src="images/horst_monsters_tidyverse.jpeg" width="950"> </center>Art by [Allison Horst](https://www.allisonhorst.com/) --- # .center[The main `dplyr` "verbs"] <br> "Verb" | What it does --------------|-------------------- `select()` | Select columns by name `filter()` | Keep rows that match criteria `arrange()` | Sort rows based on column(s) `mutate()` | Create new columns `summarize()` | Create summary values --- # .center[Core `tidyverse` concept:<br>**Chain functions together with "pipes"**] # .center[`%>%`] -- ## Think of the words "...and then..." ``` r data %>% do_something() %>% do_something_else() ``` --- class: center, middle, inverse # Select columns with `select()` <br> <center> <img src="images/rstudio-cheatsheet-select.png" width="900"> </center> --- # Select columns with `select()` Select the columns `film` & `race` ``` r lotr %>% select(film, race) ``` ``` #> # A tibble: 18 × 2 #> film race #> <chr> <chr> #> 1 The Fellowship Of The Ring Elf #> 2 The Fellowship Of The Ring Elf #> 3 The Fellowship Of The Ring Hobbit #> 4 The Fellowship Of The Ring Hobbit #> 5 The Fellowship Of The Ring Man #> 6 The Fellowship Of The Ring Man #> 7 The Return Of The King Elf #> 8 The Return Of The King Elf #> 9 The Return Of The King Hobbit #> 10 The Return Of The King Hobbit #> 11 The Return Of The King Man #> 12 The Return Of The King Man #> 13 The Two Towers Elf #> 14 The Two Towers Elf #> 15 The Two Towers Hobbit #> 16 The Two Towers Hobbit #> 17 The Two Towers Man #> 18 The Two Towers Man ``` --- # Select columns with `select()` Use the `-` sign to drop columns ``` r lotr %>% select(-film) ``` ``` #> # A tibble: 18 × 3 #> race gender word_count #> <chr> <chr> <dbl> #> 1 Elf Female 1229 #> 2 Elf Male 971 #> 3 Hobbit Female 14 #> 4 Hobbit Male 3644 #> 5 Man Female 0 #> 6 Man Male 1995 #> 7 Elf Female 183 #> 8 Elf Male 510 #> 9 Hobbit Female 2 #> 10 Hobbit Male 2673 #> 11 Man Female 268 #> 12 Man Male 2459 #> 13 Elf Female 331 #> 14 Elf Male 513 #> 15 Hobbit Female 0 #> 16 Hobbit Male 2463 #> 17 Man Female 401 #> 18 Man Male 3589 ``` --- class: center, middle, inverse # Filter for rows with `filter()` <br> <center> <img src="images/rstudio-cheatsheet-filter.png" width="900"> </center> --- # Filter for rows with `filter()` Keep only the rows with Elf characters ``` r lotr %>% filter(race == "Elf") ``` ``` #> # A tibble: 6 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 #> 3 The Return Of The King Elf Female 183 #> 4 The Return Of The King Elf Male 510 #> 5 The Two Towers Elf Female 331 #> 6 The Two Towers Elf Male 513 ``` --- # Filter for rows with `filter()` Keep only the rows with Elf or Hobbit characters ``` r lotr %>% filter((race == "Elf") | (race == "Hobbit")) ``` ``` #> # A tibble: 12 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 #> 3 The Fellowship Of The Ring Hobbit Female 14 #> 4 The Fellowship Of The Ring Hobbit Male 3644 #> 5 The Return Of The King Elf Female 183 #> 6 The Return Of The King Elf Male 510 #> 7 The Return Of The King Hobbit Female 2 #> 8 The Return Of The King Hobbit Male 2673 #> 9 The Two Towers Elf Female 331 #> 10 The Two Towers Elf Male 513 #> 11 The Two Towers Hobbit Female 0 #> 12 The Two Towers Hobbit Male 2463 ``` --- # Filter for rows with `filter()` Keep only the rows with Elf or Hobbit characters ``` r lotr %>% filter(race %in% c("Elf", "Hobbit")) ``` ``` #> # A tibble: 12 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 #> 3 The Fellowship Of The Ring Hobbit Female 14 #> 4 The Fellowship Of The Ring Hobbit Male 3644 #> 5 The Return Of The King Elf Female 183 #> 6 The Return Of The King Elf Male 510 #> 7 The Return Of The King Hobbit Female 2 #> 8 The Return Of The King Hobbit Male 2673 #> 9 The Two Towers Elf Female 331 #> 10 The Two Towers Elf Male 513 #> 11 The Two Towers Hobbit Female 0 #> 12 The Two Towers Hobbit Male 2463 ``` --- # .center[Logic operators for `filter()`] <br> Description | Example ------------|------------ Values greater than 1 | `value > 1` Values greater than or equal to 1 | `value >= 1` Values less than 1 | `value < 1` Values less than or equal to 1 | `value <= 1` Values equal to 1 | `value == 1` Values not equal to 1 | `value != 1` Values in the set c(1, 4) | `value %in% c(1, 4)` --- # Combine `filter()` and `select()` Keep only the rows with Elf characters that spoke more than 1000 words, then select everything but the race column ``` r lotr %>% filter((race == "Elf") & (word_count > 1000)) %>% select(-race) ``` ``` #> # A tibble: 1 × 3 #> film gender word_count #> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Female 1229 ``` --- class: center, middle, inverse ## Create new variables with `mutate()` <br> <center> <img src="images/rstudio-cheatsheet-mutate.png" width="900"> </center> --- # Create new variables with `mutate()` Create a new variable, `word1000` which is `TRUE` if the character spoke 1,000 or more words ``` r lotr %>% mutate(word1000 = word_count >= 1000) ``` ``` #> # A tibble: 18 × 5 #> film race gender word_count word1000 #> <chr> <chr> <chr> <dbl> <lgl> #> 1 The Fellowship Of The Ring Elf Female 1229 TRUE #> 2 The Fellowship Of The Ring Elf Male 971 FALSE #> 3 The Fellowship Of The Ring Hobbit Female 14 FALSE #> 4 The Fellowship Of The Ring Hobbit Male 3644 TRUE #> 5 The Fellowship Of The Ring Man Female 0 FALSE #> 6 The Fellowship Of The Ring Man Male 1995 TRUE #> 7 The Return Of The King Elf Female 183 FALSE #> 8 The Return Of The King Elf Male 510 FALSE #> 9 The Return Of The King Hobbit Female 2 FALSE #> 10 The Return Of The King Hobbit Male 2673 TRUE #> 11 The Return Of The King Man Female 268 FALSE #> 12 The Return Of The King Man Male 2459 TRUE #> 13 The Two Towers Elf Female 331 FALSE #> 14 The Two Towers Elf Male 513 FALSE #> 15 The Two Towers Hobbit Female 0 FALSE #> 16 The Two Towers Hobbit Male 2463 TRUE #> 17 The Two Towers Man Female 401 FALSE #> 18 The Two Towers Man Male 3589 TRUE ``` --- # .center[Handling if/else conditions] ### .center[`ifelse(<condition>, <if TRUE>, <else>)`] -- ``` r lotr %>% mutate(word1000 = ifelse(word_count >= 1000, TRUE, FALSE)) ``` ``` #> # A tibble: 18 × 5 #> film race gender word_count word1000 #> <chr> <chr> <chr> <dbl> <lgl> #> 1 The Fellowship Of The Ring Elf Female 1229 TRUE #> 2 The Fellowship Of The Ring Elf Male 971 FALSE #> 3 The Fellowship Of The Ring Hobbit Female 14 FALSE #> 4 The Fellowship Of The Ring Hobbit Male 3644 TRUE #> 5 The Fellowship Of The Ring Man Female 0 FALSE #> 6 The Fellowship Of The Ring Man Male 1995 TRUE #> 7 The Return Of The King Elf Female 183 FALSE #> 8 The Return Of The King Elf Male 510 FALSE #> 9 The Return Of The King Hobbit Female 2 FALSE #> 10 The Return Of The King Hobbit Male 2673 TRUE #> 11 The Return Of The King Man Female 268 FALSE #> 12 The Return Of The King Man Male 2459 TRUE #> 13 The Two Towers Elf Female 331 FALSE #> 14 The Two Towers Elf Male 513 FALSE #> 15 The Two Towers Hobbit Female 0 FALSE #> 16 The Two Towers Hobbit Male 2463 TRUE #> 17 The Two Towers Man Female 401 FALSE #> 18 The Two Towers Man Male 3589 TRUE ``` --- # Sort data frame with `arrange()` Sort the `lotr` data frame by `word_count` ``` r lotr %>% arrange(word_count) ``` ``` #> # A tibble: 18 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Man Female 0 #> 2 The Two Towers Hobbit Female 0 #> 3 The Return Of The King Hobbit Female 2 #> 4 The Fellowship Of The Ring Hobbit Female 14 #> 5 The Return Of The King Elf Female 183 #> 6 The Return Of The King Man Female 268 #> 7 The Two Towers Elf Female 331 #> 8 The Two Towers Man Female 401 #> 9 The Return Of The King Elf Male 510 #> 10 The Two Towers Elf Male 513 #> 11 The Fellowship Of The Ring Elf Male 971 #> 12 The Fellowship Of The Ring Elf Female 1229 #> 13 The Fellowship Of The Ring Man Male 1995 #> 14 The Return Of The King Man Male 2459 #> 15 The Two Towers Hobbit Male 2463 #> 16 The Return Of The King Hobbit Male 2673 #> 17 The Two Towers Man Male 3589 #> 18 The Fellowship Of The Ring Hobbit Male 3644 ``` --- # Sort data frame with `arrange()` Use the `desc()` function to sort in descending order ``` r lotr %>% arrange(desc(word_count)) ``` ``` #> # A tibble: 18 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Hobbit Male 3644 #> 2 The Two Towers Man Male 3589 #> 3 The Return Of The King Hobbit Male 2673 #> 4 The Two Towers Hobbit Male 2463 #> 5 The Return Of The King Man Male 2459 #> 6 The Fellowship Of The Ring Man Male 1995 #> 7 The Fellowship Of The Ring Elf Female 1229 #> 8 The Fellowship Of The Ring Elf Male 971 #> 9 The Two Towers Elf Male 513 #> 10 The Return Of The King Elf Male 510 #> 11 The Two Towers Man Female 401 #> 12 The Two Towers Elf Female 331 #> 13 The Return Of The King Man Female 268 #> 14 The Return Of The King Elf Female 183 #> 15 The Fellowship Of The Ring Hobbit Female 14 #> 16 The Return Of The King Hobbit Female 2 #> 17 The Fellowship Of The Ring Man Female 0 #> 18 The Two Towers Hobbit Female 0 ``` --- class: inverse
−
+
10
:
00
# Your turn Read in the `data.csv` file in the "data" folder: ``` r data <- read_csv(here('data', 'data.csv')) ``` Now answer these questions: .font80[ - Create a new data frame, `flights_fall`, that contains only flights that departed in the fall semester. - Create a new data frame, `flights_dc`, that contains only flights that flew to DC airports (Reagan or Dulles). - Create a new data frame, `flights_dc_carrier`, that contains only flights that flew to DC airports (Reagan or Dulles) and only the columns about the month and airline. - How many unique airlines were flying to DC airports in July? - Create a new variable, `speed`, in miles per hour using the `time` (minutes) and `distance` (miles) variables. - Which flight flew the fastest? - Remove rows that have `NA` for `air_time` and re-arrange the resulting data frame based on the longest air time and longest flight distance. ] --- class: inverse, middle # Week 1: .fancy[Getting Started] ### 1. Course Goal ### 2. Course Introduction ### 3. Break: Install Stuff ### 4. Quarto ### 5. Workflow & Reading In Data ### 6. Wrangling Data ### 7. .orange[Visualizing Data] --- .leftcol[ <img src="images/making_a_ggplot.jpeg" width=600> ] .rightcol[ # "Grammar of Graphics" Concept developed by Leland Wilkinson (1999) **ggplot2** package developed by Hadley Wickham (2005) ] --- # Making plot layers with ggplot2 <br> ### 1. The data ### 2. The aesthetic mapping (what goes on the axes?) ### 3. The geometries (points? bars? etc.) ### 4. The annotations / labels ### 5. The theme --- # Layer 1: The data ``` r head(mpg) ``` ``` #> # A tibble: 6 × 11 #> manufacturer model displ year cyl trans drv cty hwy fl class #> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> #> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact #> 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact #> 3 audi a4 2 2008 4 manual(m6) f 20 31 p compact #> 4 audi a4 2 2008 4 auto(av) f 21 30 p compact #> 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact #> 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact ``` --- # Layer 1: The data The `ggplot()` function initializes the plot with whatever data you're using .leftcol[ ``` r mpg %>% ggplot() ``` ] .rightcol[.blackborder[ <img src="figs/unnamed-chunk-56-1.png" width="504" /> ]] --- # Layer 2: The aesthetic mapping The `aes()` function determines which variables will be _mapped_ to the geometries<br>(e.g. the axes) .leftcol[ ``` r mpg %>% * ggplot(aes(x = displ, y = hwy)) ``` ] .rightcol[.blackborder[ <img src="figs/unnamed-chunk-57-1.png" width="504" /> ]] --- # Layer 3: The geometries Use `+` to add geometries, e.g. `geom_points()` for points .leftcol[ ``` r mpg %>% ggplot(aes(x = displ, y = hwy)) + * geom_point() ``` ] .rightcol[.blackborder[ <img src="figs/unnamed-chunk-58-1.png" width="504" /> ]] --- # Layer 4: The annotations / labels Use `labs()` to modify most labels .leftcol[ ``` r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + * labs( * x = "Engine displacement (liters)", * y = "Highway fuel economy (mpg)", * title = "Most larger engine vehicles are less fuel efficient" * ) ``` ] .rightcol[.blackborder[ <img src="figs/unnamed-chunk-59-1.png" width="504" /> ]] --- # Layer 5: The theme .leftcol[ ``` r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + labs( x = "Engine displacement (liters)", y = "Highway fuel economy (mpg)", title = "Most larger engine vehicles are less fuel efficient" ) + * theme_bw() ``` ] .rightcol[.blackborder[ <img src="figs/unnamed-chunk-60-1.png" width="504" /> ]] --- ### Common themes .leftcol[ `theme_bw()` ``` r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + * theme_bw() ``` <img src="figs/unnamed-chunk-61-1.png" width="432" /> ] .rightcol[ `theme_minimal()` ``` r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + * theme_minimal() ``` <img src="figs/unnamed-chunk-62-1.png" width="432" /> ] --- ### Common themes .leftcol[ `theme_classic()` ``` r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + * theme_classic() ``` <img src="figs/unnamed-chunk-63-1.png" width="432" /> ] .rightcol[ `theme_void()` ``` r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + * theme_void() ``` <img src="figs/unnamed-chunk-64-1.png" width="432" /> ] --- class: middle, inverse .leftcol[ <img src="figs/unnamed-chunk-65-1.png" width="522.144" /> <img src="figs/unnamed-chunk-66-1.png" width="522.144" /> ] .rightcol[
−
+
15
:
00
## Your turn Open `practice.qmd` Use the `mpg` data frame and ggplot to create these charts <img src="figs/unnamed-chunk-68-1.png" width="522.144" /> ] --- class: inverse # Extra practice .leftcol[ <img src="figs/ggbar_p1-1.png" width="504" /> ] .rightcol[ <img src="figs/unnamed-chunk-69-1.png" width="432" /> ]