class: middle, inverse .leftcol30[ <center> <img src="https://raw.githubusercontent.com/emse-eda-gwu/2022-Fall/master/images/logo.png" width=250> </center> ] .rightcol70[ # Week 1: .fancy[Getting Started] ###
EMSE 4575: Exploratory Data Analysis ###
John Paul Helveston ###
August 31, 2022 ] --- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. Course Goal ## 2. Course Introduction ## 3. Break: Install Stuff ## 4. Workflow & Reading In Data ## 5. Wrangling Data ## 6. Visualizing Data --- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. .orange[Course Goal] ## 2. Course Introduction ## 3. Break: Install Stuff ## 4. Workflow & Reading In Data ## 5. Wrangling Data ## 6. Visualizing Data --- ## Course 1: [Intro to Programming for Analytics](https://p4a.seas.gwu.edu/) **"Computational Literacy"** - Programming: Conditionals (if/else), loops, functions, testing, data types. - Analytics: Data structures, import / export, basic data manipulation & visualization. -- ## Course 2: [Exploratory Data Analysis](https://eda.seas.gwu.edu/) **"Data Literacy"** - Strategies for conducting an exploratory data analysis. - Design principles for visualizing and communicating _information_ extracted from data. - Reproducibility: Reports that contain code, equations, visualizations, and narrative text. --- class: center, inverse, middle # **Class goal**: translate _data_ into _information_ --- class: center # **Class goal**: translate _data_ into _information_ -- .leftcol[ **Data** Average student engagement scores Class | Type | City | County ------------|-------------|------|------- Special Ed. | Charter | 643 | 793 Special Ed. | Public | 735 | 928 General Ed. | Charter | 590 | 724 General Ed. | Public | 863 | 662 ] -- .rightcol[ **Information** <img src="figs/student-engagement-final-1.png" width="432" /> ] --- # Data exploration: an iterative process -- .leftcol[ Encode data: .code60[ ```r engagement_data <- data.frame( City = c(643, 735, 590, 863), County = c(793, 928, 724, 662), School = c('Special Ed., Charter', 'Special Ed., Public', 'General Ed., Charter', 'General Ed., Public')) engagement_data ``` ``` #> City County School #> 1 643 793 Special Ed., Charter #> 2 735 928 Special Ed., Public #> 3 590 724 General Ed., Charter #> 4 863 662 General Ed., Public ``` ]] -- .rightcol[ Re-format data for plotting: .code60[ ```r engagement_data <- engagement_data %>% gather(Location, Engagement, City:County) %>% mutate(Location = fct_relevel( Location, c('City', 'County'))) engagement_data ``` ``` #> School Location Engagement #> 1 Special Ed., Charter City 643 #> 2 Special Ed., Public City 735 #> 3 General Ed., Charter City 590 #> 4 General Ed., Public City 863 #> 5 Special Ed., Charter County 793 #> 6 Special Ed., Public County 928 #> 7 General Ed., Charter County 724 #> 8 General Ed., Public County 662 ``` ]] --- # Data exploration: an iterative process .leftcol[ Initial exploratory plotting: .code60[ ```r engagement_data %>% ggplot() + geom_col(aes(x = Engagement, y = School, fill = Location), position = 'dodge') ``` <img src="figs/student-engagement-bars1-1.png" width="432" /> ]] -- .rightcol[ More exploratory plotting:<br>highlight difference <img src="figs/student-engagement-bars2-1.png" width="432" /> ] --- # Data exploration: an iterative process .leftcol[ Directly label figure: <img src="figs/student-engagement-bars3-1.png" width="432" /> ] -- .rightcol[ Remove unnecessary axes, change colors, fix labels: <img src="figs/unnamed-chunk-5-1.png" width="432" /> ] --- **A fully reproducible analysis** .panelset[ .panel[.panel-name[Code] .code40[.leftcol[ ```r data <- data.frame( City = c(643, 735, 590, 863), County = c(793, 928, 724, 662), School = c('Special Ed., Charter', 'Special Ed., Public', 'General Ed., Charter', 'General Ed., Public'), Highlight = c(0, 0, 0, 1)) %>% gather(Location, Engagement, City:County) %>% mutate( Location = fct_relevel(Location, c('City', 'County')), Highlight = as.factor(Highlight), x = ifelse(Location == 'County', 1, 0)) ``` ] .rightcol[ ```r plot <- ggplot(data, aes(x = x, y = Engagement, group = School, color = Highlight)) + geom_point() + geom_line() + scale_color_manual(values = c('#757575', '#ed573e')) + labs(x = 'Sex', y = 'Engagement', title = paste0('Students in public, general education classes\n', 'in county schools have surprisingly low engagement')) + scale_x_continuous(limits = c(-1.2, 1.2), labels = c('City', 'County'), breaks = c(0, 1)) + geom_text_repel(aes(label = Engagement, color = as.factor(Highlight)), data = subset(engagement, Location == 'County'), size = 5, nudge_x = 0.1, segment.color = NA) + geom_text_repel(aes(label = Engagement, color = as.factor(Highlight)), data = subset(engagement, Location == 'City'), size = 5, nudge_x = -0.1, segment.color = NA) + geom_text_repel(aes(label = School, color = as.factor(Highlight)), data = subset(engagement, Location == 'City'), size = 5, nudge_x = -0.25, hjust = 1, segment.color = NA) + theme_cowplot() + background_grid(major = 'x') + theme(axis.line = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), legend.position = 'none') ``` ]]] .panel[.panel-name[Plot] <img src="figs/unnamed-chunk-8-1.png" width="432" /> ]] --- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. Course Goal ## 2. .orange[Course Introduction] ## 3. Break: Install Stuff ## 4. Workflow & Reading In Data ## 5. Wrangling Data ## 6. Visualizing Data --- # Meet your instructor! .leftcol30[.circle[ <img src="images/helveston.jpg" width="300"> ]] .rightcol70[ ### John Helveston, Ph.D. .font80[ - 2018 - Present Assistant Professor, Engineering Management & Systems Engineering - 2016-2018 Postdoc at [Institute for Sustainable Energy](https://www.bu.edu/ise/), Boston University - 2016 PhD in Engineering & Public Policy at Carnegie Mellon University - 2015 MS in Engineering & Public Policy at Carnegie Mellon University - 2010 BS in Engineering Science & Mechanics at Virginia Tech - Website: [www.jhelvy.com](http://www.jhelvy.com/) ]] --- # Meet your tutors! .leftcol30[.circle[ <img src="images/rossetti.jpg" width="300"> ]] .rightcol70[ ### **Michael Rossetti** - Graduate Assistant (GA) - PhD student in EMSE ] --- # Meet your tutors! .leftcol30[.circle[ <img src="images/ottinger.png" width="300"> ]] .rightcol70[ ### **Eliese Ottinger** - Learning Assistant (LA) - EMSE Senior & P4A / EDA alumni ] --- # Prerequisites ## [EMSE 4574: Intro to Programming for Analytics](https://p4a.seas.gwu.edu/2020-Fall/) You should be able to: - Use RStudio to write basic R commands. - Know the distinctions between different R operators and data types, including numeric, string, and logical data. - Use **tidyverse** functions to wrangle and manipulate data in R. - Use the **ggplot2** library to create plots in R. -- > [
Check out R for Analytics Primer](http://jhelvy.github.io/r4aPrimer/) --- # Course website ##
Everything you need will be on the course website:<br>https://eda.seas.gwu.edu/2022-Fall/ -- ##
The [schedule](https://emse-eda-gwu.github.io/2022-Fall/schedule.html) is the best starting point --- # **Quizzes** (8% of grade) -- ##
At the start of class every other week-ish, unscheduled. Make ups only for excused absences (i.e. don't be late). -- ##
5 total, lowest dropped -- ##
~5 - 10 minutes -- > **Why quiz at all?** The "retrieval effect" - basically, you have to _practice_ remembering things, otherwise your brain won't remember them (see the book ["Make It Stick: The Science of Successful Learning"](https://www.hup.harvard.edu/catalog.php?isbn=9780674729018)) --- ## Assignments -- ## 1)
Weekly Homework / Readings: [HW1](https://eda.seas.gwu.edu/2022-Fall/hw/1-tidy-data.html) -- ## 2)
3 Mini Projects (due 2 weeks from date assigned) -- ## 3)
[Final Project](https://eda.seas.gwu.edu/2022-Fall/project-final/0-overview.html) (Teams of 2 - 3 students) Item | Due Date ----------------|--------------- Proposal | March 12 Progress Report | April 16 Final Report | April 30 Presentation | May 03 Interview | Exam week --- background-color: #FFF # .center[Grades] <center> <img src="https://eda.seas.gwu.edu/2022-Fall/figs/grade-breakdown-1.png" width=90%> </center> --- # .center[Grades] Item | Weight | Notes -------------------------------|--------|------------------------------------- Weekly HW | 12 % | Quizzes | 8 % | 5 quizzes, lowest dropped Mini Project 1 | 8 % | Individual assignments Mini Project 2 | 8 % | Mini Project 3 | 8 % | Final Project: Proposal | 9 % | Teams of 2-3 students Final Project: Progress Report | 12 % | Final Project: Report | 16 % | Final Project: Presentation | 9 % | Final Interview | 10 % | Individual interview --- # Course policies -- .leftcol35[ - ## BE NICE - ## BE HONEST - ## DON'T CHEAT ] -- .rightcol65[ ## Copying is good, stealing is bad > "Plagiarism is trying to pass someone else’s work off as your own. Copying is about reverse-engineering." > > .right[-- Austin Kleon, from [Steal Like An Artist](https://austinkleon.com/steal/) ] ] --- # Late submissions ## - **5** late days - use them anytime, no questions asked ## - No more than **2** late days on any one assignment ## - Contact me for special cases --- # How to succeed in this class -- ##
Participate during class! -- ##
Start assignments early and **read carefully**! -- ##
Actually read (before class)! -- ##
Get sleep and take breaks often! -- ##
Ask for help! --- # [Getting Help](https://eda.seas.gwu.edu/2022-Fall/help/getting-help.html) -- ##
Use [Slack](https://emse-eda-f22.slack.com/) to ask questions. -- ##
Meet with your tutors -- ##
[Schedule a meeting](https://jhelvy.appointlet.com/b/professor-helveston) w/Prof. Helveston: - Mondays from 8:00-5:00pm - Wednesdays from 3:20-5:00pm - Thursdays from 12:00-5:00pm -- ##
[GW Coders](http://gwcoders.github.io/) --- #
[Course Software](https://eda.seas.gwu.edu/2022-Fall/help/course-software.html) -- ##
[Slack](https://emse-eda-f22.slack.com/): See bb for link to join;<br>install on phone and **turn notifications on**! -- ##
[R](https://cloud.r-project.org/) & [RStudio](https://rstudio.com/products/rstudio/download/) (Install both) -- ##
[RStudio Cloud](https://rstudio.cloud/) (Register for free!) --- class: inverse, center <br> # .fancy[Break] # Install Stuff
−
+
05
:
00
--- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. Course Goal ## 2. Course Introduction ## 3. Break: Install Stuff ## 4. .orange[Workflow & Reading In Data] ## 5. Wrangling Data ## 6. Visualizing Data --- ## Workflow for reading in data 1) Use R Projects (.Rproj files) to organize your analysis - **don't double-click .R files**! <img src = "images/rproj.png" width = "75"> -- 2) Use the `here` package to create file paths ```r path <- here::here("folder", "file.csv") ``` -- 3) Import data with these functions: File type | Function | Library -----------|----------------|---------- `.csv` | `read_csv()` | **readr** `.txt` | `read.table()` | **utils** `.xlsx` | `read_excel()` | **readxl** --- # Importing Comma Separated Values (.csv) Read in `.csv` files with `read_csv()`: ```r library(tidyverse) library(here) csvPath <- here('data', 'milk_production.csv') *milk_production <- read_csv(csvPath) head(milk_production) ``` ``` #> # A tibble: 6 × 4 #> region state year milk_produced #> <chr> <chr> <dbl> <dbl> #> 1 Northeast Maine 1970 619000000 #> 2 Northeast New Hampshire 1970 356000000 #> 3 Northeast Vermont 1970 1970000000 #> 4 Northeast Massachusetts 1970 658000000 #> 5 Northeast Rhode Island 1970 75000000 #> 6 Northeast Connecticut 1970 661000000 ``` --- # Importing Text Files (.txt) Read in `.txt` files with `read.table()`: ```r txtPath <- here('data', 'nasa_global_temps.txt') *global_temps <- read.table(txtPath, skip = 5, header = FALSE) head(global_temps) ``` ``` #> V1 V2 V3 #> 1 1880 -0.15 -0.08 #> 2 1881 -0.07 -0.12 #> 3 1882 -0.10 -0.15 #> 4 1883 -0.16 -0.19 #> 5 1884 -0.27 -0.23 #> 6 1885 -0.32 -0.25 ``` --- # Importing Text Files (.txt) Read in `.txt` files with `read.table()`: ```r txtPath <- here('data', 'nasa_global_temps.txt') global_temps <- read.table(txtPath, skip = 5, header = FALSE) *names(global_temps) <- c('year', 'no_smoothing', 'loess') # Add header head(global_temps) ``` ``` #> year no_smoothing loess #> 1 1880 -0.15 -0.08 #> 2 1881 -0.07 -0.12 #> 3 1882 -0.10 -0.15 #> 4 1883 -0.16 -0.19 #> 5 1884 -0.27 -0.23 #> 6 1885 -0.32 -0.25 ``` --- # Importing Excel Files (.xlsx) Read in `.xlsx` files with `read_excel()`: ```r library(readxl) xlsxPath <- here('data', 'pv_cell_production.xlsx') *pv_cells <- read_excel(xlsxPath, sheet = 'Cell Prod by Country', skip = 2) ``` .code70[ ```r glimpse(pv_cells) ``` ``` #> Rows: 25 #> Columns: 10 #> $ Year <chr> NA, NA, "1995", "1996", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", NA, "Note: NA = data not available.", NA, "Source: Compiled by E… #> $ China <chr> "Megawatts", NA, "NA", "NA", "NA", "NA", "NA", "2.5", "3", "10", "13", "40", "128.30000000000001", "341.8", "1192.8735755126208", "2535.9804999999997", "5193.2335000000003", "12882.114299891044", "24338.646000000004", "24139… #> $ Taiwan <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "NA", "3.5", "8", "17", "39.299999999999997", "88", "169.5", "413.19362206495737", "871.4", "1573.2", "3755.9046488657718", "4773.1499999999996", "5270.1999999999989", "6338.565000000000… #> $ Japan <dbl> NA, NA, 16.4, 21.2, 35.0, 49.0, 80.0, 128.6, 171.2, 251.1, 363.9, 601.5, 833.0, 926.4, 937.5, 1268.0, 1503.0, 2169.0, 2707.0, 2641.8, 3679.0, NA, NA, NA, NA #> $ Malaysia <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "0", "0", "100.1", "397.9", "1228.0566037735848", "1919.0129442119946", "2684.5953947368421", "2597.365436241611", "3072.59", NA, NA, NA, NA #> $ Germany <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "22.5", "23.5", "55", "121.5", "193", "339", "469.1", "815.35421116529074", "1476.6923205919056", "1606.0497978436656", "2181.2726133183096", "2152.8626315789475", "1406.7827181208054", … #> $ `South Korea` <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "5.3", "13", "31.883935905674612", "70.848164851527258", "234", "886.29518449560589", "1227.3", "1107.0999999999999", "1127.0999999999999", NA, NA, NA, NA #> $ `United States` <dbl> NA, NA, 34.7500, 38.8500, 51.0000, 53.7000, 60.8000, 75.0000, 100.3000, 120.6000, 103.0000, 138.7000, 153.1000, 177.6000, 261.9804, 403.1250, 594.7922, 1162.5177, 1044.1895, 886.4018, 868.4250, NA, NA, NA, NA #> $ Others <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "48.200000000000017", "69.800000000000011", "97.299999999999955", "131", "186.29999999999995", "235.70000000000027", "361.09999999999991", "410.97322650945807", "709.03112641453299", "66… #> $ World <dbl> NA, NA, 77.600, 88.600, 125.800, 154.900, 201.300, 276.800, 371.300, 542.000, 749.400, 1198.800, 1782.400, 2458.500, 4163.859, 7732.977, 12595.992, 26399.539, 40761.761, 39523.565, 44464.496, NA, NA, NA, NA ``` ] --- # Importing Excel Files (.xlsx) Read in `.xlsx` files with `read_excel()`: ```r library(readxl) xlsxPath <- here('data', 'pv_cell_production.xlsx') pv_cells <- read_excel(xlsxPath, sheet = 'Cell Prod by Country', skip = 2) %>% * mutate(Year = as.numeric(Year)) %>% # Convert "non-years" to NA * filter(!is.na(Year)) # Drop NA rows in Year ``` .code60[ ```r glimpse(pv_cells) ``` ``` #> Rows: 19 #> Columns: 10 #> $ Year <dbl> 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 #> $ China <chr> "NA", "NA", "NA", "NA", "NA", "2.5", "3", "10", "13", "40", "128.30000000000001", "341.8", "1192.8735755126208", "2535.9804999999997", "5193.2335000000003", "12882.114299891044", "24338.646000000004", "24139.014999999999", "… #> $ Taiwan <chr> "NA", "NA", "NA", "NA", "NA", "NA", "3.5", "8", "17", "39.299999999999997", "88", "169.5", "413.19362206495737", "871.4", "1573.2", "3755.9046488657718", "4773.1499999999996", "5270.1999999999989", "6338.5650000000005" #> $ Japan <dbl> 16.4, 21.2, 35.0, 49.0, 80.0, 128.6, 171.2, 251.1, 363.9, 601.5, 833.0, 926.4, 937.5, 1268.0, 1503.0, 2169.0, 2707.0, 2641.8, 3679.0 #> $ Malaysia <chr> "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "0", "0", "100.1", "397.9", "1228.0566037735848", "1919.0129442119946", "2684.5953947368421", "2597.365436241611", "3072.59" #> $ Germany <chr> "NA", "NA", "NA", "NA", "NA", "22.5", "23.5", "55", "121.5", "193", "339", "469.1", "815.35421116529074", "1476.6923205919056", "1606.0497978436656", "2181.2726133183096", "2152.8626315789475", "1406.7827181208054", "1054.88… #> $ `South Korea` <chr> "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "5.3", "13", "31.883935905674612", "70.848164851527258", "234", "886.29518449560589", "1227.3", "1107.0999999999999", "1127.0999999999999" #> $ `United States` <dbl> 34.7500, 38.8500, 51.0000, 53.7000, 60.8000, 75.0000, 100.3000, 120.6000, 103.0000, 138.7000, 153.1000, 177.6000, 261.9804, 403.1250, 594.7922, 1162.5177, 1044.1895, 886.4018, 868.4250 #> $ Others <chr> "NA", "NA", "NA", "NA", "NA", "48.200000000000017", "69.800000000000011", "97.299999999999955", "131", "186.29999999999995", "235.70000000000027", "361.09999999999991", "410.97322650945807", "709.03112641453299", "663.660000… #> $ World <dbl> 77.600, 88.600, 125.800, 154.900, 201.300, 276.800, 371.300, 542.000, 749.400, 1198.800, 1782.400, 2458.500, 4163.859, 7732.977, 12595.992, 26399.539, 40761.761, 39523.565, 44464.496 ``` ] --- class: inverse
−
+
10
:
00
# Your turn Open the `practice.Rmd` file. Write code to import the following data files from the "data" folder: - For `lotr_words.csv`, call the data frame `lotr` - For `north_america_bear_killings.txt`, call the data frame `bears` - For `uspto_clean_energy_patents.xlsx`, call the data frame `patents` --- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. Course Goal ## 2. Course Introduction ## 3. Break: Install Stuff ## 4. Workflow & Reading In Data ## 5. .orange[Wrangling Data] ## 6. Visualizing Data --- .leftcol[ # .center[The data frame...<br>in .darkgreen[Excel]] <center> <img src="images/spreadsheet.png" width=340> </center> ] .rightcol[ # .center[The data frame...<br>in
] ```r lotr ``` ``` #> # A tibble: 18 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 #> 3 The Fellowship Of The Ring Hobbit Female 14 #> 4 The Fellowship Of The Ring Hobbit Male 3644 #> 5 The Fellowship Of The Ring Man Female 0 #> 6 The Fellowship Of The Ring Man Male 1995 #> 7 The Return Of The King Elf Female 183 #> 8 The Return Of The King Elf Male 510 #> 9 The Return Of The King Hobbit Female 2 #> 10 The Return Of The King Hobbit Male 2673 #> 11 The Return Of The King Man Female 268 #> 12 The Return Of The King Man Male 2459 #> 13 The Two Towers Elf Female 331 #> 14 The Two Towers Elf Male 513 #> 15 The Two Towers Hobbit Female 0 #> 16 The Two Towers Hobbit Male 2463 #> 17 The Two Towers Man Female 401 #> 18 The Two Towers Man Male 3589 ``` ] --- ## **Columns**: _Vectors_ of values (must be same data type) Extract a column using `$` ```r lotr$race ``` ``` #> [1] "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" ``` --- ## **Columns**: _Vectors_ of values (must be same data type) Can also use brackets: ```r lotr$race ``` ``` #> [1] "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" "Elf" "Elf" "Hobbit" "Hobbit" "Man" "Man" ``` ```r lotr[,2] ``` ``` #> # A tibble: 18 × 1 #> race #> <chr> #> 1 Elf #> 2 Elf #> 3 Hobbit #> 4 Hobbit #> 5 Man #> 6 Man #> 7 Elf #> 8 Elf #> 9 Hobbit #> 10 Hobbit #> 11 Man #> 12 Man #> 13 Elf #> 14 Elf #> 15 Hobbit #> 16 Hobbit #> 17 Man #> 18 Man ``` --- ## **Rows**: Information about individual observations Information about the first row: ```r lotr[1,] ``` ``` #> # A tibble: 1 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 ``` -- Information about rows 1 & 2: ```r lotr[1:2,] ``` ``` #> # A tibble: 2 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 ``` --- class: inverse ## Quick Practice Read in the `data.csv` file in the "data" folder: ```r data <- read_csv(here('data', 'data.csv')) ``` Now answer these questions: - How many rows and columns are in the data frame? - What type of data is each column? - Preview the different columns - what do you think this data is about? What might one row represent? - How many unique airlines are in the data frame? - What is the shortest and longest air time for any one flight in the data frame? --- class: center ### The tidyverse: `stringr` + `dplyr` + `readr` + `ggplot2` + ... <center> <img src="images/horst_monsters_tidyverse.jpeg" width="950"> </center>Art by [Allison Horst](https://www.allisonhorst.com/) --- # .center[The main `dplyr` "verbs"] <br> "Verb" | What it does --------------|-------------------- `select()` | Select columns by name `filter()` | Keep rows that match criteria `arrange()` | Sort rows based on column(s) `mutate()` | Create new columns `summarize()` | Create summary values --- # .center[Core `tidyverse` concept:<br>**Chain functions together with "pipes"**] # .center[`%>%`] -- ## Think of the words "...and then..." ```r data %>% do_something() %>% do_something_else() ``` --- class: center, middle, inverse # Select columns with `select()` <br> <center> <img src="images/rstudio-cheatsheet-select.png" width="900"> </center> --- # Select columns with `select()` Select the columns `film` & `race` ```r lotr %>% select(film, race) ``` ``` #> # A tibble: 18 × 2 #> film race #> <chr> <chr> #> 1 The Fellowship Of The Ring Elf #> 2 The Fellowship Of The Ring Elf #> 3 The Fellowship Of The Ring Hobbit #> 4 The Fellowship Of The Ring Hobbit #> 5 The Fellowship Of The Ring Man #> 6 The Fellowship Of The Ring Man #> 7 The Return Of The King Elf #> 8 The Return Of The King Elf #> 9 The Return Of The King Hobbit #> 10 The Return Of The King Hobbit #> 11 The Return Of The King Man #> 12 The Return Of The King Man #> 13 The Two Towers Elf #> 14 The Two Towers Elf #> 15 The Two Towers Hobbit #> 16 The Two Towers Hobbit #> 17 The Two Towers Man #> 18 The Two Towers Man ``` --- # Select columns with `select()` Use the `-` sign to drop columns ```r lotr %>% select(-film) ``` ``` #> # A tibble: 18 × 3 #> race gender word_count #> <chr> <chr> <dbl> #> 1 Elf Female 1229 #> 2 Elf Male 971 #> 3 Hobbit Female 14 #> 4 Hobbit Male 3644 #> 5 Man Female 0 #> 6 Man Male 1995 #> 7 Elf Female 183 #> 8 Elf Male 510 #> 9 Hobbit Female 2 #> 10 Hobbit Male 2673 #> 11 Man Female 268 #> 12 Man Male 2459 #> 13 Elf Female 331 #> 14 Elf Male 513 #> 15 Hobbit Female 0 #> 16 Hobbit Male 2463 #> 17 Man Female 401 #> 18 Man Male 3589 ``` --- class: center, middle, inverse # Filter for rows with `filter()` <br> <center> <img src="images/rstudio-cheatsheet-filter.png" width="900"> </center> --- # Filter for rows with `filter()` Keep only the rows with Elf characters ```r lotr %>% filter(race == "Elf") ``` ``` #> # A tibble: 6 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 #> 3 The Return Of The King Elf Female 183 #> 4 The Return Of The King Elf Male 510 #> 5 The Two Towers Elf Female 331 #> 6 The Two Towers Elf Male 513 ``` --- # Filter for rows with `filter()` Keep only the rows with Elf or Hobbit characters ```r lotr %>% filter((race == "Elf") | (race == "Hobbit")) ``` ``` #> # A tibble: 12 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 #> 3 The Fellowship Of The Ring Hobbit Female 14 #> 4 The Fellowship Of The Ring Hobbit Male 3644 #> 5 The Return Of The King Elf Female 183 #> 6 The Return Of The King Elf Male 510 #> 7 The Return Of The King Hobbit Female 2 #> 8 The Return Of The King Hobbit Male 2673 #> 9 The Two Towers Elf Female 331 #> 10 The Two Towers Elf Male 513 #> 11 The Two Towers Hobbit Female 0 #> 12 The Two Towers Hobbit Male 2463 ``` --- # Filter for rows with `filter()` Keep only the rows with Elf or Hobbit characters ```r lotr %>% filter(race %in% c("Elf", "Hobbit")) ``` ``` #> # A tibble: 12 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Elf Female 1229 #> 2 The Fellowship Of The Ring Elf Male 971 #> 3 The Fellowship Of The Ring Hobbit Female 14 #> 4 The Fellowship Of The Ring Hobbit Male 3644 #> 5 The Return Of The King Elf Female 183 #> 6 The Return Of The King Elf Male 510 #> 7 The Return Of The King Hobbit Female 2 #> 8 The Return Of The King Hobbit Male 2673 #> 9 The Two Towers Elf Female 331 #> 10 The Two Towers Elf Male 513 #> 11 The Two Towers Hobbit Female 0 #> 12 The Two Towers Hobbit Male 2463 ``` --- # .center[Logic operators for `filter()`] <br> Description | Example ------------|------------ Values greater than 1 | `value > 1` Values greater than or equal to 1 | `value >= 1` Values less than 1 | `value < 1` Values less than or equal to 1 | `value <= 1` Values equal to 1 | `value == 1` Values not equal to 1 | `value != 1` Values in the set c(1, 4) | `value %in% c(1, 4)` --- # Combine `filter()` and `select()` Keep only the rows with Elf characters that spoke more than 1000 words, then select everything but the race column ```r lotr %>% filter((race == "Elf") & (word_count > 1000)) %>% select(-race) ``` ``` #> # A tibble: 1 × 3 #> film gender word_count #> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Female 1229 ``` --- class: center, middle, inverse ## Create new variables with `mutate()` <br> <center> <img src="images/rstudio-cheatsheet-mutate.png" width="900"> </center> --- # Create new variables with `mutate()` Create a new variable, `word1000` which is `TRUE` if the character spoke 1,000 or more words ```r lotr %>% mutate(word1000 = word_count >= 1000) ``` ``` #> # A tibble: 18 × 5 #> film race gender word_count word1000 #> <chr> <chr> <chr> <dbl> <lgl> #> 1 The Fellowship Of The Ring Elf Female 1229 TRUE #> 2 The Fellowship Of The Ring Elf Male 971 FALSE #> 3 The Fellowship Of The Ring Hobbit Female 14 FALSE #> 4 The Fellowship Of The Ring Hobbit Male 3644 TRUE #> 5 The Fellowship Of The Ring Man Female 0 FALSE #> 6 The Fellowship Of The Ring Man Male 1995 TRUE #> 7 The Return Of The King Elf Female 183 FALSE #> 8 The Return Of The King Elf Male 510 FALSE #> 9 The Return Of The King Hobbit Female 2 FALSE #> 10 The Return Of The King Hobbit Male 2673 TRUE #> 11 The Return Of The King Man Female 268 FALSE #> 12 The Return Of The King Man Male 2459 TRUE #> 13 The Two Towers Elf Female 331 FALSE #> 14 The Two Towers Elf Male 513 FALSE #> 15 The Two Towers Hobbit Female 0 FALSE #> 16 The Two Towers Hobbit Male 2463 TRUE #> 17 The Two Towers Man Female 401 FALSE #> 18 The Two Towers Man Male 3589 TRUE ``` --- # .center[Handling if/else conditions] ### .center[`ifelse(<condition>, <if TRUE>, <else>)`] -- ```r lotr %>% mutate(word1000 = ifelse(word_count >= 1000, TRUE, FALSE)) ``` ``` #> # A tibble: 18 × 5 #> film race gender word_count word1000 #> <chr> <chr> <chr> <dbl> <lgl> #> 1 The Fellowship Of The Ring Elf Female 1229 TRUE #> 2 The Fellowship Of The Ring Elf Male 971 FALSE #> 3 The Fellowship Of The Ring Hobbit Female 14 FALSE #> 4 The Fellowship Of The Ring Hobbit Male 3644 TRUE #> 5 The Fellowship Of The Ring Man Female 0 FALSE #> 6 The Fellowship Of The Ring Man Male 1995 TRUE #> 7 The Return Of The King Elf Female 183 FALSE #> 8 The Return Of The King Elf Male 510 FALSE #> 9 The Return Of The King Hobbit Female 2 FALSE #> 10 The Return Of The King Hobbit Male 2673 TRUE #> 11 The Return Of The King Man Female 268 FALSE #> 12 The Return Of The King Man Male 2459 TRUE #> 13 The Two Towers Elf Female 331 FALSE #> 14 The Two Towers Elf Male 513 FALSE #> 15 The Two Towers Hobbit Female 0 FALSE #> 16 The Two Towers Hobbit Male 2463 TRUE #> 17 The Two Towers Man Female 401 FALSE #> 18 The Two Towers Man Male 3589 TRUE ``` --- # Sort data frame with `arrange()` Sort the `lotr` data frame by `word_count` ```r lotr %>% arrange(word_count) ``` ``` #> # A tibble: 18 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Man Female 0 #> 2 The Two Towers Hobbit Female 0 #> 3 The Return Of The King Hobbit Female 2 #> 4 The Fellowship Of The Ring Hobbit Female 14 #> 5 The Return Of The King Elf Female 183 #> 6 The Return Of The King Man Female 268 #> 7 The Two Towers Elf Female 331 #> 8 The Two Towers Man Female 401 #> 9 The Return Of The King Elf Male 510 #> 10 The Two Towers Elf Male 513 #> 11 The Fellowship Of The Ring Elf Male 971 #> 12 The Fellowship Of The Ring Elf Female 1229 #> 13 The Fellowship Of The Ring Man Male 1995 #> 14 The Return Of The King Man Male 2459 #> 15 The Two Towers Hobbit Male 2463 #> 16 The Return Of The King Hobbit Male 2673 #> 17 The Two Towers Man Male 3589 #> 18 The Fellowship Of The Ring Hobbit Male 3644 ``` --- # Sort data frame with `arrange()` Use the `desc()` function to sort in descending order ```r lotr %>% arrange(desc(word_count)) ``` ``` #> # A tibble: 18 × 4 #> film race gender word_count #> <chr> <chr> <chr> <dbl> #> 1 The Fellowship Of The Ring Hobbit Male 3644 #> 2 The Two Towers Man Male 3589 #> 3 The Return Of The King Hobbit Male 2673 #> 4 The Two Towers Hobbit Male 2463 #> 5 The Return Of The King Man Male 2459 #> 6 The Fellowship Of The Ring Man Male 1995 #> 7 The Fellowship Of The Ring Elf Female 1229 #> 8 The Fellowship Of The Ring Elf Male 971 #> 9 The Two Towers Elf Male 513 #> 10 The Return Of The King Elf Male 510 #> 11 The Two Towers Man Female 401 #> 12 The Two Towers Elf Female 331 #> 13 The Return Of The King Man Female 268 #> 14 The Return Of The King Elf Female 183 #> 15 The Fellowship Of The Ring Hobbit Female 14 #> 16 The Return Of The King Hobbit Female 2 #> 17 The Fellowship Of The Ring Man Female 0 #> 18 The Two Towers Hobbit Female 0 ``` --- class: inverse
−
+
10
:
00
# Your turn Read in the `data.csv` file in the "data" folder: ```r data <- read_csv(here('data', 'data.csv')) ``` Now answer these questions: .font80[ - Create a new data frame, `flights_fall`, that contains only flights that departed in the fall semester. - Create a new data frame, `flights_dc`, that contains only flights that flew to DC airports (Reagan or Dulles). - Create a new data frame, `flights_dc_carrier`, that contains only flights that flew to DC airports (Reagan or Dulles) and only the columns about the month and airline. - How many unique airlines were flying to DC airports in July? - Create a new variable, `speed`, in miles per hour using the `time` (minutes) and `distance` (miles) variables. - Which flight flew the fastest? - Remove rows that have `NA` for `air_time` and re-arrange the resulting data frame based on the longest air time and longest flight distance. ] --- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. Course Goal ## 2. Course Introduction ## 3. Break: Install Stuff ## 4. Workflow & Reading In Data ## 5. Wrangling Data ## 6. .orange[Visualizing Data] --- .leftcol[ <img src="images/making_a_ggplot.jpeg" width=600> ] .rightcol[ # "Grammar of Graphics" Concept developed by Leland Wilkinson (1999) **ggplot2** package developed by Hadley Wickham (2005) ] --- # Making plot layers with ggplot2 <br> ### 1. The data ### 2. The aesthetic mapping (what goes on the axes?) ### 3. The geometries (points? bars? etc.) ### 4. The annotations / labels ### 5. The theme --- # Layer 1: The data ```r head(mpg) ``` ``` #> # A tibble: 6 × 11 #> manufacturer model displ year cyl trans drv cty hwy fl class #> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> #> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact #> 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact #> 3 audi a4 2 2008 4 manual(m6) f 20 31 p compact #> 4 audi a4 2 2008 4 auto(av) f 21 30 p compact #> 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact #> 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact ``` --- # Layer 1: The data The `ggplot()` function initializes the plot with whatever data you're using .leftcol[ ```r mpg %>% ggplot() ``` ] .rightcol[.blackborder[ <img src="figs/unnamed-chunk-43-1.png" width="504" /> ]] --- # Layer 2: The aesthetic mapping The `aes()` function determines which variables will be _mapped_ to the geometries<br>(e.g. the axes) .leftcol[ ```r mpg %>% * ggplot(aes(x = displ, y = hwy)) ``` ] .rightcol[.blackborder[ <img src="figs/unnamed-chunk-44-1.png" width="504" /> ]] --- # Layer 3: The geometries Use `+` to add geometries, e.g. `geom_points()` for points .leftcol[ ```r mpg %>% ggplot(aes(x = displ, y = hwy)) + * geom_point() ``` ] .rightcol[.blackborder[ <img src="figs/unnamed-chunk-45-1.png" width="504" /> ]] --- # Layer 4: The annotations / labels Use `labs()` to modify most labels .leftcol[ ```r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + * labs( * x = "Engine displacement (liters)", * y = "Highway fuel economy (mpg)", * title = "Most larger engine vehicles are less fuel efficient" * ) ``` ] .rightcol[.blackborder[ <img src="figs/unnamed-chunk-46-1.png" width="504" /> ]] --- # Layer 5: The theme .leftcol[ ```r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + labs( x = "Engine displacement (liters)", y = "Highway fuel economy (mpg)", title = "Most larger engine vehicles are less fuel efficient" ) + * theme_bw() ``` ] .rightcol[.blackborder[ <img src="figs/unnamed-chunk-47-1.png" width="504" /> ]] --- ### Common themes .leftcol[ `theme_bw()` ```r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + * theme_bw() ``` <img src="figs/unnamed-chunk-48-1.png" width="432" /> ] .rightcol[ `theme_minimal()` ```r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + * theme_minimal() ``` <img src="figs/unnamed-chunk-49-1.png" width="432" /> ] --- ### Common themes .leftcol[ `theme_classic()` ```r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + * theme_classic() ``` <img src="figs/unnamed-chunk-50-1.png" width="432" /> ] .rightcol[ `theme_void()` ```r mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + * theme_void() ``` <img src="figs/unnamed-chunk-51-1.png" width="432" /> ] --- class: middle, inverse .leftcol[ <img src="figs/unnamed-chunk-52-1.png" width="522.144" /> <img src="figs/unnamed-chunk-53-1.png" width="522.144" /> ] .rightcol[
−
+
15
:
00
## Your turn Open `practice.Rmd` Use the `mpg` data frame and ggplot to create these charts <img src="figs/unnamed-chunk-55-1.png" width="522.144" /> ] --- class: inverse # Extra practice .leftcol[ <img src="figs/ggbar_p1-1.png" width="504" /> ] .rightcol[ <img src="figs/unnamed-chunk-56-1.png" width="432" /> ]