class: middle, inverse .leftcol30[ <center> <img src="https://raw.githubusercontent.com/emse-eda-gwu/2021-Spring/master/images/eda_hex_sticker.png" width=250> </center> ] .rightcol70[ # Week 1: .fancy[Getting Started] ### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 512 512"><path d="M496 128v16a8 8 0 0 1-8 8h-24v12c0 6.627-5.373 12-12 12H60c-6.627 0-12-5.373-12-12v-12H24a8 8 0 0 1-8-8v-16a8 8 0 0 1 4.941-7.392l232-88a7.996 7.996 0 0 1 6.118 0l232 88A8 8 0 0 1 496 128zm-24 304H40c-13.255 0-24 10.745-24 24v16a8 8 0 0 0 8 8h464a8 8 0 0 0 8-8v-16c0-13.255-10.745-24-24-24zM96 192v192H60c-6.627 0-12 5.373-12 12v20h416v-20c0-6.627-5.373-12-12-12h-36V192h-64v192h-64V192h-64v192h-64V192H96z"/></svg> EMSE 4575: Exploratory Data Analysis ### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 448 512"><path d="M224 256c70.7 0 128-57.3 128-128S294.7 0 224 0 96 57.3 96 128s57.3 128 128 128zm89.6 32h-16.7c-22.2 10.2-46.9 16-72.9 16s-50.6-5.8-72.9-16h-16.7C60.2 288 0 348.2 0 422.4V464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48v-41.6c0-74.2-60.2-134.4-134.4-134.4z"/></svg> John Paul Helveston ### <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 448 512"><path d="M0 464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V192H0v272zm320-196c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM192 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM64 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zM400 64h-48V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H160V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H48C21.5 64 0 85.5 0 112v48h448v-48c0-26.5-21.5-48-48-48z"/></svg> January 12, 2021 ] --- class: center background-image: url("images/planning.jpg") background-size: contain --- class: center # It's nice to see your faces 😄 .leftcol[ <center> <img src="images/zoom_video.jpg" width="500px"> </center> ] .rightcol[.left[ If you're okay with it, please turn on your camera - it creates a more engaging discussion environment and an opportunity for us to get to know each other better. Fun Zoom backgrounds encouraged 😄 (Your privacy is important, and I understand if you wish to keep cameras off. No pressure.) ]] --- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. Course Goal ## 2. Course Introduction ## 3. Break: Install Stuff ## 4. Workflow & Reading In Data ## 5. Data Provenance ## 6. Tidy Data --- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. .orange[Course Goal] ## 2. Course Introduction ## 3. Break: Install Stuff ## 4. Workflow & Reading In Data ## 5. Data Provenance ## 6. Tidy Data --- ## Course 1: [Intro to Programming for Analytics](https://p4a.seas.gwu.edu/2020-Fall/) **"Computational Literacy"** - Programming: Conditionals (if/else), loops, functions, testing, data types. - Analytics: Data structures, import / export, basic data manipulation & visualization. -- ## Course 2: [Exploratory Data Analysis](https://emse-eda-gwu.github.io/2021-Spring/) **"Data Literacy"** - Strategies for conducting an exploratory data analysis. - Design principles for visualizing and communicating _information_ extracted from data. - Reproducibility: Reports that contain code, equations, visualizations, and narrative text. --- class: center, inverse, middle # **Class goal**: translate _data_ into _information_ --- class: center # **Class goal**: translate _data_ into _information_ -- .leftcol[ **Data** Average student engagement scores Class | Type | City | County ------------|-------------|------|------- Special Ed. | Charter | 643 | 793 Special Ed. | Public | 735 | 928 General Ed. | Charter | 590 | 724 General Ed. | Public | 863 | 662 ] -- .rightcol[ **Information** <img src="figs/student-engagement-final-1.png" width="432" /> ] --- # Data exploration: an iterative process -- .leftcol[ Encode data: .code60[ ```r engagement_data <- data.frame( City = c(643, 735, 590, 863), County = c(793, 928, 724, 662), School = c('Special Ed., Charter', 'Special Ed., Public', 'General Ed., Charter', 'General Ed., Public')) engagement_data ``` ``` #> City County School #> 1 643 793 Special Ed., Charter #> 2 735 928 Special Ed., Public #> 3 590 724 General Ed., Charter #> 4 863 662 General Ed., Public ``` ]] -- .rightcol[ Re-format data for plotting: .code60[ ```r engagement_data <- engagement_data %>% gather(Location, Engagement, City:County) %>% mutate(Location = fct_relevel( Location, c('City', 'County'))) engagement_data ``` ``` #> School Location Engagement #> 1 Special Ed., Charter City 643 #> 2 Special Ed., Public City 735 #> 3 General Ed., Charter City 590 #> 4 General Ed., Public City 863 #> 5 Special Ed., Charter County 793 #> 6 Special Ed., Public County 928 #> 7 General Ed., Charter County 724 #> 8 General Ed., Public County 662 ``` ]] --- # Data exploration: an iterative process .leftcol[ Initial exploratory plotting: .code60[ ```r engagement_data %>% ggplot() + geom_col(aes(x = Engagement, y = School, fill = Location), position = 'dodge') ``` <img src="figs/student-engagement-bars1-1.png" width="432" /> ]] -- .rightcol[ More exploratory plotting:<br>highlight difference <img src="figs/student-engagement-bars2-1.png" width="432" /> ] --- # Data exploration: an iterative process .leftcol[ Directly label figure: <img src="figs/student-engagement-bars3-1.png" width="432" /> ] -- .rightcol[ Remove unnecessary axes, change colors, fix labels: <img src="figs/unnamed-chunk-5-1.png" width="432" /> ] --- **A fully reproducible analysis** .panelset[ .panel[.panel-name[Code] .code40[.leftcol[ ```r data <- data.frame( City = c(643, 735, 590, 863), County = c(793, 928, 724, 662), School = c('Special Ed., Charter', 'Special Ed., Public', 'General Ed., Charter', 'General Ed., Public'), Highlight = c(0, 0, 0, 1)) %>% gather(Location, Engagement, City:County) %>% mutate( Location = fct_relevel(Location, c('City', 'County')), Highlight = as.factor(Highlight), x = ifelse(Location == 'County', 1, 0)) ``` ] .rightcol[ ```r plot <- ggplot(data, aes(x = x, y = Engagement, group = School, color = Highlight)) + geom_point() + geom_line() + scale_color_manual(values = c('#757575', '#ed573e')) + labs(x = 'Sex', y = 'Engagement', title = paste0('Students in public, general education classes\n', 'in county schools have surprisingly low engagement')) + scale_x_continuous(limits = c(-1.2, 1.2), labels = c('City', 'County'), breaks = c(0, 1)) + geom_text_repel(aes(label = Engagement, color = as.factor(Highlight)), data = subset(engagement, Location == 'County'), size = 5, nudge_x = 0.1, segment.color = NA) + geom_text_repel(aes(label = Engagement, color = as.factor(Highlight)), data = subset(engagement, Location == 'City'), size = 5, nudge_x = -0.1, segment.color = NA) + geom_text_repel(aes(label = School, color = as.factor(Highlight)), data = subset(engagement, Location == 'City'), size = 5, nudge_x = -0.25, hjust = 1, segment.color = NA) + theme_cowplot() + background_grid(major = 'x') + theme(axis.line = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), legend.position = 'none') ``` ]]] .panel[.panel-name[Plot] <img src="figs/unnamed-chunk-8-1.png" width="432" /> ]] --- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. Course Goal ## 2. .orange[Course Introduction] ## 3. Break: Install Stuff ## 4. Workflow & Reading In Data ## 5. Data Provenance ## 6. Tidy Data --- # Meet your instructor! .leftcol30[.circle[ <img src="images/helveston.jpg" width="300"> ]] .rightcol70[ ### John Helveston, Ph.D. .font80[ - 2018 - Present Assistant Professor, Engineering Management & Systems Engineering - 2016-2018 Postdoc at [Institute for Sustainable Energy](https://www.bu.edu/ise/), Boston University - 2016 PhD in Engineering & Public Policy at Carnegie Mellon University - 2015 MS in Engineering & Public Policy at Carnegie Mellon University - 2010 BS in Engineering Science & Mechanics at Virginia Tech - Website: [www.jhelvy.com](http://www.jhelvy.com/) ]] --- # Meet your tutors! .leftcol30[.circle[ <img src="images/pantha.jpg" width="300"> ]] .rightcol70[ ### **Saurav Pantha** (aka "The Firefighter") - Graduate Assistant (GA) - Masters student in EMSE ] --- # Meet your tutors! .leftcol30[.circle[ <img src="images/kim.png" width="300"> ]] .rightcol70[ ### **Jennifer Kim** (aka "The Monitor") - Learning Assistant (LA) - EMSE Junior & P4A alumni ] --- # Prerequisites ## [EMSE 4574: Intro to Programming for Analytics](https://p4a.seas.gwu.edu/2020-Fall/) You should be able to: - Use RStudio to write basic R commands. - Know the distinctions between different R operators and data types, including numeric, string, and logical data. - Use **tidyverse** functions to wrangle and manipulate data in R. - Use the **ggplot2** library to create plots in R. -- > [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M496 128v16a8 8 0 0 1-8 8h-24v12c0 6.627-5.373 12-12 12H60c-6.627 0-12-5.373-12-12v-12H24a8 8 0 0 1-8-8v-16a8 8 0 0 1 4.941-7.392l232-88a7.996 7.996 0 0 1 6.118 0l232 88A8 8 0 0 1 496 128zm-24 304H40c-13.255 0-24 10.745-24 24v16a8 8 0 0 0 8 8h464a8 8 0 0 0 8-8v-16c0-13.255-10.745-24-24-24zM96 192v192H60c-6.627 0-12 5.373-12 12v20h416v-20c0-6.627-5.373-12-12-12h-36V192h-64v192h-64V192h-64v192h-64V192H96z"/></svg> Check out R for Analytics Primer](http://jhelvy.github.io/r4aPrimer/) --- # Course website ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 496 512"><path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"/></svg> Everything you need will be on the course website:<br>https://eda.seas.gwu.edu/2021-Spring/ -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M0 464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V192H0v272zm320-196c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM192 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM64 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zM400 64h-48V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H160V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H48C21.5 64 0 85.5 0 112v48h448v-48c0-26.5-21.5-48-48-48z"/></svg> The [schedule](https://emse-eda-gwu.github.io/2021-Spring/schedule.html) is the best starting point --- # Quizzes -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M0 464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V192H0v272zm320-196c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM192 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12h-40c-6.6 0-12-5.4-12-12v-40zM64 268c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zm0 128c0-6.6 5.4-12 12-12h40c6.6 0 12 5.4 12 12v40c0 6.6-5.4 12-12 12H76c-6.6 0-12-5.4-12-12v-40zM400 64h-48V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H160V16c0-8.8-7.2-16-16-16h-32c-8.8 0-16 7.2-16 16v48H48C21.5 64 0 85.5 0 112v48h448v-48c0-26.5-21.5-48-48-48z"/></svg> In class every other week-ish (5 total, lowest dropped) -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M256 8C119 8 8 119 8 256s111 248 248 248 248-111 248-248S393 8 256 8zm57.1 350.1L224.9 294c-3.1-2.3-4.9-5.9-4.9-9.7V116c0-6.6 5.4-12 12-12h48c6.6 0 12 5.4 12 12v137.7l63.5 46.2c5.4 3.9 6.5 11.4 2.6 16.8l-28.2 38.8c-3.9 5.3-11.4 6.5-16.8 2.6z"/></svg> ~5 minutes -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M139.61 35.5a12 12 0 0 0-17 0L58.93 98.81l-22.7-22.12a12 12 0 0 0-17 0L3.53 92.41a12 12 0 0 0 0 17l47.59 47.4a12.78 12.78 0 0 0 17.61 0l15.59-15.62L156.52 69a12.09 12.09 0 0 0 .09-17zm0 159.19a12 12 0 0 0-17 0l-63.68 63.72-22.7-22.1a12 12 0 0 0-17 0L3.53 252a12 12 0 0 0 0 17L51 316.5a12.77 12.77 0 0 0 17.6 0l15.7-15.69 72.2-72.22a12 12 0 0 0 .09-16.9zM64 368c-26.49 0-48.59 21.5-48.59 48S37.53 464 64 464a48 48 0 0 0 0-96zm432 16H208a16 16 0 0 0-16 16v32a16 16 0 0 0 16 16h288a16 16 0 0 0 16-16v-32a16 16 0 0 0-16-16zm0-320H208a16 16 0 0 0-16 16v32a16 16 0 0 0 16 16h288a16 16 0 0 0 16-16V80a16 16 0 0 0-16-16zm0 160H208a16 16 0 0 0-16 16v32a16 16 0 0 0 16 16h288a16 16 0 0 0 16-16v-32a16 16 0 0 0-16-16z"/></svg> [Example quiz](https://p4aquizdemo.formr.org/) -- > **Why quiz at all?** The "retrieval effect" - basically, you have to _practice_ remembering things, otherwise your brain won't remember them (see the book ["Make It Stick: The Science of Successful Learning"](https://www.hup.harvard.edu/catalog.php?isbn=9780674729018)) --- ## Assignments -- ## 1) <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M448 360V24c0-13.3-10.7-24-24-24H96C43 0 0 43 0 96v320c0 53 43 96 96 96h328c13.3 0 24-10.7 24-24v-16c0-7.5-3.5-14.3-8.9-18.7-4.2-15.4-4.2-59.3 0-74.7 5.4-4.3 8.9-11.1 8.9-18.6zM128 134c0-3.3 2.7-6 6-6h212c3.3 0 6 2.7 6 6v20c0 3.3-2.7 6-6 6H134c-3.3 0-6-2.7-6-6v-20zm0 64c0-3.3 2.7-6 6-6h212c3.3 0 6 2.7 6 6v20c0 3.3-2.7 6-6 6H134c-3.3 0-6-2.7-6-6v-20zm253.4 250H96c-17.7 0-32-14.3-32-32 0-17.6 14.4-32 32-32h285.4c-1.9 17.1-1.9 46.9 0 64z"/></svg> Weekly "reflections" on [readings](https://eda.seas.gwu.edu/2021-Spring/r1-exploring-data.html) -- ## 2) <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M109.46 244.04l134.58-134.56-44.12-44.12-61.68 61.68a7.919 7.919 0 0 1-11.21 0l-11.21-11.21c-3.1-3.1-3.1-8.12 0-11.21l61.68-61.68-33.64-33.65C131.47-3.1 111.39-3.1 99 9.29L9.29 99c-12.38 12.39-12.39 32.47 0 44.86l100.17 100.18zm388.47-116.8c18.76-18.76 18.75-49.17 0-67.93l-45.25-45.25c-18.76-18.76-49.18-18.76-67.95 0l-46.02 46.01 113.2 113.2 46.02-46.03zM316.08 82.71l-297 296.96L.32 487.11c-2.53 14.49 10.09 27.11 24.59 24.56l107.45-18.84L429.28 195.9 316.08 82.71zm186.63 285.43l-33.64-33.64-61.68 61.68c-3.1 3.1-8.12 3.1-11.21 0l-11.21-11.21c-3.09-3.1-3.09-8.12 0-11.21l61.68-61.68-44.14-44.14L267.93 402.5l100.21 100.2c12.39 12.39 32.47 12.39 44.86 0l89.71-89.7c12.39-12.39 12.39-32.47 0-44.86z"/></svg> 3 Mini Projects (due 2 weeks from date assigned) -- ## 3) <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M109.46 244.04l134.58-134.56-44.12-44.12-61.68 61.68a7.919 7.919 0 0 1-11.21 0l-11.21-11.21c-3.1-3.1-3.1-8.12 0-11.21l61.68-61.68-33.64-33.65C131.47-3.1 111.39-3.1 99 9.29L9.29 99c-12.38 12.39-12.39 32.47 0 44.86l100.17 100.18zm388.47-116.8c18.76-18.76 18.75-49.17 0-67.93l-45.25-45.25c-18.76-18.76-49.18-18.76-67.95 0l-46.02 46.01 113.2 113.2 46.02-46.03zM316.08 82.71l-297 296.96L.32 487.11c-2.53 14.49 10.09 27.11 24.59 24.56l107.45-18.84L429.28 195.9 316.08 82.71zm186.63 285.43l-33.64-33.64-61.68 61.68c-3.1 3.1-8.12 3.1-11.21 0l-11.21-11.21c-3.09-3.1-3.09-8.12 0-11.21l61.68-61.68-44.14-44.14L267.93 402.5l100.21 100.2c12.39 12.39 32.47 12.39 44.86 0l89.71-89.7c12.39-12.39 12.39-32.47 0-44.86z"/></svg> [Final Project](https://emse-eda-gwu.github.io/2021-Spring/a-project.html) (Teams of 2 - 3 students) Item | Due Date ----------------|--------------- Proposal | March 12 Progress Report | April 16 Final Report | April 30 Presentation | May 03 Interview | Exam week --- background-color: #FFF # .center[Grades] <img src="figs/grade-breakdown-1.png" width="936" style="display: block; margin: auto;" /> --- # .center[Grades] Item | Weight | Notes ------------------------------|--------|------------------------------------- Reflections | 6 % | Weekly assignment (12 x 0.5%) Quizzes | 12 % | 5 quizzes, lowest dropped Mini Project 1 | 9 % | Individual projects Mini Project 2 | 9 % | Mini Project 3 | 9 % | Final Project Proposal | 10 % | Teams of 2-3 students Final Project Progress Report | 10 % | Final Project Report | 15 % | Final Project Presentation | 10 % | Final Interview | 10 % | Individual interview about your project --- # Course policies -- .leftcol35[ - ## BE NICE - ## BE HONEST - ## DON'T CHEAT ] -- .rightcol65[ ## Copying is good, stealing is bad > "Plagiarism is trying to pass someone else’s work off as your own. Copying is about reverse-engineering." > > .right[-- Austin Kleon, from [Steal Like An Artist](https://austinkleon.com/steal/) ] ] --- # Late submissions ## - **5** late days - use them anytime, no questions asked ## - No more than **2** late days on any one assignment ## - Contact me for special cases --- # How to succeed in this class -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M96 224c35.3 0 64-28.7 64-64s-28.7-64-64-64-64 28.7-64 64 28.7 64 64 64zm448 0c35.3 0 64-28.7 64-64s-28.7-64-64-64-64 28.7-64 64 28.7 64 64 64zm32 32h-64c-17.6 0-33.5 7.1-45.1 18.6 40.3 22.1 68.9 62 75.1 109.4h66c17.7 0 32-14.3 32-32v-32c0-35.3-28.7-64-64-64zm-256 0c61.9 0 112-50.1 112-112S381.9 32 320 32 208 82.1 208 144s50.1 112 112 112zm76.8 32h-8.3c-20.8 10-43.9 16-68.5 16s-47.6-6-68.5-16h-8.3C179.6 288 128 339.6 128 403.2V432c0 26.5 21.5 48 48 48h288c26.5 0 48-21.5 48-48v-28.8c0-63.6-51.6-115.2-115.2-115.2zm-223.7-13.4C161.5 263.1 145.6 256 128 256H64c-35.3 0-64 28.7-64 64v32c0 17.7 14.3 32 32 32h65.9c6.3-47.4 34.9-87.3 75.2-109.4z"/></svg> Participate during class! -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M109.46 244.04l134.58-134.56-44.12-44.12-61.68 61.68a7.919 7.919 0 0 1-11.21 0l-11.21-11.21c-3.1-3.1-3.1-8.12 0-11.21l61.68-61.68-33.64-33.65C131.47-3.1 111.39-3.1 99 9.29L9.29 99c-12.38 12.39-12.39 32.47 0 44.86l100.17 100.18zm388.47-116.8c18.76-18.76 18.75-49.17 0-67.93l-45.25-45.25c-18.76-18.76-49.18-18.76-67.95 0l-46.02 46.01 113.2 113.2 46.02-46.03zM316.08 82.71l-297 296.96L.32 487.11c-2.53 14.49 10.09 27.11 24.59 24.56l107.45-18.84L429.28 195.9 316.08 82.71zm186.63 285.43l-33.64-33.64-61.68 61.68c-3.1 3.1-8.12 3.1-11.21 0l-11.21-11.21c-3.09-3.1-3.09-8.12 0-11.21l61.68-61.68-44.14-44.14L267.93 402.5l100.21 100.2c12.39 12.39 32.47 12.39 44.86 0l89.71-89.7c12.39-12.39 12.39-32.47 0-44.86z"/></svg> Start assignments early and **read carefully**! -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M448 360V24c0-13.3-10.7-24-24-24H96C43 0 0 43 0 96v320c0 53 43 96 96 96h328c13.3 0 24-10.7 24-24v-16c0-7.5-3.5-14.3-8.9-18.7-4.2-15.4-4.2-59.3 0-74.7 5.4-4.3 8.9-11.1 8.9-18.6zM128 134c0-3.3 2.7-6 6-6h212c3.3 0 6 2.7 6 6v20c0 3.3-2.7 6-6 6H134c-3.3 0-6-2.7-6-6v-20zm0 64c0-3.3 2.7-6 6-6h212c3.3 0 6 2.7 6 6v20c0 3.3-2.7 6-6 6H134c-3.3 0-6-2.7-6-6v-20zm253.4 250H96c-17.7 0-32-14.3-32-32 0-17.6 14.4-32 32-32h285.4c-1.9 17.1-1.9 46.9 0 64z"/></svg> Actually read (before class)! -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M176 256c44.11 0 80-35.89 80-80s-35.89-80-80-80-80 35.89-80 80 35.89 80 80 80zm352-128H304c-8.84 0-16 7.16-16 16v144H64V80c0-8.84-7.16-16-16-16H16C7.16 64 0 71.16 0 80v352c0 8.84 7.16 16 16 16h32c8.84 0 16-7.16 16-16v-48h512v48c0 8.84 7.16 16 16 16h32c8.84 0 16-7.16 16-16V240c0-61.86-50.14-112-112-112z"/></svg> Get sleep and take breaks often! -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M128 96c26.5 0 48-21.5 48-48S154.5 0 128 0 80 21.5 80 48s21.5 48 48 48zm384 0c26.5 0 48-21.5 48-48S538.5 0 512 0s-48 21.5-48 48 21.5 48 48 48zm125.7 372.1l-44-110-41.1 46.4-2 18.2 27.7 69.2c5 12.5 17 20.1 29.7 20.1 4 0 8-.7 11.9-2.3 16.4-6.6 24.4-25.2 17.8-41.6zm-34.2-209.8L585 178.1c-4.6-20-18.6-36.8-37.5-44.9-18.5-8-39-6.7-56.1 3.3-22.7 13.4-39.7 34.5-48.1 59.4L432 229.8 416 240v-96c0-8.8-7.2-16-16-16H240c-8.8 0-16 7.2-16 16v96l-16.1-10.2-11.3-33.9c-8.3-25-25.4-46-48.1-59.4-17.2-10-37.6-11.3-56.1-3.3-18.9 8.1-32.9 24.9-37.5 44.9l-18.4 80.2c-4.6 20 .7 41.2 14.4 56.7l67.2 75.9 10.1 92.6C130 499.8 143.8 512 160 512c1.2 0 2.3-.1 3.5-.2 17.6-1.9 30.2-17.7 28.3-35.3l-10.1-92.8c-1.5-13-6.9-25.1-15.6-35l-43.3-49 17.6-70.3 6.8 20.4c4.1 12.5 11.9 23.4 24.5 32.6l51.1 32.5c4.6 2.9 12.1 4.6 17.2 5h160c5.1-.4 12.6-2.1 17.2-5l51.1-32.5c12.6-9.2 20.4-20 24.5-32.6l6.8-20.4 17.6 70.3-43.3 49c-8.7 9.9-14.1 22-15.6 35l-10.1 92.8c-1.9 17.6 10.8 33.4 28.3 35.3 1.2.1 2.3.2 3.5.2 16.1 0 30-12.1 31.8-28.5l10.1-92.6 67.2-75.9c13.6-15.5 19-36.7 14.4-56.7zM46.3 358.1l-44 110c-6.6 16.4 1.4 35 17.8 41.6 16.8 6.6 35.1-1.7 41.6-17.8l27.7-69.2-2-18.2-41.1-46.4z"/></svg> Ask for help! --- # [Getting Help](https://p4a.seas.gwu.edu/2020-Fall/ref-getting-help.html) -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M94.12 315.1c0 25.9-21.16 47.06-47.06 47.06S0 341 0 315.1c0-25.9 21.16-47.06 47.06-47.06h47.06v47.06zm23.72 0c0-25.9 21.16-47.06 47.06-47.06s47.06 21.16 47.06 47.06v117.84c0 25.9-21.16 47.06-47.06 47.06s-47.06-21.16-47.06-47.06V315.1zm47.06-188.98c-25.9 0-47.06-21.16-47.06-47.06S139 32 164.9 32s47.06 21.16 47.06 47.06v47.06H164.9zm0 23.72c25.9 0 47.06 21.16 47.06 47.06s-21.16 47.06-47.06 47.06H47.06C21.16 243.96 0 222.8 0 196.9s21.16-47.06 47.06-47.06H164.9zm188.98 47.06c0-25.9 21.16-47.06 47.06-47.06 25.9 0 47.06 21.16 47.06 47.06s-21.16 47.06-47.06 47.06h-47.06V196.9zm-23.72 0c0 25.9-21.16 47.06-47.06 47.06-25.9 0-47.06-21.16-47.06-47.06V79.06c0-25.9 21.16-47.06 47.06-47.06 25.9 0 47.06 21.16 47.06 47.06V196.9zM283.1 385.88c25.9 0 47.06 21.16 47.06 47.06 0 25.9-21.16 47.06-47.06 47.06-25.9 0-47.06-21.16-47.06-47.06v-47.06h47.06zm0-23.72c-25.9 0-47.06-21.16-47.06-47.06 0-25.9 21.16-47.06 47.06-47.06h117.84c25.9 0 47.06 21.16 47.06 47.06 0 25.9-21.16 47.06-47.06 47.06H283.1z"/></svg> Use [Slack](https://emse-eda-s21.slack.com/) to ask questions. -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M208 352c-2.39 0-4.78.35-7.06 1.09C187.98 357.3 174.35 360 160 360c-14.35 0-27.98-2.7-40.95-6.91-2.28-.74-4.66-1.09-7.05-1.09C49.94 352-.33 402.48 0 464.62.14 490.88 21.73 512 48 512h224c26.27 0 47.86-21.12 48-47.38.33-62.14-49.94-112.62-112-112.62zm-48-32c53.02 0 96-42.98 96-96s-42.98-96-96-96-96 42.98-96 96 42.98 96 96 96zM592 0H208c-26.47 0-48 22.25-48 49.59V96c23.42 0 45.1 6.78 64 17.8V64h352v288h-64v-64H384v64h-76.24c19.1 16.69 33.12 38.73 39.69 64H592c26.47 0 48-22.25 48-49.59V49.59C640 22.25 618.47 0 592 0z"/></svg> Meet with your tutors -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M496 224c-79.6 0-144 64.4-144 144s64.4 144 144 144 144-64.4 144-144-64.4-144-144-144zm64 150.3c0 5.3-4.4 9.7-9.7 9.7h-60.6c-5.3 0-9.7-4.4-9.7-9.7v-76.6c0-5.3 4.4-9.7 9.7-9.7h12.6c5.3 0 9.7 4.4 9.7 9.7V352h38.3c5.3 0 9.7 4.4 9.7 9.7v12.6zM320 368c0-27.8 6.7-54.1 18.2-77.5-8-1.5-16.2-2.5-24.6-2.5h-16.7c-22.2 10.2-46.9 16-72.9 16s-50.6-5.8-72.9-16h-16.7C60.2 288 0 348.2 0 422.4V464c0 26.5 21.5 48 48 48h347.1c-45.3-31.9-75.1-84.5-75.1-144zm-96-112c70.7 0 128-57.3 128-128S294.7 0 224 0 96 57.3 96 128s57.3 128 128 128z"/></svg> [Schedule a meeting](https://jhelvy.appointlet.com/b/professor-helveston) w/Prof. Helveston: - Mondays from 8:00-5:00pm - Wednesdays from 3:20-5:00pm - Thursdays from 12:00-5:00pm -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M278.9 511.5l-61-17.7c-6.4-1.8-10-8.5-8.2-14.9L346.2 8.7c1.8-6.4 8.5-10 14.9-8.2l61 17.7c6.4 1.8 10 8.5 8.2 14.9L293.8 503.3c-1.9 6.4-8.5 10.1-14.9 8.2zm-114-112.2l43.5-46.4c4.6-4.9 4.3-12.7-.8-17.2L117 256l90.6-79.7c5.1-4.5 5.5-12.3.8-17.2l-43.5-46.4c-4.5-4.8-12.1-5.1-17-.5L3.8 247.2c-5.1 4.7-5.1 12.8 0 17.5l144.1 135.1c4.9 4.6 12.5 4.4 17-.5zm327.2.6l144.1-135.1c5.1-4.7 5.1-12.8 0-17.5L492.1 112.1c-4.8-4.5-12.4-4.3-17 .5L431.6 159c-4.6 4.9-4.3 12.7.8 17.2L523 256l-90.6 79.7c-5.1 4.5-5.5 12.3-.8 17.2l43.5 46.4c4.5 4.9 12.1 5.1 17 .6z"/></svg> [GW Coders](http://gwcoders.github.io/) --- # <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 576 512"><path d="M576 304v96c0 26.51-21.49 48-48 48H48c-26.51 0-48-21.49-48-48v-96c0-26.51 21.49-48 48-48h480c26.51 0 48 21.49 48 48zm-48-80a79.557 79.557 0 0 1 30.777 6.165L462.25 85.374A48.003 48.003 0 0 0 422.311 64H153.689a48 48 0 0 0-39.938 21.374L17.223 230.165A79.557 79.557 0 0 1 48 224h480zm-48 96c-17.673 0-32 14.327-32 32s14.327 32 32 32 32-14.327 32-32-14.327-32-32-32zm-96 0c-17.673 0-32 14.327-32 32s14.327 32 32 32 32-14.327 32-32-14.327-32-32-32z"/></svg> [Course Software](https://eda.seas.gwu.edu/2021-Spring/ref-course-software.html) -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M94.12 315.1c0 25.9-21.16 47.06-47.06 47.06S0 341 0 315.1c0-25.9 21.16-47.06 47.06-47.06h47.06v47.06zm23.72 0c0-25.9 21.16-47.06 47.06-47.06s47.06 21.16 47.06 47.06v117.84c0 25.9-21.16 47.06-47.06 47.06s-47.06-21.16-47.06-47.06V315.1zm47.06-188.98c-25.9 0-47.06-21.16-47.06-47.06S139 32 164.9 32s47.06 21.16 47.06 47.06v47.06H164.9zm0 23.72c25.9 0 47.06 21.16 47.06 47.06s-21.16 47.06-47.06 47.06H47.06C21.16 243.96 0 222.8 0 196.9s21.16-47.06 47.06-47.06H164.9zm188.98 47.06c0-25.9 21.16-47.06 47.06-47.06 25.9 0 47.06 21.16 47.06 47.06s-21.16 47.06-47.06 47.06h-47.06V196.9zm-23.72 0c0 25.9-21.16 47.06-47.06 47.06-25.9 0-47.06-21.16-47.06-47.06V79.06c0-25.9 21.16-47.06 47.06-47.06 25.9 0 47.06 21.16 47.06 47.06V196.9zM283.1 385.88c25.9 0 47.06 21.16 47.06 47.06 0 25.9-21.16 47.06-47.06 47.06-25.9 0-47.06-21.16-47.06-47.06v-47.06h47.06zm0-23.72c-25.9 0-47.06-21.16-47.06-47.06 0-25.9 21.16-47.06 47.06-47.06h117.84c25.9 0 47.06 21.16 47.06 47.06 0 25.9-21.16 47.06-47.06 47.06H283.1z"/></svg> [Slack](https://emse-eda-s21.slack.com/): See bb for link to join;<br>install on phone and **turn notifications on**! -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 581 512"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg> [R](https://cloud.r-project.org/) & [RStudio](https://rstudio.com/products/rstudio/download/) (Install both) -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 640 512"><path d="M634.91 154.88C457.74-8.99 182.19-8.93 5.09 154.88c-6.66 6.16-6.79 16.59-.35 22.98l34.24 33.97c6.14 6.1 16.02 6.23 22.4.38 145.92-133.68 371.3-133.71 517.25 0 6.38 5.85 16.26 5.71 22.4-.38l34.24-33.97c6.43-6.39 6.3-16.82-.36-22.98zM320 352c-35.35 0-64 28.65-64 64s28.65 64 64 64 64-28.65 64-64-28.65-64-64-64zm202.67-83.59c-115.26-101.93-290.21-101.82-405.34 0-6.9 6.1-7.12 16.69-.57 23.15l34.44 33.99c6 5.92 15.66 6.32 22.05.8 83.95-72.57 209.74-72.41 293.49 0 6.39 5.52 16.05 5.13 22.05-.8l34.44-33.99c6.56-6.46 6.33-17.06-.56-23.15z"/></svg> Install [Cisco AnyConnect VPN Client](https://seascf.seas.gwu.edu/vpn-access) to use RStudio in the cloud: https://rstudio.seas.gwu.edu/ -- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 448 512"><path d="M448 73.143v45.714C448 159.143 347.667 192 224 192S0 159.143 0 118.857V73.143C0 32.857 100.333 0 224 0s224 32.857 224 73.143zM448 176v102.857C448 319.143 347.667 352 224 352S0 319.143 0 278.857V176c48.125 33.143 136.208 48.572 224 48.572S399.874 209.143 448 176zm0 160v102.857C448 479.143 347.667 512 224 512S0 479.143 0 438.857V336c48.125 33.143 136.208 48.572 224 48.572S399.874 369.143 448 336z"/></svg> [DataCamp](https://www.datacamp.com/): sign up with your **@gwu.edu** email --- class: inverse, center <br> # .fancy[Break] # Install Stuff
05
:
00
--- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. Course Goal ## 2. Course Introduction ## 3. Break: Install Stuff ## 4. .orange[Workflow & Reading In Data] ## 5. Data Provenance ## 6. Tidy Data --- ## Workflow for reading in data 1) Use R Projects (.Rproj files) to organize your analysis - **don't double-click .R files**! <img src = "images/rproj.png" width = "75"> -- 2) Use the `here` package to create file paths ```r path <- here::here("folder", "file.csv") ``` -- 3) Import data with these functions: File type | Function | Library -----------|----------------|---------- `.csv` | `read_csv()` | **readr** `.txt` | `read.table()` | **utils** `.xlsx` | `read_excel()` | **readxl** --- # Importing Comma Separated Values (.csv) Read in `.csv` files with `read_csv()`: ```r library(tidyverse) library(here) csvPath <- here('data', 'milk_production.csv') *milk_production <- read_csv(csvPath) head(milk_production) ``` ``` #> # A tibble: 6 x 4 #> region state year milk_produced #> <chr> <chr> <dbl> <dbl> #> 1 Northeast Maine 1970 619000000 #> 2 Northeast New Hampshire 1970 356000000 #> 3 Northeast Vermont 1970 1970000000 #> 4 Northeast Massachusetts 1970 658000000 #> 5 Northeast Rhode Island 1970 75000000 #> 6 Northeast Connecticut 1970 661000000 ``` --- # Importing Text Files (.txt) Read in `.txt` files with `read.table()`: ```r txtPath <- here('data', 'nasa_global_temps.txt') *global_temps <- read.table(txtPath, skip = 5, header = FALSE) head(global_temps) ``` ``` #> V1 V2 V3 #> 1 1880 -0.18 -0.11 #> 2 1881 -0.10 -0.14 #> 3 1882 -0.11 -0.17 #> 4 1883 -0.19 -0.21 #> 5 1884 -0.28 -0.24 #> 6 1885 -0.31 -0.26 ``` --- # Importing Text Files (.txt) Read in `.txt` files with `read.table()`: ```r txtPath <- here('data', 'nasa_global_temps.txt') global_temps <- read.table(txtPath, skip = 5, header = FALSE) *names(global_temps) <- c('year', 'no_smoothing', 'loess') # Add header head(global_temps) ``` ``` #> year no_smoothing loess #> 1 1880 -0.18 -0.11 #> 2 1881 -0.10 -0.14 #> 3 1882 -0.11 -0.17 #> 4 1883 -0.19 -0.21 #> 5 1884 -0.28 -0.24 #> 6 1885 -0.31 -0.26 ``` --- # Importing Excel Files (.xlsx) Read in `.xlsx` files with `read_excel()`: ```r library(readxl) xlsxPath <- here('data', 'pv_cell_production.xlsx') *pv_cells <- read_excel(xlsxPath, sheet = 'Cell Prod by Country', skip = 2) ``` .code70[ ```r glimpse(pv_cells) ``` ``` #> Rows: 25 #> Columns: 10 #> $ Year <chr> NA, NA, "1995", "1996", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", NA, "Note: NA = data not available.", NA, "Source: Compiled by … #> $ China <chr> "Megawatts", NA, "NA", "NA", "NA", "NA", "NA", "2.5", "3", "10", "13", "40", "128.30000000000001", "341.8", "1192.8735755126208", "2535.9804999999997", "5193.2335000000003", "12882.114299891044", "24338.646000000004", "2413… #> $ Taiwan <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "NA", "3.5", "8", "17", "39.299999999999997", "88", "169.5", "413.19362206495737", "871.4", "1573.2", "3755.9046488657718", "4773.1499999999996", "5270.1999999999989", "6338.56500000000… #> $ Japan <dbl> NA, NA, 16.4, 21.2, 35.0, 49.0, 80.0, 128.6, 171.2, 251.1, 363.9, 601.5, 833.0, 926.4, 937.5, 1268.0, 1503.0, 2169.0, 2707.0, 2641.8, 3679.0, NA, NA, NA, NA #> $ Malaysia <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "0", "0", "100.1", "397.9", "1228.0566037735848", "1919.0129442119946", "2684.5953947368421", "2597.365436241611", "3072.59", NA, NA, NA, NA #> $ Germany <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "22.5", "23.5", "55", "121.5", "193", "339", "469.1", "815.35421116529074", "1476.6923205919056", "1606.0497978436656", "2181.2726133183096", "2152.8626315789475", "1406.7827181208054",… #> $ `South Korea` <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "5.3", "13", "31.883935905674612", "70.848164851527258", "234", "886.29518449560589", "1227.3", "1107.0999999999999", "1127.0999999999999", NA, NA, NA, NA #> $ `United States` <dbl> NA, NA, 34.7500, 38.8500, 51.0000, 53.7000, 60.8000, 75.0000, 100.3000, 120.6000, 103.0000, 138.7000, 153.1000, 177.6000, 261.9804, 403.1250, 594.7922, 1162.5177, 1044.1895, 886.4018, 868.4250, NA, NA, NA, NA #> $ Others <chr> NA, NA, "NA", "NA", "NA", "NA", "NA", "48.200000000000017", "69.800000000000011", "97.299999999999955", "131", "186.29999999999995", "235.70000000000027", "361.09999999999991", "410.97322650945807", "709.03112641453299", "6… #> $ World <dbl> NA, NA, 77.600, 88.600, 125.800, 154.900, 201.300, 276.800, 371.300, 542.000, 749.400, 1198.800, 1782.400, 2458.500, 4163.859, 7732.977, 12595.992, 26399.539, 40761.761, 39523.565, 44464.496, NA, NA, NA, NA ``` ] --- # Importing Excel Files (.xlsx) Read in `.xlsx` files with `read_excel()`: ```r library(readxl) xlsxPath <- here('data', 'pv_cell_production.xlsx') pv_cells <- read_excel(xlsxPath, sheet = 'Cell Prod by Country', skip = 2) %>% * mutate(Year = as.numeric(Year)) %>% # Convert "non-years" to NA * filter(!is.na(Year)) # Drop NA rows in Year ``` .code60[ ```r glimpse(pv_cells) ``` ``` #> Rows: 19 #> Columns: 10 #> $ Year <dbl> 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 #> $ China <chr> "NA", "NA", "NA", "NA", "NA", "2.5", "3", "10", "13", "40", "128.30000000000001", "341.8", "1192.8735755126208", "2535.9804999999997", "5193.2335000000003", "12882.114299891044", "24338.646000000004", "24139.014999999999", … #> $ Taiwan <chr> "NA", "NA", "NA", "NA", "NA", "NA", "3.5", "8", "17", "39.299999999999997", "88", "169.5", "413.19362206495737", "871.4", "1573.2", "3755.9046488657718", "4773.1499999999996", "5270.1999999999989", "6338.5650000000005" #> $ Japan <dbl> 16.4, 21.2, 35.0, 49.0, 80.0, 128.6, 171.2, 251.1, 363.9, 601.5, 833.0, 926.4, 937.5, 1268.0, 1503.0, 2169.0, 2707.0, 2641.8, 3679.0 #> $ Malaysia <chr> "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "0", "0", "100.1", "397.9", "1228.0566037735848", "1919.0129442119946", "2684.5953947368421", "2597.365436241611", "3072.59" #> $ Germany <chr> "NA", "NA", "NA", "NA", "NA", "22.5", "23.5", "55", "121.5", "193", "339", "469.1", "815.35421116529074", "1476.6923205919056", "1606.0497978436656", "2181.2726133183096", "2152.8626315789475", "1406.7827181208054", "1054.8… #> $ `South Korea` <chr> "NA", "NA", "NA", "NA", "NA", "NA", "0", "0", "0", "0", "5.3", "13", "31.883935905674612", "70.848164851527258", "234", "886.29518449560589", "1227.3", "1107.0999999999999", "1127.0999999999999" #> $ `United States` <dbl> 34.7500, 38.8500, 51.0000, 53.7000, 60.8000, 75.0000, 100.3000, 120.6000, 103.0000, 138.7000, 153.1000, 177.6000, 261.9804, 403.1250, 594.7922, 1162.5177, 1044.1895, 886.4018, 868.4250 #> $ Others <chr> "NA", "NA", "NA", "NA", "NA", "48.200000000000017", "69.800000000000011", "97.299999999999955", "131", "186.29999999999995", "235.70000000000027", "361.09999999999991", "410.97322650945807", "709.03112641453299", "663.66000… #> $ World <dbl> 77.600, 88.600, 125.800, 154.900, 201.300, 276.800, 371.300, 542.000, 749.400, 1198.800, 1782.400, 2458.500, 4163.859, 7732.977, 12595.992, 26399.539, 40761.761, 39523.565, 44464.496 ``` ] --- class: inverse
10
:
00
# Your turn Download [today's class notes](https://eda.seas.gwu.edu/2021-Spring/class/1-getting-started/1-getting-started.zip) Write code to import the following data files from the "data" folder: - `lotr_words.csv` - `north_america_bear_killings.txt` - `uspto_clean_energy_patents.xlsx` --- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. Course Goal ## 2. Course Introduction ## 3. Break: Install Stuff ## 4. Workflow & Reading In Data ## 5. .orange[Data Provenance] ## 6. Tidy Data --- ### Data provenance - It matters where you get your data -- **Validity**: - Is this data trustworthy? Is it authentic? - Where did the data come from? - How has the data been changed / managed over time? - Is the data complete? -- **Comprehension**: - Is this data accurate? - Can you explain your results? - Is this the _right_ data to answer your question? -- **Reproducibility**: The data source is the start of the reproducibility chain. --- ## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M505 442.7L405.3 343c-4.5-4.5-10.6-7-17-7H372c27.6-35.3 44-79.7 44-128C416 93.1 322.9 0 208 0S0 93.1 0 208s93.1 208 208 208c48.3 0 92.7-16.4 128-44v16.3c0 6.4 2.5 12.5 7 17l99.7 99.7c9.4 9.4 24.6 9.4 33.9 0l28.3-28.3c9.4-9.4 9.4-24.6.1-34zM208 336c-70.7 0-128-57.2-128-128 0-70.7 57.2-128 128-128 70.7 0 128 57.2 128 128 0 70.7-57.2 128-128 128z"/></svg> **Document your source like a museum curator** **Example**: View `README.md` file in the `data` folder -- Whenever you download data, you should **at a minimum** record the following: - The name of the file you are describing. - The date you downloaded it. - The original name of the downloaded file (in case you renamed it). - The url to the site you downloaded it from. - The source of the _original_ data (sometimes different from the site you downloaded it from). - A short description of the data, maybe how they were collected (if available). - A dictionary for the data (e.g. a simple markdown table describing each variable). --- class: inverse
10
:
00
# Your turn Documentation in the "data/README.md" file is missing for the following data sets: - wildlife_impacts.csv: [source](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-07-23) (Breakout Rooms 1 & 2) - north_america_bear_killings.txt: [source](https://data.world/makeovermonday/2019w21) (Breakout Rooms 3 & 4) - uspto_clean_energy_patents.xlsx: [source](https://www.nsf.gov/statistics/2018/nsb20181/report/sections/industry-technology-and-the-global-marketplace/global-trends-in-sustainable-energy-research-and-technologies) (Breakout Rooms 5 & 6) Go to the above sites and add the following information to the "data/README.md" file: - The name of the downloaded file. - The web address to the site you downloaded the data from. - The source of the _original_ data (if different from the website). - A short description of the data and how they were collected. - A dictionary for the data (hint: the site might already have this!). --- class: inverse, middle # Week 1: .fancy[Getting Started] ## 1. Course Goal ## 2. Course Introduction ## 3. Break: Install Stuff ## 4. Workflow & Reading In Data ## 5. Data Provenance ## 6. .orange[Tidy Data] --- # Variables, values, and observations - **Variable**: Something you can measure - **Value**: The measurement of a variable - **Observation**: A set of associated measurements across different variables -- .code100[ ```r head(fed_spend_long) ``` ``` #> # A tibble: 6 x 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ] --- # Tidy data Tidy data follows the following three rules: - Each **variable** has its own **column** - Each **observation** has its own **row** - Each **value** has its own **cell** <center> <img src="images/tidy-data.png" width = "850"> </center> --- ## Tidy data .leftcol[ ``` #> # A tibble: 6 x 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ] .rightcol[ ] <center> <img src="images/tidy-data.png" width = "850"> </center> --- .leftcol40[.code70[ # Tidy ("long") ```r head(fed_spend_long) ``` ``` #> # A tibble: 6 x 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ]] .rightcol60[.code70[ # Untidy ("wide") ```r head(fed_spend_wide) ``` ``` #> # A tibble: 6 x 15 #> year DHS DOC DOD DOE DOT EPA HHS Interior NASA NIH NSF Other USDA VA #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 1152 12513 8025 2372 1191 1837 404 #> 2 1977 0 837 37967 13741 1095 966 9507 1082 12553 8214 2395 1280 1796 374 #> 3 1978 0 871 37022 15663 1156 1175 10533 1125 12516 8802 2446 1237 1962 356 #> 4 1979 0 952 37174 15612 1004 1102 10127 1176 13079 9243 2404 2321 2054 353 #> 5 1980 0 945 37005 15226 1048 903 10045 1082 13837 9093 2407 2468 1887 359 #> 6 1981 0 829 41737 14798 978 901 9644 990 13276 8580 2300 1925 1964 382 ``` ]] --- ## Identifying tidy data > 1. Pick a cell in a column > 2. Ask "is **cell** a _value_ of **column**?" > 3. Repeat for each column .leftcol40[.code70[ ```r head(fed_spend_long) ``` ``` #> # A tibble: 6 x 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ]] -- .rightcol60[.code70[ ```r head(fed_spend_wide) ``` ``` #> # A tibble: 6 x 15 #> year DHS DOC DOD DOE DOT EPA HHS Interior NASA NIH NSF Other USDA VA #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 1152 12513 8025 2372 1191 1837 404 #> 2 1977 0 837 37967 13741 1095 966 9507 1082 12553 8214 2395 1280 1796 374 #> 3 1978 0 871 37022 15663 1156 1175 10533 1125 12516 8802 2446 1237 1962 356 #> 4 1979 0 952 37174 15612 1004 1102 10127 1176 13079 9243 2404 2321 2054 353 #> 5 1980 0 945 37005 15226 1048 903 10045 1082 13837 9093 2407 2468 1887 359 #> 6 1981 0 829 41737 14798 978 901 9644 990 13276 8580 2300 1925 1964 382 ``` ]] --- ## Identifying tidy data > Are the column names _values_ of a variable? .leftcol40[.code70[ ```r head(fed_spend_long) ``` ``` #> # A tibble: 6 x 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ]] .rightcol60[.code70[ ```r head(fed_spend_wide) ``` ``` #> # A tibble: 6 x 15 #> year DHS DOC DOD DOE DOT EPA HHS Interior NASA NIH NSF Other USDA VA #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 1152 12513 8025 2372 1191 1837 404 #> 2 1977 0 837 37967 13741 1095 966 9507 1082 12553 8214 2395 1280 1796 374 #> 3 1978 0 871 37022 15663 1156 1175 10533 1125 12516 8802 2446 1237 1962 356 #> 4 1979 0 952 37174 15612 1004 1102 10127 1176 13079 9243 2404 2321 2054 353 #> 5 1980 0 945 37005 15226 1048 903 10045 1082 13837 9093 2407 2468 1887 359 #> 6 1981 0 829 41737 14798 978 901 9644 990 13276 8580 2300 1925 1964 382 ``` ]] --- # **Quick practice 1**: Is this data frame "tidy"? > Decide [here](https://docs.google.com/presentation/d/1c66hRe_adYoWUPrhPf2A17pL4fmhAlCmNNqkSIloNnE/edit?usp=sharing) (link also in #classroom) **Description**: Tuberculosis cases in various countries .code90[ ``` #> # A tibble: 6 x 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 ``` ] --- # **Quick practice 2**: Is this data frame "tidy"? > Decide [here](https://docs.google.com/presentation/d/1c66hRe_adYoWUPrhPf2A17pL4fmhAlCmNNqkSIloNnE/edit?usp=sharing) (link also in #classroom) **Description**: Word counts by character type in "Lord of the Rings" trilogy .code80[ ``` #> # A tibble: 9 x 4 #> Film Race Female Male #> <chr> <chr> <dbl> <dbl> #> 1 The Fellowship Of The Ring Elf 1229 971 #> 2 The Fellowship Of The Ring Hobbit 14 3644 #> 3 The Fellowship Of The Ring Man 0 1995 #> 4 The Return Of The King Elf 183 510 #> 5 The Return Of The King Hobbit 2 2673 #> 6 The Return Of The King Man 268 2459 #> 7 The Two Towers Elf 331 513 #> 8 The Two Towers Hobbit 0 2463 #> 9 The Two Towers Man 401 3589 ``` ] --- # **Quick practice 3**: Is this data frame "tidy"? > Decide [here](https://docs.google.com/presentation/d/1c66hRe_adYoWUPrhPf2A17pL4fmhAlCmNNqkSIloNnE/edit?usp=sharing) (link also in #classroom) **Description**: Photovoltaic cell production by country .code90[ ``` #> # A tibble: 6 x 10 #> Year China Taiwan Japan Malaysia Germany `South Korea` `United States` Others World #> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> #> 1 1995 NA NA 16.4 NA NA NA 34.8 NA 77.6 #> 2 1996 NA NA 21.2 NA NA NA 38.8 NA 88.6 #> 3 1997 NA NA 35 NA NA NA 51 NA 126. #> 4 1998 NA NA 49 NA NA NA 53.7 NA 155. #> 5 1999 NA NA 80 NA NA NA 60.8 NA 201. #> 6 2000 2.5 NA 129. NA 22.5 NA 75 48.200000000000017 277. ``` ] --- class: center, middle, inverse # Why do we need tidy data? (a quick explanation with cute graphics, by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)) --- class: center background-image: url("images/horst_tidydata_1.jpg") background-size: contain --- class: center background-image: url("images/horst_tidydata_2.jpg") background-size: contain --- class: center background-image: url("images/horst_tidydata_3.jpg") background-size: contain --- class: center background-image: url("images/horst_tidydata_4.jpg") background-size: contain --- # Some tidy examples: data wrangling Compute the total R&D spending in each year .leftcol[ ```r head(fed_spend_long) ``` ``` #> # A tibble: 6 x 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ] .rightcol[ ```r fed_spend_long %>% group_by(year) %>% summarise(total = sum(rd_budget_mil)) ``` ``` #> # A tibble: 42 x 2 #> year total #> <dbl> <dbl> #> 1 1976 86227 #> 2 1977 91807 #> 3 1978 94864 #> 4 1979 96601 #> 5 1980 96305 #> 6 1981 98304 #> 7 1982 95448 #> 8 1983 95010 #> 9 1984 105371 #> 10 1985 114818 #> # … with 32 more rows ``` ] --- # Some tidy examples: data wrangling Compute the total R&D spending in each year .leftcol[ ```r head(fed_spend_wide) ``` ``` #> # A tibble: 6 x 15 #> year DHS DOC DOD DOE DOT EPA HHS Interior NASA NIH NSF Other USDA VA #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 1152 12513 8025 2372 1191 1837 404 #> 2 1977 0 837 37967 13741 1095 966 9507 1082 12553 8214 2395 1280 1796 374 #> 3 1978 0 871 37022 15663 1156 1175 10533 1125 12516 8802 2446 1237 1962 356 #> 4 1979 0 952 37174 15612 1004 1102 10127 1176 13079 9243 2404 2321 2054 353 #> 5 1980 0 945 37005 15226 1048 903 10045 1082 13837 9093 2407 2468 1887 359 #> 6 1981 0 829 41737 14798 978 901 9644 990 13276 8580 2300 1925 1964 382 ``` ] .rightcol[ ```r fed_spend_wide %>% mutate(total = DHS + DOC + DOD + DOE + DOT + EPA + HHS + Interior + NASA + NIH + NSF + Other + USDA + VA) %>% select(year, total) ``` ``` #> # A tibble: 42 x 2 #> year total #> <dbl> <dbl> #> 1 1976 86227 #> 2 1977 91807 #> 3 1978 94864 #> 4 1979 96601 #> 5 1980 96305 #> 6 1981 98304 #> 7 1982 95448 #> 8 1983 95010 #> 9 1984 105371 #> 10 1985 114818 #> # … with 32 more rows ``` ] --- # Some tidy examples: plotting Make a bar chart of total R&D spending by agency .leftcol45[ ```r head(fed_spend_long) ``` ``` #> # A tibble: 6 x 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ] .rightcol55[ ```r ggplot(fed_spend_long) + * geom_col(aes(x = rd_budget_mil, y = reorder(department, rd_budget_mil)), width = 0.7, alpha = 0.8) + theme_bw(base_size = 15) + labs(x = "R&D Spending ($Millions)", y = "Federal Agency") ``` <img src="figs/fed-spend-bars-1.png" width="432" /> ] --- class: inverse, center, middle # Tidying and Untidying your data with # `spread()` and `gather()` --- ## `spread()`: from tidy ("long") to untidy ("wide") ### `key` = column names, `value` = cells <center> <img src="images/tidy-spread.png" width=550> </center> --- ## `spread()`: from tidy ("long") to untidy ("wide") ### `key` = column names, `value` = cells -- .leftcol45[ ```r head(fed_spend_long) ``` ``` #> # A tibble: 6 x 3 #> department year rd_budget_mil #> <chr> <dbl> <dbl> #> 1 DOD 1976 35696 #> 2 NASA 1976 12513 #> 3 DOE 1976 10882 #> 4 HHS 1976 9226 #> 5 NIH 1976 8025 #> 6 NSF 1976 2372 ``` ] .rightcol55[ ```r fed_spend_wide <- fed_spend_long %>% * spread(key = department, * value = rd_budget_mil) head(fed_spend_wide) ``` ``` #> # A tibble: 6 x 15 #> year DHS DOC DOD DOE DOT EPA HHS Interior NASA NIH NSF Other USDA VA #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 1152 12513 8025 2372 1191 1837 404 #> 2 1977 0 837 37967 13741 1095 966 9507 1082 12553 8214 2395 1280 1796 374 #> 3 1978 0 871 37022 15663 1156 1175 10533 1125 12516 8802 2446 1237 1962 356 #> 4 1979 0 952 37174 15612 1004 1102 10127 1176 13079 9243 2404 2321 2054 353 #> 5 1980 0 945 37005 15226 1048 903 10045 1082 13837 9093 2407 2468 1887 359 #> 6 1981 0 829 41737 14798 978 901 9644 990 13276 8580 2300 1925 1964 382 ``` ] --- ## `gather()`: from untidy ("wide") to tidy ("long") ### `key` = column names, `value` = cells <center> <img src="images/tidy-gather.png" width=550> </center> --- ## `gather()`: from untidy ("wide") to tidy ("long") ### `key` = column names, `value` = cells -- .leftcol55[ ``` #> # A tibble: 6 x 15 #> year DHS DOC DOD DOE DOT EPA HHS Interior NASA NIH NSF Other USDA VA #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1976 0 819 35696 10882 1142 968 9226 1152 12513 8025 2372 1191 1837 404 #> 2 1977 0 837 37967 13741 1095 966 9507 1082 12553 8214 2395 1280 1796 374 #> 3 1978 0 871 37022 15663 1156 1175 10533 1125 12516 8802 2446 1237 1962 356 #> 4 1979 0 952 37174 15612 1004 1102 10127 1176 13079 9243 2404 2321 2054 353 #> 5 1980 0 945 37005 15226 1048 903 10045 1082 13837 9093 2407 2468 1887 359 #> 6 1981 0 829 41737 14798 978 901 9644 990 13276 8580 2300 1925 1964 382 ``` ] .rightcol45[ ```r fed_spend_long <- fed_spend_wide %>% * gather(key = "department", * value = "rd_budget_mil", * DHS:VA) head(fed_spend_long) ``` ``` #> # A tibble: 6 x 3 #> year department rd_budget_mil #> <dbl> <chr> <dbl> #> 1 1976 DHS 0 #> 2 1977 DHS 0 #> 3 1978 DHS 0 #> 4 1979 DHS 0 #> 5 1980 DHS 0 #> 6 1981 DHS 0 ``` ] --- class: inverse
10
:
00
# Your turn: Tidy <--> Untidy We already read in the following two data frames: - `pv_cells` - `milk_production` Now we'll modify the format of each: 1. Use `spread()` to "untidy" the `milk_production` data into a format where the columns are state names and the values are the milk produced in each state. 2. Use `gather()` to "tidy" the `pv_cells` data into a data frame with three names: `year`, `country`, `numCells` --- class: center, middle, inverse # Start thinking about research questions --- # Writing a research question Follow [these guidelines](https://writingcenter.gmu.edu/guides/how-to-write-a-research-question) - your question should be: -- - **Clear**: your audience can easily understand its purpose without additional explanation. -- - **Focused**: it is narrow enough that it can be addressed thoroughly with the data available and within the limits of the final project report. -- - **Concise**: it is expressed in the fewest possible words. -- - **Complex**: it is not answerable with a simple "yes" or "no," but rather requires synthesis and analysis of data. -- - **Arguable**: its potential answers are open to debate rather than accepted facts (do others care about it?) --- # Writing a research question -- **Bad question: Why are social networking sites harmful?** - Unclear: it does not specify _which_ social networking sites or state what harm is being caused; assumes that "harm" exists. -- **Improved question: How are online users experiencing or addressing privacy issues on such social networking sites as Facebook and Twitter?** - Specifies the sites (Facebook and Twitter), type of harm (privacy issues), and who is harmed (online users). -- **Other good examples**: See the [Example Projects Page](https://eda.seas.gwu.edu/2021-Spring/ref-example-analyses.html) page