Due: 16 February, 11:59 pm

Weight: This assignment is worth 9% of your final grade.

Purpose: When you get a data file to analyze, one of the first things you’ll want to do is explore it. But that’s often easier said than done. The purpose of this assignment is to give you a chance to practice some of the techniques we’ve discussed in class for exploring a data set.

Assessment: I will use this rubric to grade your submissions.

Background: For this assignment, you’ll be exploring data on the costs of hundreds of transit projects around the world collected by the Transit Costs Project. The data span more than 50 countries and totals more than 11,000 km of urban rail built since the late 1990s.


1. Get organized

Download and unzip this template for your project, then open the project.Rproj file. In the setup chunk in your report.Rmd file, read in the transit_cost.csv file - you can read it in directly from the web with this line of code:

transit_cost <- read_csv('http://eda.seas.gwu.edu/2021-Spring/data/transit_cost.csv')

Note: The template comes with some text and code explaining how to use it - you should delete this code and adjust the content in the YAML for your report.

2. Document the data

Go to this GitHub page and read about the data we’ll be using (that’s where I got the data in from). Identify and document in your report.Rmd file the original data source as well as the source from where you downloaded the data (note that the GitHub page is not the original data source - it is just a repository where someone else put the data).

Format: This can just be a single paragraph describing the data sources.

3. Preview the data

Preview the data (e.g. using head(), glimpse(), View(), and / or make some quick plots). Take note of what variables are available, their types, and what they measure (Hint: look at the data dictionary on the GitHub page!).

Format: This can be a mix of code chunks and written text. It is more for you than me - think of this like jotting down notes to help you understand what is in the data.

4. Identify research questions

Once you have a sense for the available variables in the data, list at least three questions you think you may be able to answer with these data (you can list more if you want). For each question, also note which variables you plan to explore to address the question. A question can be about what the data captures (e.g. “Which countries in the data have the longest total amount of projects in km?”) or about a relationship between different variables (e.g. “Are projects with tunnels more expensive on a $/km basis than projects without tunnels?”).

Format: Use a numbered list to list each question and the associated variables you will explore.

Note: It is okay if you end up not being able to answer your question - just write down what you think you might be able to find out by exploring the data.

5. Explore the data

Go through each of your questions and search for answers. You should include at least one visualization for each question (so a minimum of 3 charts total).

For each question, follow these steps:

  1. If necessary, modify the transit_cost data frame to address your question. For example, you may need to create new variables (e.g. using mutate()), filter the data (e.g. if you’re only interested in a single country, year, etc.), or you may want to rename some of the variables. For this task, you’ll be relying on your data wrangling skills - if you’re unfamiliar with how to do this, review this data wrangling lesson, come to office hours, and ask questions on Slack - we’re here to help!
  2. Examine summary measures (centrality, variability, correlation) in the variables relevant to your question by making charts and / or printing out summary values / tables. Your chart(s) should be appropriately chosen according to the data type and / or relationship you are searching for. At this stage in the course, don’t worry to much about how “pretty” your charts looks - we’ll get better at polishing our charts throughout the class. Just make sure they are clear and legible.
  3. Write a paragraph describing what you found. If you found an answer to your question, make sure you have included at least one chart (or a few charts if appropriate) that helps explain your answer. If this process did not lead to an answer your question, write about how you might adjust your question or perhaps what other data you may need to address your question.

6. Knit and submit

Click the “knit” button to compile your .Rmd file into a html web page, then create a zip file of everything in your R Project folder. Go to the “Assignment Submission” page on Blackboard and submit your zip file under “Mini Project 1.”

EMSE 4575: Exploratory Data Analysis (Spring 2021)
Wednesdays | 12:45 - 3:15 PM | Dr. John Paul Helveston | jph@gwu.edu |