Due: 16 February, 11:59 pm
Weight: This assignment is worth 9% of your final grade.
Purpose: When you get a data file to analyze, one of the first things you’ll want to do is explore it. But that’s often easier said than done. The purpose of this assignment is to give you a chance to practice some of the techniques we’ve discussed in class for exploring a data set.
Assessment: I will use this rubric to grade your submissions.
Background: For this assignment, you’ll be exploring data on the costs of hundreds of transit projects around the world collected by the Transit Costs Project. The data span more than 50 countries and totals more than 11,000 km of urban rail built since the late 1990s.
Download and unzip this
template for your project, then open the project.Rproj
file. In the setup chunk in your report.Rmd
file, read in
the transit_cost.csv
file - you can read it in directly
from the web with this line of code:
transit_cost <- read_csv('http://eda.seas.gwu.edu/2021-Spring/data/transit_cost.csv')
Note: The template comes with some text and code explaining how to use it - you should delete this code and adjust the content in the YAML for your report.
Go to this
GitHub page and read about the data we’ll be using (that’s where I
got the data in from). Identify and document in your
report.Rmd
file the original data source as well as the
source from where you downloaded the data (note that the GitHub page is
not the original data source - it is just a repository
where someone else put the data).
Format: This can just be a single paragraph describing the data sources.
Preview the data (e.g. using head()
,
glimpse()
, View()
, and / or make some quick
plots). Take note of what variables are available, their types, and what
they measure (Hint: look at the data dictionary on the GitHub
page!).
Format: This can be a mix of code chunks and written text. It is more for you than me - think of this like jotting down notes to help you understand what is in the data.
Once you have a sense for the available variables in the data, list at least three questions you think you may be able to answer with these data (you can list more if you want). For each question, also note which variables you plan to explore to address the question. A question can be about what the data captures (e.g. “Which countries in the data have the longest total amount of projects in km?”) or about a relationship between different variables (e.g. “Are projects with tunnels more expensive on a $/km basis than projects without tunnels?”).
Format: Use a numbered list to list each question and the associated variables you will explore.
Note: It is okay if you end up not being able to answer your question - just write down what you think you might be able to find out by exploring the data.
Go through each of your questions and search for answers. You should include at least one visualization for each question (so a minimum of 3 charts total).
For each question, follow these steps:
transit_cost
data frame to
address your question. For example, you may need to create new variables
(e.g. using mutate()
), filter the data (e.g. if you’re only
interested in a single country, year, etc.), or you may want to rename
some of the variables. For this task, you’ll be relying on your data
wrangling skills - if you’re unfamiliar with how to do this, review this
data wrangling lesson, come to office hours, and ask questions on
Slack - we’re here to help!Click the “knit” button to compile your .Rmd
file into a
html web page, then create a zip file of everything in your R Project
folder. Go to the “Assignment Submission” page on Blackboard and submit
your zip file under “Mini Project 1.”