Due: 27 March, 11:00 pm

Weight: This assignment is worth 8% of your final grade.

Purpose: This assignment is designed to be a final practice run before we will shift our focus towards the class projects. You will practice and further hone skills you have already developed to prepare you for similar challenges that you may run into in your projects.

Skills & Knowledge: In this assignment, you will practice exploring data in which time in a central variable. Your analysis will involve writing text and code to create a reproducible document in the form of an html page.

Assessment: I will use this rubric to grade the finished product.

Background: The author of this article from 7 years ago wrote about how attendance at NHL games was climing. In an effort to communicate this message, the author created this, um…unfortunate chart:



Tasks:

For this assignment, you will use the ggplot2 library in R and data from ESPN to explore this question:

How has NHL game attendence changed over the past two decades?

Here’s what you need to do:

  1. Download this .zip file. It contains a analysis.Rmd file that you should use as a template to write your analysis in, a data folder with the relevant data needed for this assignment, and a project.RProj file to help you stay organized.

  2. Clean the data. Read in the file NHL_Attendance.xlsx in the data folder. Write code to preview the data. Take note of the type of each variable and whether there are any missing values. Are all the variables encoded the way you would expect? Write code to modify variable types and names to get your data frame cleaned up for analysis. (Hint: The janitor::clean_names() function will come in handy). When you’re done cleaning your data, write a few sentences describing any modifications you made to the original data and why you did it.

  3. Create some new variables. First, use the SEASON variable to create a new variable representing the year stored as a number. For example, for the season "2017-18", the year should be the number 2017. Also, remember that we are interested in assessing the change in NHL game attendance over time. To facilitate that, create a variable for the percentage change in attendance for each team in each season relative to the first season in the data ("2000-01"). For example, Tampa Bay’s attendance at home was 611,173 in the 2000-01 season and grew to 782,772 in the 2017-18 season. Thus, the percentage growth in home attendance between these two seasons was 100*(782,772 - 611,173) / 611,173 = 28%. You should end up with three new variables (growth in “home”, “road”, and “total” attendance) that store the percentage growth for each team and each season relative to the "2000-01" season.

  4. Summarize the data. Examine measures of centrality and variability in the important variables relevant to our research question, including the new variables you created in step 3.

  5. Visualize the data. The original chart is pretty terrible, so we’re going to just scratch that and start over. Create an appropriate visualization that highlights the change in NHL game attendance for each team in each season since the 2000-01 season. You are free to use whatever chart type you wish. Here are some of the options we covered in week 6:

    • Points
    • Points + line
    • Points + smoothed line
    • Line + area
    • Bars
    • Stacked area
    • Lines with mean line overlay
    • Heatmap
    • Seasonal chart
    • Sankey chart
    • Animations (yes, you can totally do an animation if you want!)

    Your chart should follow the design principles we have covered in class, and it should be “polished” following the techniques we covered in week 8.

  6. Visualize the data (again). While the first visualization highlights the change in NHL game attendance for each team, this second chart should highlight the overall trend across all teams. To do this, you will probably need to create a summary data frame from your original one. Again, you are free to use whatever chart type you wish, but your chart should follow the design principles we have covered in class and it should be “polished”.

  7. Write a summary of your analysis. I’m specifically looking for a discussion of the following:

    • What was wrong with the original chart? Discuss specific design principles we have covered in class.
    • Discuss the message you intended to convey with each of your charts and how your design choices draw attention to that message.
  8. Click the “knit” button to compile your .Rmd file into a html web page.

  9. Create a zip file of your whole project (.Rmd, .html, .RProj, and the data folder), then go to the “Assignment Submission” page on Blackboard and submit your zip file.


EMSE 4197 (CRN 78916): Exploratory Data Analysis - Spring 2020
George Washington University | School of Engineering & Applied Science
Dr. John Paul Helveston | jph@gwu.edu | Wednesdays | 12:45–3:15 PM | District House B205 | |
This work is licensed under a Creative Commons Attribution 4.0 International License.
See the licensing page for more details about copyright information.
Content 2020 John Paul Helveston