Mini Project 1: Data Cleaning

Due: Sep 23 by 11:59pm

Weight: This assignment is worth 10% of your final grade.

Purpose: When you start looking for data to address your project for this course, you will inevitably come across some rather messy data. Fear not! After completing this assignment, you will be well-prepared to tackle the messiness of real-world data! This assignment will put your data wrangling and cleaning skills to the test and will challenge you to become a master data janitor!

Assessment: Your submission will be assessed using the rubric at the bottom of this page.

Your mission, should you choose to accept it, is to clean up a relatively messy data file that contains sales of passenger cars by country between 2019 and 2024. The data are from the International Organization of Motor Vehicle Manufacturers (OICA). Your final product should be a tidy (long format) data frame with three columns: country, year, and sales. It should look like this:

head(df)
#> # A tibble: 6 × 3
#>   country    year   sales
#>   <chr>     <dbl>   <dbl>
#> 1 Argentina  2019  333183
#> 2 Australia  2019  799263
#> 3 Austria    2019  320381
#> 4 Belgium    2019  550008
#> 5 Brazil     2019 2262073
#> 6 Bulgaria   2019   35371
tail(df)
#> # A tibble: 6 × 3
#>   country                   year   sales
#>   <chr>                    <dbl>   <dbl>
#> 1 Ukraine                   2024   69599
#> 2 United Arab Emirates      2024  268876
#> 3 United Kingdom            2024 1952778
#> 4 United States Of America  2024 2984039
#> 5 Uzbekistan                2024  166058
#> 6 Vietnam                   2024  291797

With this in mind, your job is to develop and execute a strategy to go from the raw excel file to this cleaned version of the data.

1. Get organized

Download and unzip this template for your project, then open the report.Rproj file.

Once RStudio opens, click on the report.qmd file. That is the primary file you will edit to conduct your analysis.

2. Document the data

Inside the data folder, there is a README.md file with some missing information. Click on that file and edit it to fill in the missing information. Here is some info that will help:

The main data file we’ll be working with is the pc_sales_2024.xlsx file in the data folder. You can find this file online at https://www.oica.net/category/sales-statistics/. We’re using the “Passengers Cars” data (the “pc” in pc_sales_2024). The link to the original data file can be found by right-clicking on the Excel icon on the right side of the page, like this:

3. Preview the data

With messy Excel files, it is often helpful to first open and view them so you can learn about what might be needed to clean them up in R, such as how many lines you may need to skip at the top when reading in the data. To make sure Excel doesn’t corrupt your data, make a copy of the Excel file and open that copy with Microsoft Excel. You can keep that copy open throughout your data cleaning journey and can be confident that you haven’t corrupted the original file!

4. Load the data

Use read_excel() from the readxl package to read in the pc_sales_2024.xlsx data file. Do you need to skip any rows while reading it in?

5. Develop Your Strategy

Before writing any cleaning code, you need a plan.

Examine your imported data using functions like head() and glimpse() to understand the structure of the data in R.

Write out your strategy: Create a numbered list of the major steps you’ll need to take to transform this messy data into the target format shown above. Consider:

  • What needs to happen to the column names?
  • Do you need to reshape the data (wide vs long format)?
  • What cleaning is needed in each column?
  • Are there rows that shouldn’t be included?
  • How will you get the data into the final column structure?

Deliverable: A written strategy with numbered steps describing your planned approach. Be specific about what each step will accomplish.

6. Implement Your Strategy

Now execute your plan! Work through each step of your strategy and write the code to implement it.

Important: Your code should transform the messy Excel data into the exact target format shown at the top of this assignment. The final dataset should contain:

  • Only observations of countries (not regions or totals)
  • Years as numbers (e.g., 2019, 2020, 2021, 2022, 2023, 2024)
  • Sales values as numbers
  • Country names in title case (e.g., “France” not “FRANCE”)
  • Rows arranged by year, then country

Deliverable: Clean, well-commented code that successfully produces the target dataset.

7. Validate and Document

Save your cleaned data as my_clean_data.csv in the data folder.

Create a brief summary addressing:

  • What were the 2-3 biggest challenges in cleaning this data?
  • How did you decide what counted as a “country” vs. a “region”?
  • How confident are you that your cleaning preserved the data integrity?
  • What would you do differently if you encountered similar data again?

Deliverable: Saved CSV file and written reflection (4-6 sentences).

8. Render and submit

Click the “Render” button to compile your .qmd file into a html web page. Then open the report.html file in a web browser and proofread your report.

Does all of the formatting look correct? Make sure there are no errors in the rendered file before submitting it.

Once you’ve proofread your report, create a zip file of all the files in your R project folder for this assignment and submit it on the corresponding assignment submission on Blackboard.

BONUS: Make a summary visualization (+5%)

For a 5% bonus, add a code chunk at the bottom of your report to generate the plot below. If your cleaned data is not properly formatted, you can read in the pc_sales_clean.csv file and use it to make the plot.

Some hints to perfectly replicate the figure:

  • Consider using ifelse() to make a new variable for the bar color based on the country variable.
  • You can use fct_relevel() to re-order the country factors (the order of how they are stacked).
  • The fill colors are 'grey', 'red', and 'blue'.
  • The theme is theme_minimal().

Grading Rubric

35 Total Points

Category Excellent Good Needs work
Organization & Formatting 5
All formatting guidelines are followed; YAML is correctly filled out.
4
Most formatting guidelines are followed; YAML is correctly filled out.
3
Several or all formatting guidelines not followed; YAML contains elements that aren't updated from the template; template file wasn’t used.
Data Documentation 5
The README file included in the template was updated with thoughtful and accurate information about the data.
4
The README file included in the template was updated, but some of the information was missing or inaccurate.
3
The README file included in the template was not updated, missing, or most of the information in it was inaccurate.
Data Cleaning 18 - 20
All cleaning steps are completed and the resulting "clean" data frame is exactly (or very close to) the desired final state.
15 - 17
All cleaning steps are completed, though there are one or two significant errors that prevents the resulting "clean" data frame from being in the desired final state.
10 - 14
More than one of the cleaning steps are missing; multiple errors prevent the resulting "clean" data frame from being in the desired final state.
Technical things 5
All code runs without errors; the html file can be compiled from the Rmd; all files included in the submitted .zip file.
4
Code has only one or two error, otherwise runs; all files included in the submitted .zip file.
3
Code has multiple errors; html file cannot be rendered without significant changes to the Rmd file; submitted .zip file is missing components necessary to reproduce analysis.