HW 03 - Data tidying

Homework

Modified

February 20, 2025

Important

This homework is due February 26 at 11:59pm.

Learning objectives

Pivot untidy data sets
Join relational data tables
Implement interpretable and accessible data visualizations

Getting started

Go to the info2951-sp25 organization on GitHub. Click on the repo with the prefix hw-03. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.

General guidance

Guidelines + tips

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Workflow + formatting

Make sure to

Update author name on your document.
Label all code chunks informatively and concisely.
Follow the Tidyverse code style guidelines.
Make at least 3 commits.
Resize figures where needed, avoid tiny or huge plots.
Turn in an organized, well formatted document.

Packages

We’ll use the {tidyverse} package for much of the data wrangling and the {scales} package for better plot labels.

library(tidyverse)
library(scales)

Data

The datasets that you will work with in this dataset come from the Organization for Economic Co-Operation and Development (OECD), stats.oecd.org.

Part 1: Inflation across the world

For this part of the analysis you will work with inflation data from various countries in the world over the last 30 years.

country_inflation <- read_csv("data/country-inflation.csv")

Exercise 1

Describe the data structure. What does each row of the country_inflation dataset represent? What are the columns in the dataset and what do they represent?

Exercise 2

Tidying the data set. Reshape country_inflation such that each row represents a country/year combination, with columns country, year, and annual_inflation. Make sure that annual_inflation is a numeric variable. Save the result as a new data frame – you should give it a concise and informative name.

Render, commit (with a descriptive and concise commit message), and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 3

Analyze inflation across the globe. In a single pipeline, filter your reshaped dataset to analyze countries of your choosing to visualize the inflation rates for over the years and create a plot of annual inflation over time for these countries. Then, in a few sentences, state why you chose these countries and describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of these countries economies.

Data should be represented with points as well as lines connecting the points for each country.
Each country should be represented by a different color line.
Axes and legend should be properly labeled.
The plot should have an appropriate title (and optionally a subtitle).
Axis labels for annual inflation should be shown in percentages (e.g., 25% not 25).

label_percent()

The label_percent() function from the {scales} package will be useful.

ggplot(...) +
  ... +
  scale_y_continuous(label = label_percent())

Render, commit (with a descriptive and concise commit message), and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Part 2: Inflation in the US

The OECD defines inflation as follows:

Inflation is a rise in the general level of prices of goods and services that households acquire for the purpose of consumption in an economy over a period of time.

The main measure of inflation is the annual inflation rate which is the movement of the Consumer Price Index (CPI) from one month/period to the same month/period of the previous year expressed as percentage over time.

Source: OECD CPI FAQ

CPI is broken down into 12 divisions such as food, housing, health, etc. Your goal in this part is to create another time series plot of annual inflation, this time for only the United States.

The data you will need to create this visualization is spread across two files:

us-inflation.csv: Annual inflation rate for the US for 12 CPI divisions. Each division is identified by an ID number.
cpi-divisions.csv: A “lookup table” of CPI division ID numbers and their descriptions.

Let’s load both of these files.

us_inflation <- read_csv("data/us-inflation.csv")
cpi_divisions <- read_csv("data/cpi-divisions.csv")

Exercise 4

Join the data frames. Add a column to the us_inflation dataset called description which has the CPI division description that matches the cpi_division_id, by joining the two datasets.

Determine which type of join is the most appropriate one and use that.
Note that the two datasets don’t have common column names. Review the documentation for the *_join() functions to determine how to use the by argument when the names of the variables that the datasets should be joined by are different.

Exercise 5

Analyze inflation across divisions. In a single pipeline, filter your joined dataset to include a subset of CPI divisions which you wish to examine, and create a plot of annual inflation over time for these divisions. Then, in a few sentences, state why you chose these divisions and describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of inflation rates in the US over the last decade.

Data should be represented with points as well as lines connecting the points for each division.
Each division should be represented by a different color line.
Axes and legend should be properly labeled.
The plot should have an appropriate title (and optionally a subtitle).
Axis labels for annual inflation should be shown in percentages (e.g., 25% not 25).

Once again, render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
Click on your INFO 2951 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

Component	Points
Ex 1	3
Ex 2	5
Ex 3	16
Ex 4	5
Ex 5	16
Workflow & formatting	5
Total	50

Workflow & formatting criteria

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

Having at least 3 informative commit messages
Following {tidyverse} code style
All code being visible in rendered PDF (no more than 80 characters)

Acknowledgments

This assignment is derived from STA 199: Introduction to Data Science