HW 02 - Data wrangling + tidying
This homework is due February 11 at 11:59pm ET.
Learning objectives
- Transform data to extract meaning from it
- Pivot untidy data sets
- Join relational data tables
- Implement interpretable and accessible data visualizations
Getting started
Go to the info2951-sp26 organization on GitHub. Click on the repo with the prefix hw-02. It contains the starter documents you need to complete the lab.
Clone the repo and start a new workspace in Positron. See the Homework 0 instructions for details on cloning a repo and starting a new R project.
General guidance
As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Make sure to
- Update author name on your document.
- Label all code chunks informatively and concisely.
- Follow the Tidyverse code style guidelines.
- Make at least 3 commits.
- Resize figures where needed, avoid tiny or huge plots.
- Turn in an organized, well formatted document.
Packages
We’ll use the {tidyverse} package for much of the data wrangling and visualization and the {scales} package for better formatting of labels on visualizations.
Part 1: College majors and earnings
The first step in the process of turning information into knowledge process is to summarize and describe the raw information - the data. In this part we explore data on college majors and earnings, inspired by this FiveThirtyEight story “The Economic Guide To Picking A College Major”.
The data come from the American Community Survey (ACS) 2019-2023 Public Use Microdata Series, focusing specifically on recent college graduates.1
We should also note that there are many considerations that go into picking a major. Earnings potential and employment prospects are two of them, and they are important, but they don’t tell the whole story. Keep this in mind as you analyze the data.
Data
The data can be found in data/college-recent-grads.csv, and you can load it with read_csv(). It contains the following variables:
| Variable | Description |
|---|---|
rank |
Rank by median earnings |
major_code |
Major code, FO1DP in ACS PUMS |
major |
Major description |
major_category |
Category of major from Carnevale et al |
total |
Total number of people with major |
sample_size |
Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |
men |
Men with major |
women |
Women with major |
sharewomen |
Proportion women |
employed |
Number employed (ESR == 1 or 2) |
employed_fulltime |
Employed 35 hours or more |
employed_parttime |
Employed less than 35 hours |
employed_fulltime_yearround |
Employed at least 48 weeks and at least 30 hours |
unemployed |
Number unemployed (ESR == 3) |
unemployment_rate |
Unemployed / (Unemployed + Employed) |
p25th |
25th percentile of earnings |
median |
Median earnings of full-time, year-round workers |
p75th |
75th percentile of earnings |
You can also take a quick peek at your data frame and view its dimensions with the glimpse() function.
glimpse(college_recent_grads)Rows: 174
Columns: 18
$ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
$ major_code <dbl> 4005, 2417, 2407, 2102, 2419, 2408, 2405, …
$ major <chr> "Mathematics And Computer Science", "Naval…
$ major_category <chr> "Computers & Mathematics", "Engineering", …
$ total <dbl> 2087, 2368, 85839, 302333, 5188, 109217, 6…
$ sample_size <dbl> 107, 95, 3705, 12857, 218, 4844, 2833, 401…
$ men <dbl> 1296, 1843, 65771, 234122, 4201, 85888, 39…
$ women <dbl> 791, 525, 20068, 68211, 987, 23329, 24122,…
$ sharewomen <dbl> 0.3790129, 0.2217061, 0.2337865, 0.2256155…
$ employed <dbl> 1899, 1595, 69907, 248415, 4630, 91870, 55…
$ employed_fulltime <dbl> 1705, 2018, 64633, 226707, 4280, 84015, 51…
$ employed_parttime <dbl> 382, 350, 21206, 75626, 908, 25202, 12267,…
$ employed_fulltime_yearround <dbl> 1353, 1668, 54911, 189733, 3205, 69432, 42…
$ unemployed <dbl> 54, 44, 4002, 17776, 139, 3751, 2299, 190,…
$ unemployment_rate <dbl> 0.02764977, 0.02684564, 0.05414767, 0.0667…
$ median <dbl> 100000, 89000, 80000, 78000, 78000, 76000,…
$ p25th <dbl> 70000, 65000, 56000, 52000, 56000, 60000, …
$ p75th <dbl> 114000, 100000, 108000, 110000, 104000, 10…
Exercise 1
Which majors have the highest and lowest representation of women in the workforce? Answer the question using two data wrangling pipelines. One table should contain the ten majors with the highest proportions of women, and the other table should contain the ten majors with the lowest proportions of women. The tables should be sorted in an appropriate order. In a few sentences, describe any trends you observe.
Render, commit (with a descriptive and concise commit message), and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 2
How much are college graduates making?
Plot the distribution of all median incomes using a histogram with an appropriate binwidth (you will need to determine what is “appropriate” – remember there is not one single value you should use).
Calculate the mean and median for median income. Based on the shape of the histogram, determine which of these summary statistics is useful for describing the distribution.
-
Describe the distribution of median incomes of college graduates across various majors based on your histogram from part (a) and incorporating the statistic you chose in part (b) to help your narrative.
TipHintMention shape, center, spread, any unusual observations.
Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 3
How do the distributions of median income compare across major categories?
Calculate the minimum, median, and maximum median income per major category as well as the number of majors in each category. Your summary statistics should be in decreasing order of median income.
-
Create box plots of the distribution of median income by major category.
- The variable
major_categoryshould be on the y-axis andmedianon the x-axis. - The boxes should be sorted meaningfully with the major with the largest median income at the top of the chart and the major with the smallest median income at the bottom of the chart.
- Use color to enhance your plot, and turn off any legends providing redundant information.
- Style the x-axis labels such that the values are shown in thousands, e.g., 20000 should show up as $20K.
- The variable
In 1-2 sentences, describe how median incomes across various major categories compare. Your description should also touch on where your own intended/declared major (yes, your major at Cornell).
Once again, render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Part 2: Inflation across the world
The Organization for Economic Co-Operation and Development (OECD) defines inflation as:
Inflation is a rise in the general level of prices of goods and services that households acquire for the purpose of consumption in an economy over a period of time.
The main measure of inflation is the annual inflation rate which is the movement of the Consumer Price Index (CPI) from one month/period to the same month/period of the previous year expressed as percentage over time.
Source: OECD CPI FAQ
For this part you will examine annualized inflation rates from various countries in the world over the last 30 years using data from stats.oecd.org. The data you will use is stored in data/country-inflation.csv.
Exercise 4
Describe the data structure. What does each row of the country_inflation dataset represent? What are the columns in the dataset and what do they represent?
Exercise 5
Cleaning the data set. Structure the data so it is tidy. Save the result as a new data frame – you should give it a concise and informative name.
Render, commit (with a descriptive and concise commit message), and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 6
Analyze inflation across the globe. In a single pipeline, filter your reshaped dataset to analyze countries of your choosing to visualize the inflation rates for over the years and create a plot of annual inflation over time for these countries. Then, in a few sentences, state why you chose these countries and describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of these countries economies.
Render, commit (with a descriptive and concise commit message), and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Part 3: Inflation in the United States
In the United States the Bureau of Labor Statistics (BLS) is responsible for calculating and reporting the CPI. The BLS breaks down the CPI into various components to better understand how prices are changing for different types of goods and services. Your goal in this part is to create another time series plot of monthly inflation, this time for only the United States.
The data you will need to create this visualization is split across two files:
-
us-inflation.csv: Monthly inflation rate for the US for 30 CPI divisions measured monthly.2 Each division is identified by an ID number. -
cpi-divisions.csv: A “lookup table” of CPI division ID numbers and their descriptions.
Let’s load both of these files.
Exercise 7
Join the data frames. Add a column to the us_inflation dataset which has the CPI division description that matches the cpi_series_id, by joining the two datasets.
Exercise 8
Analyze inflation across divisions. In a single pipeline, filter your joined dataset to include a subset of CPI divisions which you wish to examine, and create a plot of inflation over time for these divisions. Then, in a few sentences, state why you chose these divisions and describe the patterns you observe in the plot, particularly focusing on anything you find surprising or not surprising, based on your knowledge (or lack thereof) of inflation rates in the US over the last 25 years.
Once again, render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 2951 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
- Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
- Exercise 1: 3 points
- Exercise 2: 8 points
- Exercise 3: 8 points
- Exercise 4: 2 points
- Exercise 5: 5 points
- Exercise 6: 6 points
- Exercise 7: 4 points
- Exercise 8: 9 points
- Workflow + formatting: 5 points
- Total: 50 points
The “Workflow & formatting” component assesses the reproducible workflow. This includes:
- At least 3 informative commit messages
- Following {tidyverse} code style
- All code being visible in rendered PDF without automatic wrapping (no more than 80 characters)
Acknowledgments
- This assignment is derived in part from Data Science in a Box and licensed under CC BY-SA 4.0.
- This assignment is derived in part from STA 199: Introduction to Data Science