HW 02 - Data wrangling
This homework is due February 12 at 11:59pm ET.
The first step in the process of turning information into knowledge process is to summarize and describe the raw information - the data. In this assignment we explore data on college majors and earnings, specifically the data begin the FiveThirtyEight story “The Economic Guide To Picking A College Major”.
These data originally come from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series. While this is outside the scope of this assignment, if you are curious about how raw data from the ACS were cleaned and prepared, see the code FiveThirtyEight authors used.
We should also note that there are many considerations that go into picking a major. Earnings potential and employment prospects are two of them, and they are important, but they don’t tell the whole story. Keep this in mind as you analyze the data.
Learning objectives
- Transform data to extract meaning from it
- Utilize relational join operations to combine tables
- Implement data visualizations
Getting started
Go to the info2951-sp25 organization on GitHub. Click on the repo with the prefix hw-02. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.
General guidance
As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Make sure to
- Update author name on your document.
- Label all code chunks informatively and concisely.
- Follow the Tidyverse code style guidelines.
- Make at least 3 commits.
- Resize figures where needed, avoid tiny or huge plots.
- Turn in an organized, well formatted document.
Packages
We’ll use the {tidyverse} package for much of the data wrangling and visualization, the {scales} package for better formatting of labels on visualizations, and the {fivethirtyeight} package for the data sets.
Data
The data can be found in the {fivethirtyeight} package, and it’s called college_recent_grads
. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package. You can find out more about the dataset by inspecting its documentation, which you can access by running ?college_recent_grads
in the Console or using the Help menu in RStudio to search for college_recent_grads
. You can also find this information here.
You can also take a quick peek at your data frame and view its dimensions with the glimpse()
function.
glimpse(college_recent_grads)
Rows: 173
Columns: 21
$ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
$ major_code <int> 2419, 2416, 2415, 2417, 2405, 2418, 6202, …
$ major <chr> "Petroleum Engineering", "Mining And Miner…
$ major_category <chr> "Engineering", "Engineering", "Engineering…
$ total <int> 2339, 756, 856, 1258, 32260, 2573, 3777, 1…
$ sample_size <int> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, …
$ men <int> 2057, 679, 725, 1123, 21239, 2200, 2110, 8…
$ women <int> 282, 77, 131, 135, 11021, 373, 1667, 960, …
$ sharewomen <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132…
$ employed <int> 1976, 640, 648, 758, 25694, 1857, 2912, 15…
$ employed_fulltime <int> 1849, 556, 558, 1069, 23170, 2038, 2924, 1…
$ employed_parttime <int> 270, 170, 133, 150, 5180, 264, 296, 553, 1…
$ employed_fulltime_yearround <int> 1207, 388, 340, 692, 16697, 1449, 2482, 82…
$ unemployed <int> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, …
$ unemployment_rate <dbl> 0.018380527, 0.117241379, 0.024096386, 0.0…
$ p25th <dbl> 95000, 55000, 50000, 43000, 50000, 50000, …
$ median <dbl> 110000, 75000, 73000, 70000, 65000, 65000,…
$ p75th <dbl> 125000, 90000, 105000, 80000, 75000, 10200…
$ college_jobs <int> 1534, 350, 456, 529, 18314, 1142, 1768, 97…
$ non_college_jobs <int> 364, 257, 176, 102, 4440, 657, 314, 500, 1…
$ low_wage_jobs <int> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 3…
Exercises
Exercise 1
Which majors have the lowest unemployment rate? Answer the question using a single data wrangling pipeline. The output should be a tibble with the columns major
, and unemployment_rate
. Only the 10 majors with the lowest unemployment rates should be included. The major(s) with the lowest unemployment rate should be at the top. In a few sentences, describe any trends you observe.
Exercise 2
Which majors have the highest percentage of women? Answer the question using a single data wrangling pipeline. The output should be a tibble with the columns major
, and sharewomen
Only the five majors with the highest proportions of women should be included. The major with the highest proportion of women should be at the top. In a few sentences, describe any trends you observe.
Render, commit (with a descriptive and concise commit message), and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 3
How much are college graduates (those who finished undergrad) making?
Plot the distribution of all median incomes using a histogram with an appropriate binwidth (you will need to determine what is “appropriate” – remember there is not one single value you should use).
Calculate the mean and median for median income. Based on the shape of the histogram, determine which of these summary statistics is useful for describing the distribution.
-
Describe the distribution of median incomes of college graduates across various majors based on your histogram from part (a) and incorporating the statistic you chose in part (b) to help your narrative.
HintMention shape, center, spread, any unusual observations.
Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 4
How do the distributions of median income compare across major categories?
Calculate the minimum, median, and maximum median income per major category as well as the number of majors in each category. Your summary statistics should be in decreasing order of median income.
-
Create box plots of the distribution of median income by major category.
- The variable
major_category
should be on the y-axis andmedian
on the x-axis. - The boxes should be sorted meaningfully with the major with the largest median income at the top of the chart and the major with the smallest median income at the bottom of the chart.
- Use color to enhance your plot, and turn off any legends providing redundant information.
- Style the x-axis labels such that the values are shown in thousands, e.g., 20000 should show up as $20K.
- The variable
In 1-2 sentences, describe how median incomes across various major categories compare. Your description should also touch on where your own intended/declared major (yes, your major at Cornell).
Once again, render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 5
One of the sections of the FiveThirtyEight story is “All STEM fields aren’t the same”. Let’s see if this is true.
-
First, let’s create a new vector called
stem_categories
that lists the major categories that are considered STEM fields.stem_categories <- c( "Biology & Life Science", "Computers & Mathematics", "Engineering", "Physical Sciences" )
Then, fill in the partial code to create a new variable in our data frame indicating whether a major is STEM or not. Note that you need to figure out the logical operator that goes into
___
. Double check that you have successfully created this variable by selecting the variablesmajor_type
andmajor_category
.<- college_recent_grads |> college_recent_grads mutate(major_type = if_else(major_category ___ stem_categories, "STEM", "Not STEM"))
In a single pipeline, determine which STEM majors’ median earnings are less than $36,000. Your answer should be a tibble with the columns
major
andmedian
, arranged in order of descendingmedian
.
Once again, render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 6
Not only do we have data on former undergraduate students, we also have similar data available for individuals who earned a graduate degree.
Rows: 173
Columns: 12
$ major_code <dbl> 5601, 6004, 6211, 2201, 2001, 3201, 6…
$ major <chr> "Construction Services", "Commercial …
$ major_category <chr> "Industrial Arts & Consumer Services"…
$ grad_total <dbl> 9173, 53864, 24417, 5411, 9109, 1542,…
$ grad_sample_size <dbl> 200, 882, 437, 72, 171, 22, 3738, 386…
$ grad_employed <dbl> 7098, 40492, 18368, 3590, 7512, 1008,…
$ grad_employed_fulltime_yearround <dbl> 6511, 29553, 14784, 2701, 5622, 860, …
$ grad_unemployed <dbl> 681, 2482, 1465, 316, 466, 0, 8324, 4…
$ grad_unemployment_rate <dbl> 0.08754339, 0.05775585, 0.07386679, 0…
$ grad_p25th <dbl> 110000, 89000, 100000, 85000, 83700, …
$ grad_median <dbl> 75000, 60000, 65000, 47000, 57000, 75…
$ grad_p75th <dbl> 53000, 40000, 45000, 24500, 40600, 55…
Let’s compare median incomes of STEM majors with and without a graduate degree in their major.
To do so, we will first join data that contains information on median incomes of those with undergraduate and graduate degrees. Join the
college_recent_grads
and thecollege_grad_students
data sets. Join them in such a way where only rows that include the samemajor_code
from each data set are included. Name the new data setmajor_income
.-
Create a new variable called
grad_premium
– the percentage difference in median income between individuals with a graduate degree and those with just an undergraduate degree, for STEM majors. For example, if the median income for a STEM major with a graduate degree is $60,000 and the median income for the same major with just an undergraduate degree is $50,000, thegrad_premium
would be 20%.The result should be a tibble with the variables
major
,grad_premium
,grad_median
, andundergrad_median
.Report two tables: one with the 5 majors with the highest
grad_premium
and one with the 5 majors with the lowestgrad_premium
. In a few sentences, describe any trends you observe.
Render, commit, and push one last time. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 2951 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
- Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
- Exercise 1: 5 points
- Exercise 2: 5 points
- Exercise 3: 9 points
- Exercise 4: 10 points
- Exercise 5: 6 points
- Exercise 6: 10 points
- Workflow + formatting: 5 points
- Total: 50 points
The “Workflow & formatting” component assesses the reproducible workflow. This includes:
- At least 3 informative commit messages
- Following {tidyverse} code style
- All code being visible in rendered PDF (no more than 80 characters)
Acknowledgments
- This assignment is derived from Data Science in a Box and licensed under CC BY-SA 4.0.