AE 02: Wrangling college education metrics

Application exercise

Modified

January 28, 2026

Important

Go to the course GitHub organization and locate the repo titled ae-02-YOUR_GITHUB_USERNAME to get started.

This AE is due January 29 at 11:59pm.

To demonstrate data wrangling we will use data from College Scorecard.¹ The subset we will analyze contains a small number of metrics for all four-year colleges and universities in the United States for the 2023-24 academic year. ²

library(tidyverse)

The data is stored in scorecard.csv. The variables are:

unit_id - Unit ID for institution
name - Name of the college
state - State abbreviation
type - Type of college (Public; Private, nonprofit; Private, for-profit)
adm_rate - Undergraduate admissions rate (from 0-100%)
sat_avg - Average SAT equivalent score of students admitted
cost - The average annual total cost of attendance, including tuition and fees, books and supplies, and living expenses
net_cost - The average annual net cost of attendance (annual cost of attendance minus the average grant/scholarship aid)
avg_fac_sal - Average faculty salary (9 month)
pct_pell - Percentage of undergraduates who receive a Pell Grant
comp_rate - Rate of first-time, full-time students at four-year institutions who complete their degree within six years
first_gen - Share of first-generation students
debt - Median debt of students after leaving school
locale - Locale of institution

scorecard <- read_csv("data/scorecard.csv")

The data frame has over 1700 observations (rows), 1712 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.

glimpse(scorecard)

Rows: 1,712
Columns: 14
$ unit_id     <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 10…
$ name        <chr> "Alabama A & M University", "University of Alabama at Birm…
$ state       <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ type        <chr> "Public", "Public", "Public", "Public", "Public", "Public"…
$ adm_rate    <dbl> 0.6622, 0.8842, 0.7425, 0.9564, 0.7582, 0.9263, 0.5047, 0.…
$ sat_avg     <dbl> 947, 1251, 1321, 977, 1287, 1090, 1318, 1197, 1016, NA, 11…
$ cost        <dbl> 23751, 27826, 27098, 22028, 32024, 21873, 34402, 38385, 36…
$ net_cost    <dbl> 14559, 17727, 19880, 13889, 22150, 14596, 23897, 23351, 21…
$ avg_fac_sal <dbl> 77490, 109899, 93699, 72135, 99810, 79407, 107163, 59274, …
$ pct_pell    <dbl> 0.6441, 0.3318, 0.2250, 0.7203, 0.1799, 0.4275, 0.1226, 0.…
$ comp_rate   <dbl> 0.2874, 0.6260, 0.6191, 0.3018, 0.7369, 0.3568, 0.7921, 0.…
$ first_gen   <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.3…
$ debt        <dbl> 16600, 15832, 13905, 17500, 17986, 13119, 17750, 16000, 15…
$ locale      <chr> "City", "City", "City", "City", "City", "City", "City", "C…

names(scorecard)

 [1] "unit_id"     "name"        "state"       "type"        "adm_rate"   
 [6] "sat_avg"     "cost"        "net_cost"    "avg_fac_sal" "pct_pell"   
[11] "comp_rate"   "first_gen"   "debt"        "locale"

head(scorecard)

# A tibble: 6 × 14
  unit_id name  state type  adm_rate sat_avg  cost net_cost avg_fac_sal pct_pell
    <dbl> <chr> <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>    <dbl>
1  100654 Alab… AL    Publ…    0.662     947 23751    14559       77490    0.644
2  100663 Univ… AL    Publ…    0.884    1251 27826    17727      109899    0.332
3  100706 Univ… AL    Publ…    0.742    1321 27098    19880       93699    0.225
4  100724 Alab… AL    Publ…    0.956     977 22028    13889       72135    0.720
5  100751 The … AL    Publ…    0.758    1287 32024    22150       99810    0.180
6  100830 Aubu… AL    Publ…    0.926    1090 21873    14596       79407    0.428
# ℹ 4 more variables: comp_rate <dbl>, first_gen <dbl>, debt <dbl>,
#   locale <chr>

The head() function returns “A tibble: 6 x 14” and then the first six rows of the scorecard data.

Data wrangling with dplyr

{dplyr} is the primary package in the {tidyverse} for data wrangling.

Helpful data wrangling resources

Quick summary of key {dplyr} functions³

Rows:

filter():chooses rows based on column values.
slice(): chooses rows based on location.
arrange(): changes the order of the rows
slice_sample(): take a random subset of the rows
slice_min()/slice_max(): select rows with minimum/maximum values of a variable.

Columns:

select(): changes whether or not a column is included.
rename(): changes the name of columns.
mutate(): changes the values of columns and creates new columns.

Groups of rows:

summarize(): collapses a group into a single row.
count(): count unique values of one or more variables.
group_by(): perform calculations separately for each value of a variable

Operators

In order to make comparisons, we will use logical operators. These should be familiar from other programming languages. See below for a reference table for how to use these operators in R.

operator	definition
`<`	is less than?
`<=`	is less than or equal to?
`>`	is greater than?
`>=`	is greater than or equal to?
`==`	is exactly equal to?
`!=`	is not equal to?
`x & y`	is x AND y?
`x \| y`	is x OR y?
`is.na(x)`	is x NA?
`!is.na(x)`	is x not NA?
`x %in% y`	is x in y?
`!(x %in% y)`	is x not in y?
`!x`	is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

The pipe

Before working with data wrangling functions, let’s formally introduce the pipe. The pipe, |>, is an operator (a tool) for passing information from one process to another. We will use |> mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.

When reading code “in English”, say “and then” whenever you see a pipe.

Your turn (3 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.

scorecard |>
  select(name, type) |>
  head()

# A tibble: 6 × 2
  name                                type  
  <chr>                               <chr> 
1 Alabama A & M University            Public
2 University of Alabama at Birmingham Public
3 University of Alabama in Huntsville Public
4 Alabama State University            Public
5 The University of Alabama           Public
6 Auburn University at Montgomery     Public

# add code here

Exercises

Demo: Filter the data frame to keep only schools with a greater than 40% share of first-generation students.

# add code here

Your turn: Filter the data frame to keep only public schools with a net cost of attendance below $12,000.

# add code here

Your turn: How many public colleges and universities in each state have a net cost of attendance below $12,000?

# add code here

Your turn: Generate a data frame with the 10 most expensive colleges in 2023-24 based on net cost of attendance.

# add code here

Your turn: Generate a data frame with the average SAT score for each type of college.

# add code here

Your turn: Calculate for each school how many students it takes to pay the average faculty member’s salary and generate a data frame with the school’s name, net cost of attendance, average faculty salary, and the calculated value. How many Cornell and Ithaca College students does it take to pay their average faculty member’s salary?

Note

You should use the net cost of attendance measure, not the sticker price.

# add code here

Footnotes

College Scorecard is a product of the U.S. Department of Education and compiles detailed information about student completion, debt and repayment, earnings, and more for all degree-granting institutions across the country.↩︎
The full database contains thousands of variables from 1996-2024.↩︎
From {dplyr} vignette ↩︎

Data wrangling with dplyr

Quick summary of key {dplyr} functions3

Operators

The pipe

Exercises

Footnotes

Quick summary of key {dplyr} functions³