AE 02: Wrangling college education metrics
Suggested answers
These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.
To demonstrate data wrangling we will use data from College Scorecard.1 The subset we will analyze contains a small number of metrics for all four-year colleges and universities in the United States for the 2023-24 academic year. 2
The data is stored in scorecard.csv. The variables are:
-
unit_id- Unit ID for institution -
name- Name of the college -
state- State abbreviation -
type- Type of college (Public; Private, nonprofit; Private, for-profit) -
adm_rate- Undergraduate admissions rate (from 0-100%) -
sat_avg- Average SAT equivalent score of students admitted -
cost- The average annual total cost of attendance, including tuition and fees, books and supplies, and living expenses -
net_cost- The average annual net cost of attendance (annual cost of attendance minus the average grant/scholarship aid) -
avg_fac_sal- Average faculty salary (9 month) -
pct_pell- Percentage of undergraduates who receive a Pell Grant -
comp_rate- Rate of first-time, full-time students at four-year institutions who complete their degree within six years -
first_gen- Share of first-generation students -
debt- Median debt of students after leaving school -
locale- Locale of institution
scorecard <- read_csv("data/scorecard.csv")The data frame has over 1700 observations (rows), 1712 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.
glimpse(scorecard)Rows: 1,712
Columns: 14
$ unit_id <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 10…
$ name <chr> "Alabama A & M University", "University of Alabama at Birm…
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ type <chr> "Public", "Public", "Public", "Public", "Public", "Public"…
$ adm_rate <dbl> 0.6622, 0.8842, 0.7425, 0.9564, 0.7582, 0.9263, 0.5047, 0.…
$ sat_avg <dbl> 947, 1251, 1321, 977, 1287, 1090, 1318, 1197, 1016, NA, 11…
$ cost <dbl> 23751, 27826, 27098, 22028, 32024, 21873, 34402, 38385, 36…
$ net_cost <dbl> 14559, 17727, 19880, 13889, 22150, 14596, 23897, 23351, 21…
$ avg_fac_sal <dbl> 77490, 109899, 93699, 72135, 99810, 79407, 107163, 59274, …
$ pct_pell <dbl> 0.6441, 0.3318, 0.2250, 0.7203, 0.1799, 0.4275, 0.1226, 0.…
$ comp_rate <dbl> 0.2874, 0.6260, 0.6191, 0.3018, 0.7369, 0.3568, 0.7921, 0.…
$ first_gen <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.3…
$ debt <dbl> 16600, 15832, 13905, 17500, 17986, 13119, 17750, 16000, 15…
$ locale <chr> "City", "City", "City", "City", "City", "City", "City", "C…
names(scorecard) [1] "unit_id" "name" "state" "type" "adm_rate"
[6] "sat_avg" "cost" "net_cost" "avg_fac_sal" "pct_pell"
[11] "comp_rate" "first_gen" "debt" "locale"
head(scorecard)# A tibble: 6 × 14
unit_id name state type adm_rate sat_avg cost net_cost avg_fac_sal pct_pell
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100654 Alab… AL Publ… 0.662 947 23751 14559 77490 0.644
2 100663 Univ… AL Publ… 0.884 1251 27826 17727 109899 0.332
3 100706 Univ… AL Publ… 0.742 1321 27098 19880 93699 0.225
4 100724 Alab… AL Publ… 0.956 977 22028 13889 72135 0.720
5 100751 The … AL Publ… 0.758 1287 32024 22150 99810 0.180
6 100830 Aubu… AL Publ… 0.926 1090 21873 14596 79407 0.428
# ℹ 4 more variables: comp_rate <dbl>, first_gen <dbl>, debt <dbl>,
# locale <chr>
The head() function returns “A tibble: 6 x 14” and then the first six rows of the scorecard data.
Data wrangling with dplyr
{dplyr} is the primary package in the {tidyverse} for data wrangling.
Quick summary of key {dplyr} functions3
Rows:
-
filter():chooses rows based on column values. -
slice(): chooses rows based on location. -
arrange(): changes the order of the rows -
slice_sample(): take a random subset of the rows -
slice_min()/slice_max(): select rows with minimum/maximum values of a variable.
Columns:
-
select(): changes whether or not a column is included. -
rename(): changes the name of columns. -
mutate(): changes the values of columns and creates new columns.
Groups of rows:
-
summarize(): collapses a group into a single row. -
count(): count unique values of one or more variables. -
group_by(): perform calculations separately for each value of a variable
Operators
In order to make comparisons, we will use logical operators. These should be familiar from other programming languages. See below for a reference table for how to use these operators in R.
| operator | definition |
|---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
x & y |
is x AND y? |
x | y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? |
The final operator only makes sense if x is logical (TRUE / FALSE).
The pipe
Before working with data wrangling functions, let’s formally introduce the pipe. The pipe, |>, is an operator (a tool) for passing information from one process to another. We will use |> mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.
When reading code “in English”, say “and then” whenever you see a pipe.
- Your turn (3 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.
# A tibble: 6 × 2
name type
<chr> <chr>
1 Alabama A & M University Public
2 University of Alabama at Birmingham Public
3 University of Alabama in Huntsville Public
4 Alabama State University Public
5 The University of Alabama Public
6 Auburn University at Montgomery Public
Exercises
Demo: Filter the data frame to keep only schools with a greater than 40% share of first-generation students.
filter(.data = scorecard, first_gen > .40)# A tibble: 349 × 14
unit_id name state type adm_rate sat_avg cost net_cost avg_fac_sal
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 101189 Faulkner Uni… AL Priv… 0.661 1016 36910 21165 59166
2 101365 Herzing Univ… AL Priv… 0.945 NA 27195 20957 60696
3 101587 University o… AL Publ… 0.871 1015 22402 13832 68643
4 102270 Stillman Col… AL Priv… 0.742 NA 23485 12272 48735
5 104717 Grand Canyon… AZ Priv… 0.604 1095 31493 21931 67311
6 106467 Arkansas Tec… AR Publ… 0.963 1050 21536 12171 61812
7 107983 Southern Ark… AR Publ… 0.666 1050 24721 15944 66339
8 108092 University o… AR Publ… 0.815 1050 18052 10097 65052
9 110361 California B… CA Priv… 0.736 NA 52408 27384 96012
10 110486 California S… CA Publ… 0.882 NA 18815 5825 92466
# ℹ 339 more rows
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
# debt <dbl>, locale <chr>
Your turn: Filter the data frame to keep only public schools with a net cost of attendance below $12,000.
filter(.data = scorecard, type == "Public", net_cost < 12000)# A tibble: 144 × 14
unit_id name state type adm_rate sat_avg cost net_cost avg_fac_sal
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 101879 University o… AL Publ… 0.965 NA 22086 11024 81441
2 102632 University o… AK Publ… 0.625 NA 16796 7233 81819
3 106412 University o… AR Publ… 0.575 916 20907 11679 54900
4 108092 University o… AR Publ… 0.815 1050 18052 10097 65052
5 110486 California S… CA Publ… 0.882 NA 18815 5825 92466
6 110495 California S… CA Publ… 0.975 NA 18542 5671 95049
7 110510 California S… CA Publ… 0.982 NA 16910 5918 97290
8 110529 California S… CA Publ… 0.737 NA 22672 11580 104904
9 110547 California S… CA Publ… 0.913 NA 15449 3659 96687
10 110556 California S… CA Publ… 0.964 NA 18650 6480 92475
# ℹ 134 more rows
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
# debt <dbl>, locale <chr>
filter(.data = scorecard, type == "Public" & net_cost < 12000)# A tibble: 144 × 14
unit_id name state type adm_rate sat_avg cost net_cost avg_fac_sal
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 101879 University o… AL Publ… 0.965 NA 22086 11024 81441
2 102632 University o… AK Publ… 0.625 NA 16796 7233 81819
3 106412 University o… AR Publ… 0.575 916 20907 11679 54900
4 108092 University o… AR Publ… 0.815 1050 18052 10097 65052
5 110486 California S… CA Publ… 0.882 NA 18815 5825 92466
6 110495 California S… CA Publ… 0.975 NA 18542 5671 95049
7 110510 California S… CA Publ… 0.982 NA 16910 5918 97290
8 110529 California S… CA Publ… 0.737 NA 22672 11580 104904
9 110547 California S… CA Publ… 0.913 NA 15449 3659 96687
10 110556 California S… CA Publ… 0.964 NA 18650 6480 92475
# ℹ 134 more rows
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
# debt <dbl>, locale <chr>
Your turn: How many public colleges and universities in each state have a net cost of attendance below $12,000?
# A tibble: 39 × 2
state n
<chr> <int>
1 AK 1
2 AL 1
3 AR 2
4 CA 15
5 CT 2
6 FL 12
7 FM 1
8 GA 8
9 HI 1
10 IL 3
# ℹ 29 more rows
# A tibble: 39 × 2
state n
<chr> <int>
1 AK 1
2 AL 1
3 AR 2
4 CA 15
5 CT 2
6 FL 12
7 FM 1
8 GA 8
9 HI 1
10 IL 3
# ℹ 29 more rows
Your turn: Generate a data frame with the 10 most expensive colleges in 2023-24 based on net cost of attendance.
We could use a combination of arrange() and slice() to sort the data frame from most to least expensive, then keep the first 10 rows:
# A tibble: 10 × 14
unit_id name state type adm_rate sat_avg cost net_cost avg_fac_sal
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 487861 Felbry Colle… OH Priv… 0.647 NA 66505 63922 52479
2 197151 School of Vi… NY Priv… 0.913 1337 75016 58628 35442
3 136774 Ringling Col… FL Priv… 0.674 NA 75726 55390 85779
4 193654 The New Scho… NY Priv… 0.625 NA 82757 55359 118188
5 122931 Santa Clara … CA Priv… 0.438 1417 77768 50959 127134
6 247649 Landmark Col… VT Priv… 0.436 NA 81895 50879 63180
7 194578 Pratt Instit… NY Priv… 0.532 1339 76533 50647 111807
8 164748 Berklee Coll… MA Priv… 0.499 NA 69874 50404 95085
9 113698 Dominican Un… CA Priv… 0.956 1194 68620 50218 84888
10 109651 Art Center C… CA Priv… 0.750 NA 71652 50209 78534
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
# debt <dbl>, locale <chr>
# A tibble: 10 × 14
unit_id name state type adm_rate sat_avg cost net_cost avg_fac_sal
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 487861 Felbry Colle… OH Priv… 0.647 NA 66505 63922 52479
2 197151 School of Vi… NY Priv… 0.913 1337 75016 58628 35442
3 136774 Ringling Col… FL Priv… 0.674 NA 75726 55390 85779
4 193654 The New Scho… NY Priv… 0.625 NA 82757 55359 118188
5 122931 Santa Clara … CA Priv… 0.438 1417 77768 50959 127134
6 247649 Landmark Col… VT Priv… 0.436 NA 81895 50879 63180
7 194578 Pratt Instit… NY Priv… 0.532 1339 76533 50647 111807
8 164748 Berklee Coll… MA Priv… 0.499 NA 69874 50404 95085
9 113698 Dominican Un… CA Priv… 0.956 1194 68620 50218 84888
10 109651 Art Center C… CA Priv… 0.750 NA 71652 50209 78534
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
# debt <dbl>, locale <chr>
We can also use the slice_max() function in {dplyr} to accomplish the same thing with a single function.
slice_max(.data = scorecard, order_by = net_cost, n = 10)# A tibble: 10 × 14
unit_id name state type adm_rate sat_avg cost net_cost avg_fac_sal
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 487861 Felbry Colle… OH Priv… 0.647 NA 66505 63922 52479
2 197151 School of Vi… NY Priv… 0.913 1337 75016 58628 35442
3 136774 Ringling Col… FL Priv… 0.674 NA 75726 55390 85779
4 193654 The New Scho… NY Priv… 0.625 NA 82757 55359 118188
5 122931 Santa Clara … CA Priv… 0.438 1417 77768 50959 127134
6 247649 Landmark Col… VT Priv… 0.436 NA 81895 50879 63180
7 194578 Pratt Instit… NY Priv… 0.532 1339 76533 50647 111807
8 164748 Berklee Coll… MA Priv… 0.499 NA 69874 50404 95085
9 113698 Dominican Un… CA Priv… 0.956 1194 68620 50218 84888
10 109651 Art Center C… CA Priv… 0.750 NA 71652 50209 78534
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
# debt <dbl>, locale <chr>
Your turn: Generate a data frame with the average SAT score for each type of college.
Note that since the sat_avg column contains NAs (missing values), we need to explicitly exclude them from our mean calculation. Otherwise the resulting data frame contains NAs.
# A tibble: 3 × 2
type mean_sat
<chr> <dbl>
1 Private, for-profit NA
2 Private, nonprofit NA
3 Public NA
# A tibble: 3 × 2
type mean_sat
<chr> <dbl>
1 Private, for-profit 1121
2 Private, nonprofit 1206.
3 Public 1137.
# A tibble: 3 × 2
type mean_sat
<chr> <dbl>
1 Private, for-profit 1121
2 Private, nonprofit 1206.
3 Public 1137.
Your turn: Calculate for each school how many students it takes to pay the average faculty member’s salary and generate a data frame with the school’s name, net cost of attendance, average faculty salary, and the calculated value. How many Cornell and Ithaca College students does it take to pay their average faculty member’s salary?
You should use the net cost of attendance measure, not the sticker price.
scorecard |>
# mutate() to create a column with the ratio
mutate(ratio = avg_fac_sal / net_cost) |>
# select() to keep only the name and ratio columns
select(name, net_cost, avg_fac_sal, ratio) |>
# filter() to keep only Cornell and Ithaca College
filter(name == "Cornell University" | name == "Ithaca College")# A tibble: 2 × 4
name net_cost avg_fac_sal ratio
<chr> <dbl> <dbl> <dbl>
1 Cornell University 32337 155277 4.80
2 Ithaca College 32965 86337 2.62
sessioninfo::session_info()─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.5.2 (2025-10-31)
os macOS Tahoe 26.2
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2026-01-30
pandoc 3.4 @ /usr/local/bin/ (via rmarkdown)
quarto 1.9.17 @ /usr/local/bin/quarto
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P archive 1.1.12.1 2025-12-24 [?] RSPM
P bit 4.6.0 2025-03-06 [?] RSPM (R 4.5.0)
P bit64 4.6.0-1 2025-01-16 [?] RSPM (R 4.5.0)
P cli 3.6.5 2025-04-23 [?] RSPM (R 4.5.0)
P crayon 1.5.3 2024-06-20 [?] RSPM (R 4.5.0)
P digest 0.6.39 2025-11-19 [?] RSPM (R 4.5.0)
P dplyr * 1.1.4 2023-11-17 [?] RSPM (R 4.5.0)
P evaluate 1.0.5 2025-08-27 [?] RSPM (R 4.5.0)
P farver 2.1.2 2024-05-13 [?] RSPM (R 4.5.0)
P fastmap 1.2.0 2024-05-15 [?] RSPM (R 4.5.0)
P forcats * 1.0.1 2025-09-25 [?] RSPM (R 4.5.0)
P generics 0.1.4 2025-05-09 [?] RSPM (R 4.5.0)
P ggplot2 * 4.0.1 2025-11-14 [?] RSPM (R 4.5.0)
P glue 1.8.0 2024-09-30 [?] RSPM (R 4.5.0)
P gtable 0.3.6 2024-10-25 [?] RSPM (R 4.5.0)
P here 1.0.2 2025-09-15 [?] CRAN (R 4.5.0)
P hms 1.1.4 2025-10-17 [?] RSPM (R 4.5.0)
P htmltools 0.5.9 2025-12-04 [?] RSPM (R 4.5.0)
P htmlwidgets 1.6.4 2023-12-06 [?] RSPM (R 4.5.0)
P jsonlite 2.0.0 2025-03-27 [?] RSPM (R 4.5.0)
P knitr 1.51 2025-12-20 [?] RSPM (R 4.5.0)
P lifecycle 1.0.4 2023-11-07 [?] RSPM (R 4.5.0)
P lubridate * 1.9.4 2024-12-08 [?] RSPM (R 4.5.0)
P magrittr 2.0.4 2025-09-12 [?] RSPM (R 4.5.0)
P otel 0.2.0 2025-08-29 [?] RSPM (R 4.5.0)
P pillar 1.11.1 2025-09-17 [?] RSPM (R 4.5.0)
P pkgconfig 2.0.3 2019-09-22 [?] RSPM (R 4.5.0)
P purrr * 1.2.0 2025-11-04 [?] CRAN (R 4.5.0)
P R6 2.6.1 2025-02-15 [?] RSPM (R 4.5.0)
P RColorBrewer 1.1-3 2022-04-03 [?] RSPM (R 4.5.0)
P readr * 2.1.6 2025-11-14 [?] RSPM (R 4.5.0)
renv 1.1.5 2025-07-24 [1] RSPM (R 4.5.0)
P rlang 1.1.6 2025-04-11 [?] RSPM (R 4.5.0)
P rmarkdown 2.30 2025-09-28 [?] RSPM (R 4.5.0)
P rprojroot 2.1.1 2025-08-26 [?] RSPM (R 4.5.0)
P S7 0.2.1 2025-11-14 [?] RSPM (R 4.5.0)
P scales 1.4.0 2025-04-24 [?] RSPM (R 4.5.0)
P sessioninfo 1.2.3 2025-02-05 [?] RSPM (R 4.5.0)
P stringi 1.8.7 2025-03-27 [?] RSPM (R 4.5.0)
P stringr * 1.6.0 2025-11-04 [?] RSPM (R 4.5.0)
P tibble * 3.3.0 2025-06-08 [?] RSPM (R 4.5.0)
P tidyr * 1.3.2 2025-12-19 [?] RSPM (R 4.5.0)
P tidyselect 1.2.1 2024-03-11 [?] RSPM (R 4.5.0)
P tidyverse * 2.0.0 2023-02-22 [?] RSPM (R 4.5.0)
P timechange 0.3.0 2024-01-18 [?] RSPM (R 4.5.0)
P tzdb 0.5.0 2025-03-15 [?] RSPM (R 4.5.0)
P utf8 1.2.6 2025-06-08 [?] RSPM (R 4.5.0)
P vctrs 0.6.5 2023-12-01 [?] RSPM (R 4.5.0)
P vroom 1.6.7 2025-11-28 [?] RSPM (R 4.5.0)
P withr 3.0.2 2024-10-28 [?] RSPM (R 4.5.0)
P xfun 0.55 2025-12-16 [?] CRAN (R 4.5.2)
P yaml 2.3.12 2025-12-10 [?] RSPM (R 4.5.0)
[1] /Users/bcs88/Projects/info-2950/course-site/renv/library/macos/R-4.5/aarch64-apple-darwin20
[2] /Users/bcs88/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.5/aarch64-apple-darwin20/4cd76b74
* ── Packages attached to the search path.
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────
Footnotes
College Scorecard is a product of the U.S. Department of Education and compiles detailed information about student completion, debt and repayment, earnings, and more for all degree-granting institutions across the country.↩︎
The full database contains thousands of variables from 1996-2024.↩︎
From {dplyr} vignette↩︎