AE 02: Wrangling college education metrics

Data wrangling with dplyr

{dplyr} is the primary package in the {tidyverse} for data wrangling.

Helpful data wrangling resources

Quick summary of key {dplyr} functions³

Rows:

filter():chooses rows based on column values.
slice(): chooses rows based on location.
arrange(): changes the order of the rows
slice_sample(): take a random subset of the rows
slice_min()/slice_max(): select rows with minimum/maximum values of a variable.

Columns:

select(): changes whether or not a column is included.
rename(): changes the name of columns.
mutate(): changes the values of columns and creates new columns.

Groups of rows:

summarize(): collapses a group into a single row.
count(): count unique values of one or more variables.
group_by(): perform calculations separately for each value of a variable

Operators

In order to make comparisons, we will use logical operators. These should be familiar from other programming languages. See below for a reference table for how to use these operators in R.

operator	definition
`<`	is less than?
`<=`	is less than or equal to?
`>`	is greater than?
`>=`	is greater than or equal to?
`==`	is exactly equal to?
`!=`	is not equal to?
`x & y`	is x AND y?
`x \| y`	is x OR y?
`is.na(x)`	is x NA?
`!is.na(x)`	is x not NA?
`x %in% y`	is x in y?
`!(x %in% y)`	is x not in y?
`!x`	is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

The pipe

Before working with data wrangling functions, let’s formally introduce the pipe. The pipe, |>, is an operator (a tool) for passing information from one process to another. We will use |> mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.

When reading code “in English”, say “and then” whenever you see a pipe.

Your turn (3 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.

scorecard |>
  select(name, type) |>
  head()

# A tibble: 6 × 2
  name                                type  
  <chr>                               <chr> 
1 Alabama A & M University            Public
2 University of Alabama at Birmingham Public
3 University of Alabama in Huntsville Public
4 Alabama State University            Public
5 The University of Alabama           Public
6 Auburn University at Montgomery     Public

Exercises

Demo: Filter the data frame to keep only schools with a greater than 40% share of first-generation students.

filter(.data = scorecard, first_gen > .40)

# A tibble: 349 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  101189 Faulkner Uni… AL    Priv…    0.661    1016 36910    21165       59166
 2  101365 Herzing Univ… AL    Priv…    0.945      NA 27195    20957       60696
 3  101587 University o… AL    Publ…    0.871    1015 22402    13832       68643
 4  102270 Stillman Col… AL    Priv…    0.742      NA 23485    12272       48735
 5  104717 Grand Canyon… AZ    Priv…    0.604    1095 31493    21931       67311
 6  106467 Arkansas Tec… AR    Publ…    0.963    1050 21536    12171       61812
 7  107983 Southern Ark… AR    Publ…    0.666    1050 24721    15944       66339
 8  108092 University o… AR    Publ…    0.815    1050 18052    10097       65052
 9  110361 California B… CA    Priv…    0.736      NA 52408    27384       96012
10  110486 California S… CA    Publ…    0.882      NA 18815     5825       92466
# ℹ 339 more rows
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>

Your turn: Filter the data frame to keep only public schools with a net cost of attendance below $12,000.

filter(.data = scorecard, type == "Public", net_cost < 12000)

# A tibble: 144 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  101879 University o… AL    Publ…    0.965      NA 22086    11024       81441
 2  102632 University o… AK    Publ…    0.625      NA 16796     7233       81819
 3  106412 University o… AR    Publ…    0.575     916 20907    11679       54900
 4  108092 University o… AR    Publ…    0.815    1050 18052    10097       65052
 5  110486 California S… CA    Publ…    0.882      NA 18815     5825       92466
 6  110495 California S… CA    Publ…    0.975      NA 18542     5671       95049
 7  110510 California S… CA    Publ…    0.982      NA 16910     5918       97290
 8  110529 California S… CA    Publ…    0.737      NA 22672    11580      104904
 9  110547 California S… CA    Publ…    0.913      NA 15449     3659       96687
10  110556 California S… CA    Publ…    0.964      NA 18650     6480       92475
# ℹ 134 more rows
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>

filter(.data = scorecard, type == "Public" & net_cost < 12000)

# A tibble: 144 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  101879 University o… AL    Publ…    0.965      NA 22086    11024       81441
 2  102632 University o… AK    Publ…    0.625      NA 16796     7233       81819
 3  106412 University o… AR    Publ…    0.575     916 20907    11679       54900
 4  108092 University o… AR    Publ…    0.815    1050 18052    10097       65052
 5  110486 California S… CA    Publ…    0.882      NA 18815     5825       92466
 6  110495 California S… CA    Publ…    0.975      NA 18542     5671       95049
 7  110510 California S… CA    Publ…    0.982      NA 16910     5918       97290
 8  110529 California S… CA    Publ…    0.737      NA 22672    11580      104904
 9  110547 California S… CA    Publ…    0.913      NA 15449     3659       96687
10  110556 California S… CA    Publ…    0.964      NA 18650     6480       92475
# ℹ 134 more rows
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>

Your turn: How many public colleges and universities in each state have a net cost of attendance below $12,000?

# using group_by() and summarize()
scorecard |>
  filter(type == "Public", net_cost < 12000) |>
  group_by(state) |>
  summarize(n = n())

# A tibble: 39 × 2
   state     n
   <chr> <int>
 1 AK        1
 2 AL        1
 3 AR        2
 4 CA       15
 5 CT        2
 6 FL       12
 7 FM        1
 8 GA        8
 9 HI        1
10 IL        3
# ℹ 29 more rows

# using count()
scorecard |>
  filter(type == "Public", net_cost < 12000) |>
  count(state)

# A tibble: 39 × 2
   state     n
   <chr> <int>
 1 AK        1
 2 AL        1
 3 AR        2
 4 CA       15
 5 CT        2
 6 FL       12
 7 FM        1
 8 GA        8
 9 HI        1
10 IL        3
# ℹ 29 more rows

Your turn: Generate a data frame with the 10 most expensive colleges in 2023-24 based on net cost of attendance.

We could use a combination of arrange() and slice() to sort the data frame from most to least expensive, then keep the first 10 rows:

# using desc()
arrange(.data = scorecard, desc(net_cost)) |>
  slice(1:10)

# A tibble: 10 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  487861 Felbry Colle… OH    Priv…    0.647      NA 66505    63922       52479
 2  197151 School of Vi… NY    Priv…    0.913    1337 75016    58628       35442
 3  136774 Ringling Col… FL    Priv…    0.674      NA 75726    55390       85779
 4  193654 The New Scho… NY    Priv…    0.625      NA 82757    55359      118188
 5  122931 Santa Clara … CA    Priv…    0.438    1417 77768    50959      127134
 6  247649 Landmark Col… VT    Priv…    0.436      NA 81895    50879       63180
 7  194578 Pratt Instit… NY    Priv…    0.532    1339 76533    50647      111807
 8  164748 Berklee Coll… MA    Priv…    0.499      NA 69874    50404       95085
 9  113698 Dominican Un… CA    Priv…    0.956    1194 68620    50218       84888
10  109651 Art Center C… CA    Priv…    0.750      NA 71652    50209       78534
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>

# using -
arrange(.data = scorecard, -net_cost) |>
  slice(1:10)

# A tibble: 10 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  487861 Felbry Colle… OH    Priv…    0.647      NA 66505    63922       52479
 2  197151 School of Vi… NY    Priv…    0.913    1337 75016    58628       35442
 3  136774 Ringling Col… FL    Priv…    0.674      NA 75726    55390       85779
 4  193654 The New Scho… NY    Priv…    0.625      NA 82757    55359      118188
 5  122931 Santa Clara … CA    Priv…    0.438    1417 77768    50959      127134
 6  247649 Landmark Col… VT    Priv…    0.436      NA 81895    50879       63180
 7  194578 Pratt Instit… NY    Priv…    0.532    1339 76533    50647      111807
 8  164748 Berklee Coll… MA    Priv…    0.499      NA 69874    50404       95085
 9  113698 Dominican Un… CA    Priv…    0.956    1194 68620    50218       84888
10  109651 Art Center C… CA    Priv…    0.750      NA 71652    50209       78534
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>

We can also use the slice_max() function in {dplyr} to accomplish the same thing with a single function.

slice_max(.data = scorecard, order_by = net_cost, n = 10)

# A tibble: 10 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  487861 Felbry Colle… OH    Priv…    0.647      NA 66505    63922       52479
 2  197151 School of Vi… NY    Priv…    0.913    1337 75016    58628       35442
 3  136774 Ringling Col… FL    Priv…    0.674      NA 75726    55390       85779
 4  193654 The New Scho… NY    Priv…    0.625      NA 82757    55359      118188
 5  122931 Santa Clara … CA    Priv…    0.438    1417 77768    50959      127134
 6  247649 Landmark Col… VT    Priv…    0.436      NA 81895    50879       63180
 7  194578 Pratt Instit… NY    Priv…    0.532    1339 76533    50647      111807
 8  164748 Berklee Coll… MA    Priv…    0.499      NA 69874    50404       95085
 9  113698 Dominican Un… CA    Priv…    0.956    1194 68620    50218       84888
10  109651 Art Center C… CA    Priv…    0.750      NA 71652    50209       78534
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>

Your turn: Generate a data frame with the average SAT score for each type of college.

Note that since the sat_avg column contains NAs (missing values), we need to explicitly exclude them from our mean calculation. Otherwise the resulting data frame contains NAs.

# incorrect - ignores NAs
scorecard |>
  group_by(type) |>
  summarize(mean_sat = mean(sat_avg))

# A tibble: 3 × 2
  type                mean_sat
  <chr>                  <dbl>
1 Private, for-profit       NA
2 Private, nonprofit        NA
3 Public                    NA

# exclude NAs using mean()
scorecard |>
  group_by(type) |>
  summarize(mean_sat = mean(sat_avg, na.rm = TRUE))

# A tibble: 3 × 2
  type                mean_sat
  <chr>                  <dbl>
1 Private, for-profit    1121 
2 Private, nonprofit     1206.
3 Public                 1137.

# exclude NAs using drop_na() to remove the rows prior to summarizing
scorecard |>
  drop_na(sat_avg) |>
  group_by(type) |>
  summarize(mean_sat = mean(sat_avg))

# A tibble: 3 × 2
  type                mean_sat
  <chr>                  <dbl>
1 Private, for-profit    1121 
2 Private, nonprofit     1206.
3 Public                 1137.

Your turn: Calculate for each school how many students it takes to pay the average faculty member’s salary and generate a data frame with the school’s name, net cost of attendance, average faculty salary, and the calculated value. How many Cornell and Ithaca College students does it take to pay their average faculty member’s salary?

Note

You should use the net cost of attendance measure, not the sticker price.

scorecard |>
  # mutate() to create a column with the ratio
  mutate(ratio = avg_fac_sal / net_cost) |>
  # select() to keep only the name and ratio columns
  select(name, net_cost, avg_fac_sal, ratio) |>
  # filter() to keep only Cornell and Ithaca College
  filter(name == "Cornell University" | name == "Ithaca College")

# A tibble: 2 × 4
  name               net_cost avg_fac_sal ratio
  <chr>                 <dbl>       <dbl> <dbl>
1 Cornell University    32337      155277  4.80
2 Ithaca College        32965       86337  2.62

Session information

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.5.2 (2025-10-31)
 os       macOS Tahoe 26.3.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2026-03-13
 pandoc   3.6.3 @ /Applications/Positron.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
 quarto   1.9.33 @ /Applications/quarto/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 ! package      * version  date (UTC) lib source
 P archive        1.1.12.1 2025-12-24 [?] RSPM
 P bit            4.6.0    2025-03-06 [?] RSPM (R 4.5.0)
 P bit64          4.6.0-1  2025-01-16 [?] RSPM (R 4.5.0)
 P cli            3.6.5    2025-04-23 [?] RSPM (R 4.5.0)
 P crayon         1.5.3    2024-06-20 [?] RSPM (R 4.5.0)
 P digest         0.6.39   2025-11-19 [?] RSPM (R 4.5.0)
 P dplyr        * 1.1.4    2023-11-17 [?] RSPM (R 4.5.0)
 P evaluate       1.0.5    2025-08-27 [?] RSPM (R 4.5.0)
 P farver         2.1.2    2024-05-13 [?] RSPM (R 4.5.0)
 P fastmap        1.2.0    2024-05-15 [?] RSPM (R 4.5.0)
 P forcats      * 1.0.1    2025-09-25 [?] RSPM (R 4.5.0)
 P generics       0.1.4    2025-05-09 [?] RSPM (R 4.5.0)
 P ggplot2      * 4.0.1    2025-11-14 [?] RSPM (R 4.5.0)
 P glue           1.8.0    2024-09-30 [?] RSPM (R 4.5.0)
 P gtable         0.3.6    2024-10-25 [?] RSPM (R 4.5.0)
 P here           1.0.2    2025-09-15 [?] CRAN (R 4.5.0)
 P hms            1.1.4    2025-10-17 [?] RSPM (R 4.5.0)
 P htmltools      0.5.9    2025-12-04 [?] RSPM (R 4.5.0)
 P htmlwidgets    1.6.4    2023-12-06 [?] RSPM (R 4.5.0)
 P jsonlite       2.0.0    2025-03-27 [?] RSPM (R 4.5.0)
 P knitr          1.51     2025-12-20 [?] RSPM (R 4.5.0)
 P lifecycle      1.0.4    2023-11-07 [?] RSPM (R 4.5.0)
 P lubridate    * 1.9.4    2024-12-08 [?] RSPM (R 4.5.0)
 P magrittr       2.0.4    2025-09-12 [?] RSPM (R 4.5.0)
 P otel           0.2.0    2025-08-29 [?] RSPM (R 4.5.0)
 P pillar         1.11.1   2025-09-17 [?] RSPM (R 4.5.0)
 P pkgconfig      2.0.3    2019-09-22 [?] RSPM (R 4.5.0)
 P purrr        * 1.2.0    2025-11-04 [?] CRAN (R 4.5.0)
 P R6             2.6.1    2025-02-15 [?] RSPM (R 4.5.0)
 P RColorBrewer   1.1-3    2022-04-03 [?] RSPM (R 4.5.0)
 P readr        * 2.1.6    2025-11-14 [?] RSPM (R 4.5.0)
   renv           1.1.5    2025-07-24 [1] RSPM (R 4.5.0)
 P rlang          1.1.6    2025-04-11 [?] RSPM (R 4.5.0)
 P rmarkdown      2.30     2025-09-28 [?] RSPM (R 4.5.0)
 P rprojroot      2.1.1    2025-08-26 [?] RSPM (R 4.5.0)
 P S7             0.2.1    2025-11-14 [?] RSPM (R 4.5.0)
 P scales         1.4.0    2025-04-24 [?] RSPM (R 4.5.0)
 P sessioninfo    1.2.3    2025-02-05 [?] RSPM (R 4.5.0)
 P stringi        1.8.7    2025-03-27 [?] RSPM (R 4.5.0)
 P stringr      * 1.6.0    2025-11-04 [?] RSPM (R 4.5.0)
 P tibble       * 3.3.0    2025-06-08 [?] RSPM (R 4.5.0)
 P tidyr        * 1.3.2    2025-12-19 [?] RSPM (R 4.5.0)
 P tidyselect     1.2.1    2024-03-11 [?] RSPM (R 4.5.0)
 P tidyverse    * 2.0.0    2023-02-22 [?] RSPM (R 4.5.0)
 P timechange     0.3.0    2024-01-18 [?] RSPM (R 4.5.0)
 P tzdb           0.5.0    2025-03-15 [?] RSPM (R 4.5.0)
 P utf8           1.2.6    2025-06-08 [?] RSPM (R 4.5.0)
 P vctrs          0.6.5    2023-12-01 [?] RSPM (R 4.5.0)
 P vroom          1.6.7    2025-11-28 [?] RSPM (R 4.5.0)
 P withr          3.0.2    2024-10-28 [?] RSPM (R 4.5.0)
 P xfun           0.55     2025-12-16 [?] CRAN (R 4.5.2)
 P yaml           2.3.12   2025-12-10 [?] RSPM (R 4.5.0)

 [1] /Users/bcs88/Projects/info-2950/course-site/renv/library/macos/R-4.5/aarch64-apple-darwin20
 [2] /Users/bcs88/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.5/aarch64-apple-darwin20/4cd76b74

 * ── Packages attached to the search path.
 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

Footnotes

College Scorecard is a product of the U.S. Department of Education and compiles detailed information about student completion, debt and repayment, earnings, and more for all degree-granting institutions across the country.↩︎
The full database contains thousands of variables from 1996-2024.↩︎
From {dplyr} vignette ↩︎

Data wrangling with dplyr

Quick summary of key {dplyr} functions3

Operators

The pipe

Exercises

Footnotes

Quick summary of key {dplyr} functions³