Grammar of data wrangling

Lecture 5

Dr. Benjamin Soltoff

Cornell University
INFO 2951 - Spring 2025

February 4, 2025

Announcements

Announcements

  • Suggested solutions posted for ae-01 and ae-02
  • Homework 01 due tomorrow by 11:59pm
  • Ask course questions on GitHub Discussion

Clarification for hw-01

Home age variable

tompkins <- tompkins |>
  mutate(home_age = if_else(year_built < 1960, "Before 1960", "Newer than 1960"))
1
Create a new column of data
2
Save the modified data frame as tompkins

Ordering of health categories

brfss <- brfss |>
  mutate(
    general_health = as.factor(general_health),
    general_health = fct_relevel(
      general_health, "Excellent", "Very good",
      "Good", "Fair", "Poor"
    )
  )
1
Modify a column of data
2
Save the modified data frame as brfss

Computational problem-solving

Charles Xavier from 'X-Men' sitting with his eyes closed wearing Cerebro.

Meet the Palmer penguins

Meet the Palmer penguins

penguins

library(palmerpenguins)
glimpse(penguins)
Rows: 333
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
$ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
$ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
$ sex               <fct> male, female, female, female, male, female, male, fe…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

What is the average body mass of an Adelie penguin?

  1. First we need to identify the input, or the data we’re going to analyze.
  2. Next we need to select only the observations which are Adelie penguins.
  3. Finally we need to calculate the average value, or mean, of body_mass_g.
penguins |>
  filter(species == "Adelie") |>
  summarize(avg_mass = mean(body_mass_g))
# A tibble: 1 × 1
  avg_mass
     <dbl>
1    3706.

What is the average body mass of a penguin for each species?

01:00
penguins |>
  group_by(species) |>
  summarize(avg_mass = mean(body_mass_g))
# A tibble: 3 × 2
  species   avg_mass
  <fct>        <dbl>
1 Adelie       3706.
2 Chinstrap    3733.
3 Gentoo       5092.

What is the average bill length and body mass for each Adelie penguin by sex?

01:00
penguins |>
  filter(species == "Adelie") |>
  group_by(sex) |>
  summarize(
    bill = mean(bill_length_mm),
    avg_mass = mean(body_mass_g)
  )
# A tibble: 2 × 3
  sex     bill avg_mass
  <fct>  <dbl>    <dbl>
1 female  37.3    3369.
2 male    40.4    4043.
penguins |>
  group_by(sex) |>
  filter(species == "Adelie") |>
  summarize(
    bill = mean(bill_length_mm),
    avg_mass = mean(body_mass_g)
  )
# A tibble: 2 × 3
  sex     bill avg_mass
  <fct>  <dbl>    <dbl>
1 female  37.3    3369.
2 male    40.4    4043.

The pipe |> operator

Avoids more complex syntax such as:

Nested functions

summarize(
  group_by(
    filter(
      .data = penguins,
      species == "Adelie"
    ),
    sex
  ),
  bill = mean(bill_length_mm),
  avg_mass = mean(body_mass_g)
)

Intermediate objects

penguins1 <- filter(
  .data = penguins,
  species == "Adelie"
)
penguins2 <- group_by(.data = penguins1, sex)
summarize(
  .data = penguins2,
  bill = mean(bill_length_mm),
  avg_mass = mean(body_mass_g)
)

Cartoon of a fuzzy monster with a cowboy hat and lasso, riding another fuzzy monster labeled 'dplyr', lassoing a group of angry / unruly looking creatures labeled 'data.'

Verbiage for data transformation

  1. The first argument is a data frame
  2. Subsequent arguments describe what to do with the data frame
  3. The result is a new data frame

Key functions in {dplyr}

function() Action performed
filter() Subsets observations based on their values
arrange() Changes the order of observations based on their values
select() Selects a subset of columns from the data frame
rename() Changes the name of columns in the data frame
mutate() Creates new columns (or variables)
group_by() Changes the unit of analysis from the complete dataset to individual groups
summarize() Collapses the data frame to a smaller number of rows which summarize the larger data

Application exercise

ae-03

Instructions

  • Go to the course GitHub org and find your ae-03 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Wrap up

Recap

  • The pipe operator, |>, can be read as “and then”.

  • The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.

    sum(1, 2)
    [1] 3
    1 |>
      sum(2)
    [1] 3
  • Always use a line break after the pipe, and indent the next line of code.

  • Use {dplyr} functions to transform your data

Six more weeks of winter