Working with relational data

Lecture 5

Dr. Benjamin Soltoff

Cornell University
INFO 2951 - Spring 2026

February 3, 2026

Announcements

Announcements

fill vs. color

  • Use fill when a geom is filled
  • Use color when a geom is outlined

Learning objectives

  • Introduce relational data
  • Define keys
  • Demonstrate how tables can be linked to one another
  • Demonstrate methods in {dplyr} for linking and merging related tables
  • Practice joining tables

Relational joins

Introduction to relational data

  • Multiple tables of data that when combined together accomplish goals
  • Relations define the important element, not just the individual tables
  • Relations are defined between a pair of tables
  • Relational verbs
    • Mutating joins
    • Filtering joins

Keys

Keys uniquely link rows between tables

  • Primary key: a variable (or set of variables) that uniquely identifies each observation
  • Foreign key: a variable (or set of variables) that corresponds to a primary key in another table

Often (but not always) the variable names are the same in both tables

Comic book characters

Name Alignment Gender Publisher
Deadpool Chaotic Male Marvel
Batman Good Male DC
Sabrina Good Female Archie Comics

Publishers

Publisher Year founded
DC 1934
Marvel 1939
Image 1992
  • Primary key for superheroes
  • Primary key for publishers
  • Foreign key in superheroes and publishers

Mutating joins

inner_join()

inner_join(x = superheroes, y = publishers, by = join_by(Publisher))

inner_join()

inner_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 2 × 5
  Name     Alignment Gender Publisher `Year\nfounded`
  <chr>    <chr>     <chr>  <chr>               <dbl>
1 Deadpool Chaotic   Male   Marvel               1939
2 Batman   Good      Male   DC                   1934

left_join()

left_join(x = superheroes, y = publishers, by = join_by(Publisher))

left_join()

left_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 3 × 5
  Name     Alignment Gender Publisher        `Year\nfounded`
  <chr>    <chr>     <chr>  <chr>                      <dbl>
1 Deadpool Chaotic   Male   "Marvel"                    1939
2 Batman   Good      Male   "DC"                        1934
3 Sabrina  Good      Female "Archie\nComics"              NA

right_join()

right_join(x = superheroes, y = publishers, by = join_by(Publisher))

right_join()

right_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 3 × 5
  Name     Alignment Gender Publisher `Year\nfounded`
  <chr>    <chr>     <chr>  <chr>               <dbl>
1 Deadpool Chaotic   Male   Marvel               1939
2 Batman   Good      Male   DC                   1934
3 <NA>     <NA>      <NA>   Image                1992

right_join() reversed

left_join(x = publishers, y = superheroes, by = join_by(Publisher))

full_join()

full_join(x = superheroes, y = publishers, by = join_by(Publisher))

full_join()

full_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 4 × 5
  Name     Alignment Gender Publisher        `Year\nfounded`
  <chr>    <chr>     <chr>  <chr>                      <dbl>
1 Deadpool Chaotic   Male   "Marvel"                    1939
2 Batman   Good      Male   "DC"                        1934
3 Sabrina  Good      Female "Archie\nComics"              NA
4 <NA>     <NA>      <NA>   "Image"                     1992

Filtering joins

semi_join()

semi_join(x = superheroes, y = publishers, by = join_by(Publisher))

semi_join()

semi_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 2 × 4
  Name     Alignment Gender Publisher
  <chr>    <chr>     <chr>  <chr>    
1 Deadpool Chaotic   Male   Marvel   
2 Batman   Good      Male   DC       

anti_join()

anti_join(x = superheroes, y = publishers, by = join_by(Publisher))

anti_join()

anti_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 1 × 4
  Name    Alignment Gender Publisher       
  <chr>   <chr>     <chr>  <chr>           
1 Sabrina Good      Female "Archie\nComics"

Application exercise

Goal

Instructions

Join data from multiple data frames, summarize it, and create this plot.

ae-03

Instructions

  • Go to the course GitHub org and find your ae-03 (repo name will be suffixed with your GitHub name).
  • Clone the repo in Positron, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Wrap up

Recap

  • Use the *_join() function appropriate for your analysis
  • Leverage the difference between mutating and filtering joins

TidyTuesday

A logo for Tidy Tuesday. Describes Tidy Tuesday as a weekly data project from the Data Science Learning Community.