AE 08: Using {duckdb} for NYC flights

Application exercise
Modified

February 25, 2025

Packages

We will use the following packages in this application exercise.

  • {tidyverse}: For data import, wrangling, and visualization.
  • {duckdb}: To query a DuckDB.
  • {dbplyr}: To translate our {dplyr} and {tidyr} functions to SQL.
  • {scales}: Formatting graph labels.

New York City flights data

The U.S. Department of Transportation publishes detailed statistics on flights originating in the United States. {nycflights13} is a highly popular R package containing information on all flights that departed NYC1 in 2013. The package was originally developed to provide practice wrangling and working with relational datasets.

1 Specifically, John F. Kennedy International Airport (JFK), LaGuardia Airport (LGA), and Newark International Airport (EWR).

That said, this package has become quite dated since it was originally developed in the mid-2010s. The same air travel data can be accessed for any airports in the United States using the {anyflights} package. The challenge is that data for more than a trivial handful of airports and years becomes significantly large and unwieldy to utilize.

To solve these issues, we will utilize a DuckDB that contains detailed air travel data for all NYC airports from 2015-2023. This dataset can be access from data/nyc-flights.duckdb.

Connect to the database

Demonstration: Connect to the local database and identify the tables.

con <- duckdb(dbdir = "data/nyc-flights.duckdb") |>
  dbConnect()

dbListTables(con)
[1] "airlines" "airports" "flights"  "planes"   "weather" 

The database has the following relational structure:

Connections between all five tables in the database. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows. Source: R4DS.

Connections between all five tables in the database. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows. Source: R4DS.

Demonstration: Use tbl() to create an object that represents each database table. This will allow us to use {dbplyr} to translate our code to the corresponding SQL query.

airlines <- tbl(con, "airlines")
airports <- tbl(con, "airports")
flights <- tbl(con, "flights")
planes <- tbl(con, "planes")
weather <- tbl(con, "weather")

Translate some basic SQL queries

Your turn: Explain what each of the following SQL queries does, and recreate them using {dbplyr}.

SELECT flights.*
FROM flights
WHERE (dep_delay < arr_delay)

Add response here.

# add code here
SELECT flights.*, distance / (air_time / 60.0) AS speed
FROM flights

Add response here.

# add code here

What causes flight delays?

Your turn: What is the relationship between weather events and departure delays? Calculate the average departure delay and join with the weather table. How do these R commands translate to SQL? What trends do you see?

Hints
  • Before summarizing flights, examine the weather table and figure out its unit of analysis. This will clue you in to how best to summarize the flights data.
  • Use show_query() to examine the translated SQL code.
# add code here

Number of flights departing NYC airports over time

Now let’s examine potential patterns or trends in the volume of flights departing NYC-area airports over time.

Create a date variable

Your turn: Currently flights does not have a standalone date variable that identifies the date of the flight. Create one and store it at the beginning of the table.

Make a date from component columns

lubridate::make_date() can create a date variable from numeric columns which contain each of the date components.

Modifying contents of the database

tbl() provides a read-only interface to the table. You can modify the contents of the table and store them as an R object, but it does not actually change the contents of the table in the database. To modify the database directly, you need to use an appropriate method from {DBI}.

# add code here

Visualize the volume of flights over time

Demonstration: Calculate the daily number of departing flights from NYC and visualize as a line graph.

# add code here

Demonstration: Calculate a rolling seven-day average number of flights departing from NYC.

# add code here