Databases + SQL

Lecture 10

Dr. Benjamin Soltoff

Cornell University
INFO 2951 - Spring 2025

February 25, 2025

Announcements

Announcements

  • Lab 03
  • Homework 03
  • Reminder about team participation

Reading data into R

  • Local data files
  • Databases
  • Web scraping
  • Application programming interfaces (APIs)

Databases

  • Large volumes of data lives in databases
  • Rather than retrieving snapshots of databases to periodically analyze, better to query the database directly
  • Structured Query Language

Database basics

Data is organized in tables

Major differences from data frames

  • Stored on disk
  • Tables are indexed
  • Row-oriented structure for collecting data

Running databases

Databases run on database management systems (DBMS’s)

  • Client-server
    • PostgreSQL, MariaDB, SQL Server
  • Cloud
    • Snowflake, Amazon’s RedShift, and Google’s BigQuery
  • In-process
    • SQLite or DuckDB

Intended benefits

  • Centralized data storage
  • Controlled access
  • Unconstrained by system memory
  • Faster queries

SQL queries and {dplyr} equivalents

SQL {dplyr}
SELECT select(), mutate(), rename(), relocate(), summarize()
FROM N/A
GROUP BY group_by()
WHERE filter()
ORDER BY arrange()

Relevant R packages

Objective: make interacting with databases as seamless as possible

Application exercise

ae-08

Instructions

  • Go to the course GitHub org and find your ae-08 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day

Wrap up

Recap

  • Databases are used to structure large data sets
  • Use DBMS and {DBI} to connected to databases using R
  • {dbplyr} translates common data tidying and wrangling tasks to SQL