AE 07: Using {duckdb} for Lego® data
Packages
We will use the following packages in this application exercise.
- {tidyverse}: For data import, wrangling, and visualization.
- {duckdb}: To query a DuckDB.
- {dbplyr}: To translate our {dplyr} and {tidyr} functions to SQL.
- {scales}: Formatting graph labels.
Lego®
The Lego® Group was founded in 1932 by Ole Kirk Kristiansen. “LEGO” is an abbreviation of the two Danish words “leg” and “godt”, meaning “play well”. In the 1950s, the Lego® Group launched its most popular line of toys: LEGO Bricks. The company has produced over 600 billion LEGO bricks since its founding, and has become one of the most popular toys in the world. As of 2026, the company has produced over 26,000 different sets and over 60,000 unique parts. Hobbyists and afficianados of the toy have created a vibrant community around collecting and building with LEGO bricks, and there are many online resources for sharing and discussing LEGO-related content.
One of the most popular resources for LEGO fans is Rebrickable, a website that provides a comprehensive database of LEGO sets, parts, and themes. The website allows users to search for specific sets or parts, create and share their own collections, and connect with other LEGO enthusiasts. Rebrickable also provides a variety of tools and resources for building and customizing LEGO creations, making it a valuable resource for both casual fans and serious collectors. They publish a comprehensive dataset of LEGO-related data, which can be accessed through their website or downloaded in bulk. Unfortunately the dataset is difficult to work with in its raw form, as it is provided as a collection of CSV files that are not optimized for analysis. In total there are over 1.7 million rows of data split across 12 tables.
To solve these issues, we will utilize a DuckDB that contains all the tables provided by Rebrickable. This dataset can be access from data/lego.duckdb.
Connect to the database
Demonstration: Connect to the local database and identify the tables.
con <- duckdb(dbdir = "data/lego.duckdb") |>
dbConnect()
dbListTables(con) [1] "colors" "elements" "inventories"
[4] "inventory_minifigs" "inventory_parts" "inventory_sets"
[7] "minifigs" "part_categories" "part_relationships"
[10] "parts" "sets" "themes"
The database has the following relational structure:
Demonstration: Use tbl() to create an object that represents each database table. This will allow us to use {dbplyr} to translate our code to the corresponding SQL query.
lego_colors <- tbl(con, "colors")
lego_elements <- tbl(con, "elements")
lego_inventories <- tbl(con, "inventories")
lego_inventory_minifigs <- tbl(con, "inventory_minifigs")
lego_inventory_parts <- tbl(con, "inventory_parts")
lego_inventory_sets <- tbl(con, "inventory_sets")
lego_minifigs <- tbl(con, "minifigs")
lego_part_categories <- tbl(con, "part_categories")
lego_part_relationships <- tbl(con, "part_relationships")
lego_parts <- tbl(con, "parts")
lego_sets <- tbl(con, "sets")
lego_themes <- tbl(con, "themes")Examine the relational structure of the database
Your turn: Identify the primary key for colors. What is its foreign key in the elements and inventory_parts tables?
Add response here.
Your turn: Identify the primary key for themes. What does the dashed line between themes and themes mean?
Add response here.
Translate some basic SQL queries
Your turn: Explain what each of the following SQL queries does, and recreate them using {dbplyr}.
SELECT "sets".*
FROM "sets"
WHERE (num_parts > 200.0)Add response here.
# add code hereSELECT "year", AVG(num_parts) AS num_parts
FROM "sets"
GROUP BY "year"Add response here.
# add code hereHow many red parts are in the LEGO® data?
Your turn: How many red parts are in the LEGO® data? Count all parts with red anywhere in the color name.
-
stringr::str_detect()can be used to identify rows where a string contains a particular pattern. In this case, we can use it to identify colors with “Red” anywhere in the name. - Use
show_query()to examine the translated SQL code.
# add code hereWhat is the parent theme with the most “children” themes?
Your turn: What is the parent theme with the most “children” themes? A “child” theme is a theme that has a parent theme. A “parent” theme is a theme that has one or more child themes. The themes table contains a parent_id column which identifies the parent theme for each theme. If this column is NA, then the theme does not have a parent and is therefore a parent theme itself.
# add code hereWhich theme has the most total parts across all sets?
Your turn: Which theme has the most total parts across all sets? To answer this question, we need to identify the number of parts in each set, and then sum these values for each theme.
# add code hereChange over time in set complexity
Now let’s examine how the complexity of LEGO® sets has changed over time. We can use the number of parts in a set as a proxy for its complexity.
Demonstration: Visualize the change over time in the number of parts in sets.
# add code hereVisualize the average set size over time
Your turn: Visualize the change over time in the average number of parts in sets. For each year, calculate the average number of parts in sets released that year, and visualize the change in this value over time.
# add code hereCalculate a rolling average
Demonstration: Calculate a rolling five-year average set size.
# add code here