Build better training data

Lecture 23

Dr. Benjamin Soltoff

Cornell University
INFO 2951 - Spring 2025

April 17, 2025

Announcements

Project draft
Homework 08

Application exercise

`ae-21`

Instructions

Go to the course GitHub org and find your ae-21 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of the day

Import data

hotels <- read_csv("data/hotels.csv") |>
  mutate(across(where(is.character), as.factor))

count(hotels, children)

# A tibble: 2 × 2
  children     n
  <fct>    <int>
1 children  4039
2 none     45961

👩🏼‍🍳 Build a better training set with {recipes}

Preprocessing/feature engineering options

Encode categorical predictors
Center and scale variables
Handle class imbalance
Impute missing data
Perform dimensionality reduction
A lot more!

To build a recipe

Start the recipe()
Define the variables involved
Describe preprocessing step-by-step

`recipe()`

Creates a recipe for a set of variables

recipe(children ~ ., data = hotels)

rec <- recipe(children ~ ., data = hotels)

`step_*()`

Adds a single transformation to a recipe. Transformations are replayed in order when the recipe is run on data.

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date)

Before recipe

# A tibble: 45,000 × 1
   arrival_date
   <date>      
 1 2016-04-28  
 2 2016-12-29  
 3 2016-10-17  
 4 2016-05-22  
 5 2016-03-02  
 6 2016-06-16  
 7 2017-02-13  
 8 2017-08-20  
 9 2017-08-22  
10 2017-05-18  
# ℹ 44,990 more rows

After recipe

# A tibble: 45,000 × 4
   arrival_date arrival_date_dow arrival_date_month arrival_date_year
   <date>       <fct>            <fct>                          <int>
 1 2016-04-28   Thu              Apr                             2016
 2 2016-12-29   Thu              Dec                             2016
 3 2016-10-17   Mon              Oct                             2016
 4 2016-05-22   Sun              May                             2016
 5 2016-03-02   Wed              Mar                             2016
 6 2016-06-16   Thu              Jun                             2016
 7 2017-02-13   Mon              Feb                             2017
 8 2017-08-20   Sun              Aug                             2017
 9 2017-08-22   Tue              Aug                             2017
10 2017-05-18   Thu              May                             2017
# ℹ 44,990 more rows

`step_holiday()` + `step_rm()`

Generate a set of indicator variables for specific holidays.

holidays <- c("AllSouls", "AshWednesday", "ChristmasEve", "Easter", 
              "ChristmasDay", "GoodFriday", "NewYearsDay", "PalmSunday")

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date)

`step_holiday()` + `step_rm()`

Rows: 45,000
Columns: 11
$ arrival_date_dow          <fct> Thu, Thu, Mon, Sun, Wed, Thu, Mon, Sun, Tue,…
$ arrival_date_month        <fct> Apr, Dec, Oct, May, Mar, Jun, Feb, Aug, Aug,…
$ arrival_date_year         <int> 2016, 2016, 2016, 2016, 2016, 2016, 2017, 20…
$ arrival_date_AllSouls     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_AshWednesday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_ChristmasEve <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_Easter       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_ChristmasDay <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_GoodFriday   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_NewYearsDay  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_PalmSunday   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

K Nearest Neighbors (KNN)

To predict the outcome of a new data point:

Find the K most similar old data points
Take the average/mode/etc. outcome

To specify a model with {parsnip}

Pick a model
Set the engine
Set the mode (if needed)

To specify a KNN model with {parsnip}

knn_mod <- nearest_neighbor() |>              
  set_engine("kknn") |>             
  set_mode("classification")

Fact

KNN requires all numeric predictors, and all need to be centered and scaled.

What does that mean?

Quiz

Why do you need to “train” a recipe?

Imagine “scaling” a new data point. What do you subtract from it? What do you divide it by?

Guess

# A tibble: 5 × 1
  meal     
  <fct>    
1 SC       
2 BB       
3 HB       
4 Undefined
5 FB

# A tibble: 50,000 × 5
      SC    BB    HB Undefined    FB
   <dbl> <dbl> <dbl>     <dbl> <dbl>
 1     1     0     0         0     0
 2     0     1     0         0     0
 3     0     1     0         0     0
 4     0     1     0         0     0
 5     0     1     0         0     0
 6     0     1     0         0     0
 7     0     0     1         0     0
 8     0     1     0         0     0
 9     0     0     1         0     0
10     1     0     0         0     0
# ℹ 49,990 more rows

Dummy Variables

logistic_reg() |>
  fit(children ~ meal, data = hotels) |> 
  broom::tidy()

# A tibble: 5 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      2.38     0.0183    130.   0       
2 mealFB          -1.15     0.165      -6.98 2.88e-12
3 mealHB          -0.118    0.0465     -2.54 1.12e- 2
4 mealSC           1.43     0.104      13.7  1.37e-42
5 mealUndefined    0.570    0.188       3.03 2.47e- 3

`step_dummy()`

Converts nominal data into numeric dummy variables, needed as predictors for models like KNN.

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |> 
  step_dummy(all_nominal_predictors())

Quiz

How does {recipes} know which variables are numeric and which are nominal?

rec <- recipe(
  children ~ ., 
  data = hotels
  )

Quiz

How does {recipes} know what is a predictor and what is an outcome?

rec <- recipe(
  children ~ .,
  data = hotels
  )

The formula → indicates outcomes vs predictors

The data → is only used to catalog the names and types of each variable

Selectors

Helper functions for selecting sets of variables

rec |> 
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

Some common selector functions

selector	description
`all_predictors()`	Each x variable (right side of ~)
`all_outcomes()`	Each y variable (left side of ~)
`all_numeric()`	Each numeric variable
`all_nominal()`	Each categorical variable (e.g. factor, string)
`all_nominal_predictors()`	Each categorical variable (e.g. factor, string) that is defined as a predictor
`all_numeric_predictors()`	Each numeric variable that is defined as a predictor
`dplyr::select()` helpers	`starts_with(‘NY_’)`, etc.

Guess

What would happen if you try to normalize a variable that doesn’t vary?

Error! You’d be dividing by zero!

`step_zv()`

Intelligently handles zero variance variables (variables that contain only a single value)

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_zv(all_predictors())

`step_normalize()`

Centers then scales numeric variable (mean = 0, sd = 1)

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric())

Imbalanced outcome

`step_downsample()`

library(themis)

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric()) |>
  step_downsample(children)

After downsampling

⏱️ Your Turn 1

Instructions

Unscramble! You have all the steps from our knn_rec- your challenge is to unscramble them into the right order!

Save the result as knn_rec

03:00

knn_rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric()) |>
  step_downsample(children)
knn_rec

Now we’ve built a recipe.

But, how do we use a recipe?

Axiom

Feature engineering and modeling are two halves of a single predictive workflow.

🪢🪵 Bundling machine learning workflows with `workflow()`

`workflow()`

Creates a workflow to which you can add a model (and more)

workflow()

`add_formula()`

Adds a formula to a workflow *

workflow() |> add_formula(children ~ average_daily_rate)

`add_model()`

Adds a parsnip model spec to a workflow

workflow() |> add_model(knn_mod)

Guess

If we use add_model() to add a model to a workflow, what would we use to add a recipe?

Let’s see!

⏱️ Your Turn 2

Instructions

Fill in the blanks to make a workflow that combines knn_rec and with knn_mod.

01:00

knn_wf <- workflow() |> 
  add_recipe(knn_rec) |> 
  add_model(knn_mod)
knn_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
7 Recipe Steps

• step_date()
• step_holiday()
• step_rm()
• step_dummy()
• step_zv()
• step_normalize()
• step_downsample()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)

Computational engine: kknn

`add_recipe()`

Adds a recipe to a workflow.

knn_wf <- workflow() |>
  add_recipe(knn_rec) |>
  add_model(knn_mod)

Guess

Do you need to add a formula if you have a recipe?

Nope!

rec <- recipe(
  children ~ .,
  data = hotels
)

`fit()`

Fit a workflow that bundles a recipe* and a model.

knn_wf |> 
  fit(data = hotels_train) |> 
  augment(hotels_test)

Preprocess k-fold resamples?

set.seed(100)
hotels_folds <- vfold_cv(hotels_train, v = 10,
                         strata = children)

`fit_resamples()`

Fit a workflow that bundles a recipe* and a model with resampling.

knn_wf |> 
  fit_resamples(resamples = hotels_folds)

⏱️ Your Turn 3

Instructions

Run the first chunk. Then try our KNN workflow on hotels_folds. What is the ROC AUC?

03:00

set.seed(100)
hotels_folds <- vfold_cv(hotels_train, v = 10, strata = children)

knn_wf |> 
  fit_resamples(resamples = hotels_folds) |> 
  collect_metrics()

# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.741    10 0.00209 Preprocessor1_Model1
2 brier_class binary     0.173    10 0.00161 Preprocessor1_Model1
3 roc_auc     binary     0.830    10 0.00231 Preprocessor1_Model1

`update_recipe()`

Replace the recipe in a workflow.

knn_wf |>
  update_recipe(glmnet_rec)

`update_model()`

Replace the model in a workflow.

knn_wf |>
  update_model(tree_mod)

⏱️ Your Turn 4

Instructions

Turns out, the same knn_rec recipe can also be used to fit a penalized logistic regression model. Let’s try it out!

plr_mod <- logistic_reg(penalty = .01, mixture = 1) |> 
  set_engine("glmnet") |> 
  set_mode("classification")

plr_mod |> 
  translate()

Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = 0.01
  mixture = 1

Computational engine: glmnet 

Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    alpha = 1, family = "binomial")

03:00

glmnet_wf <- knn_wf |> 
  update_model(plr_mod)

glmnet_wf |> 
  fit_resamples(resamples = hotels_folds) |> 
  collect_metrics()

# A tibble: 3 × 6
  .metric     .estimator  mean     n  std_err .config             
  <chr>       <chr>      <dbl> <int>    <dbl> <chr>               
1 accuracy    binary     0.828    10 0.00218  Preprocessor1_Model1
2 brier_class binary     0.139    10 0.000871 Preprocessor1_Model1
3 roc_auc     binary     0.873    10 0.00210  Preprocessor1_Model1

Wrap up

Recap

Feature engineering defines a series of pre-processing steps to prepare for modeling the outcome of interest
- Some feature engineering steps are required for specific types of models
- Others are dependent on specific types of variables/data structures
Feature engineering and modeling are two halves of a single predictive workflow
Feature engineering requires training, just like the model
Implement feature engineering using {recipes}
Leverage workflow() to create explicitly, logical pipelines for training a machine learning model

Acknowledgments

Materials derived from Tidymodels, Virtually by Allison Hill and licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) License.
Dataset and some modeling steps derived from A predictive modeling case study and licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) License.

Build better training data

Announcements

Announcements

Application exercise

ae-21

Import data

👩🏼‍🍳 Build a better training set with {recipes}

Preprocessing/feature engineering options

To build a recipe

recipe()

step_*()

Before recipe

After recipe

step_holiday() + step_rm()

step_holiday() + step_rm()

K Nearest Neighbors (KNN)

K Nearest Neighbors (KNN)

To specify a model with {parsnip}

To specify a KNN model with {parsnip}

Fact

Quiz

Guess

Dummy Variables

step_dummy()

Quiz

Quiz

Selectors

Some common selector functions

Guess

step_zv()

step_normalize()

Imbalanced outcome

step_downsample()

After downsampling

⏱️ Your Turn 1

Axiom

🪢🪵 Bundling machine learning workflows with workflow()

workflow()

add_formula()

add_model()

Guess

⏱️ Your Turn 2

add_recipe()

Guess

fit()

Preprocess k-fold resamples?

fit_resamples()

⏱️ Your Turn 3

update_recipe()

update_model()

⏱️ Your Turn 4

Wrap up

Recap

Acknowledgments

`ae-21`

`recipe()`

`step_*()`

`step_holiday()` + `step_rm()`

`step_holiday()` + `step_rm()`

`step_dummy()`

`step_zv()`

`step_normalize()`

`step_downsample()`

🪢🪵 Bundling machine learning workflows with `workflow()`

`workflow()`

`add_formula()`

`add_model()`

`add_recipe()`

`fit()`

`fit_resamples()`

`update_recipe()`

`update_model()`