Functions

Lecture 12

Dr. Benjamin Soltoff

Cornell University
INFO 2951 - Spring 2025

March 4, 2025

Announcements

Homework 04
Project mentor office hours

Why write functions?

Rationale

Makes code easier to read
As requirements change, you only need to update code in one place
Avoid errors caused by copy-paste-replace
Increases productivity from project-to-project by reusing code

`penguins`

library(palmerpenguins)
glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

We have several measurements on different scales:

bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g

Standardizing variables

Difficult to plot on same axis or determine what value is large for that variable
A common solution is to apply a $z$ score transformation to each variable.
Normalizes the values to have a mean of 0 and a standard deviation of 1

\[z = \frac{x - \bar{x}}{\text{s.d.}}\]

Apply transformation

We can apply the same transformation to each variable:

penguins <- penguins |>
  mutate(
    z_bill_length_mm = (bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) / sd(bill_length_mm, na.rm = TRUE),
    z_bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm, na.rm = TRUE)) / sd(bill_depth_mm, na.rm = TRUE),
    z_flipper_length_mm = (flipper_length_mm - mean(flipper_length_mm, na.rm = TRUE)) / sd(flipper_length_mm, na.rm = TRUE),
    z_body_mass_g = (body_mass_g - mean(body_mass_g, na.rm = TRUE)) / sd(body_mass_g, na.rm = TRUE)
  )

Long, unclear

(bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) / sd(bill_length_mm, na.rm = TRUE)

Quite a lot of code
Difficult to determine what the transformation is

How to shorten it and make it more clear?

Write a function

Can be named to make transformation transparent
Will make code shorter
Can be reused

Types of function

Vector functions: one of more vectors as input, one vector as output
Data frame functions: data frame as input and data frame as output

Types of function

Vector functions: one of more vectors as input, one vector as output
1. Output same length as input
2. Summary functions
Data frame functions: data frame as input and data frame as output

Vector functions

Output same length as input
Summary functions

Output same length as input

Output same length as input
Works well in mutate() or filter()
Appropriate for the z-transformation example

Writing a function

To turn your code into a function you need:

A name
The arguments - which represent the parts that vary
The code body for the function

name <- function(arguments) {
  code body
}

Whitespace does not matter in R

Unlike Python, R does not care about whitespace/indentation structure in a function. It is utilized for code readability.

Function name

Use a verb

Difficulty in naming? Should this be two or three functions?

What should we call the function we write to do a $z$ score transformation?

Arguments

The input vector
Additional arguments

name <- function(x) {
  body does things with x
}

Example

\[z = \frac{x - \bar{x}}{s.d.}\]

penguins <- penguins |>
  mutate(
    z_bill_length_mm = (bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) / sd(bill_length_mm, na.rm = TRUE),
    z_bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm, na.rm = TRUE)) / sd(bill_depth_mm, na.rm = TRUE),
    z_flipper_length_mm = (flipper_length_mm - mean(flipper_length_mm, na.rm = TRUE)) / sd(flipper_length_mm, na.rm = TRUE),
    z_body_mass_g = (body_mass_g - mean(body_mass_g, na.rm = TRUE)) / sd(body_mass_g, na.rm = TRUE)
  )

Example

Identify the arguments: the things that vary across calls

(bill_length_mm    - mean(bill_length_mm,    na.rm = TRUE)) / sd(bill_length_mm,    na.rm = TRUE)
(bill_depth_mm     - mean(bill_depth_mm,     na.rm = TRUE)) / sd(bill_depth_mm,     na.rm = TRUE)
(flipper_length_mm - mean(flipper_length_mm, na.rm = TRUE)) / sd(flipper_length_mm, na.rm = TRUE)
(body_mass_g       - mean(body_mass_g,       na.rm = TRUE)) / sd(body_mass_g,       na.rm = TRUE)

(🟧 - mean(🟧, na.rm = TRUE)) / sd(🟧, na.rm = TRUE)
(🟧 - mean(🟧, na.rm = TRUE)) / sd(🟧, na.rm = TRUE)
(🟧 - mean(🟧, na.rm = TRUE)) / sd(🟧, na.rm = TRUE)
(🟧 - mean(🟧, na.rm = TRUE)) / sd(🟧, na.rm = TRUE)

🟧 is x

Example

Put into the template

name <- function(x) {
  # body does things with x
}

to_z <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

Apply

Rewrite the call to mutate() as:

penguins <- penguins |>
  mutate(
    z_bill_length_mm = to_z(bill_length_mm),
    z_bill_depth_mm = to_z(bill_depth_mm),
    z_flipper_length_mm = to_z(flipper_length_mm),
    z_body_mass_g = to_z(body_mass_g)
  )

Much shorter, much more clear.

A modification

mean() has a trim argument: mean(x, trim = 0, na.rm = FALSE, ...)

The fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed.

Suppose we want to specify the middle proportion left rather than the proportion trimmed from each end

A modification

A value of 0.1 for trim trims 0.1 from each end leaving 0.8 in the middle
trim = (1 - middle)/2

schematic of trim and middle demonstrating that trim = (1 - middle)/2

Trim is the proportion trimmed off each end; middle is what’s left

Add an argument

to_z <- function(x, middle) {
  trim <- (1 - middle) / 2
  (x - mean(x, na.rm = TRUE, trim = trim)) / sd(x, na.rm = TRUE)
}

Try it out

to_z(penguins$bill_length_mm, middle = 0.2)

  [1] -0.92838057 -0.85511491 -0.70858359          NA -1.36797452 -0.89174774
  [7] -0.96501340 -0.91006415 -1.84420131 -0.39720454 -1.16649396 -1.16649396
 [13] -0.56205227 -1.01996264 -1.75261923 -1.38629094 -1.00164623 -0.30562246
 [19] -1.78925206  0.33545205 -1.16649396 -1.18481038 -1.51450584 -1.09322830
 [25] -0.98332981 -1.62440433 -0.65363435 -0.67195076 -1.14817755 -0.67195076
 [31] -0.85511491 -1.27639245 -0.85511491 -0.59868510 -1.42292377 -0.91006415
 [37] -0.98332981 -0.36057171 -1.20312679 -0.80016566 -1.40460735 -0.61700152
 [43] -1.49618943 -0.01255983 -1.31302528 -0.83679849 -0.56205227 -1.22144320
 [49] -1.49618943 -0.34225529 -0.83679849 -0.74521642 -1.67935358 -0.39720454
 [55] -1.77093565 -0.50710303 -0.94669698 -0.65363435 -1.40460735 -1.20312679
 [61] -1.55113867 -0.52541944 -1.20312679 -0.56205227 -1.42292377 -0.47047020
 [67] -1.58777150 -0.56205227 -1.51450584 -0.43383737 -1.95409980 -0.81848208
 [73] -0.83679849  0.29881922 -1.58777150 -0.25067322 -0.59868510 -1.27639245
 [79] -1.45955660 -0.37888812 -1.75261923 -0.23235681 -1.36797452 -1.66103716
 [85] -1.25807603 -0.52541944 -1.44124018 -1.33134169 -1.07491189 -0.96501340
 [91] -1.55113867 -0.56205227 -1.86251772 -0.83679849 -1.45955660 -0.61700152
 [97] -1.11154472 -0.70858359 -2.02736546 -0.17740756 -1.67935358 -0.58036869
[103] -1.18481038 -1.16649396 -1.14817755 -0.81848208 -1.01996264 -1.09322830
[109] -1.11154472 -0.17740756 -1.11154472  0.26218639 -0.81848208 -0.36057171
[115] -0.83679849 -0.26898963 -1.01996264 -1.25807603 -1.55113867 -0.56205227
[121] -1.45955660 -1.18481038 -0.72690000 -0.50710303 -1.64272075 -0.65363435
[127] -0.98332981 -0.48878661 -0.94669698 -0.01255983 -1.03827906 -0.19572398
[133] -1.34965811 -1.22144320 -1.11154472 -0.56205227 -1.56945509 -0.72690000
[139] -1.31302528 -0.81848208 -0.72690000 -0.65363435 -2.21052960 -0.63531793
[145] -1.25807603 -0.94669698 -0.91006415 -1.38629094 -1.49618943 -1.16649396
[151] -1.49618943 -0.48878661  0.35376847  1.06810865  0.82999525  1.06810865
[157]  0.62851469  0.42703413  0.22555357  0.46366696 -0.15909115  0.48198337
[163] -0.59868510  0.88494450  0.24386998  0.77504601  0.29881922  0.93989374
[169] -0.39720454  0.92157733  0.37208488  0.82999525  1.10474148  0.17060432
[175]  0.42703413  0.39040130 -0.23235681  0.35376847  0.06070583  0.66514752
[181]  0.73841318  1.06810865  0.57356545 -0.25067322  0.17060432  2.82648447
[187]  0.90326091  0.77504601 -0.28730605  0.04238942 -0.03087624  0.82999525
[193] -0.26898963  0.99484299  0.20723715  0.99484299  1.15969072 -0.10414190
[199]  0.24386998  1.15969072  0.13397149  0.18892074  0.44535054  0.79336242
[205]  0.17060432  1.08642506  0.42703413  0.15228791 -0.06750907  0.24386998
[211] -0.17740756  1.14137431  0.20723715  0.37208488  0.28050281  1.85571448
[217]  0.29881922  1.03147582  0.37208488  0.97652657 -0.12245832  1.19632355
[223]  0.64683111  0.40871771  0.73841318  0.42703413  0.40871771  0.81167884
[229]  0.61019828  1.26958921  0.18892074  0.18892074  0.90326091  1.52601902
[235]  0.59188186  1.06810865  0.13397149  1.21463997 -0.14077473  1.30622204
[241]  0.61019828  1.45275336  0.61019828  1.47106977  0.24386998  0.97652657
[247]  0.06070583  1.21463997  0.95821016  0.50029979  0.77504601  1.26958921
[253]  0.79336242  2.14877712  0.55524903  0.90326091  0.57356545  0.48198337
[259] -0.45215378  1.69086675 -0.15909115  0.72009677  1.15969072  1.03147582
[265] -0.12245832  1.34285487  0.37208488  2.00224580  0.06070583  0.84831167
[271]  0.55524903          NA  0.48198337  1.14137431  0.18892074  1.04979223
[277]  0.42703413  1.06810865  1.30622204  0.22555357  1.56265185  0.18892074
[283]  0.35376847  1.30622204  0.33545205  1.30622204  0.44535054  1.37948770
[289]  0.51861620  1.43443694  0.31713564  1.15969072  1.12305789  2.53342183
[295]  0.40871771  0.92157733 -0.32393888  0.79336242 -0.17740756  1.17800714
[301]  0.46366696  1.43443694  1.15969072  0.97652657  0.40871771  1.58096826
[307] -0.59868510  1.83739807 -0.30562246  1.25127279  1.01315940  0.61019828
[313]  0.62851469  1.43443694  0.50029979  1.70918316  0.88494450  0.37208488
[319]  1.23295638  0.24386998  1.23295638  1.21463997  1.08642506  0.88494450
[325]  1.34285487  1.03147582  0.72009677  1.32453845  0.28050281  1.19632355
[331] -0.30562246  1.47106977  0.18892074  0.93989374  1.10474148  0.26218639
[337]  1.41612053  0.48198337  0.28050281  2.13046071 -0.12245832  0.99484299
[343]  1.21463997  1.10474148

But what if we forget?

to_z(penguins$bill_length_mm)

Error in to_z(penguins$bill_length_mm): argument "middle" is missing, with no default

Give a default

Give defaults whenever possible:

to_z <- function(x, middle = 1) {
  trim <- (1 - middle) / 2
  (x - mean(x, na.rm = TRUE, trim = trim)) / sd(x, na.rm = TRUE)
}

Try it out

to_z(penguins$bill_length_mm)

  [1] -0.88320467 -0.80993901 -0.66340769          NA -1.32279862 -0.84657184
  [7] -0.91983750 -0.86488825 -1.79902541 -0.35202864 -1.12131806 -1.12131806
 [13] -0.51687637 -0.97478674 -1.70744334 -1.34111504 -0.95647033 -0.26044656
 [19] -1.74407616  0.38062795 -1.12131806 -1.13963448 -1.46932994 -1.04805240
 [25] -0.93815391 -1.57922843 -0.60845845 -0.62677486 -1.10300165 -0.62677486
 [31] -0.80993901 -1.23121655 -0.80993901 -0.55350920 -1.37774787 -0.86488825
 [37] -0.93815391 -0.31539581 -1.15795089 -0.75498976 -1.35943145 -0.57182562
 [43] -1.45101353  0.03261607 -1.26784938 -0.79162259 -0.51687637 -1.17626731
 [49] -1.45101353 -0.29707939 -0.79162259 -0.70004052 -1.63417768 -0.35202864
 [55] -1.72575975 -0.46192713 -0.90152108 -0.60845845 -1.35943145 -1.15795089
 [61] -1.50596277 -0.48024354 -1.15795089 -0.51687637 -1.37774787 -0.42529430
 [67] -1.54259560 -0.51687637 -1.46932994 -0.38866147 -1.90892390 -0.77330618
 [73] -0.79162259  0.34399512 -1.54259560 -0.20549732 -0.55350920 -1.23121655
 [79] -1.41438070 -0.33371222 -1.70744334 -0.18718091 -1.32279862 -1.61586126
 [85] -1.21290014 -0.48024354 -1.39606428 -1.28616579 -1.02973599 -0.91983750
 [91] -1.50596277 -0.51687637 -1.81734182 -0.79162259 -1.41438070 -0.57182562
 [97] -1.06636882 -0.66340769 -1.98218956 -0.13223166 -1.63417768 -0.53519279
[103] -1.13963448 -1.12131806 -1.10300165 -0.77330618 -0.97478674 -1.04805240
[109] -1.06636882 -0.13223166 -1.06636882  0.30736229 -0.77330618 -0.31539581
[115] -0.79162259 -0.22381374 -0.97478674 -1.21290014 -1.50596277 -0.51687637
[121] -1.41438070 -1.13963448 -0.68172411 -0.46192713 -1.59754485 -0.60845845
[127] -0.93815391 -0.44361071 -0.90152108  0.03261607 -0.99310316 -0.15054808
[133] -1.30448221 -1.17626731 -1.06636882 -0.51687637 -1.52427919 -0.68172411
[139] -1.26784938 -0.77330618 -0.68172411 -0.60845845 -2.16535371 -0.59014203
[145] -1.21290014 -0.90152108 -0.86488825 -1.34111504 -1.45101353 -1.12131806
[151] -1.45101353 -0.44361071  0.39894437  1.11328455  0.87517115  1.11328455
[157]  0.67369059  0.47221003  0.27072946  0.50884286 -0.11391525  0.52715927
[163] -0.55350920  0.93012040  0.28904588  0.82022191  0.34399512  0.98506964
[169] -0.35202864  0.96675323  0.41726078  0.87517115  1.14991738  0.21578022
[175]  0.47221003  0.43557720 -0.18718091  0.39894437  0.10588173  0.71032342
[181]  0.78358908  1.11328455  0.61874135 -0.20549732  0.21578022  2.87166037
[187]  0.94843681  0.82022191 -0.24213015  0.08756532  0.01429966  0.87517115
[193] -0.22381374  1.04001889  0.25241305  1.04001889  1.20486662 -0.05896600
[199]  0.28904588  1.20486662  0.17914739  0.23409663  0.49052644  0.83853832
[205]  0.21578022  1.13160096  0.47221003  0.19746381 -0.02233317  0.28904588
[211] -0.13223166  1.18655021  0.25241305  0.41726078  0.32567871  1.90089038
[217]  0.34399512  1.07665172  0.41726078  1.02170247 -0.07728242  1.24149945
[223]  0.69200701  0.45389361  0.78358908  0.47221003  0.45389361  0.85685474
[229]  0.65537418  1.31476511  0.23409663  0.23409663  0.94843681  1.57119492
[235]  0.63705776  1.11328455  0.17914739  1.25981586 -0.09559883  1.35139794
[241]  0.65537418  1.49792926  0.65537418  1.51624567  0.28904588  1.02170247
[247]  0.10588173  1.25981586  1.00338606  0.54547569  0.82022191  1.31476511
[253]  0.83853832  2.19395302  0.60042493  0.94843681  0.61874135  0.52715927
[259] -0.40697788  1.73604265 -0.11391525  0.76527266  1.20486662  1.07665172
[265] -0.07728242  1.38803077  0.41726078  2.04742170  0.10588173  0.89348757
[271]  0.60042493          NA  0.52715927  1.18655021  0.23409663  1.09496813
[277]  0.47221003  1.11328455  1.35139794  0.27072946  1.60782775  0.23409663
[283]  0.39894437  1.35139794  0.38062795  1.35139794  0.49052644  1.42466360
[289]  0.56379210  1.47961284  0.36231154  1.20486662  1.16823379  2.57859773
[295]  0.45389361  0.96675323 -0.27876298  0.83853832 -0.13223166  1.22318303
[301]  0.50884286  1.47961284  1.20486662  1.02170247  0.45389361  1.62614416
[307] -0.55350920  1.88257397 -0.26044656  1.29644869  1.05833530  0.65537418
[313]  0.67369059  1.47961284  0.54547569  1.75435906  0.93012040  0.41726078
[319]  1.27813228  0.28904588  1.27813228  1.25981586  1.13160096  0.93012040
[325]  1.38803077  1.07665172  0.76527266  1.36971435  0.32567871  1.24149945
[331] -0.26044656  1.51624567  0.23409663  0.98506964  1.14991738  0.30736229
[337]  1.46129643  0.52715927  0.32567871  2.17563660 -0.07728242  1.04001889
[343]  1.25981586  1.14991738

Application exercise

`ae-10`

Instructions

Go to the course GitHub org and find your ae-10 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of the day

Your turn

Instructions

Write a function that performs the Box-Cox power transformation using the value of (non-zero) lambda ($\lambda$) supplied.

\[bc = \frac{x^{\lambda} - 1}{\lambda} \text{ for }\lambda \ne 0\]

10:00

Vector functions

Output same length as input
Summary functions

Summary functions

Input is vector
Output is a single value
Could be used in summarize()

Example

Write a function to compute the standard error of a sample.

\[\text{s.e.} = \frac{\text{s.d.}}{\sqrt{n}}\]

Example

sd_error <- function(x) {
  sd(x, na.rm = TRUE) / sqrt(sum(!is.na(x)))
}

Try it out

Call the function on penguins$bill_length_mm

sd_error(penguins$bill_length_mm)

[1] 0.2952205

Or in a pipeline

penguins |>
  summarize(se = sd_error(bill_length_mm))

# A tibble: 1 × 1
     se
  <dbl>
1 0.295

Data frame functions

Data frame as input and data frame as output

For example, we might summarize one of our columns like this:

penguins |>
  summarize(
    mean = mean(bill_length_mm, na.rm = TRUE),
    n = sum(!is.na(bill_length_mm)),
    sd = sd(bill_length_mm, na.rm = TRUE),
    se = sd_error(bill_length_mm)
  )

# A tibble: 1 × 4
   mean     n    sd    se
  <dbl> <int> <dbl> <dbl>
1  43.9   342  5.46 0.295

Output is a data frame

Data frame functions

What if we want to summarize several data frames in the same way?

Good candidate for a function to avoid repetitive code: my_summary()

Define `my_summary()` function

my_summary <- function(df, column) {
  df |>
    summarize(
      mean = mean(column, na.rm = TRUE),
      n = sum(!is.na(column)),
      sd = sd(column, na.rm = TRUE),
      se = sd_error(column)
    )
}

Use function

my_summary(df = penguins, column = bill_length_mm)

Error in `summarize()`:
ℹ In argument: `mean = mean(column, na.rm = TRUE)`.
Caused by error:
! object 'bill_length_mm' not found

Tidy evaluation

{tidyverse} functions like dplyr::summarize() use tidy evaluation so you can refer to the names of variables inside data frames. For example, you can use:

# tidyverse
penguins |> filter(species == "Adelie", body_mass_g >= 4000)

# base R
penguins[penguins$species == "Adelie" & penguins$body_mass_g >= 4000, ]

This is known as data-masking: the data frame environment masks the user environment by giving priority to the data frame.

Data masking is great…

and makes life easier when working interactively

But not so useful in functions

Because of data-masking, summarize() in my_summary() is looking for a column literally called column in the data frame that has been passed in. It is not looking in the variable column for the name of column you want to give it.

Fix `my_summary()` function

The solution is to use embracing: {{ var }}

my_summary <- function(df, column) {
  df |>
    summarize(
      mean = mean({{ column }}, na.rm = TRUE),
      n = sum(!is.na({{ column }})),
      sd = sd({{ column }}, na.rm = TRUE),
      se = sd_error({{ column }}),
      .groups = "drop"
    )
}

Look inside column variable
Style with spaces
.groups = "drop" to avoid message and leave the data in an ungrouped state

Use function

my_summary(df = penguins, column = bill_length_mm)

# A tibble: 1 × 4
   mean     n    sd    se
  <dbl> <int> <dbl> <dbl>
1  43.9   342  5.46 0.295

When to embrace?

When tidy evaluation is used

Your turn

Instructions

Write a function to calculate the median, maximum and minimum values of a variable grouped by another variable.

Wrap up

Recap

Writing functions can make you more efficient and make your code more readable
Vector functions take one of more vectors as input; output can be a vector (useful in mutate() and filter()) or a single value (useful in summarize())
Data frame functions take a data frame as input and output a data frame
Give arguments a default value where possible
Use {{ var }} embracing to manage data masking
Use pick() to select more than one variable

Acknowledgements

Slides are derived from From R User to R Programmer and licensed under CC BY 4.0.

Functions

Announcements

Announcements

Why write functions?

Rationale

penguins

Standardizing variables

Apply transformation

Long, unclear

Write a function

Types of function

Types of function

Types of function

Vector functions

Output same length as input

Writing a function

Function name

Arguments

Example

Example

Example

Apply

A modification

A modification

Add an argument

Try it out

But what if we forget?

Give a default

Try it out

Application exercise

ae-10

Your turn

Vector functions

Summary functions

Example

Example

Try it out

Data frame functions

Data frame functions

Data frame functions

Define my_summary() function

Use function

Tidy evaluation

Data masking is great…

Fix my_summary() function

Use function

When to embrace?

Your turn

Wrap up

Recap

Acknowledgements

`penguins`

`ae-10`

Define `my_summary()` function

Fix `my_summary()` function