Meet the toolkit

Lecture 2

Dr. Benjamin Soltoff

Cornell University
INFO 2951 - Spring 2025

January 23, 2025

Announcements

Announcements

  • Homework 00
  • Lab 00 tomorrow

Making INFO 2951 a success

Five tips for success

  1. Complete all the preparation work before class.
  2. Ask questions.
  3. Do the readings.
  4. Do the lab and homework assignments.
  5. Don’t procrastinate and don’t let a week pass by with lingering questions.

Course FAQ

Q - What data science background does this course assume?
A - None! Sort of…

Q - Is this an intro stats course?
A - No. We presume you have already met the prereq and taken one of AEM 2100, BTRY 3010, CEE 3040, ECON 3110, ECON 3130, ENGRD 2700, ILRST 2100, MATH 1710, PAM 2100, PSYCH 2500, SOC 3010, STSCI 2100, STSCI 2150, STSCI 2200 🙄

While statistics \(\ne\) data science, they are very closely related and have tremendous of overlap.

Q - Will we be doing computing?
A - Yes! Lots of it.

Course FAQ

Q - Is this an intro CS course?
A - No – you’ve already taken CS 1110 or 1112

Q - What computing language will we learn?
A - R.

Q: How is this course different from INFO 2950?
A: R rather than Python.

Q: I don’t want to learn R! When can I take INFO 2950?
A: Probably next fall.

Course toolkit

Course operation

Doing data science

  • Computing:
    • R
    • RStudio
    • {tidyverse}
    • Quarto
  • Version control and collaboration:
    • Git
    • GitHub

Toolkit: Computing

Learning goals

By the end of the semester, you will…

  • Conduct exploratory data analysis through data wrangling and munging as well as visualizations and summary statistics.
  • Identify patterns in data to make predictions or to identify associations between variables.
  • Evaluate the strength of patterns using statistical and substantive significance.
  • Implement data science workflows using common, reproducible methods and software tools.
  • Use data ethically and responsibly.

Reproducible data analysis

Reproducibility checklist

What does it mean for a data analysis to be “reproducible”?

Near-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done?

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?

Toolkit for reproducibility

  • Scriptability \(\rightarrow\) R
  • Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto
  • Version control \(\rightarrow\) Git / GitHub
  • Reproducible environments \(\rightarrow\) {renv}

R and RStudio

Tour: R and RStudio

{tidyverse}

Hex logos for dplyr, ggplot2, forcats, tibble, readr, stringr, tidyr, and purrr

tidyverse.org

  • The {tidyverse} is an opinionated collection of R packages designed for data science
  • All packages share an underlying philosophy and a common grammar

Quarto

Quarto

  • Fully reproducible documents – each time you render the analysis is run from the beginning

  • YAML header to define document settings

  • Code goes in chunks

  • Narrative goes outside of chunks

  • A visual editor for a familiar / Google docs-like editing experience

  • Plain-text file format for easy editing and version control

  • More robust and flexible compared to Jupyter Notebooks

    But you can still use the Jupyter engine to run Python natively

Tour: Quarto

RStudio IDE with a Quarto document, source code on the left and output on the right. Annotated to show the YAML, a link, a header, and a code chunk.

Environments

Important

The environment of your Quarto document is separate from the Console!

Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!

Environments

First, run the following in the console:

x <- 2
x * 3
[1] 6


All looks good, eh?

Then, add the following in an R chunk in your Quarto document

x * 3
Error in eval(expr, envir, enclos): object 'x' not found


What happens? Why the error?

{renv} for reproducible environments

Reproducible environments

Project-based workflows benefit from reproducible environments

  • Isolated
  • Portable
  • Reproducible

Global library management

  • All packages installed in a global library directory

    install.packages("dplyr")
  • Can only install one version on system at a time

  • Requires manual setup to ensure identical package versions are installed on multiple systems

  • Extreme hassle for collaborative workflows

Reproducible environments with {renv}

The Bechdel test

The Bechdel test

In order to pass the test, a movie must have

  1. 👭 At least two named women in it
  2. 🗣️ Who talk to each other
  3. 🚫 About something besides a man

The Bechdel test

ae-00

Instructions

  • Go to ae-00-bechdel.
  • Clone the repo in RStudio
  • Run renv::restore() to install the required packages
  • Open the Quarto document in the repo and follow along and complete the exercises.

Warning

ae-00 is hosted on GitHub.com because we have not configured your authentication method for Cornell’s GitHub. We will do this tomorrow in lab.

Toolkit: Version control and collaboration

Git and GitHub

Git logo

  • Git is a version control system – like “Track Changes” features from Microsoft Word, on steroids
  • It’s not the only version control system, but it’s a very popular one

GitHub logo

  • GitHub is the home for your Git-based projects on the internet – like DropBox but much, much better

  • We will use GitHub (Enterprise) as a platform for web hosting and collaboration

Kevin McCallister in Home Alone cocking an air rifle and saying 'Don't get scared now'.

Git and GitHub tips

  • There are hundreds of Git commands – you don’t have to know them all. 99% of the time you will use Git to add, commit, push, and pull.
  • We will be doing Git things and interfacing with GitHub through RStudio, but if you Google for help you might come across methods for doing these things in the command line – skip that and move on to the next resource unless you feel comfortable trying it out.
  • There is a great resource for working with git and R: happygitwithr.com. Some of the content in there is beyond the scope of this course, but it’s a good place to look for help.

Tour: Git + GitHub

  • In lab section
  • Make sure to access Cornell’s GitHub so I can add you to the course organization on GitHub

Wrap up

Recap

  • Use R for scriptable data science workflows
  • Literate programming with Quarto weaves together code, output, and narrative in a single document
  • Version control enables reproducibility and collaboration
  • Reproducible environments maximize reproducibility

Acknowledgements

Taking #BookTok by storm

Book cover of 'Onyx Storm' by Rebecca Yarros