AE 02: Visualizing the prognosticators

Application exercise
Modified

January 30, 2025

Important

Go to the course GitHub organization and locate the repo titled ae-02-YOUR_GITHUB_USERNAME to get started.

This AE is due January 30 at 11:59pm.

For all analyses, we’ll use the {tidyverse} packages.

Data: The prognosticators

The dataset we will visualize is called seers.1 It contains summary statistics for all known Groundhog Day forecasters. 2 Let’s glimpse() at it.

1 I would prefer prognosticators, but I had way too many typos preparing these materials to make you all use it.

2 Source: Countdown to Groundhog Day. Application exercise inspired by Groundhogs Do Not Make Good Meteorologists originally published on FiveThirtyEight.

# import data using readr::read_csv()
seers <- read_csv("data/prognosticators-sum-stats.csv")
Rows: 154 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): name, forecaster_type, forecaster_simple, climate_region, town, state
dbl (11): preds_n, preds_long_winter, preds_long_winter_pct, preds_correct, ...
lgl  (1): alive

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(seers)
Rows: 154
Columns: 18
$ name                  <chr> "Allen McButterpants", "Arboretum Annie", "Babyl…
$ forecaster_type       <chr> "Groundhog", "Groundhog", "Groundhog Mascot", "S…
$ forecaster_simple     <chr> "Groundhog", "Groundhog", "Groundhog Mascot", "O…
$ alive                 <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE,…
$ climate_region        <chr> "Northeast", "South", "Northeast", "Northeast", …
$ town                  <chr> "Hampton Bays", "Dallas", "Babylon", "Bridgeport…
$ state                 <chr> "NY", "TX", "NY", "CT", "TX", "OH", "TX", "TX", …
$ preds_n               <dbl> 2, 3, 1, 13, 14, 9, 8, 1, 1, 12, 2, 4, 13, 10, 1…
$ preds_long_winter     <dbl> 0, 1, 0, 1, 3, 4, 5, 1, 1, 6, 2, 2, 8, 9, 0, 1, …
$ preds_long_winter_pct <dbl> 0.00000000, 0.33333333, 0.00000000, 0.07692308, …
$ preds_correct         <dbl> 2, 2, 1, 10, 10, 6, 4, 0, 0, 5, 0, 2, 5, 2, 0, 2…
$ preds_rate            <dbl> 1.0000000, 0.6666667, 1.0000000, 0.7692308, 0.71…
$ temp_mean             <dbl> 33.70000, 50.18333, 35.05000, 31.43462, 51.57143…
$ temp_hist             <dbl> 30.31167, 51.32333, 30.47667, 29.63667, 50.99310…
$ temp_sd               <dbl> 4.154767, 3.908807, 4.154767, 4.154767, 3.908807…
$ precip_mean           <dbl> 3.007500, 2.768333, 3.620000, 3.059231, 2.577500…
$ precip_hist           <dbl> 3.0251667, 2.5588889, 3.0760000, 3.0700000, 2.56…
$ precip_sd             <dbl> 0.9715631, 0.8999887, 0.9715631, 0.9715631, 0.89…

The variables are:

  • name - name of the prognosticator
  • forecaster_type - what kind of animal or thing is the prognosticator?
  • forecaster_simple - a simplified version that lumps together the least-frequently appearing types of prognosticators
  • alive - is the prognosticator an animate (alive) being?3
  • climate_region - the NOAA climate region in which the prognosticator is located.
  • town - self-explanatory
  • state - state (or territory) where prognosticator is located
  • preds_n - number of predictions in the database
  • preds_long_winter - number of predictions for a “Late Winter” (as opposed to “Early Spring”)
  • preds_long_winter_pct - percentage of predictions for a “Late Winter”
  • preds_correct - number of correct predictions4
  • preds_rate - proportion of predictions that are correct
  • temp_mean - average temperature (in Fahrenheit) in February and March in the climate region across all prognostication years
  • temp_hist - average of the rolling 15-year historic average temperature in February and March across all prognostication years
  • temp_sd - standard deviation of average February and March temperatures across all prognostication years
  • precip_mean - average amount of precipitation in February and March across all prognostication years (measured in rainfall inches)
  • precip_hist average of the rolling 15-year historic average precipitation in February and March across all prognostication years
  • precip_sd - standard deviation of average February and March precipitation across all prognostication years

3 Prognosticators labeled as Animatronic/Puppet/Statue/Stuffed/Taxidermied are classified as not alive.

4 We adopt the same definition as FiveThirtyEight. An “Early Spring” is defined as any year in which the average temperature in either February or March was higher than the historic average. A “Late Winter” was when the average temperature in both months was lower than or the same as the historical average.

Visualizing prediction success rate

Single variable - Demo

Note

Analyzing the a single variable is called univariate analysis.

Create visualizations of the distribution of preds_rate for the prognosticators.

  1. Make a histogram. Set an appropriate binwidth.
# add code here

Two variables - Your turn

Note

Analyzing the relationship between two variables is called bivariate analysis.

Create visualizations of the distribution of preds_rate by alive (whether or not the prognosticator is alive). For each plot, write a 1-2 sentence interpretation of the graph and what you learn from it.

  1. Use multiple histograms via faceting, one for each type, and set an appropriate binwidth.
# add code here

Add response here.

  1. Make a color-coded frequency polygon with geom_freqpoly(), mapping alive to the color channel. Set an appropriate binwidth.
# add code here

Add response here.

  1. Use a density plot. Add color as you see fit.
# add code here

Add response here.

  1. Use side-by-side box plots. Add color as you see fit and turn off legends if not needed.
Warning

You do not need to use faceting to create a box plot. Simply map the categorical and continuous variables to the appropriate axes and they are automatically drawn side-by-side.

# add code here

Add response here.

  1. Use a violin plot. Add color as you see fit and turn off legends if not needed.
# add code here

Add response here.

  1. Make a jittered scatter plot. Add color as you see fit and turn off legends if not needed.
# add code here

Add response here.

  1. Use beeswarm plots. Add color as you see fit and turn off legends if not needed.
library(ggbeeswarm)

# add code here

Add response here.

  1. Demonstration: Use multiple geoms on a single plot. Be deliberate about the order of plotting. Change the theme and the color scale of the plot. Finally, add informative labels.
# add code here

Add response here.

Multiple variables - Demo

Note

Analyzing the relationship between three or more variables is called multivariate analysis.

  1. Facet the plot you created in the previous exercise by forecaster_simple. Adjust labels accordingly.
# add code here

Before you continue, let’s turn off all warnings the code chunks generate and resize all figures. We’ll do this by adding these execute options to the YAML header.

---
execute:
  fig-width: 8
  fig-height: 4
  warning: false
---

Visualizing other variables - Your turn!

  1. Pick a single categorical variable from the data set and make a bar plot of its distribution. What do you learn?
# add code here

Add response here.

  1. Pick two categorical variables and make a visualization to visualize the relationship between the two variables. Along with your code and output, provide an interpretation of the visualization.
# add code here

Add response here.

  1. Make another plot that uses at least three variables. At least one should be numeric and at least one categorical. In 1-2 sentences, describe what the plot shows about the relationships between the variables you plotted. Don’t forget to label your code chunk.
# add code here

Add response here.