AE 02: Visualizing the prognosticators
Go to the course GitHub organization and locate the repo titled ae-02-YOUR_GITHUB_USERNAME to get started.
This AE is due January 30 at 11:59pm.
For all analyses, we’ll use the {tidyverse} packages.
Data: The prognosticators
The dataset we will visualize is called seers.1 It contains summary statistics for all known Groundhog Day forecasters. 2 Let’s glimpse() at it.
1 I would prefer prognosticators, but I had way too many typos preparing these materials to make you all use it.
2 Source: Countdown to Groundhog Day. Application exercise inspired by Groundhogs Do Not Make Good Meteorologists originally published on FiveThirtyEight.
# import data using readr::read_csv()
seers <- read_csv("data/prognosticators-sum-stats.csv")Rows: 154 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): name, forecaster_type, forecaster_simple, climate_region, town, state
dbl (11): preds_n, preds_long_winter, preds_long_winter_pct, preds_correct, ...
lgl (1): alive
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(seers)Rows: 154
Columns: 18
$ name <chr> "Allen McButterpants", "Arboretum Annie", "Babyl…
$ forecaster_type <chr> "Groundhog", "Groundhog", "Groundhog Mascot", "S…
$ forecaster_simple <chr> "Groundhog", "Groundhog", "Groundhog Mascot", "O…
$ alive <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE,…
$ climate_region <chr> "Northeast", "South", "Northeast", "Northeast", …
$ town <chr> "Hampton Bays", "Dallas", "Babylon", "Bridgeport…
$ state <chr> "NY", "TX", "NY", "CT", "TX", "OH", "TX", "TX", …
$ preds_n <dbl> 2, 3, 1, 13, 14, 9, 8, 1, 1, 12, 2, 4, 13, 10, 1…
$ preds_long_winter <dbl> 0, 1, 0, 1, 3, 4, 5, 1, 1, 6, 2, 2, 8, 9, 0, 1, …
$ preds_long_winter_pct <dbl> 0.00000000, 0.33333333, 0.00000000, 0.07692308, …
$ preds_correct <dbl> 2, 2, 1, 10, 10, 6, 4, 0, 0, 5, 0, 2, 5, 2, 0, 2…
$ preds_rate <dbl> 1.0000000, 0.6666667, 1.0000000, 0.7692308, 0.71…
$ temp_mean <dbl> 33.70000, 50.18333, 35.05000, 31.43462, 51.57143…
$ temp_hist <dbl> 30.31167, 51.32333, 30.47667, 29.63667, 50.99310…
$ temp_sd <dbl> 4.154767, 3.908807, 4.154767, 4.154767, 3.908807…
$ precip_mean <dbl> 3.007500, 2.768333, 3.620000, 3.059231, 2.577500…
$ precip_hist <dbl> 3.0251667, 2.5588889, 3.0760000, 3.0700000, 2.56…
$ precip_sd <dbl> 0.9715631, 0.8999887, 0.9715631, 0.9715631, 0.89…
The variables are:
-
name- name of the prognosticator -
forecaster_type- what kind of animal or thing is the prognosticator? -
forecaster_simple- a simplified version that lumps together the least-frequently appearing types of prognosticators -
alive- is the prognosticator an animate (alive) being?3 -
climate_region- the NOAA climate region in which the prognosticator is located. -
town- self-explanatory -
state- state (or territory) where prognosticator is located -
preds_n- number of predictions in the database -
preds_long_winter- number of predictions for a “Late Winter” (as opposed to “Early Spring”) -
preds_long_winter_pct- percentage of predictions for a “Late Winter” -
preds_correct- number of correct predictions4 -
preds_rate- proportion of predictions that are correct -
temp_mean- average temperature (in Fahrenheit) in February and March in the climate region across all prognostication years -
temp_hist- average of the rolling 15-year historic average temperature in February and March across all prognostication years -
temp_sd- standard deviation of average February and March temperatures across all prognostication years -
precip_mean- average amount of precipitation in February and March across all prognostication years (measured in rainfall inches) -
precip_histaverage of the rolling 15-year historic average precipitation in February and March across all prognostication years -
precip_sd- standard deviation of average February and March precipitation across all prognostication years
3 Prognosticators labeled as Animatronic/Puppet/Statue/Stuffed/Taxidermied are classified as not alive.
4 We adopt the same definition as FiveThirtyEight. An “Early Spring” is defined as any year in which the average temperature in either February or March was higher than the historic average. A “Late Winter” was when the average temperature in both months was lower than or the same as the historical average.
Visualizing prediction success rate
Single variable - Demo
Analyzing the a single variable is called univariate analysis.
Create visualizations of the distribution of preds_rate for the prognosticators.
- Make a histogram. Set an appropriate binwidth.
# add code hereTwo variables - Your turn
Analyzing the relationship between two variables is called bivariate analysis.
Create visualizations of the distribution of preds_rate by alive (whether or not the prognosticator is alive). For each plot, write a 1-2 sentence interpretation of the graph and what you learn from it.
- Use multiple histograms via faceting, one for each type, and set an appropriate binwidth.
# add code hereAdd response here.
- Make a color-coded frequency polygon with
geom_freqpoly(), mappingaliveto thecolorchannel. Set an appropriate binwidth.
# add code hereAdd response here.
- Use a density plot. Add color as you see fit.
# add code hereAdd response here.
- Use side-by-side box plots. Add color as you see fit and turn off legends if not needed.
You do not need to use faceting to create a box plot. Simply map the categorical and continuous variables to the appropriate axes and they are automatically drawn side-by-side.
# add code hereAdd response here.
- Use a violin plot. Add color as you see fit and turn off legends if not needed.
# add code hereAdd response here.
- Make a jittered scatter plot. Add color as you see fit and turn off legends if not needed.
# add code hereAdd response here.
- Use beeswarm plots. Add color as you see fit and turn off legends if not needed.
library(ggbeeswarm)
# add code hereAdd response here.
- Demonstration: Use multiple geoms on a single plot. Be deliberate about the order of plotting. Change the theme and the color scale of the plot. Finally, add informative labels.
# add code hereAdd response here.
Multiple variables - Demo
Analyzing the relationship between three or more variables is called multivariate analysis.
- Facet the plot you created in the previous exercise by
forecaster_simple. Adjust labels accordingly.
# add code hereBefore you continue, let’s turn off all warnings the code chunks generate and resize all figures. We’ll do this by adding these execute options to the YAML header.
---
execute:
fig-width: 8
fig-height: 4
warning: false
---Visualizing other variables - Your turn!
- Pick a single categorical variable from the data set and make a bar plot of its distribution. What do you learn?
# add code hereAdd response here.
- Pick two categorical variables and make a visualization to visualize the relationship between the two variables. Along with your code and output, provide an interpretation of the visualization.
# add code hereAdd response here.
- Make another plot that uses at least three variables. At least one should be numeric and at least one categorical. In 1-2 sentences, describe what the plot shows about the relationships between the variables you plotted. Don’t forget to label your code chunk.
# add code hereAdd response here.