AE 01: Visualizing college athletics financial data

Application exercise
Modified

January 27, 2026

Important

Go to the course GitHub organization and locate the repo titled ae-01-YOUR_GITHUB_USERNAME to get started.

This AE is due January 27 at 11:59pm.

For all analyses, we’ll use the {tidyverse} packages.

Data: The Knight-Newhouse College Athletics Database

College athletics in the United States are a big business. The Knight Commission on College Athletics is an independent organization that promotes the education, health, safety and success of college athletes. As part of this mission, the Commission maintains a database of college athletics finances for all public schools who compete in Division I (the top college athletics division).1 The data include information on revenues and expenses for each school’s athletics department, as well as information on student fees, enrollment, and other institutional characteristics.

We will analyze the 2024 fiscal year data for all public Division I schools which operate a football program. The data are stored in the file data/ncaa-finances.csv.

# import data using readr::read_csv()
ncaa <- read_csv("data/ncaa-finances.csv")
Rows: 193 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): school, stabbr, number_of_sports_teams, ncaa_subdivision, fbs_conf...
dbl (28): fips, ipeds_id, year, total_unduplicated_athletes, other_revenue, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(ncaa)
Rows: 193
Columns: 34
$ school                                                              <chr> "A…
$ fips                                                                <dbl> 1,…
$ stabbr                                                              <chr> "A…
$ ipeds_id                                                            <dbl> 10…
$ year                                                                <dbl> 20…
$ total_unduplicated_athletes                                         <dbl> 35…
$ number_of_sports_teams                                              <chr> "1…
$ ncaa_subdivision                                                    <chr> "F…
$ fbs_conference                                                      <chr> NA…
$ p4                                                                  <chr> NA…
$ other_revenue                                                       <dbl> 38…
$ corporate_sponsorship_advertising_licensing                         <dbl> 22…
$ donor_contributions                                                 <dbl> 12…
$ competition_guarantees_revenues                                     <dbl> 73…
$ conference_ncaa_distributions_media_rights_and_post_season_football <dbl> 84…
$ ticket_sales                                                        <dbl> 11…
$ institutional_government_support                                    <dbl> 12…
$ student_fees                                                        <dbl> 19…
$ total_revenue                                                       <dbl> 17…
$ allocated_revenue                                                   <dbl> 13…
$ allocated_revenue_pct                                               <dbl> 0.…
$ student_athlete_meals_non_travel                                    <dbl> 21…
$ excess_transfers_back                                               <dbl> 0,…
$ total_coaching_severance                                            <dbl> 0,…
$ other_expenses                                                      <dbl> 23…
$ medical                                                             <dbl> 0,…
$ competition_guarantees_expenses                                     <dbl> 85…
$ recruiting                                                          <dbl> 0,…
$ game_expenses_and_travel                                            <dbl> 23…
$ facilities_debt_service_and_equipment                               <dbl> 46…
$ coaches_compensation                                                <dbl> 21…
$ non_coaching_athletics_staff_compensation                           <dbl> 16…
$ athletic_student_aid                                                <dbl> 39…
$ total_expense                                                       <dbl> 17…

The database includes many variables related to revenues and expenses for each school’s athletics department, as well as information on student fees, enrollment, and other institutional characteristics. We will focus on a small subset of these variables for this exercise:

  • ncaa_subdivision: Whether the school competes in the Football Bowl Subdivision (FBS) or the Football Championship Subdivision (FCS) (previously classified as Division I-A and Division I-AA, respectively).
  • total_expense: Total expenses paid by the athletics department (in USD).
  • number_of_sports_teams: Number of sports teams sponsored by the athletics department.

We provide the remaining variables in the dataset for your own exploration.

Visualizing student fees

Single variable - Demo

Note

Analyzing the a single variable is called univariate analysis.

Create visualizations of the distribution of total_expense (sum of all expenses paid by a school’s athletics department). Write a 1-2 sentence interpretation of the graph and what you learn from it.

  1. Make a histogram. Set an appropriate binwidth.
# add code here

Add response here.

Two variables - Your turn

Note

Analyzing the relationship between two variables is called bivariate analysis.

Create visualizations of the distribution of total_expense by ncaa_subdivision (whether the school competes in the Football Bowl Subdivision or the Football Championship Subdivision). For each plot, write a 1-2 sentence interpretation of the graph and what you learn from it.

  1. Use multiple histograms via faceting, one for each type, and set an appropriate binwidth.
# add code here

Add response here.

  1. Make a color-coded frequency polygon with geom_freqpoly(), mapping ncaa_subdivision to the color channel. Set an appropriate binwidth.
# add code here

Add response here.

  1. Use a density plot. Add color as you see fit.
# add code here

Add response here.

  1. Use side-by-side box plots. Add color as you see fit and turn off legends if not needed.
Warning

You do not need to use faceting to create a box plot. Simply map the categorical and continuous variables to the appropriate axes and they are automatically drawn side-by-side.

# add code here

Add response here.

  1. Use a violin plot. Add color as you see fit and turn off legends if not needed.
# add code here

Add response here.

  1. Make a jittered scatter plot. Add color as you see fit and turn off legends if not needed.
# add code here

Add response here.

  1. Use beeswarm plots. Add color as you see fit and turn off legends if not needed.
library(ggbeeswarm)

# add code here

Add response here.

  1. Demonstration: Use multiple geoms on a single plot. Be deliberate about the order of plotting. Change the theme and the color scale of the plot. Finally, add informative labels.
# add code here

Multiple variables - Demo

Note

Analyzing the relationship between three or more variables is called multivariate analysis.

  1. Facet the plot you created in the previous exercise by number_of_sports_teams. Adjust labels accordingly.
# add code here

Before you continue, let’s turn off all warnings the code chunks generate and resize all figures. We’ll do this by adding these execute options to the YAML header.

---
execute:
  fig-width: 8
  fig-height: 4
  warning: false
---

Footnotes

  1. The database does not include private universities such as Cornell which are not required to publish their financial data.↩︎