library(tidyverse)HW 01 - Data visualization
This homework is due February 4 at 11:59pm ET.
Learning objectives
- Visualize numeric and categorical data
- Transform scales using interpretable methods
- Select optimal chart types for specific variables
- Utilize reference documentation
Getting started
Go to the info2951-sp26 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the assignment.
Clone the repo and start a new workspace in Positron. See the Homework 0 instructions for details on cloning a repo and starting a new R project.
Packages
General guidance
As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Make sure to
- Update author name on your document.
- Label all code chunks informatively and concisely.
- Follow the Tidyverse code style guidelines.
- Make at least 3 commits.
- Resize figures where needed, avoid tiny or huge plots.
- Turn in an organized, well formatted document.
Do not let R output answer the question for you unless the question specifically asks for just a plot. For example, if the question asks for the number of columns in the data set, please type out the number of columns. If you are generating a plot, include a written interpretation of the graph. Minimal effort will likely earn minimal credit.
Part 1: More college athletics finance data
Use this dataset for Exercises 1-3.
We previously worked with data from the Knight Commission on Intercollegiate Athletics on the finances of NCAA Division I athletic programs. For this homework we will extend our analysis to additional variables and visualizations.
The dataset can be found in the data folder of your repo. It is called ncaa-finances.csv.
ncaa <- read_csv("data/ncaa-finances.csv")Rows: 193 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): school, stabbr, number_of_sports_teams, ncaa_subdivision, fbs_conf...
dbl (28): fips, ipeds_id, year, total_unduplicated_athletes, other_revenue, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
read_csv() produces a bunch of text output.
This is just the default behavior of read_csv() as it summarizes how it has imported the data file. We will learn more about data types in a couple of weeks. For now, if you don’t like seeing the message in your rendered document you can disable it using the code chunk option message: false, like below:
```{r}
#| message: false
ncaa <- read_csv("data/ncaa-finances.csv")
```As with the application exercise, we focus on the 2024 fiscal year data for all public Division I schools which operate a football program.
Exercise 1
College athletics burden on students. College athletics programs rely on a range of revenue sources to fund their operations. While some athletics departments are completely self-sufficient (generates all revenues directly through media rights, event tickets, donations, etc.), many college athletic programs rely on student fees to help fund their operations. The NCAA defines student fees as fees “paid by students and allocated for the restricted use of the athletics department.”
College athletics programs are organized into distinct conferences.1 Within the Football Bowl Subdivision (FBS), some conferences are considered “Power 4” conferences due to their larger media contracts and greater overall revenues. These include the Atlantic Coast Conference (ACC), Big Ten Conference, Big 12 Conference, and Southeastern Conference (SEC). Other conferences are considered “Group of 6” conferences.
- Create a boxplot comparing the distribution of student fees (
student_fees) by conference type (p4) for only FBS schools (i.e. exclude non-FBS schools withNAvalues forp4). - Then, plot
p4vs.student_fees. - Include informative title and axis labels.
- Finally, include a brief (2-3 sentence) narrative comparing the distributions of student fees between Power 4 and Group of 6 schools.
By default ggplot() will include an NA category when plotting categorical variables with missing values. To exclude these missing values from your plot, you can use the drop_na() function from {tidyr} to remove the rows with missing values. For example, to drop rows with missing values in the p4 column you can do
ncaa |>
drop_na(p4) |>
ggplot(...)Now is a good time to render, commit, and push.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 2
Coaches’ compensation and institutional funding sources. Coaches’ compensation is often one of the largest expenses for college athletics programs. However, the sources of revenue used to pay for coaches’ salaries can vary widely between schools. Some schools generate a large portion of their athletics revenue directly from the educational institution (e.g. state government allocations, student fees), while others rely more heavily on external revenue sources such as ticket sales, media rights, and donations.
One critique of high coaching salaries is that they divert funds away from educational institutions, especially public universities that receive state funding. But is this actually the case? Do schools that receive a greater percentage of their athletics revenue directly from the educational institution tend to pay their coaches more?
Construct a scatterplot to visualize the relationship between the percentage of athletics revenue from the educational institution (allocated_revenue_pct) and total coaches’ compensation (coaches_compensation). Include a smoothing trend line to summarize the relationship. Remember to include informative title and axis labels. Finally, include a brief (2-3 sentence) narrative commenting on the relationship between these two variables.
Now is a good time to render, commit, and push.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceding.
Exercise 3
Choose your own adventure. Select (at least) two variables from the dataset not previously visualized in Exercises 1 and 2. Create a visualization to explore the relationship between these two variables. You may choose any type of plot you think is appropriate (scatterplot, bar chart, histogram, boxplot, etc.). Ensure the plot adheres to best practices for interpretability. Finally, include a brief (1 paragraph) narrative explaining why you selected these variables and comment on the relationship you observe in the visualization.
Now is a good time to render, commit, and push.
Part 2: BRFSS
Use this dataset for Exercises 3 to 5.
The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
Source: cdc.gov/brfss
In the following exercises we will work with data from the 2024 BRFSS survey. The original survey contains over 300 columns and 400k+ rows. These have already been sampled for you and the dataset you’ll use can be found in the data folder of your repo. It’s called brfss-24.csv.
brfss <- read_csv("data/brfss-24.csv")Exercise 4
- How many rows are in the
brfssdataset? What does each row represent? - How many columns are in the
brfssdataset? Indicate the type of each variable. - Include the code and resulting output used to support your answer.
Now is a good time to render, commit, and push.
Exercise 5
Do people who use marijuana more frequently tend to be satisfied with life?
- Use a segmented bar chart (also known as a standardized or filled bar chart) to visualize the relationship between marijuana usage (
marijuana) and satisfaction with life (satisfaction). Decide on which variable to represent with bars and which variable to fill the color of the bars by. - Pay attention to the order of the bars and, if need be, use the
fct_relevelfunction to reorder the levels of the variables.- Below is sample code for releveling
satisfaction. Here we first convertsatisfactionto a factor (how R stores categorical data) and then order the levels from Excellent to Poor.
- Below is sample code for releveling
brfss <- brfss |>
mutate(
satisfaction = as.factor(satisfaction),
satisfaction = fct_relevel(
.f = satisfaction,
"Very satisfied",
"Satisfied",
"Dissatisfied",
"Very dissatisfied"
)
)- 1
- Modify a column of data
- 2
-
Ensure specific ordering of
satisfactionwhen plotted - 3
-
Save the modified data frame as
brfss
- Include informative title, axis, and legend labels.
- Comment on the motivating question based on evidence from the visualization: Do people who use marijuana more frequently tend to be satisfied with life?
Now is a good time to render, commit, and push.
Exercise 6
How are mental health and marijuana usage associated?
- Create a visualization displaying the relationship between the number of mental health days in a month that are not good (
mental_health) andmarijuana. - Include informative title and axis labels.
- Modify your plot to use a different theme than the default.
- Comment on the motivating question based on evidence from the visualization: How are mental health and marijuana usage associated?
Render, commit, and push one last time.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 2951 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
- Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
- Exercise 1: 7 points
- Exercise 2: 8 points
- Exercise 3: 8 points
- Exercise 4: 6 points
- Exercise 5: 8 points
- Exercise 6: 8 points
- Workflow + formatting: 5 points
- Total: 50 points
The “Workflow & formatting” component assesses the reproducible workflow. This includes:
- At least 3 informative commit messages
- Following {tidyverse} code style
- All code being visible in rendered PDF without automatic wrapping (no more than 80 characters)
Footnotes
For instance, Cornell University belongs to the Ivy League. Yes, the Ivy League was originally organized as a sports conference.↩︎