HW 07 - Generalized linear models

Homework

Modified

April 10, 2026

Important

This homework is due April 15 at 11:59pm ET.

Learning objectives

Use an API to import data
Estimate linear regression models
Interpret the results of regression models
Estimate logistic regression models
Generate and interpret predicted probabilities
Evaluate the effectiveness of generalized linear models

Getting started

Go to the info2951-sp26 organization on GitHub. Click on the repo with the prefix hw-07. It contains the starter documents you need to complete the homework.
Clone the repo and start a new workspace in Positron. See the Homework 0 instructions for details on cloning a repo and starting a new R project.

General guidance

Guidelines + tips

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Workflow + formatting

Make sure to

Update author name on your document.
Label all code chunks informatively and concisely.
Follow the Tidyverse code style guidelines.
Make at least 3 commits.
Resize figures where needed, avoid tiny or huge plots.
Turn in an organized, well formatted document.

Packages

library(tidyverse)
library(tidymodels)
library(probably)
library(wbstats)
library(gt)

Part 1: Explaining economic development

Exercise 1 - Get data from the World Bank

The World Bank publishes extensive socioeconomic data on countries/economies around the world.¹ We will use an excerpt of their data to explore relationships among world health metrics across countries and regions for 2023.

Rather than downloading every country’s complete indicators as standalone spreadsheet files and performing a complex iterative operation to import, tidy, and combine them together, we will use the World Bank’s API to obtain our required data in a tidy format. Specifically, use {wbstats} to obtain GDP per capita (current US$) and Life expectancy at birth, total (years) for 2023. Then combine those indicators with the World Bank’s country metadata² so we know each country’s region. Filter out any rows with missing values. Store the final data frame as gdp_wb.

The resulting data frame should contain 203 observations.

Exercise 2 - GDP vs. life expectancy

Visualization: We are interested in learning more about GDP per capita (hereby referred to simply as GDP), and we’ll start with exploring the relationship between GDP and life expectancy. Create two visualizations:
- Scatter plot of GDP vs. life expectancy. GDP is the dependent variable/outcome of interest for all of the exercises.
- Scatter plot of logged GDP vs. life expectancy. Modify the plot’s scale to implement a log transformation for GDP per capita.
Note

For the visualizations, you can use scale_*_log10() to transform the axis. For model estimation, use natural logarithms so that the coefficients are interpretable using the method described in class. The default base for log() is $e$, so you don’t have to change any arguments.

First describe the relationship between each pair of the variables. Then, comment on which relationship would be better modeled using a linear model, and explain your reasoning.
Model fitting and interpretation:
- Fit a linear model predicting log GDP from life expectancy. Display the tidy summary.
- Interpret the intercept of the model, making sure that your interpretation is in the units of the original data (not on log scale).
- Interpret the coefficient, making sure that your interpretation is in the units of the original data (not on log scale).
Hypothesis testing: Test the null hypothesis that there is no relationship between life expectancy and GDP against the alternative hypothesis that there is a relationship. Use a significance level of $\alpha = 0.05$.
Model evaluation: Calculate the $R^2$ and adjusted $R^2$ of the model and interpret in the context of the data and the research question.

Exercise 3 - GDP vs. life expectancy + region

Next, we want to examine if the relationship between GDP and life expectancy that we observed in the previous exercise holds across all regions in our data. We’ll continue to work with logged GDP.

Justification: Create a scatter plot of logged GDP vs. life expectancy, with some method of differentiating between each region. Do you think the trend between life expectancy and GDP is different for different regions? Justify your answer with specific features of the plot.
Model fitting and interpretation:
- Regardless of your answer in part (a), fit an additive model (main effects) that predicts logged GDP from life expectancy and region (with Europe & Central Asia as the baseline level). Display a tidy summary of the model output.
- Interpret the intercept of the model, making sure that your interpretation is in the units of the original data (not on log scale).
- Interpret the coefficient for life expectancy, making sure that your interpretation is in the units of the original data (not on log scale).
Hypothesis testing: Test the null hypothesis that there is no relationship between life expectancy and GDP against the alternative hypothesis that there is a relationship while holding constant the relationship between region and GDP. Use a significance level of $\alpha = 0.05$.
Prediction: Predict the GDP of a country in East Asia and Pacific where the average life expectancy is 70 years old.
Model evaluation: Calculate the $R^2$ and adjusted $R^2$ of the model and interpret in the context of the data and the research question.

Exercise 4 - GDP vs. life expectancy x region

Finally, we want to examine if the relationship between GDP and life expectancy that we observed in the previous exercise holds across all regions in our data again, this time allowing for different relationships between GDP and life expectancy across regions. We’ll continue to work with logged GDP.

Model fitting and interpretation: Fit an interaction model that predicts logged GDP from life expectancy and region (with Europe & Central Asia as the baseline level). Display a tidy summary of the model output and in 2-3 sentences, explain how this model differs from the additive model.
Prediction: Predict the GDP of a country in East Asia and Pacific where the average life expectancy is 70 years old. Is this prediction different from your prediction with the additive model from Exercise 3?
Model evaluation:
- Calculate the $R^2$ and adjusted $R^2$ of the model and interpret in the context of the data and the research question.
- Based on the $R^2$ and adjusted $R^2$ values, which model (additive or interaction) is better at explaining the variation in GDP? What does this suggest about the relationship between life expectancy and GDP across regions?

Exercise 5 - Construct a regression results table

Construct a regression results table for the best-performing model. The table should include the estimated coefficients, 95% confidence intervals, and $p$-values for each predictor. Confidence intervals and $p$-values must be generated using randomization-based approaches (i.e. permutation or bootstrap). The table should be formatted in a clear and professional manner, with appropriate labels and formatting to enhance readability.

Review past application exercises

Application exercise 15 includes a clear example of how to construct a regression results table using

Part 2: Predicting attitudes on scientific advancement

The General Social Survey is a biannual survey of the American public.

About the GSS

The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years. The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

On the 2022 edition, respondents were asked to respond to the prompt:

Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.

data/gss-science.rds contains a tidy data frame with the relevant variables for this exercise. The dependent variable is advfront, which captures respondents’ agreement with the above statement as either “Agree” or “Not agree”. The independent variables are educ (years of education) and polviews (political ideology defined as “Liberal”, “Moderate”, or “Conservative”).

Exercise 6 - Partition the data

Reproducibly partition the data into a training set (80%) and a testing set (20%).

Exercise 7 - Predict scientific advancement by education

Fit a logistic regression model that predicts advfront by educ. Report the model coefficients and briefly summarize what they indicate.

Using your estimated model, predict the probability of agreeing with the statement across all plausible values of educ. Visualize the predicted probabilities using a line chart and interpret the relationship between education and agreement with the statement.

Tip

You should create a data frame that contains all possible combinations of educ and polviews. Do not rely solely on the output from augment() since there is no guarantee that all possible combinations actually were observed in the dataset. I recommend considering tidyr::expand_grid() for this task.

Exercise 8 - Predict scientific advancement by education + political ideology

Fit a new model that adds the additional explanatory variable of polviews. Report the model coefficients and briefly summarize what they indicate.

Using your estimated model, predict the probability of agreeing with the following statement across all plausible values of educ and polviews. Visualize the predicted probabilities using a line chart and interpret the relationship between education and agreement with the statement.

Exercise 9 - Evaluate the model’s performance

Evaluate the performance of the model you fit in Exercise 7 using the testing set. Report the confusion matrix, accuracy, sensitivity, and specificity of the model, along with the ROC AUC. Interpret these metrics in the context of the data and the research question. If you need to predict whether or not a new respondent would agree with the statement, what decision threshold would you use and why?

Wrap up

Submission

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials $\rightarrow$ Cornell University NetID and log in using your NetID credentials.
Click on your INFO 2951 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

Exercise 1: 5 points
Exercise 2: 5 points
Exercise 3: 5 points
Exercise 4: 5 points
Exercise 5: 5 points
Exercise 6: 5 points
Exercise 7: 5 points
Exercise 8: 5 points
Exercise 9: 5 points
Workflow + formatting: 5 points
Total: 50 points

Workflow & formatting criteria

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

Following {tidyverse} code style
All code being visible in rendered PDF without automatic wrapping (no more than 80 characters)
Appropriate figure sizing, and figures with informative labels and legends
Ensuring reproducibility by setting a random seed value.

Acknowledgments

This assignment is adapted from STA 199: Introduction to Data Science

Footnotes

You knew this from homework 05.↩︎
See wb_countries() for more information.↩︎