HW 07 - Generalized linear models
This homework is due April 15 at 11:59pm ET.
Learning objectives
- Use an API to import data
- Estimate linear regression models
- Interpret the results of regression models
- Estimate logistic regression models
- Generate and interpret predicted probabilities
- Evaluate the effectiveness of generalized linear models
Getting started
Go to the info2951-sp26 organization on GitHub. Click on the repo with the prefix hw-07. It contains the starter documents you need to complete the homework.
Clone the repo and start a new workspace in Positron. See the Homework 0 instructions for details on cloning a repo and starting a new R project.
General guidance
As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
Make sure to
- Update author name on your document.
- Label all code chunks informatively and concisely.
- Follow the Tidyverse code style guidelines.
- Make at least 3 commits.
- Resize figures where needed, avoid tiny or huge plots.
- Turn in an organized, well formatted document.
Packages
Part 1: Explaining economic development
Exercise 1 - Get data from the World Bank
The World Bank publishes extensive socioeconomic data on countries/economies around the world.1 We will use an excerpt of their data to explore relationships among world health metrics across countries and regions for 2023.
Rather than downloading every country’s complete indicators as standalone spreadsheet files and performing a complex iterative operation to import, tidy, and combine them together, we will use the World Bank’s API to obtain our required data in a tidy format. Specifically, use {wbstats} to obtain GDP per capita (current US$) and Life expectancy at birth, total (years) for 2023. Then combine those indicators with the World Bank’s country metadata2 so we know each country’s region. Filter out any rows with missing values. Store the final data frame as gdp_wb.
The resulting data frame should contain 203 observations.
Exercise 2 - GDP vs. life expectancy
-
Visualization: We are interested in learning more about GDP per capita (hereby referred to simply as GDP), and we’ll start with exploring the relationship between GDP and life expectancy. Create two visualizations:
Scatter plot of GDP vs. life expectancy. GDP is the dependent variable/outcome of interest for all of the exercises.
Scatter plot of logged GDP vs. life expectancy. Modify the plot’s scale to implement a log transformation for GDP per capita.
NoteFor the visualizations, you can use
scale_*_log10()to transform the axis. For model estimation, use natural logarithms so that the coefficients are interpretable using the method described in class. The default base forlog()is \(e\), so you don’t have to change any arguments.First describe the relationship between each pair of the variables. Then, comment on which relationship would be better modeled using a linear model, and explain your reasoning.
-
Model fitting and interpretation:
Fit a linear model predicting log GDP from life expectancy. Display the tidy summary.
Interpret the intercept of the model, making sure that your interpretation is in the units of the original data (not on log scale).
Interpret the coefficient, making sure that your interpretation is in the units of the original data (not on log scale).
Hypothesis testing: Test the null hypothesis that there is no relationship between life expectancy and GDP against the alternative hypothesis that there is a relationship. Use a significance level of \(\alpha = 0.05\).
Model evaluation: Calculate the \(R^2\) and adjusted \(R^2\) of the model and interpret in the context of the data and the research question.
Exercise 3 - GDP vs. life expectancy + region
Next, we want to examine if the relationship between GDP and life expectancy that we observed in the previous exercise holds across all regions in our data. We’ll continue to work with logged GDP.
Justification: Create a scatter plot of logged GDP vs. life expectancy, with some method of differentiating between each region. Do you think the trend between life expectancy and GDP is different for different regions? Justify your answer with specific features of the plot.
-
Model fitting and interpretation:
Regardless of your answer in part (a), fit an additive model (main effects) that predicts logged GDP from life expectancy and region (with Europe & Central Asia as the baseline level). Display a tidy summary of the model output.
Interpret the intercept of the model, making sure that your interpretation is in the units of the original data (not on log scale).
Interpret the coefficient for life expectancy, making sure that your interpretation is in the units of the original data (not on log scale).
Hypothesis testing: Test the null hypothesis that there is no relationship between life expectancy and GDP against the alternative hypothesis that there is a relationship while holding constant the relationship between region and GDP. Use a significance level of \(\alpha = 0.05\).
Prediction: Predict the GDP of a country in East Asia and Pacific where the average life expectancy is 70 years old.
Model evaluation: Calculate the \(R^2\) and adjusted \(R^2\) of the model and interpret in the context of the data and the research question.
Exercise 4 - GDP vs. life expectancy x region
Finally, we want to examine if the relationship between GDP and life expectancy that we observed in the previous exercise holds across all regions in our data again, this time allowing for different relationships between GDP and life expectancy across regions. We’ll continue to work with logged GDP.
Model fitting and interpretation: Fit an interaction model that predicts logged GDP from life expectancy and region (with Europe & Central Asia as the baseline level). Display a tidy summary of the model output and in 2-3 sentences, explain how this model differs from the additive model.
Prediction: Predict the GDP of a country in East Asia and Pacific where the average life expectancy is 70 years old. Is this prediction different from your prediction with the additive model from Exercise 3?
-
Model evaluation:
- Calculate the \(R^2\) and adjusted \(R^2\) of the model and interpret in the context of the data and the research question.
- Based on the \(R^2\) and adjusted \(R^2\) values, which model (additive or interaction) is better at explaining the variation in GDP? What does this suggest about the relationship between life expectancy and GDP across regions?
Exercise 5 - Construct a regression results table
Construct a regression results table for the best-performing model. The table should include the estimated coefficients, 95% confidence intervals, and \(p\)-values for each predictor. Confidence intervals and \(p\)-values must be generated using randomization-based approaches (i.e. permutation or bootstrap). The table should be formatted in a clear and professional manner, with appropriate labels and formatting to enhance readability.
Application exercise 15 includes a clear example of how to construct a regression results table using
Part 2: Predicting attitudes on scientific advancement
The General Social Survey is a biannual survey of the American public.
The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years. The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.
On the 2022 edition, respondents were asked to respond to the prompt:
Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.
data/gss-science.rds contains a tidy data frame with the relevant variables for this exercise. The dependent variable is advfront, which captures respondents’ agreement with the above statement as either “Agree” or “Not agree”. The independent variables are educ (years of education) and polviews (political ideology defined as “Liberal”, “Moderate”, or “Conservative”).
Exercise 6 - Partition the data
Reproducibly partition the data into a training set (80%) and a testing set (20%).
Exercise 7 - Predict scientific advancement by education
Fit a logistic regression model that predicts advfront by educ. Report the model coefficients and briefly summarize what they indicate.
Using your estimated model, predict the probability of agreeing with the statement across all plausible values of educ. Visualize the predicted probabilities using a line chart and interpret the relationship between education and agreement with the statement.
You should create a data frame that contains all possible combinations of educ and polviews. Do not rely solely on the output from augment() since there is no guarantee that all possible combinations actually were observed in the dataset. I recommend considering tidyr::expand_grid() for this task.
Exercise 8 - Predict scientific advancement by education + political ideology
Fit a new model that adds the additional explanatory variable of polviews. Report the model coefficients and briefly summarize what they indicate.
Using your estimated model, predict the probability of agreeing with the following statement across all plausible values of educ and polviews. Visualize the predicted probabilities using a line chart and interpret the relationship between education and agreement with the statement.
Exercise 9 - Evaluate the model’s performance
Evaluate the performance of the model you fit in Exercise 7 using the testing set. Report the confusion matrix, accuracy, sensitivity, and specificity of the model, along with the ROC AUC. Interpret these metrics in the context of the data and the research question. If you need to predict whether or not a new respondent would agree with the statement, what decision threshold would you use and why?
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 2951 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
- Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
- Exercise 1: 5 points
- Exercise 2: 5 points
- Exercise 3: 5 points
- Exercise 4: 5 points
- Exercise 5: 5 points
- Exercise 6: 5 points
- Exercise 7: 5 points
- Exercise 8: 5 points
- Exercise 9: 5 points
- Workflow + formatting: 5 points
- Total: 50 points
The “Workflow & formatting” component assesses the reproducible workflow. This includes:
- Following {tidyverse} code style
- All code being visible in rendered PDF without automatic wrapping (no more than 80 characters)
- Appropriate figure sizing, and figures with informative labels and legends
- Ensuring reproducibility by setting a random seed value.
Acknowledgments
- This assignment is adapted from STA 199: Introduction to Data Science
Footnotes
You knew this from homework 05.↩︎
See
wb_countries()for more information.↩︎