HW 10 - Predicting Congressional speechmaker

Homework

Modified

May 7, 2025

Important

This homework is due Tuesday May 6 at 11:59pm ET.

Getting started

Go to the info2951-sp25 organization on GitHub. Click on the repo with the prefix hw-10. It contains the starter documents you need to complete the homework.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.

General guidance

Guidelines + tips

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Workflow + formatting

Make sure to

Update author name on your document.
Label all code chunks informatively and concisely.
Follow the Tidyverse code style guidelines.
Make at least 3 commits.
Resize figures where needed, avoid tiny or huge plots.
Turn in an organized, well formatted document.

Machine learning models are like running a marathon

They take a long time to complete. If you wait until the last minute to render your Quarto document, you will likely run out of time to submit it — especially if you wait until the slip day deadline. I highly recommend making use of code caching to store cached contents of your model code chunks so they don’t have to unnecessarily run on every render. Refer back to the slides for setting up chunk dependencies.

Data and packages

We’ll use the {tidyverse} and {tidymodels} packages for this assignment. You are welcome and encouraged to load additional packages if you desire.

library(tidyverse)
library(tidymodels)

U.S. Congressional speeches

Legislators frequently give speeches on the floor of the U.S. Congress. The target audience of these speeches varies from the general public to other legislators, and the speeches are often used to persuade others to support a particular bill or policy. The speeches are typically recorded in the Congressional Record, which is a transcript of the proceedings of Congress.

In this assignment you will estimate a series of machine learning models to predict the partisan identity (Democrat or Republican) of a speechmaker based on the text of the speech itself. The dataset contains speeches from the 117th Congress which met from 2021-23.¹

¹ The dataset is a filtered version of the legislative speech dataset from Aroyehun, S.T., Simchon, A., Carrella, F. et al. Computational analysis of US congressional speeches reveals a shift from evidence to intuition. Nat Hum Behav (2025). https://doi.org/10.1038/s41562-025-02136-2

There are two files in the data folder:

cong-speech-train.rds - this contains the training set of observations
cong-speech-test.rds - this contains the test set of observations

Tip

We have already split the data into training/test sets for you. You do not need to use initial_split() to partition the data. Unless otherwise specified, all models should be fit using 5-fold cross-validation.

Exercises

Exercise 1

Analyze word and document frequency. Use the training dataset to identify which words or phrases are most commonly associated with Democrats and Republicans. Calculate the tf-idf scores for Democratic versus Republican speeches for all unigrams and bigrams in the legislative speeches. Visualize and interpret the top-20 most frequent terms for each party. Which make sense to you? Which do not? Maybe investigate the more confusing ones to see why they might have such a high tf-idf score.

Does this sound unfamiliar to you?

Did you complete your preparation activities?

Exercise 2

Fit a null model to predict the partisan identity (Democrat or Republican) of the speechmaker. Report the accuracy and ROC AUC, and interpret them in the context of the model.

Tip

When fitting a null model, {parsnip} doesn’t actually use the specified predictor(s) to fit the model. However you still need to explicitly provide a model formula to the fit() function. Since the text column is a character vector, I recommend using only speech_id as the “predictor” variable for the sake of efficiency.

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 3

Fit a naive Bayes model. Use the transcript of the legislative speech from text to fit a naive Bayes model predicting the partisan identity (Democrat or Republican) of the speechmaker.²

² Don’t recognize this model type? Look back at the preparation materials for this week.

At minimum, create an appropriate feature engineering recipe to tokenize the data, retain the 5000 most frequently occurring tokens, and calculate tf-idf scores. But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.

Report the accuracy and ROC AUC values for this model, along with the confusion matrix for the predictions. How does this model perform? Which outcome is it more likely to predict incorrectly?

Exercise 4

Fit a lasso regression model. Estimate a lasso logistic regression model to predict the partisan identity (Democrat or Republican) of the speechmaker.

Tip

A lasso regression model is a form of penalized regression where the mixture hyperparameter is set to 1.

At minimum, create an appropriate feature engineering recipe to tokenize the data, retain the 1000 most frequently occurring tokens, and calculate tf-idf scores.³ But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.

³ Lasso regression requires all features to be scaled and normalized so they have the same mean and variance. Fortunately for us, by definition tf-idf scores already are normalized and directly comparable so you do not have to explicitly perform this feature engineering step if all features are tf-idf scores.

Tune the model over the penalty hyperparameter using a regular grid of at least 30 values.

Tip

Check out dials::grid_regular().

Tune the model using the cross-validated folds and the glmnet engine, and report the ROC AUC values for the five best models. Use autoplot() to inspect the performance of the models. How do these models perform?

Exercise 5

Use sparse encoding to improve model fitting efficiency. Review 7.5 Case study: sparse encoding from your preparation materials. Use sparse encoding to improve the efficiency of fitting the lasso regression model from exercise 4.

Perform the same tuning process again using the sparse-encoded dataset, and report the ROC AUC values for the five best models. Use autoplot() to inspect the performance of the models. How do these models perform compared to the models from exercise 3? What, if anything, did you notice about the runtime?

Exercise 6

Build a better recipe. Revise the feature engineering recipe from the lasso model to improve its performance. At minimum, you should:

Remove stop words
Stem the tokens
Calculate all possible 1-grams, 2-grams, and 3-grams
Generate additional text features using step_textfeature(). This creates a series of numeric features based on the original character strings.
Ensure non-tf-idf predictors are normalized to the same mean and variance⁴

⁴ Once again, only non-tf-idf features need to be normalized. Check out The effects of feature scaling for more thorough coverage of this topic.

But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.

Exercise 7

Fit a model of your own choosing. Fit a model of your own choosing to predict the partisan identity of the speechmaker. You are responsible for implementing appropriate feature engineering and/or hyperparameter tuning for the model.

Briefly summarize how you decided on the workflow for each model (e.g. feature engineering + model specification). How does this model perform compared to the previous models? Report relevant metrics and plots to support your conclusions.

Tip

Credit will be earned based on the effort applied to fitting appropriate models and utilizing the techniques taught in this class. Do the minimum and you can expect to earn minimal credit.

Exercise 8

Pick the best performing model. Select a single model to train using the entire training set and provide a brief written justification as to why you chose this specific model.

Fit the recipe + model using the full training set. Report the accuracy and ROC AUC values for this model, along with the ROC curve and confusion matrix for the predictions. How does this model perform? Does it perform equally well for speeches by Democrats and Republicans, or does it have a built-in bias towards one specific outcome?

Finally, report the top 20 most relevant features of the model.⁵ What features were most relevant? Do these features make sense?

⁵ If the final model uses penalized regression, report the top 20 most relevant features for both outcomes. For all other model types, feature importance is measured the same for both outcomes.

Render, commit, and push one last time. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
Click on your INFO 2951 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

Exercise 1: 6 points
Exercise 2: 4 points
Exercise 3: 6 points
Exercise 4: 8 points
Exercise 5: 4 points
Exercise 6: 6 points
Exercise 7: 6 points
Exercise 8: 6 points
Workflow + formatting: 4 points
Total: 50 points

Workflow & formatting criteria

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

Following {tidyverse} code style
All code being visible in rendered PDF without automatic wrapping (no more than 80 characters)
Appropriate figure sizing, and figures with informative labels and legends
Ensuring reproducibility by setting a random seed value.