# A tibble: 146 × 3
film critics audience
<chr> <int> <int>
1 Avengers: Age of Ultron 74 86
2 Cinderella 85 80
3 Ant-Man 80 90
4 Do You Believe? 18 84
5 Hot Tub Time Machine 2 14 28
6 The Water Diviner 63 62
7 Irrational Man 42 53
8 Top Five 86 64
9 Shaun the Sheep Movie 99 82
10 Love & Mercy 89 87
# ℹ 136 more rows
Modeling film ratings
What is the relationship between the critics and audience scores for films?
What is your best guess for a film’s audience score if the critics rated it a 73?
Predictor (explanatory variable)
audience
critics
86
74
80
85
90
80
84
18
28
14
62
63
...
...
Outcome (response variable)
audience
critics
86
74
80
85
90
80
84
18
28
14
62
63
...
...
Regression line
Regression line: slope
Regression line: intercept
Correlation
Correlation
Ranges between -1 and 1.
Same sign as the slope.
Models with a single predictor
Regression model
A regression model is a function that describes the relationship between the outcome, \(Y\), and the predictor, \(X\).
The regression line goes through the center of mass point (the coordinates corresponding to average \(X\) and average \(Y\)): \(b_0 = \bar{Y} - b_1~\bar{X}\)
Slope has the same sign as the correlation coefficient: \(b_1 = r \frac{sd_Y}{sd_X}\)
Sum of the residuals is zero: \(\sum_{i = 1}^n \epsilon_i = 0\)
Residuals and \(X\) values are uncorrelated
Interpreting the slope
The slope of the model for predicting audience score from critics score is 0.519. Which of the following is the best interpretation of this value?
For every one point increase in the critics score, the audience score goes up by 0.519 points, on average.
For every one point increase in the critics score, we expect the audience score to be higher by 0.519 points, on average.
For every one point increase in the critics score, the audience score goes up by 0.519 points.
For every one point increase in the audience score, the critics score goes up by 0.519 points, on average.
Slope: For every one point increase in the critics score, we expect the audience score to be higher by 0.519 points, on average.
Intercept: If the critics score is 0 points, we expect the audience score to be 32.3 points.
Is the intercept meaningful?
✅ The intercept is meaningful in context of the data if
the predictor can feasibly take values equal to or near zero or
the predictor has values near zero in the observed data
🛑 Otherwise, it might not be meaningful!
Statistical inference with a single predictor
Birth weights and gestation time
Null hypothesis framework
Population of interest: All births in the United States
\(H_0: \beta_1= 0\), there is no linear relationship between weight and weeks.
\(H_A: \beta_1 \ne 0\), there is some linear relationship between weight and weeks.
Permutation-based approach
Observed data
Characteristic
Beta
SE
Statistic
95% CI
p-value
(Intercept)
-5.7
1.61
-3.54
-8.9, -2.5
<0.001
weeks
0.34
0.042
8.07
0.25, 0.42
<0.001
Abbreviations: CI = Confidence Interval, SE = Standard Error
Variability of the statistic
Observed statistic vs. null statistics
Bootstrap CI for the slope
Characteristic
Beta
SE
Statistic
95% CI
p-value
(Intercept)
6.2
0.708
8.79
4.8, 7.6
<0.001
mage
0.04
0.024
1.50
-0.01, 0.08
0.14
Abbreviations: CI = Confidence Interval, SE = Standard Error
Bootstrap resampling
Bootstrap resampling (repeated)
Bootstrap CI for the slope
Checking model conditions
Linearity
Independent observations
Nearly normal residuals
Constant or equal variability
Violations of model conditions
Application exercise
ae-16
Instructions
Go to the course GitHub org and find your ae-16 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of the day
Wrap up
Recap
Linear regression fits a line to the relationship between \(X\) and \(Y\)
It estimates coefficients that minimize the sum of squared error (residuals)
We can use permutation tests and bootstrapping to make inferences about the coefficients
Linear regression has specific assumptions that must be met in order to make valid inferences