AE 17: Traffic fines and electoral politics - regression with multiple predictors

Suggested answers

Application exercise
Answers
Modified

March 26, 2026

Last class we worked with a dataset on traffic fines revenue in California counties to investigate the question of whether or not politicians manipulate government policy during electoral years in an effort to gain an electoral advantage. We estimated a simple linear regression model with one predictor variable and found that there was not a statistically significant relationship between whether or not it is an election year and per capita traffic fines revenue. However, when we checked the assumptions of the linear model, we found that several of these assumptions were violated, casting doubt on our original results.

In this exercise we will build a series of more robust models to better evaluate the hypotheses.

The replication data file can be found in data/traffic_fines.csv. Let’s load the data and take a look at the summary statistics.1

traffic_fines <- read_csv("data/traffic_fines.csv")
Rows: 1025 Columns: 95
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): county_name, elec_dummy
dbl (93): year, county_code, vehicle_code_fines, vehicle_code_fines_i_p, she...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(traffic_fines)
Data summary
Name traffic_fines
Number of rows 1025
Number of columns 95
_______________________
Column type frequency:
character 2
numeric 93
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
county_name 0 1 4 15 0 57 0
elec_dummy 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2011.49 5.19 2003.00 2007.00 2011.00 2016.00 2020.00 ▇▆▇▆▇
county_code 0 1.00 29.37 16.86 1.00 15.00 29.00 44.00 58.00 ▇▇▇▇▇
vehicle_code_fines 0 1.00 1640704.86 3145153.92 0.00 68944.00 429544.00 1545628.00 21712704.00 ▇▁▁▁▁
vehicle_code_fines_i_p 0 1.00 3.90 7.23 0.00 0.43 2.16 4.87 108.73 ▇▁▁▁▁
sheriff_incumb 807 0.21 0.75 0.44 0.00 0.25 1.00 1.00 1.00 ▃▁▁▁▇
pre_elec 0 1.00 1.61 1.11 0.00 1.00 2.00 3.00 3.00 ▆▆▁▇▇
no_incumb 163 0.84 0.06 0.24 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
incumb 55 0.95 0.17 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
sheriff_margin 918 0.10 0.24 0.17 0.00 0.10 0.23 0.32 0.97 ▇▇▂▁▁
rep_share 0 1.00 46.75 13.67 14.66 36.42 48.13 57.27 74.83 ▂▅▅▇▂
dem_share 0 1.00 49.60 13.48 21.13 39.06 47.57 59.62 82.33 ▂▇▆▅▂
otherparty_share 0 1.00 3.66 2.58 0.80 1.88 2.54 5.66 16.00 ▇▂▂▁▁
white_share 0 1.00 57.83 19.31 12.67 40.35 56.92 75.99 89.98 ▁▇▆▆▇
asian_share 0 1.00 6.03 6.70 0.21 1.27 3.58 7.04 32.82 ▇▁▁▁▁
black_share 0 1.00 3.04 3.19 0.00 0.81 1.82 3.50 14.65 ▇▂▁▁▁
hispanic_share 0 1.00 28.49 17.22 5.15 13.41 24.73 42.22 82.03 ▇▆▃▂▁
other_share 0 1.00 4.61 2.93 1.55 2.89 3.75 5.52 21.52 ▇▂▁▁▁
young_drivers 0 1.00 14.24 2.73 8.65 12.10 14.22 15.81 25.39 ▅▇▅▁▁
density 0 1.00 364.64 694.47 1.53 24.97 100.92 321.29 4070.64 ▇▁▁▁▁
areain_square_miles 0 1.00 2781.69 3096.29 440.00 1003.00 1598.00 3510.00 20164.00 ▇▁▁▁▁
med_inc 0 1.00 55598.49 17114.92 28533.00 43237.00 52078.00 63398.00 139462.00 ▇▇▂▁▁
unemp 0 1.00 8.72 3.99 2.10 5.70 8.00 10.70 28.90 ▇▇▂▁▁
own_source_share 0 1.00 43.67 16.07 15.17 32.46 41.09 49.33 97.96 ▃▇▃▁▁
emp_goods 0 1.00 43390.45 85018.15 0.00 2107.00 13093.00 47344.00 642230.00 ▇▁▁▁▁
emp_service 0 1.00 175746.56 442315.84 0.00 6374.00 34955.00 124505.00 3439959.00 ▇▁▁▁▁
pay_goods_i 0 1.00 39124.06 16100.25 0.00 29667.93 35001.89 43053.98 160212.25 ▅▇▁▁▁
pay_service_i 0 1.00 31512.13 11371.71 0.00 25082.33 28896.69 34273.00 116676.48 ▂▇▁▁▁
arte_share 56 0.95 1.75 2.30 0.00 0.60 1.11 1.83 17.49 ▇▁▁▁▁
collect_share 56 0.95 1.75 1.85 0.00 0.80 1.22 1.94 14.79 ▇▁▁▁▁
cnty_le_sworn_1000p 0 1.00 1.93 1.34 0.72 1.39 1.65 2.02 12.89 ▇▁▁▁▁
felony_tot_1000p 0 1.00 12.70 4.58 4.08 9.48 11.97 15.49 34.35 ▅▇▃▁▁
misdemeanor_tot_1000p 0 1.00 29.72 11.26 8.18 21.48 28.20 36.72 156.12 ▇▂▁▁▁
forfeitures_i_p 0 1.00 3.34 6.06 0.00 0.43 1.98 4.61 154.06 ▇▁▁▁▁
other_court_fines_i_p 0 1.00 10.81 16.44 0.00 2.79 6.36 11.62 154.86 ▇▁▁▁▁
delinquent_fines_i_p 0 1.00 11.42 14.56 0.00 5.09 8.59 13.24 190.04 ▇▁▁▁▁
i_trend_1 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_2 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_3 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_4 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_5 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_6 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_7 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_8 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_9 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_10 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_11 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_12 0 1.00 0.15 1.31 0.00 0.00 0.00 0.00 17.00 ▇▁▁▁▁
i_trend_13 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_14 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_15 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_16 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_17 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_18 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_19 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_20 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_21 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_22 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_23 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_24 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_25 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_26 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_27 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_28 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_29 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_30 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_31 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_32 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_33 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_34 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_35 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_36 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_37 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_38 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_39 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_40 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_41 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_42 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_43 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_44 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_45 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_46 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_47 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_48 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_49 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_50 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_51 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_52 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_53 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_54 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_55 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_56 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_57 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
vic_margin 807 0.21 0.12 0.17 0.00 0.00 0.00 0.22 0.97 ▇▂▁▁▁

Review: why was the original model bad?

Recall the question we sought to answer last class: do politicians manipulate government policy during electoral years in an effort to gain an electoral advantage? We used a simple linear regression model to evaluate the hypotheses:

  • Null hypothesis: There is no linear relationship between whether or not it is an election year and per capita traffic fines revenue.

    \[H_0: \beta_1 = 0\]

  • Alternative hypothesis: There is some linear relationship between whether or not it is an election year and per capita traffic fines revenue.

    \[H_A: \beta_1 \neq 0\]

# fit the model
fines_elec_dummy_fit <- linear_reg() |>
  fit(vehicle_code_fines_i_p ~ elec_dummy, data = traffic_fines)

tidy(fines_elec_dummy_fit)
# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     3.90       0.256   15.2    2.84e-47
2 elec_dummyYes   0.0171     0.543    0.0315 9.75e- 1
# evaluate the hypotheses
# calculate observed fit
obs_fit <- traffic_fines |>
  specify(vehicle_code_fines_i_p ~ elec_dummy) |>
  fit()

# generate permuted null distribution
null_dist <- traffic_fines |>
  specify(vehicle_code_fines_i_p ~ elec_dummy) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()

# visualize and calculate p-value
visualize(null_dist) +
  shade_p_value(obs_fit, direction = "both")

get_p_value(null_dist, obs_fit, direction = "both")
# A tibble: 2 × 2
  term          p_value
  <chr>           <dbl>
1 elec_dummyYes   0.928
2 intercept       0.928

When we checked the linearity assumptions for this model, we discovered some problems.

NoteRobustness tests

Recall the technical conditions for linear regression:

  • L: linear model
  • I: independent observations
  • N: points are normally distributed around the line
  • E: equal variability around the line for all values of the explanatory variable
# augment() allows us to extract observation-level statistics from a model object
fines_elec_dummy_aug <- augment(fines_elec_dummy_fit, new_data = traffic_fines)
fines_elec_dummy_aug
# A tibble: 1,025 × 97
   .pred .resid  year county_code county_name vehicle_code_fines
   <dbl>  <dbl> <dbl>       <dbl> <chr>                    <dbl>
 1  3.90 -0.507  2003           1 Alameda                4973766
 2  3.90 -0.107  2004           1 Alameda                5701454
 3  3.90 -0.899  2005           1 Alameda                4645572
 4  3.91 -0.639  2006           1 Alameda                5256212
 5  3.90 -0.591  2007           1 Alameda                5498544
 6  3.90 -0.388  2008           1 Alameda                6124421
 7  3.90 -0.276  2009           1 Alameda                6346925
 8  3.91 -0.713  2010           1 Alameda                5753802
 9  3.90 -1.15   2011           1 Alameda                5154709
10  3.90 -1.63   2012           1 Alameda                4409751
# ℹ 1,015 more rows
# ℹ 91 more variables: vehicle_code_fines_i_p <dbl>, elec_dummy <chr>,
#   sheriff_incumb <dbl>, pre_elec <dbl>, no_incumb <dbl>, incumb <dbl>,
#   sheriff_margin <dbl>, rep_share <dbl>, dem_share <dbl>,
#   otherparty_share <dbl>, white_share <dbl>, asian_share <dbl>,
#   black_share <dbl>, hispanic_share <dbl>, other_share <dbl>,
#   young_drivers <dbl>, density <dbl>, areain_square_miles <dbl>, …
# the linear regression model
ggplot(
  data = fines_elec_dummy_aug,
  mapping = aes(
    x = as.numeric(elec_dummy == "Yes"),
    y = vehicle_code_fines_i_p
  )
) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

# distribution of the residuals
ggplot(data = fines_elec_dummy_aug, mapping = aes(x = .resid)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

# use the .resid column to plot the predicted values vs. the residuals
# jitter because the explanatory variable only has 2 unique values
ggplot(data = fines_elec_dummy_aug, mapping = aes(x = .pred, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed")

  • L: linear model

    Doesn’t seem very linear in the relationship. The fact that we have a categorical variable as the predictor is not inherently a problem, but it doesn’t seem like there is a straight, monotonic relationship between the predictor and the response.

  • I: independent observations

    Absolutely no. There are many reasons why the observations are not independent. The obvious reason is that it is a time series cross-sectional (TSCS) panel structure. Each county is observed over multiple years, and the observations within a county are likely to be correlated. Alternatively, each year is observed over multiple counties, and the observations within a year are likely to be correlated.

  • N: points are normally distributed around the line

    No. The boxplot earlier shows there are many outliers in the data. The residuals are not normally distributed around the line.

  • E: equal variability around the line for all values of the explanatory variable

    No. The residuals are not equally variable around the line for all values of the explanatory variable. The residuals are more variable for counties in an election year than for counties not in an election year.

    fines_elec_dummy_aug |>
      group_by(elec_dummy) |>
      summarize(var = var(.resid))
    # A tibble: 2 × 2
      elec_dummy   var
      <chr>      <dbl>
    1 No          48.9
    2 Yes         64.2

Today we will build a series of more robust models to better evaluate the hypotheses.

Estimate a multiple variables model

One possible problem with our model yesterday was omitted variable bias. Perhaps we omitted a crucial variable that explains the differences in per capita traffic fine revenue. Since median household income seemed to be predictive of the outcome, let’s estimate a model that includes both whether or not it is an election year and median household income.

Demo: Estimate a model that includes both elec_dummy and med_inc.

fines_inc_elec_fit <- linear_reg() |>
  fit(vehicle_code_fines_i_p ~ med_inc + elec_dummy, data = traffic_fines)

tidy(fines_inc_elec_fit)
# A tibble: 3 × 5
  term            estimate std.error statistic  p.value
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    6.93      0.770        8.99   1.16e-18
2 med_inc       -0.0000546 0.0000131   -4.17   3.37e- 5
3 elec_dummyYes  0.0400    0.539        0.0743 9.41e- 1

Your turn: Interpret the coefficients of the model above.

  • Intercept: Add response here.

    Counties not in an election year and with median household income of $0 are expected to earn, on average, $6.93 per capita in traffic fines revenue.

  • Coefficient for median household income:

    All else held constant, for each $1,000 increase in median household income, per capita traffic fines revenue is expected to decrease by $0.055.

  • Coefficient for whether or not it is an election year:

    All else held constant, counties in an election year are expected to earn, on average, $0.04 more per capita in traffic fines revenue than counties not in an election year.

Hypothesis test

Demo: Use permutation-based methods to conduct the hypothesis test.

  • Median household income

    • Null hypothesis: There is no linear relationship between median household income and per capita traffic fines revenue.

      \[H_0: \beta_1 = 0\]

    • Alternative hypothesis: There is some linear relationship between median household income and per capita traffic fines revenue.

      \[H_A: \beta_1 \neq 0\]

  • Whether or not it is an election year

    • Null hypothesis: There is no linear relationship between whether or not it is an election year and per capita traffic fines revenue.

      \[H_0: \beta_2 = 0\]

    • Alternative hypothesis: There is some linear relationship between whether or not it is an election year and per capita traffic fines revenue.

      \[H_A: \beta_2 \neq 0\]

# calculate observed fit
obs_fit <- traffic_fines |>
  specify(vehicle_code_fines_i_p ~ med_inc + elec_dummy) |>
  fit()

# generate permuted null distribution
null_dist <- traffic_fines |>
  specify(vehicle_code_fines_i_p ~ med_inc + elec_dummy) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()

# visualize and calculate p-value
visualize(null_dist) +
  shade_p_value(obs_fit, direction = "both")

get_p_value(null_dist, obs_fit, direction = "both")
Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
# A tibble: 3 × 2
  term          p_value
  <chr>           <dbl>
1 elec_dummyYes   0.842
2 intercept       0    
3 med_inc         0    

Your turn: Interpret the results of the hypothesis test. Use a significance level of 5%.

  • With respect to median household income, the probability of observing a relationship as strong as the one in the data if the null hypothesis is true is less than \(0.001\). Since this is less than 0.05, we reject the null hypothesis and conclude that there is a relationship between median household income and per capita traffic fines revenue.
  • With respect to whether or not it is an election year, the probability of observing a relationship as strong as the one in the data if the null hypothesis is true is \(0.842\). Since this is greater than 0.05, we fail to reject the null hypothesis and cannot conclude that there is a relationship between whether or not it is an election year and per capita traffic fines revenue.

Transform a variable

Let’s examine the per capita traffic fines revenue.

Your turn: Run the code chunk below and create two separate plots. How are the two plots different than each other? Which plot’s better depicts an unskewed variable?

# Plot A
ggplot(
  data = traffic_fines,
  mapping = aes(x = vehicle_code_fines_i_p)
) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
# Plot B
ggplot(
  data = traffic_fines,
  mapping = aes(x = vehicle_code_fines_i_p)
) +
  geom_histogram() +
  scale_x_log10()
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 87 rows containing non-finite outside the scale range
(`stat_bin()`).

Fit the model

Demo: Fit a linear regression model with the transformed outcome variable.

# calculate natural-log transformed outcome
traffic_fines <- traffic_fines |>
  mutate(
    log_vehicle_code_fines_i_p = log(vehicle_code_fines_i_p),
    # replace -Inf with NA - these were counties with $0 in traffic fine revenue
    log_vehicle_code_fines_i_p = if_else(
      log_vehicle_code_fines_i_p == -Inf,
      NA,
      log_vehicle_code_fines_i_p
    )
  )

fines_elec_log_fit <- linear_reg() |>
  fit(log_vehicle_code_fines_i_p ~ elec_dummy, data = traffic_fines)

tidy(fines_elec_log_fit)
# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      0.544    0.0657     8.28  4.34e-16
2 elec_dummyYes   -0.117    0.139     -0.845 3.98e- 1

Your turn: Interpret the coefficients of the model above.

  • Intercept: Counties not in an election year are expected to earn, on average, \(e^{0\\.544} \approx \$1\\.72\) per capita in traffic fines revenue.

  • Slope: Counties in an election year are expected to earn, on average, \(100 \times [e^{−0\\.117} - 1] \approx 11\\.05%\)% less per capita in traffic fines revenue than those not in an election year.

Conduct a hypothesis test

Demo: Use permutation-based methods to conduct the hypothesis test.

# calculate observed fit
obs_fit <- traffic_fines |>
  specify(log_vehicle_code_fines_i_p ~ elec_dummy) |>
  fit()
Warning: Removed 87 rows containing missing values.
# generate permuted null distribution
null_dist <- traffic_fines |>
  specify(log_vehicle_code_fines_i_p ~ elec_dummy) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()
Warning: Removed 87 rows containing missing values.
# visualize and calculate p-value
visualize(null_dist) +
  shade_p_value(obs_fit, direction = "both")

get_p_value(null_dist, obs_fit, direction = "both")
# A tibble: 2 × 2
  term          p_value
  <chr>           <dbl>
1 elec_dummyYes    0.35
2 intercept        0.35

Your turn: Interpret the \(p\)-value in context of the data and the research question. Use a significance level of 5%.

If in fact the true relationship between whether or not it is an election and per capita traffic fines revenue is zero, the probability of observing a relationship as strong as the one in the data is \(0.35\). Since this is greater than 0.05, we fail to reject the null hypothesis and cannot conclude that there is a relationship between whether or not it is an election and per capita traffic fines revenue.

What about other differences?

So far we have only estimated a model with at most two explanatory variables. Yet there are many other variables in the data set that could explain the variation in per capita traffic fines revenue.

Recall that the data set is a panel structure - we observe multiple counties over multiple years. This violates the linear model assumption of independence of the observations. Consider the most obvious example - traffic fines revenue within a single county is likely to be correlated across years (e.g. knowing how much revenue was generated in Los Angeles County this year allows me to make a better guess for next year’s revenue).

Fixed effects

One way we can account for this by estimating a fixed effects model. This model includes a separate coefficient for each county, which allows us to account for the correlation in the data.

Demo: Let’s estimate a model that includes elec_dummy and a fixed effect for each county. We’ll also use the log-transformed outcome of interest.

fines_fe_fit <- linear_reg() |>
  fit(
    log_vehicle_code_fines_i_p ~ elec_dummy + factor(county_name),
    data = traffic_fines
  )

tidy(fines_fe_fit)
# A tibble: 57 × 5
   term                            estimate std.error statistic  p.value
   <chr>                              <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                       0.963     0.214     4.50   7.65e- 6
 2 elec_dummyYes                    -0.0981    0.0708   -1.38   1.67e- 1
 3 factor(county_name)Alpine         1.07      0.385     2.77   5.69e- 3
 4 factor(county_name)Amador        -1.11      0.323    -3.44   6.03e- 4
 5 factor(county_name)Butte         -0.832     0.458    -1.82   6.95e- 2
 6 factor(county_name)Calaveras      0.402     0.302     1.33   1.84e- 1
 7 factor(county_name)Colusa         0.0297    0.302     0.0984 9.22e- 1
 8 factor(county_name)Contra Costa  -0.471     0.302    -1.56   1.19e- 1
 9 factor(county_name)Del Norte     -5.36      0.370   -14.5    7.97e-43
10 factor(county_name)El Dorado     -0.177     0.302    -0.585  5.58e- 1
# ℹ 47 more rows

Your turn: Interpret the coefficient for elec_dummy.

Holding county constant, counties in an election year are expected to earn, on average, \(100 \times [e^{−0\\.098} - 1] \approx 9\\.34%\)% less per capita in traffic fines revenue than those not in an election year.

Conduct a hypothesis test

Demo: Use permutation-based methods to conduct the hypothesis test.

# calculate observed fit
obs_fit <- traffic_fines |>
  specify(log_vehicle_code_fines_i_p ~ elec_dummy + factor(county_name)) |>
  fit()
Warning: Removed 87 rows containing missing values.
# generate permuted null distribution
null_dist <- traffic_fines |>
  specify(log_vehicle_code_fines_i_p ~ elec_dummy + factor(county_name)) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()
Warning: Removed 87 rows containing missing values.
# calculate p-value
get_p_value(null_dist, obs_fit, direction = "both")
Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
# A tibble: 57 × 2
   term                            p_value
   <chr>                             <dbl>
 1 elec_dummyYes                     0.476
 2 factor(county_name)Alpine         0.158
 3 factor(county_name)Amador         0.072
 4 factor(county_name)Butte          0.298
 5 factor(county_name)Calaveras      0.496
 6 factor(county_name)Colusa         0.944
 7 factor(county_name)Contra Costa   0.402
 8 factor(county_name)Del Norte      0    
 9 factor(county_name)El Dorado      0.756
10 factor(county_name)Fresno         0.762
# ℹ 47 more rows

Your turn: Interpret the results of the hypothesis test in context of the data and the research question. Use a significance level of 5%.

If in fact the true relationship between whether or not it is an election and per capita traffic fines revenue is zero, the probability of observing a relationship as strong as the one in the data is \(0.476\). Since this is greater than 0.05, we fail to reject the null hypothesis and cannot conclude that there is a relationship between whether or not it is an election and per capita traffic fines revenue.

Include everything

Finally, let’s estimate a model that includes all the variables in the data set.

# keep all variables used in original analysis - table 1 column 2
tf_replication <- traffic_fines |>
  select(
    log_vehicle_code_fines_i_p,
    elec_dummy,
    dem_share,
    otherparty_share,
    asian_share,
    black_share,
    hispanic_share,
    other_share,
    young_drivers,
    density,
    med_inc,
    unemp,
    own_source_share,
    emp_goods,
    emp_service,
    pay_goods_i,
    pay_service_i,
    arte_share,
    collect_share,
    cnty_le_sworn_1000p,
    felony_tot_1000p,
    misdemeanor_tot_1000p,
    county_name,
    # separate columns for each county to include the year trend
    starts_with("i_trend")
  )

fines_all_fit <- linear_reg() |>
  # use all remaining columns as predictors
  fit(log_vehicle_code_fines_i_p ~ ., data = tf_replication)

tidy(fines_all_fit)
# A tibble: 134 × 5
   term              estimate std.error statistic      p.value
   <chr>                <dbl>     <dbl>     <dbl>        <dbl>
 1 (Intercept)       3.13      10.7        0.294  0.769       
 2 elec_dummyYes    -0.0895     0.0505    -1.77   0.0767      
 3 dem_share        -0.0275     0.0128    -2.14   0.0327      
 4 otherparty_share -0.0285     0.0165    -1.73   0.0836      
 5 asian_share      -0.0298     0.182     -0.164  0.870       
 6 black_share      -1.05       0.346     -3.03   0.00252     
 7 hispanic_share    0.450      0.0795     5.66   0.0000000215
 8 other_share       1.30       0.414      3.14   0.00175     
 9 young_drivers     0.124      0.0620     2.00   0.0459      
10 density           0.000306   0.00448    0.0683 0.946       
# ℹ 124 more rows

Your turn: Interpret the coefficient for elec_dummy.

Holding all other variables constant, counties in an election year are expected to earn, on average, \(100 \times [e^{−0\\.089} - 1] \approx 8\\.56%\)% less per capita in traffic fines revenue than those not in an election year.

Conduct a hypothesis test

At this point it seems reasonable that we have met the assumptions of the linear model. Let’s conduct a hypothesis test to evaluate the relationship between whether or not it is an election year and per capita traffic fines revenue. Instead of relying on the permutation-based approach, we will use the standard \(t\)-test that was reported when we estimated the model above.

Your turn: Interpret the results of the hypothesis test in context of the data and the research question. Use a significance level of 5%.

If in fact the true relationship between whether or not it is an election and per capita traffic fines revenue is zero, the probability of observing a relationship as strong as the one in the data is \(0.0767\). Since this is greater than 0.05, we fail to reject the null hypothesis and cannot conclude that there is a relationship between whether or not it is an election and per capita traffic fines revenue.

Comparing across models

Let’s compare the models we estimated today. Remember that we are interested in the model simplest best model.

glance(fines_elec_dummy_fit)
# A tibble: 1 × 12
    r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
        <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1 0.000000968     -0.000977  7.23  0.000990   0.975     1 -3481. 6969. 6983.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(fines_inc_elec_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0167        0.0148  7.17      8.67 0.000184     2 -3473. 6953. 6973.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(fines_elec_log_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1  0.000763     -0.000305  1.77     0.715   0.398     1 -1867. 3739. 3754.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(fines_fe_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.754         0.739 0.906      48.3 7.62e-229    56 -1209. 2533. 2814.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(fines_all_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.896         0.878 0.610      49.5 4.38e-297   132  -750. 1769. 2411.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Your turn: Which model do you believe is most appropriate? Why?

The model that includes all variables is the most appropriate. The fixed-effects model is potentially a good choice with a high adjusted-\(R^2\) value, but the model that also includes the control variables has an adjusted-\(R^2\) value nearly 14 points higher. In this situation, the additional complexity of the model is outweighed by the improvement in performance.

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.5.2 (2025-10-31)
 os       macOS Tahoe 26.4
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2026-04-02
 pandoc   3.4 @ /usr/local/bin/ (via rmarkdown)
 quarto   1.9.36 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 ! package      * version    date (UTC) lib source
 P archive        1.1.12.1   2025-12-24 [?] RSPM
 P backports      1.5.0      2024-05-23 [?] RSPM (R 4.5.0)
 P base64enc      0.1-3      2015-07-28 [?] RSPM (R 4.5.0)
 P bit            4.6.0      2025-03-06 [?] RSPM (R 4.5.0)
 P bit64          4.6.0-1    2025-01-16 [?] RSPM (R 4.5.0)
 P broom        * 1.0.11     2025-12-04 [?] RSPM (R 4.5.0)
 P class          7.3-23     2025-01-01 [?] RSPM (R 4.5.0)
 P cli            3.6.5      2025-04-23 [?] RSPM (R 4.5.0)
 P codetools      0.2-20     2024-03-31 [?] RSPM (R 4.5.0)
 P crayon         1.5.3      2024-06-20 [?] RSPM (R 4.5.0)
 P data.table     1.18.0     2025-12-24 [?] RSPM (R 4.5.0)
 P dials        * 1.4.2      2025-09-04 [?] RSPM
 P DiceDesign     1.10       2023-12-07 [?] RSPM (R 4.5.0)
 P digest         0.6.39     2025-11-19 [?] RSPM (R 4.5.0)
 P dplyr        * 1.2.0      2026-02-03 [?] RSPM
 P evaluate       1.0.5      2025-08-27 [?] RSPM (R 4.5.0)
 P farver         2.1.2      2024-05-13 [?] RSPM (R 4.5.0)
 P fastmap        1.2.0      2024-05-15 [?] RSPM (R 4.5.0)
 P forcats      * 1.0.1      2025-09-25 [?] RSPM (R 4.5.0)
 P furrr          0.3.1      2022-08-15 [?] RSPM
 P future         1.68.0     2025-11-17 [?] RSPM
 P future.apply   1.20.1     2025-12-09 [?] CRAN (R 4.5.2)
 P generics       0.1.4      2025-05-09 [?] RSPM (R 4.5.0)
 P ggplot2      * 4.0.1      2025-11-14 [?] RSPM (R 4.5.0)
 P globals        0.18.0     2025-05-08 [?] RSPM
 P glue           1.8.0      2024-09-30 [?] RSPM (R 4.5.0)
 P gower          1.0.2      2024-12-17 [?] RSPM
 P GPfit          1.0-9      2025-04-12 [?] RSPM (R 4.5.0)
 P gtable         0.3.6      2024-10-25 [?] RSPM (R 4.5.0)
 P hardhat        1.4.2      2025-08-20 [?] RSPM
 P here           1.0.2      2025-09-15 [?] CRAN (R 4.5.0)
 P hms            1.1.4      2025-10-17 [?] RSPM (R 4.5.0)
 P htmltools      0.5.9      2025-12-04 [?] RSPM (R 4.5.0)
 P htmlwidgets    1.6.4      2023-12-06 [?] RSPM (R 4.5.0)
 P infer        * 1.1.0      2025-12-18 [?] CRAN (R 4.5.2)
 P ipred          0.9-15     2024-07-18 [?] RSPM
 P jsonlite       2.0.0      2025-03-27 [?] RSPM (R 4.5.0)
 P knitr          1.51       2025-12-20 [?] RSPM (R 4.5.0)
 P labeling       0.4.3      2023-08-29 [?] RSPM (R 4.5.0)
 P lattice        0.22-7     2025-04-02 [?] RSPM (R 4.5.0)
 P lava           1.8.2      2025-10-30 [?] RSPM
 P lhs            1.2.0      2024-06-30 [?] RSPM (R 4.5.0)
 P lifecycle      1.0.5      2026-01-08 [?] RSPM (R 4.5.0)
 P listenv        0.10.0     2025-11-02 [?] RSPM
 P lubridate    * 1.9.4      2024-12-08 [?] RSPM (R 4.5.0)
 P magrittr       2.0.4      2025-09-12 [?] RSPM (R 4.5.0)
 P MASS           7.3-65     2025-02-28 [?] RSPM (R 4.5.0)
 P Matrix         1.7-4      2025-08-28 [?] RSPM (R 4.5.0)
 P mgcv           1.9-4      2025-11-07 [?] CRAN (R 4.5.0)
 P modeldata    * 1.5.1      2025-08-22 [?] RSPM
 P nlme           3.1-168    2025-03-31 [?] RSPM (R 4.5.0)
 P nnet           7.3-20     2025-01-01 [?] RSPM (R 4.5.0)
 P otel           0.2.0      2025-08-29 [?] RSPM (R 4.5.0)
 P parallelly     1.46.0     2025-12-12 [?] CRAN (R 4.5.2)
 P parsnip      * 1.4.0      2025-12-01 [?] CRAN (R 4.5.2)
 P patchwork      1.3.2      2025-08-25 [?] RSPM
 P pillar         1.11.1     2025-09-17 [?] RSPM (R 4.5.0)
 P pkgconfig      2.0.3      2019-09-22 [?] RSPM (R 4.5.0)
 P prodlim        2025.04.28 2025-04-28 [?] RSPM
 P purrr        * 1.2.0      2025-11-04 [?] CRAN (R 4.5.0)
 P R6             2.6.1      2025-02-15 [?] RSPM (R 4.5.0)
 P RColorBrewer   1.1-3      2022-04-03 [?] RSPM (R 4.5.0)
 P Rcpp           1.1.0      2025-07-02 [?] RSPM (R 4.5.0)
 P readr        * 2.1.6      2025-11-14 [?] RSPM (R 4.5.0)
 P recipes      * 1.3.1      2025-05-21 [?] RSPM
   renv           1.1.5      2025-07-24 [1] RSPM (R 4.5.0)
 P repr           1.1.7      2024-03-22 [?] RSPM
 P rlang          1.1.7      2026-01-09 [?] RSPM (R 4.5.0)
 P rmarkdown      2.30       2025-09-28 [?] RSPM (R 4.5.0)
 P rpart          4.1.24     2025-01-07 [?] RSPM (R 4.5.0)
 P rprojroot      2.1.1      2025-08-26 [?] RSPM (R 4.5.0)
 P rsample      * 1.3.1      2025-07-29 [?] RSPM
 P rstudioapi     0.17.1     2024-10-22 [?] RSPM (R 4.5.0)
 P S7             0.2.1      2025-11-14 [?] RSPM (R 4.5.0)
 P scales       * 1.4.0      2025-04-24 [?] RSPM (R 4.5.0)
 P sessioninfo    1.2.3      2025-02-05 [?] RSPM (R 4.5.0)
 P skimr        * 2.2.1      2025-07-26 [?] RSPM
 P sparsevctrs    0.3.5      2025-12-09 [?] CRAN (R 4.5.2)
 P stringi        1.8.7      2025-03-27 [?] RSPM (R 4.5.0)
 P stringr      * 1.6.0      2025-11-04 [?] RSPM (R 4.5.0)
 P survival       3.8-3      2024-12-17 [?] RSPM (R 4.5.0)
 P tailor       * 0.1.0      2025-08-25 [?] RSPM
 P tibble       * 3.3.0      2025-06-08 [?] RSPM (R 4.5.0)
 P tidymodels   * 1.4.1      2025-09-08 [?] RSPM
 P tidyr        * 1.3.2      2025-12-19 [?] RSPM (R 4.5.0)
 P tidyselect     1.2.1      2024-03-11 [?] RSPM (R 4.5.0)
 P tidyverse    * 2.0.0      2023-02-22 [?] RSPM (R 4.5.0)
 P timechange     0.3.0      2024-01-18 [?] RSPM (R 4.5.0)
 P timeDate       4051.111   2025-10-17 [?] RSPM
 P tune         * 2.0.1      2025-10-17 [?] RSPM
 P tzdb           0.5.0      2025-03-15 [?] RSPM (R 4.5.0)
 P utf8           1.2.6      2025-06-08 [?] RSPM (R 4.5.0)
 P vctrs          0.7.1      2026-01-23 [?] RSPM
 P vroom          1.6.7      2025-11-28 [?] RSPM (R 4.5.0)
 P withr          3.0.2      2024-10-28 [?] RSPM (R 4.5.0)
 P workflows    * 1.3.0      2025-08-27 [?] RSPM
 P workflowsets * 1.1.1      2025-05-27 [?] RSPM
 P xfun           0.55       2025-12-16 [?] CRAN (R 4.5.2)
 P yaml           2.3.12     2025-12-10 [?] RSPM (R 4.5.0)
 P yardstick    * 1.3.2      2025-01-22 [?] RSPM

 [1] /Users/bcs88/Projects/info-2950/course-site/renv/library/macos/R-4.5/aarch64-apple-darwin20
 [2] /Users/bcs88/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.5/aarch64-apple-darwin20/4cd76b74

 * ── Packages attached to the search path.
 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

Footnotes

  1. The codebook is available from Dataverse. The data set has been lightly cleaned for the application exercise.↩︎