AE 17: Traffic fines and electoral politics - regression with multiple predictors

Application exercise
Modified

March 26, 2026

Last class we worked with a dataset on traffic fines revenue in California counties to investigate the question of whether or not politicians manipulate government policy during electoral years in an effort to gain an electoral advantage. We estimated a simple linear regression model with one predictor variable and found that there was not a statistically significant relationship between whether or not it is an election year and per capita traffic fines revenue. However, when we checked the assumptions of the linear model, we found that several of these assumptions were violated, casting doubt on our original results.

In this exercise we will build a series of more robust models to better evaluate the hypotheses.

The replication data file can be found in data/traffic_fines.csv. Let’s load the data and take a look at the summary statistics.1

traffic_fines <- read_csv("data/traffic_fines.csv")
Rows: 1025 Columns: 95
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): county_name, elec_dummy
dbl (93): year, county_code, vehicle_code_fines, vehicle_code_fines_i_p, she...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(traffic_fines)
Data summary
Name traffic_fines
Number of rows 1025
Number of columns 95
_______________________
Column type frequency:
character 2
numeric 93
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
county_name 0 1 4 15 0 57 0
elec_dummy 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2011.49 5.19 2003.00 2007.00 2011.00 2016.00 2020.00 ▇▆▇▆▇
county_code 0 1.00 29.37 16.86 1.00 15.00 29.00 44.00 58.00 ▇▇▇▇▇
vehicle_code_fines 0 1.00 1640704.86 3145153.92 0.00 68944.00 429544.00 1545628.00 21712704.00 ▇▁▁▁▁
vehicle_code_fines_i_p 0 1.00 3.90 7.23 0.00 0.43 2.16 4.87 108.73 ▇▁▁▁▁
sheriff_incumb 807 0.21 0.75 0.44 0.00 0.25 1.00 1.00 1.00 ▃▁▁▁▇
pre_elec 0 1.00 1.61 1.11 0.00 1.00 2.00 3.00 3.00 ▆▆▁▇▇
no_incumb 163 0.84 0.06 0.24 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
incumb 55 0.95 0.17 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
sheriff_margin 918 0.10 0.24 0.17 0.00 0.10 0.23 0.32 0.97 ▇▇▂▁▁
rep_share 0 1.00 46.75 13.67 14.66 36.42 48.13 57.27 74.83 ▂▅▅▇▂
dem_share 0 1.00 49.60 13.48 21.13 39.06 47.57 59.62 82.33 ▂▇▆▅▂
otherparty_share 0 1.00 3.66 2.58 0.80 1.88 2.54 5.66 16.00 ▇▂▂▁▁
white_share 0 1.00 57.83 19.31 12.67 40.35 56.92 75.99 89.98 ▁▇▆▆▇
asian_share 0 1.00 6.03 6.70 0.21 1.27 3.58 7.04 32.82 ▇▁▁▁▁
black_share 0 1.00 3.04 3.19 0.00 0.81 1.82 3.50 14.65 ▇▂▁▁▁
hispanic_share 0 1.00 28.49 17.22 5.15 13.41 24.73 42.22 82.03 ▇▆▃▂▁
other_share 0 1.00 4.61 2.93 1.55 2.89 3.75 5.52 21.52 ▇▂▁▁▁
young_drivers 0 1.00 14.24 2.73 8.65 12.10 14.22 15.81 25.39 ▅▇▅▁▁
density 0 1.00 364.64 694.47 1.53 24.97 100.92 321.29 4070.64 ▇▁▁▁▁
areain_square_miles 0 1.00 2781.69 3096.29 440.00 1003.00 1598.00 3510.00 20164.00 ▇▁▁▁▁
med_inc 0 1.00 55598.49 17114.92 28533.00 43237.00 52078.00 63398.00 139462.00 ▇▇▂▁▁
unemp 0 1.00 8.72 3.99 2.10 5.70 8.00 10.70 28.90 ▇▇▂▁▁
own_source_share 0 1.00 43.67 16.07 15.17 32.46 41.09 49.33 97.96 ▃▇▃▁▁
emp_goods 0 1.00 43390.45 85018.15 0.00 2107.00 13093.00 47344.00 642230.00 ▇▁▁▁▁
emp_service 0 1.00 175746.56 442315.84 0.00 6374.00 34955.00 124505.00 3439959.00 ▇▁▁▁▁
pay_goods_i 0 1.00 39124.06 16100.25 0.00 29667.93 35001.89 43053.98 160212.25 ▅▇▁▁▁
pay_service_i 0 1.00 31512.13 11371.71 0.00 25082.33 28896.69 34273.00 116676.48 ▂▇▁▁▁
arte_share 56 0.95 1.75 2.30 0.00 0.60 1.11 1.83 17.49 ▇▁▁▁▁
collect_share 56 0.95 1.75 1.85 0.00 0.80 1.22 1.94 14.79 ▇▁▁▁▁
cnty_le_sworn_1000p 0 1.00 1.93 1.34 0.72 1.39 1.65 2.02 12.89 ▇▁▁▁▁
felony_tot_1000p 0 1.00 12.70 4.58 4.08 9.48 11.97 15.49 34.35 ▅▇▃▁▁
misdemeanor_tot_1000p 0 1.00 29.72 11.26 8.18 21.48 28.20 36.72 156.12 ▇▂▁▁▁
forfeitures_i_p 0 1.00 3.34 6.06 0.00 0.43 1.98 4.61 154.06 ▇▁▁▁▁
other_court_fines_i_p 0 1.00 10.81 16.44 0.00 2.79 6.36 11.62 154.86 ▇▁▁▁▁
delinquent_fines_i_p 0 1.00 11.42 14.56 0.00 5.09 8.59 13.24 190.04 ▇▁▁▁▁
i_trend_1 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_2 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_3 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_4 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_5 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_6 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_7 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_8 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_9 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_10 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_11 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_12 0 1.00 0.15 1.31 0.00 0.00 0.00 0.00 17.00 ▇▁▁▁▁
i_trend_13 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_14 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_15 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_16 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_17 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_18 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_19 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_20 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_21 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_22 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_23 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_24 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_25 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_26 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_27 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_28 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_29 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_30 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_31 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_32 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_33 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_34 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_35 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_36 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_37 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_38 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_39 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_40 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_41 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_42 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_43 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_44 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_45 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_46 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_47 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_48 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_49 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_50 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_51 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_52 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_53 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_54 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_55 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_56 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
i_trend_57 0 1.00 0.17 1.43 0.00 0.00 0.00 0.00 18.00 ▇▁▁▁▁
vic_margin 807 0.21 0.12 0.17 0.00 0.00 0.00 0.22 0.97 ▇▂▁▁▁

Review: why was the original model bad?

Recall the question we sought to answer last class: do politicians manipulate government policy during electoral years in an effort to gain an electoral advantage? We used a simple linear regression model to evaluate the hypotheses:

  • Null hypothesis: There is no linear relationship between whether or not it is an election year and per capita traffic fines revenue.

    \[H_0: \beta_1 = 0\]

  • Alternative hypothesis: There is some linear relationship between whether or not it is an election year and per capita traffic fines revenue.

    \[H_A: \beta_1 \neq 0\]

# fit the model
fines_elec_dummy_fit <- linear_reg() |>
  fit(vehicle_code_fines_i_p ~ elec_dummy, data = traffic_fines)

tidy(fines_elec_dummy_fit)
# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     3.90       0.256   15.2    2.84e-47
2 elec_dummyYes   0.0171     0.543    0.0315 9.75e- 1
# evaluate the hypotheses
# calculate observed fit
obs_fit <- traffic_fines |>
  specify(vehicle_code_fines_i_p ~ elec_dummy) |>
  fit()

# generate permuted null distribution
null_dist <- traffic_fines |>
  specify(vehicle_code_fines_i_p ~ elec_dummy) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()

# visualize and calculate p-value
visualize(null_dist) +
  shade_p_value(obs_fit, direction = "both")

get_p_value(null_dist, obs_fit, direction = "both")
# A tibble: 2 × 2
  term          p_value
  <chr>           <dbl>
1 elec_dummyYes   0.928
2 intercept       0.928

When we checked the linearity assumptions for this model, we discovered some problems.

NoteRobustness tests

Recall the technical conditions for linear regression:

  • L: linear model
  • I: independent observations
  • N: points are normally distributed around the line
  • E: equal variability around the line for all values of the explanatory variable
# augment() allows us to extract observation-level statistics from a model object
fines_elec_dummy_aug <- augment(fines_elec_dummy_fit, new_data = traffic_fines)
fines_elec_dummy_aug
# A tibble: 1,025 × 97
   .pred .resid  year county_code county_name vehicle_code_fines
   <dbl>  <dbl> <dbl>       <dbl> <chr>                    <dbl>
 1  3.90 -0.507  2003           1 Alameda                4973766
 2  3.90 -0.107  2004           1 Alameda                5701454
 3  3.90 -0.899  2005           1 Alameda                4645572
 4  3.91 -0.639  2006           1 Alameda                5256212
 5  3.90 -0.591  2007           1 Alameda                5498544
 6  3.90 -0.388  2008           1 Alameda                6124421
 7  3.90 -0.276  2009           1 Alameda                6346925
 8  3.91 -0.713  2010           1 Alameda                5753802
 9  3.90 -1.15   2011           1 Alameda                5154709
10  3.90 -1.63   2012           1 Alameda                4409751
# ℹ 1,015 more rows
# ℹ 91 more variables: vehicle_code_fines_i_p <dbl>, elec_dummy <chr>,
#   sheriff_incumb <dbl>, pre_elec <dbl>, no_incumb <dbl>, incumb <dbl>,
#   sheriff_margin <dbl>, rep_share <dbl>, dem_share <dbl>,
#   otherparty_share <dbl>, white_share <dbl>, asian_share <dbl>,
#   black_share <dbl>, hispanic_share <dbl>, other_share <dbl>,
#   young_drivers <dbl>, density <dbl>, areain_square_miles <dbl>, …
# the linear regression model
ggplot(
  data = fines_elec_dummy_aug,
  mapping = aes(
    x = as.numeric(elec_dummy == "Yes"),
    y = vehicle_code_fines_i_p
  )
) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

# distribution of the residuals
ggplot(data = fines_elec_dummy_aug, mapping = aes(x = .resid)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

# use the .resid column to plot the predicted values vs. the residuals
# jitter because the explanatory variable only has 2 unique values
ggplot(data = fines_elec_dummy_aug, mapping = aes(x = .pred, y = .resid)) +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = "dashed")

  • L: linear model

    Doesn’t seem very linear in the relationship. The fact that we have a categorical variable as the predictor is not inherently a problem, but it doesn’t seem like there is a straight, monotonic relationship between the predictor and the response.

  • I: independent observations

    Absolutely no. There are many reasons why the observations are not independent. The obvious reason is that it is a time series cross-sectional (TSCS) panel structure. Each county is observed over multiple years, and the observations within a county are likely to be correlated. Alternatively, each year is observed over multiple counties, and the observations within a year are likely to be correlated.

  • N: points are normally distributed around the line

    No. The boxplot earlier shows there are many outliers in the data. The residuals are not normally distributed around the line.

  • E: equal variability around the line for all values of the explanatory variable

    No. The residuals are not equally variable around the line for all values of the explanatory variable. The residuals are more variable for counties in an election year than for counties not in an election year.

    fines_elec_dummy_aug |>
      group_by(elec_dummy) |>
      summarize(var = var(.resid))
    # A tibble: 2 × 2
      elec_dummy   var
      <chr>      <dbl>
    1 No          48.9
    2 Yes         64.2

Today we will build a series of more robust models to better evaluate the hypotheses.

Estimate a multiple variables model

One possible problem with our model yesterday was omitted variable bias. Perhaps we omitted a crucial variable that explains the differences in per capita traffic fine revenue. Since median household income seemed to be predictive of the outcome, let’s estimate a model that includes both whether or not it is an election year and median household income.

Demo: Estimate a model that includes both elec_dummy and med_inc.

fines_inc_elec_fit <- linear_reg() |>
  fit(vehicle_code_fines_i_p ~ med_inc + elec_dummy, data = traffic_fines)

tidy(fines_inc_elec_fit)
# A tibble: 3 × 5
  term            estimate std.error statistic  p.value
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    6.93      0.770        8.99   1.16e-18
2 med_inc       -0.0000546 0.0000131   -4.17   3.37e- 5
3 elec_dummyYes  0.0400    0.539        0.0743 9.41e- 1

Your turn: Interpret the coefficients of the model above.

  • Intercept: Add response here.

  • Coefficient for median household income:

  • Coefficient for whether or not it is an election year:

Hypothesis test

Demo: Use permutation-based methods to conduct the hypothesis test.

  • Median household income

    • Null hypothesis: There is no linear relationship between median household income and per capita traffic fines revenue.

      \[H_0: \beta_1 = 0\]

    • Alternative hypothesis: There is some linear relationship between median household income and per capita traffic fines revenue.

      \[H_A: \beta_1 \neq 0\]

  • Whether or not it is an election year

    • Null hypothesis: There is no linear relationship between whether or not it is an election year and per capita traffic fines revenue.

      \[H_0: \beta_2 = 0\]

    • Alternative hypothesis: There is some linear relationship between whether or not it is an election year and per capita traffic fines revenue.

      \[H_A: \beta_2 \neq 0\]

# calculate observed fit
obs_fit <- traffic_fines |>
  specify(vehicle_code_fines_i_p ~ med_inc + elec_dummy) |>
  fit()

# generate permuted null distribution
null_dist <- traffic_fines |>
  specify(vehicle_code_fines_i_p ~ med_inc + elec_dummy) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()

# visualize and calculate p-value
visualize(null_dist) +
  shade_p_value(obs_fit, direction = "both")

get_p_value(null_dist, obs_fit, direction = "both")
Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
# A tibble: 3 × 2
  term          p_value
  <chr>           <dbl>
1 elec_dummyYes   0.842
2 intercept       0    
3 med_inc         0    

Your turn: Interpret the results of the hypothesis test. Use a significance level of 5%.

  • With respect to median household income
  • With respect to whether or not it is an election year

Transform a variable

Let’s examine the per capita traffic fines revenue.

Your turn: Run the code chunk below and create two separate plots. How are the two plots different than each other? Which plot’s better depicts an unskewed variable?

# Plot A
ggplot(
  data = traffic_fines,
  mapping = aes(x = vehicle_code_fines_i_p)
) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
# Plot B
ggplot(
  data = traffic_fines,
  mapping = aes(x = vehicle_code_fines_i_p)
) +
  geom_histogram() +
  scale_x_log10()
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 87 rows containing non-finite outside the scale range
(`stat_bin()`).

Fit the model

Demo: Fit a linear regression model with the transformed outcome variable.

# calculate natural-log transformed outcome
traffic_fines <- traffic_fines |>
  mutate(
    log_vehicle_code_fines_i_p = log(vehicle_code_fines_i_p),
    # replace -Inf with NA - these were counties with $0 in traffic fine revenue
    log_vehicle_code_fines_i_p = if_else(
      log_vehicle_code_fines_i_p == -Inf,
      NA,
      log_vehicle_code_fines_i_p
    )
  )

fines_elec_log_fit <- linear_reg() |>
  fit(log_vehicle_code_fines_i_p ~ elec_dummy, data = traffic_fines)

tidy(fines_elec_log_fit)
# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      0.544    0.0657     8.28  4.34e-16
2 elec_dummyYes   -0.117    0.139     -0.845 3.98e- 1

Your turn: Interpret the coefficients of the model above.

  • Intercept: Add response here.

  • Slope: Add response here.

Conduct a hypothesis test

Demo: Use permutation-based methods to conduct the hypothesis test.

# calculate observed fit
obs_fit <- traffic_fines |>
  specify(log_vehicle_code_fines_i_p ~ elec_dummy) |>
  fit()
Warning: Removed 87 rows containing missing values.
# generate permuted null distribution
null_dist <- traffic_fines |>
  specify(log_vehicle_code_fines_i_p ~ elec_dummy) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()
Warning: Removed 87 rows containing missing values.
# visualize and calculate p-value
visualize(null_dist) +
  shade_p_value(obs_fit, direction = "both")

get_p_value(null_dist, obs_fit, direction = "both")
# A tibble: 2 × 2
  term          p_value
  <chr>           <dbl>
1 elec_dummyYes    0.35
2 intercept        0.35

Your turn: Interpret the \(p\)-value in context of the data and the research question. Use a significance level of 5%.

Add response here.

What about other differences?

So far we have only estimated a model with at most two explanatory variables. Yet there are many other variables in the data set that could explain the variation in per capita traffic fines revenue.

Recall that the data set is a panel structure - we observe multiple counties over multiple years. This violates the linear model assumption of independence of the observations. Consider the most obvious example - traffic fines revenue within a single county is likely to be correlated across years (e.g. knowing how much revenue was generated in Los Angeles County this year allows me to make a better guess for next year’s revenue).

Fixed effects

One way we can account for this by estimating a fixed effects model. This model includes a separate coefficient for each county, which allows us to account for the correlation in the data.

Demo: Let’s estimate a model that includes elec_dummy and a fixed effect for each county. We’ll also use the log-transformed outcome of interest.

fines_fe_fit <- linear_reg() |>
  fit(
    log_vehicle_code_fines_i_p ~ elec_dummy + factor(county_name),
    data = traffic_fines
  )

tidy(fines_fe_fit)
# A tibble: 57 × 5
   term                            estimate std.error statistic  p.value
   <chr>                              <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                       0.963     0.214     4.50   7.65e- 6
 2 elec_dummyYes                    -0.0981    0.0708   -1.38   1.67e- 1
 3 factor(county_name)Alpine         1.07      0.385     2.77   5.69e- 3
 4 factor(county_name)Amador        -1.11      0.323    -3.44   6.03e- 4
 5 factor(county_name)Butte         -0.832     0.458    -1.82   6.95e- 2
 6 factor(county_name)Calaveras      0.402     0.302     1.33   1.84e- 1
 7 factor(county_name)Colusa         0.0297    0.302     0.0984 9.22e- 1
 8 factor(county_name)Contra Costa  -0.471     0.302    -1.56   1.19e- 1
 9 factor(county_name)Del Norte     -5.36      0.370   -14.5    7.97e-43
10 factor(county_name)El Dorado     -0.177     0.302    -0.585  5.58e- 1
# ℹ 47 more rows

Your turn: Interpret the coefficient for elec_dummy.

Add response here.

Conduct a hypothesis test

Demo: Use permutation-based methods to conduct the hypothesis test.

# calculate observed fit
obs_fit <- traffic_fines |>
  specify(log_vehicle_code_fines_i_p ~ elec_dummy + factor(county_name)) |>
  fit()
Warning: Removed 87 rows containing missing values.
# generate permuted null distribution
null_dist <- traffic_fines |>
  specify(log_vehicle_code_fines_i_p ~ elec_dummy + factor(county_name)) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  fit()
Warning: Removed 87 rows containing missing values.
# calculate p-value
get_p_value(null_dist, obs_fit, direction = "both")
Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
# A tibble: 57 × 2
   term                            p_value
   <chr>                             <dbl>
 1 elec_dummyYes                     0.476
 2 factor(county_name)Alpine         0.158
 3 factor(county_name)Amador         0.072
 4 factor(county_name)Butte          0.298
 5 factor(county_name)Calaveras      0.496
 6 factor(county_name)Colusa         0.944
 7 factor(county_name)Contra Costa   0.402
 8 factor(county_name)Del Norte      0    
 9 factor(county_name)El Dorado      0.756
10 factor(county_name)Fresno         0.762
# ℹ 47 more rows

Your turn: Interpret the results of the hypothesis test in context of the data and the research question. Use a significance level of 5%.

Add response here.

Include everything

Finally, let’s estimate a model that includes all the variables in the data set.

# keep all variables used in original analysis - table 1 column 2
tf_replication <- traffic_fines |>
  select(
    log_vehicle_code_fines_i_p,
    elec_dummy,
    dem_share,
    otherparty_share,
    asian_share,
    black_share,
    hispanic_share,
    other_share,
    young_drivers,
    density,
    med_inc,
    unemp,
    own_source_share,
    emp_goods,
    emp_service,
    pay_goods_i,
    pay_service_i,
    arte_share,
    collect_share,
    cnty_le_sworn_1000p,
    felony_tot_1000p,
    misdemeanor_tot_1000p,
    county_name,
    # separate columns for each county to include the year trend
    starts_with("i_trend")
  )

fines_all_fit <- linear_reg() |>
  # use all remaining columns as predictors
  fit(log_vehicle_code_fines_i_p ~ ., data = tf_replication)

tidy(fines_all_fit)
# A tibble: 134 × 5
   term              estimate std.error statistic      p.value
   <chr>                <dbl>     <dbl>     <dbl>        <dbl>
 1 (Intercept)       3.13      10.7        0.294  0.769       
 2 elec_dummyYes    -0.0895     0.0505    -1.77   0.0767      
 3 dem_share        -0.0275     0.0128    -2.14   0.0327      
 4 otherparty_share -0.0285     0.0165    -1.73   0.0836      
 5 asian_share      -0.0298     0.182     -0.164  0.870       
 6 black_share      -1.05       0.346     -3.03   0.00252     
 7 hispanic_share    0.450      0.0795     5.66   0.0000000215
 8 other_share       1.30       0.414      3.14   0.00175     
 9 young_drivers     0.124      0.0620     2.00   0.0459      
10 density           0.000306   0.00448    0.0683 0.946       
# ℹ 124 more rows

Your turn: Interpret the coefficient for elec_dummy.

Add response here.

Conduct a hypothesis test

At this point it seems reasonable that we have met the assumptions of the linear model. Let’s conduct a hypothesis test to evaluate the relationship between whether or not it is an election year and per capita traffic fines revenue. Instead of relying on the permutation-based approach, we will use the standard \(t\)-test that was reported when we estimated the model above.

Your turn: Interpret the results of the hypothesis test in context of the data and the research question. Use a significance level of 5%.

Add response here.

Comparing across models

Let’s compare the models we estimated today. Remember that we are interested in the model simplest best model.

glance(fines_elec_dummy_fit)
# A tibble: 1 × 12
    r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
        <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1 0.000000968     -0.000977  7.23  0.000990   0.975     1 -3481. 6969. 6983.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(fines_inc_elec_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0167        0.0148  7.17      8.67 0.000184     2 -3473. 6953. 6973.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(fines_elec_log_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1  0.000763     -0.000305  1.77     0.715   0.398     1 -1867. 3739. 3754.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(fines_fe_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.754         0.739 0.906      48.3 7.62e-229    56 -1209. 2533. 2814.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(fines_all_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.896         0.878 0.610      49.5 4.38e-297   132  -750. 1769. 2411.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Your turn: Which model do you believe is most appropriate? Why?

Add response here.

Footnotes

  1. The codebook is available from Dataverse. The data set has been lightly cleaned for the application exercise.↩︎