Chapter 8 Regression Analysis

Regression analysis is the workhorse of empirical economics. It allows us to estimate relationships between variables, test hypotheses, and make predictions. This chapter introduces regression modeling in R, starting with ordinary least squares (OLS) for continuous outcomes, then moving to binary response models (logit and probit) for yes/no outcomes, and finally count data models (Poisson) for discrete counts.

library(tidyverse)

# Load all datasets
macro <- read_csv("data/us_macrodata.csv")
states <- read_csv("data/acs_state.csv")
pums <- read_csv("data/acs_pums.csv")
cafe <- read_csv("data/cafe_sales.csv")

8.1 Linear Regression with lm()

The lm() function fits linear models using ordinary least squares. The basic syntax is lm(y ~ x, data = df), where y is the dependent variable and x is the independent variable.

8.1.1 Simple Linear Regression

Let’s start with a simple regression examining how education levels relate to median household income across states.

# Simple linear regression: income as a function of education
model1 <- lm(median_household_income ~ bachelors_or_higher, data = states)

# View the results
summary(model1)
## 
## Call:
## lm(formula = median_household_income ~ bachelors_or_higher, data = states)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40859  -3531   -361   3573  19441 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            18807       6583   2.857  0.00622 ** 
## bachelors_or_higher   162940      19130   8.518 2.66e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9149 on 50 degrees of freedom
## Multiple R-squared:  0.592,  Adjusted R-squared:  0.5838 
## F-statistic: 72.55 on 1 and 50 DF,  p-value: 2.66e-11

The summary() output contains several important pieces of information:

  • Coefficients: The intercept and slope estimates with their standard errors, t-statistics, and p-values
  • R-squared: The proportion of variance in the dependent variable explained by the model
  • F-statistic: Tests whether the model as a whole is statistically significant

The coefficient on bachelors_or_higher tells us that a one-unit increase in the bachelor’s degree rate (which is measured as a proportion, so 0.01 = 1 percentage point) is associated with approximately $1,100 higher median household income.

8.1.2 Extracting Model Components

R stores regression results as objects containing many useful components.

# Extract coefficients
coef(model1)
##         (Intercept) bachelors_or_higher 
##            18806.87           162940.12
# Extract fitted values (predicted y)
head(fitted(model1))
##        1        2        3        4        5        6 
## 63140.25 68908.16 70619.99 59062.12 77244.37 90061.73
# Extract residuals (y - predicted y)
head(residuals(model1))
##         1         2         3         4         5         6 
## -3531.249 17461.838  1961.008 -2727.124 14660.631 -2463.732
# Get confidence intervals for coefficients
confint(model1)
##                          2.5 %    97.5 %
## (Intercept)           5584.345  32029.39
## bachelors_or_higher 124517.006 201363.23
# Get just the R-squared
summary(model1)$r.squared
## [1] 0.5920048

8.1.3 Visualizing the Regression Line

ggplot(states, aes(x = bachelors_or_higher, y = median_household_income)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::dollar) +
  labs(
    title = "Education and Income Across U.S. States",
    x = "Percent with Bachelor's Degree or Higher",
    y = "Median Household Income"
  ) +
  theme_minimal()

8.2 Multiple Linear Regression

Most economic questions involve multiple explanatory variables. The formula syntax in R uses + to add variables.

# Multiple regression: income as a function of education, unemployment, and homeownership
model2 <- lm(median_household_income ~ bachelors_or_higher + unemployment_rate +
               homeownership_rate, data = states)

summary(model2)
## 
## Call:
## lm(formula = median_household_income ~ bachelors_or_higher + 
##     unemployment_rate + homeownership_rate, data = states)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17400.5  -4300.6   -624.6   3947.8  19605.7 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            70505      23678   2.978  0.00454 ** 
## bachelors_or_higher   142070      20292   7.001 7.32e-09 ***
## unemployment_rate    -325676      77262  -4.215  0.00011 ***
## homeownership_rate    -42013      25937  -1.620  0.11183    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7977 on 48 degrees of freedom
## Multiple R-squared:  0.7022, Adjusted R-squared:  0.6836 
## F-statistic: 37.74 on 3 and 48 DF,  p-value: 1.118e-12

8.2.1 Formula Syntax

R’s formula syntax is flexible and powerful:

# Main effects only
y ~ x1 + x2 + x3

# Interaction between x1 and x2
y ~ x1 + x2 + x1:x2

# Main effects plus interaction (shorthand)
y ~ x1 * x2

# All pairwise interactions
y ~ (x1 + x2 + x3)^2

# Polynomial terms
y ~ x + I(x^2)

# Log transformation
y ~ log(x)

# Exclude the intercept
y ~ x - 1

8.2.2 Example: The Mincer Wage Equation

The Mincer equation is a classic specification in labor economics that relates wages to education and experience. Let’s estimate it using the PUMS data.

# Prepare the data
pums_workers <- pums %>%
  filter(wage_income > 0, !is.na(education), age >= 18, age <= 65) %>%
  mutate(
    log_wage = log(wage_income),
    # Create years of education from categories
    years_education = case_when(
      education == "Less than HS" ~ 10,
      education == "High school" ~ 12,
      education == "Some college" ~ 14,
      education == "Bachelors" ~ 16,
      education == "Masters" ~ 18,
      education == "Professional/Doctorate" ~ 20
    ),
    # Approximate experience as age - education - 6
    experience = age - years_education - 6,
    experience_sq = experience^2
  )

# Estimate the Mincer equation
mincer <- lm(log_wage ~ years_education + experience + experience_sq,
             data = pums_workers)

summary(mincer)
## 
## Call:
## lm(formula = log_wage ~ years_education + experience + experience_sq, 
##     data = pums_workers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8923 -0.4142  0.1706  0.6404  4.2614 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.048e+00  6.535e-02  107.85   <2e-16 ***
## years_education  1.738e-01  4.223e-03   41.15   <2e-16 ***
## experience       9.640e-02  2.811e-03   34.30   <2e-16 ***
## experience_sq   -1.688e-03  6.339e-05  -26.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.063 on 10759 degrees of freedom
## Multiple R-squared:  0.2477, Adjusted R-squared:  0.2475 
## F-statistic:  1181 on 3 and 10759 DF,  p-value: < 2.2e-16

Interpretation:

  • The coefficient on years_education represents the approximate percentage increase in wages for each additional year of education (about 10% per year)
  • The coefficients on experience and experience-squared capture the concave experience-earnings profile

8.2.3 Categorical Variables (Factor Variables)

R automatically creates dummy variables for factors. Let’s include education categories and sex in the wage equation.

# Using categorical education
pums_workers <- pums_workers %>%
  mutate(education = factor(education, levels = c("Less than HS",
                                                   "High school",
                                                   "Some college",
                                                   "Bachelors",
                                                   "Masters",
                                                   "Professional/Doctorate")))

model_categorical <- lm(log_wage ~ education + experience + experience_sq + sex,
                        data = pums_workers)

summary(model_categorical)
## 
## Call:
## lm(formula = log_wage ~ education + experience + experience_sq + 
##     sex, data = pums_workers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8378 -0.3987  0.1763  0.6254  4.2056 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      8.625e+00  4.574e-02  188.58  < 2e-16 ***
## educationHigh school             3.405e-01  4.203e-02    8.10 6.09e-16 ***
## educationSome college            5.352e-01  4.104e-02   13.04  < 2e-16 ***
## educationBachelors               1.163e+00  4.195e-02   27.73  < 2e-16 ***
## educationMasters                 1.392e+00  4.789e-02   29.06  < 2e-16 ***
## educationProfessional/Doctorate  1.633e+00  6.184e-02   26.41  < 2e-16 ***
## experience                       9.450e-02  2.763e-03   34.21  < 2e-16 ***
## experience_sq                   -1.641e-03  6.224e-05  -26.37  < 2e-16 ***
## sexMale                          3.717e-01  2.026e-02   18.35  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.042 on 10754 degrees of freedom
## Multiple R-squared:  0.2776, Adjusted R-squared:  0.2771 
## F-statistic: 516.7 on 8 and 10754 DF,  p-value: < 2.2e-16

The first category (“Less than HS”) is the reference group. Coefficients represent the difference in log wages relative to this group.

8.3 Diagnostic Plots

Regression assumptions should be checked using diagnostic plots.

# Base R diagnostic plots
par(mfrow = c(2, 2))
plot(model2)

par(mfrow = c(1, 1))

The four plots show:

  1. Residuals vs Fitted: Should show no pattern (tests linearity and homoskedasticity)
  2. Q-Q Plot: Points should follow the diagonal line (tests normality of residuals)
  3. Scale-Location: Should show horizontal band (tests homoskedasticity)
  4. Residuals vs Leverage: Identifies influential observations

8.3.1 Creating Custom Diagnostic Plots

# Add fitted values and residuals to data
states_diag <- states %>%
  filter(!is.na(median_household_income)) %>%
  mutate(
    fitted = fitted(model2),
    residuals = residuals(model2)
  )

# Residuals vs fitted plot with ggplot
ggplot(states_diag, aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal()

8.4 Robust Standard Errors

In the presence of heteroskedasticity, standard errors from OLS are biased. The sandwich and lmtest packages provide robust standard errors.

library(sandwich)
library(lmtest)

# Original results
coeftest(model2)
## 
## t test of coefficients:
## 
##                     Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)            70505      23678  2.9776 0.0045435 ** 
## bachelors_or_higher   142070      20292  7.0013 7.319e-09 ***
## unemployment_rate    -325676      77262 -4.2152 0.0001096 ***
## homeownership_rate    -42013      25937 -1.6198 0.1118263    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Heteroskedasticity-robust standard errors (HC1, similar to Stata)
coeftest(model2, vcov = vcovHC(model2, type = "HC1"))
## 
## t test of coefficients:
## 
##                     Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)            70505      30194  2.3351   0.02377 *  
## bachelors_or_higher   142070      24853  5.7165 6.775e-07 ***
## unemployment_rate    -325676     124635 -2.6130   0.01195 *  
## homeownership_rate    -42013      35604 -1.1800   0.24381    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.5 Binary Response Models: Logit and Probit

When the dependent variable is binary (0 or 1), linear regression is inappropriate. Logit and probit models constrain predicted probabilities to lie between 0 and 1.

8.5.1 The glm() Function

The glm() function fits generalized linear models, including logit and probit.

# Prepare employment data
pums_employment <- pums %>%
  filter(age >= 25, age <= 55, !is.na(education), !is.na(sex), !is.na(employed)) %>%
  mutate(
    employed = as.integer(employed == 1),
    # Recode education to simpler categories
    education_cat = case_when(
      education == "Less than HS" ~ "Less than HS",
      education == "High school" ~ "High school",
      education == "Some college" ~ "Some college",
      education == "Bachelors" ~ "Bachelor's+",
      education %in% c("Masters", "Professional/Doctorate") ~ "Bachelor's+"
    ),
    education_cat = factor(education_cat, levels = c("Less than HS",
                                                      "High school",
                                                      "Some college",
                                                      "Bachelor's+"))
  )

# Logit model: probability of employment
logit_model <- glm(employed ~ education_cat + age + sex,
                   family = binomial(link = "logit"),
                   data = pums_employment)

summary(logit_model)
## 
## Call:
## glm(formula = employed ~ education_cat + age + sex, family = binomial(link = "logit"), 
##     data = pums_employment)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)
## (Intercept)               -2.657e+01  2.132e+04  -0.001    0.999
## education_catHigh school   5.410e-13  1.400e+04   0.000    1.000
## education_catSome college  1.977e-14  1.365e+04   0.000    1.000
## education_catBachelor's+   3.454e-14  1.301e+04   0.000    1.000
## age                       -3.976e-18  4.136e+02   0.000    1.000
## sexMale                    2.156e-13  7.464e+03   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.000e+00  on 9262  degrees of freedom
## Residual deviance: 5.374e-08  on 9257  degrees of freedom
## AIC: 12
## 
## Number of Fisher Scoring iterations: 25

8.5.2 Probit Model

# Probit model (same specification)
probit_model <- glm(employed ~ education_cat + age + sex,
                    family = binomial(link = "probit"),
                    data = pums_employment)

summary(probit_model)
## 
## Call:
## glm(formula = employed ~ education_cat + age + sex, family = binomial(link = "probit"), 
##     data = pums_employment)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)
## (Intercept)               -6.991e+00  4.430e+03  -0.002    0.999
## education_catHigh school   2.318e-13  2.908e+03   0.000    1.000
## education_catSome college  1.238e-14  2.836e+03   0.000    1.000
## education_catBachelor's+   1.647e-14  2.702e+03   0.000    1.000
## age                       -5.071e-19  8.591e+01   0.000    1.000
## sexMale                    8.978e-14  1.551e+03   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 9262  degrees of freedom
## Residual deviance: 2.5241e-08  on 9257  degrees of freedom
## AIC: 12
## 
## Number of Fisher Scoring iterations: 25

8.5.3 Interpreting Coefficients

Logit coefficients are in log-odds. To interpret them as odds ratios, exponentiate:

# Log-odds (original coefficients)
coef(logit_model)
##               (Intercept)  education_catHigh school education_catSome college 
##             -2.656607e+01              5.409722e-13              1.976718e-14 
##  education_catBachelor's+                       age                   sexMale 
##              3.453892e-14             -3.976407e-18              2.156117e-13
# Odds ratios
exp(coef(logit_model))
##               (Intercept)  education_catHigh school education_catSome college 
##              2.900701e-12              1.000000e+00              1.000000e+00 
##  education_catBachelor's+                       age                   sexMale 
##              1.000000e+00              1.000000e+00              1.000000e+00
# Odds ratios with confidence intervals (using Wald CIs for stability)
ci <- confint.default(logit_model)
exp(cbind(OR = coef(logit_model), ci))
##                                     OR 2.5 % 97.5 %
## (Intercept)               2.900701e-12     0    Inf
## education_catHigh school  1.000000e+00     0    Inf
## education_catSome college 1.000000e+00     0    Inf
## education_catBachelor's+  1.000000e+00     0    Inf
## age                       1.000000e+00     0    Inf
## sexMale                   1.000000e+00     0    Inf

An odds ratio greater than 1 means higher odds of employment; less than 1 means lower odds.

8.5.4 Marginal Effects

For more intuitive interpretation, calculate marginal effects (the change in probability for a unit change in x). The margins package makes this easy.

library(margins)

# Average marginal effects
margins_logit <- margins(logit_model)
summary(margins_logit)
##                     factor    AME     SE      z      p   lower  upper
##                        age 0.0000 0.0000 0.0000 1.0000 -0.0000 0.0000
##   education_catBachelor's+ 0.0000 0.0000 0.0000 1.0000 -0.0000 0.0000
##   education_catHigh school 0.0000 0.0000 0.0000 1.0000 -0.0000 0.0000
##  education_catSome college 0.0000 0.0000 0.0000 1.0000 -0.0000 0.0000
##                    sexMale 0.0000 0.0000 0.0000 1.0000 -0.0000 0.0000

The marginal effect tells us the percentage point change in the probability of employment for a one-unit change in the explanatory variable.

8.5.5 Predicted Probabilities

# Create prediction data
new_data <- expand.grid(
  education_cat = levels(pums_employment$education_cat),
  age = 40,
  sex = c("Male", "Female")
)

# Predict probabilities
new_data$predicted_prob <- predict(logit_model, newdata = new_data, type = "response")

# Display
new_data %>%
  pivot_wider(names_from = sex, values_from = predicted_prob) %>%
  mutate(across(c(Male, Female), ~scales::percent(.x, accuracy = 0.1)))
## # A tibble: 4 × 4
##   education_cat   age Male  Female
##   <fct>         <dbl> <chr> <chr> 
## 1 Less than HS     40 0.0%  0.0%  
## 2 High school      40 0.0%  0.0%  
## 3 Some college     40 0.0%  0.0%  
## 4 Bachelor's+      40 0.0%  0.0%

8.5.6 Visualizing Predicted Probabilities

# Predict across age range
pred_data <- expand.grid(
  education_cat = c("High school", "Bachelor's+"),
  age = seq(25, 55, by = 1),
  sex = "Female"
)

pred_data$prob <- predict(logit_model, newdata = pred_data, type = "response")

ggplot(pred_data, aes(x = age, y = prob, color = education_cat)) +
  geom_line(linewidth = 1) +
  scale_y_continuous(labels = scales::percent, limits = c(0.5, 1)) +
  labs(
    title = "Predicted Probability of Employment by Age and Education",
    subtitle = "Female workers",
    x = "Age",
    y = "Probability of Employment",
    color = "Education"
  ) +
  theme_minimal()

8.5.7 Model Fit for Binary Models

# Log-likelihood
logLik(logit_model)
## 'log Lik.' -2.68692e-08 (df=6)
# AIC and BIC (lower is better)
AIC(logit_model)
## [1] 12
BIC(logit_model)
## [1] 54.8027
# Compare logit and probit
AIC(logit_model, probit_model)
##              df AIC
## logit_model   6  12
## probit_model  6  12
# Pseudo R-squared (McFadden's)
null_model <- glm(employed ~ 1, family = binomial, data = pums_employment)
1 - logLik(logit_model) / logLik(null_model)
## 'log Lik.' 8.263909e-09 (df=6)

8.6 Count Data Models: Poisson Regression

When the dependent variable is a count (0, 1, 2, …), Poisson regression is appropriate. The model assumes the count follows a Poisson distribution with mean that depends on the explanatory variables.

8.6.1 Preparing Count Data

Let’s analyze daily transaction counts at the cafe.

# Aggregate to daily counts
cafe_daily <- cafe %>%
  group_by(date, year, month, day_of_week) %>%
  summarize(
    transactions = n(),
    revenue = sum(price),
    .groups = "drop"
  ) %>%
  mutate(
    weekend = day_of_week %in% c("Sat", "Sun"),
    covid_period = date >= "2020-03-01" & date <= "2020-12-31",
    day_of_week = factor(day_of_week,
                         levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
  )

head(cafe_daily)
## # A tibble: 6 × 8
##   date        year month day_of_week transactions revenue weekend covid_period
##   <date>     <dbl> <dbl> <fct>              <int>   <dbl> <lgl>   <lgl>       
## 1 2019-01-01  2019     1 Tue                  307   1169  FALSE   FALSE       
## 2 2019-01-02  2019     1 Wed                  296   1122. FALSE   FALSE       
## 3 2019-01-03  2019     1 Thu                  257    964. FALSE   FALSE       
## 4 2019-01-04  2019     1 Fri                  240    894. FALSE   FALSE       
## 5 2019-01-05  2019     1 Sat                  368   1388. TRUE    FALSE       
## 6 2019-01-06  2019     1 Sun                  379   1473. TRUE    FALSE
# Summary of daily transactions
summary(cafe_daily$transactions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    53.0   235.2   269.0   268.3   303.0   469.0
# Distribution of counts
ggplot(cafe_daily, aes(x = transactions)) +
  geom_histogram(binwidth = 10, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Daily Transaction Counts",
       x = "Number of Transactions", y = "Frequency") +
  theme_minimal()

8.6.2 Fitting a Poisson Model

# Poisson regression
poisson_model <- glm(transactions ~ day_of_week + year + covid_period,
                     family = poisson(link = "log"),
                     data = cafe_daily)

summary(poisson_model)
## 
## Call:
## glm(formula = transactions ~ day_of_week + year + covid_period, 
##     family = poisson(link = "log"), data = cafe_daily)
## 
## Coefficients:
##                    Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)      -35.093789   2.079157  -16.879   <2e-16 ***
## day_of_weekTue    -0.007864   0.005629   -1.397   0.1624    
## day_of_weekWed    -0.007279   0.005628   -1.293   0.1959    
## day_of_weekThu    -0.005266   0.005625   -0.936   0.3492    
## day_of_weekFri    -0.011505   0.005631   -2.043   0.0411 *  
## day_of_weekSat     0.316169   0.005222   60.547   <2e-16 ***
## day_of_weekSun     0.318675   0.005221   61.037   <2e-16 ***
## year               0.020121   0.001029   19.561   <2e-16 ***
## covid_periodTRUE  -0.598633   0.004986 -120.066   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 40698.3  on 1825  degrees of freedom
## Residual deviance:  9097.6  on 1817  degrees of freedom
## AIC: 22596
## 
## Number of Fisher Scoring iterations: 4

8.6.3 Interpreting Poisson Coefficients

Poisson coefficients represent the change in the log of expected counts. Exponentiating gives the multiplicative effect (incidence rate ratio).

# Coefficients (log scale)
coef(poisson_model)
##      (Intercept)   day_of_weekTue   day_of_weekWed   day_of_weekThu 
##    -35.093789108     -0.007864462     -0.007278799     -0.005266263 
##   day_of_weekFri   day_of_weekSat   day_of_weekSun             year 
##     -0.011504841      0.316169000      0.318675297      0.020121263 
## covid_periodTRUE 
##     -0.598632580
# Incidence rate ratios
exp(coef(poisson_model))
##      (Intercept)   day_of_weekTue   day_of_weekWed   day_of_weekThu 
##     5.740650e-16     9.921664e-01     9.927476e-01     9.947476e-01 
##   day_of_weekFri   day_of_weekSat   day_of_weekSun             year 
##     9.885611e-01     1.371862e+00     1.375305e+00     1.020325e+00 
## covid_periodTRUE 
##     5.495626e-01
# With confidence intervals
exp(cbind(IRR = coef(poisson_model), confint(poisson_model)))
##                           IRR        2.5 %       97.5 %
## (Intercept)      5.740650e-16 9.746230e-18 3.375990e-14
## day_of_weekTue   9.921664e-01 9.812804e-01 1.003173e+00
## day_of_weekWed   9.927476e-01 9.818570e-01 1.003759e+00
## day_of_weekThu   9.947476e-01 9.838405e-01 1.005776e+00
## day_of_weekFri   9.885611e-01 9.777101e-01 9.995324e-01
## day_of_weekSat   1.371862e+00 1.357897e+00 1.385979e+00
## day_of_weekSun   1.375305e+00 1.361307e+00 1.389454e+00
## year             1.020325e+00 1.018270e+00 1.022385e+00
## covid_periodTRUE 5.495626e-01 5.442124e-01 5.549533e-01

An IRR of 1.15 means the expected count is 15% higher; an IRR of 0.80 means 20% lower.

8.6.4 Predicted Counts

# Predict expected counts for each day of week
pred_days <- data.frame(
  day_of_week = factor(c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
                       levels = levels(cafe_daily$day_of_week)),
  year = 2023,
  covid_period = FALSE
)

pred_days$expected_count <- predict(poisson_model, newdata = pred_days, type = "response")

pred_days %>%
  select(day_of_week, expected_count) %>%
  mutate(expected_count = round(expected_count, 0))
##   day_of_week expected_count
## 1         Mon            274
## 2         Tue            271
## 3         Wed            272
## 4         Thu            272
## 5         Fri            270
## 6         Sat            375
## 7         Sun            376

8.6.5 Overdispersion

A key assumption of Poisson regression is that the mean equals the variance. When the variance exceeds the mean (overdispersion), standard errors are underestimated. Check for this:

# Dispersion statistic (should be close to 1 for Poisson)
dispersion <- sum(residuals(poisson_model, type = "pearson")^2) / poisson_model$df.residual
dispersion
## [1] 4.871176

If dispersion is substantially greater than 1, consider using quasi-Poisson or negative binomial regression:

# Quasi-Poisson (adjusts standard errors for overdispersion)
quasi_model <- glm(transactions ~ day_of_week + year + covid_period,
                   family = quasipoisson(link = "log"),
                   data = cafe_daily)

summary(quasi_model)
## 
## Call:
## glm(formula = transactions ~ day_of_week + year + covid_period, 
##     family = quasipoisson(link = "log"), data = cafe_daily)
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -35.093789   4.588854  -7.648  3.3e-14 ***
## day_of_weekTue    -0.007864   0.012424  -0.633    0.527    
## day_of_weekWed    -0.007279   0.012421  -0.586    0.558    
## day_of_weekThu    -0.005266   0.012415  -0.424    0.671    
## day_of_weekFri    -0.011505   0.012429  -0.926    0.355    
## day_of_weekSat     0.316169   0.011525  27.433  < 2e-16 ***
## day_of_weekSun     0.318675   0.011523  27.655  < 2e-16 ***
## year               0.020121   0.002270   8.863  < 2e-16 ***
## covid_periodTRUE  -0.598633   0.011004 -54.400  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 4.871179)
## 
##     Null deviance: 40698.3  on 1825  degrees of freedom
## Residual deviance:  9097.6  on 1817  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

8.6.6 Negative Binomial Regression

For more severe overdispersion, negative binomial regression is preferred. It requires the MASS package.

library(MASS)

# Negative binomial regression
negbin_model <- glm.nb(transactions ~ day_of_week + year + covid_period,
                       data = cafe_daily)

summary(negbin_model)
## 
## Call:
## glm.nb(formula = transactions ~ day_of_week + year + covid_period, 
##     data = cafe_daily, init.theta = 52.9606772, link = log)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -34.101470   5.281566  -6.457 1.07e-10 ***
## day_of_weekTue    -0.007332   0.013340  -0.550    0.583    
## day_of_weekWed    -0.008286   0.013341  -0.621    0.535    
## day_of_weekThu    -0.004623   0.013339  -0.347    0.729    
## day_of_weekFri    -0.012644   0.013342  -0.948    0.343    
## day_of_weekSat     0.315146   0.013169  23.932  < 2e-16 ***
## day_of_weekSun     0.317045   0.013169  24.075  < 2e-16 ***
## year               0.019631   0.002613   7.512 5.80e-14 ***
## covid_periodTRUE  -0.598171   0.010377 -57.646  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(52.9607) family taken to be 1)
## 
##     Null deviance: 7529.7  on 1825  degrees of freedom
## Residual deviance: 1983.2  on 1817  degrees of freedom
## AIC: 18723
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  52.96 
##           Std. Err.:  2.29 
## 
##  2 x log-likelihood:  -18702.62

8.7 Comparing Models

8.7.1 Nested Model Comparison with Likelihood Ratio Tests

# Restricted model (no day of week effects)
poisson_restricted <- glm(transactions ~ year + covid_period,
                          family = poisson(link = "log"),
                          data = cafe_daily)

# Likelihood ratio test
anova(poisson_restricted, poisson_model, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: transactions ~ year + covid_period
## Model 2: transactions ~ day_of_week + year + covid_period
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      1823    20486.6                          
## 2      1817     9097.6  6    11389 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.7.2 Information Criteria

# Compare models using AIC
AIC(poisson_model, quasi_model, negbin_model)
##               df      AIC
## poisson_model  9 22595.56
## quasi_model   10       NA
## negbin_model  10 18722.62
# BIC comparison
BIC(poisson_model, negbin_model)
##               df      BIC
## poisson_model  9 22645.15
## negbin_model  10 18777.72

8.8 Presenting Regression Results

8.8.1 Creating Regression Tables

The broom package converts regression output to tidy data frames.

library(broom)

# Tidy coefficients table
tidy(model2)
## # A tibble: 4 × 5
##   term                estimate std.error statistic       p.value
##   <chr>                  <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept)           70505.    23678.      2.98 0.00454      
## 2 bachelors_or_higher  142070.    20292.      7.00 0.00000000732
## 3 unemployment_rate   -325676.    77262.     -4.22 0.000110     
## 4 homeownership_rate   -42013.    25937.     -1.62 0.112
# Model summary statistics
glance(model2)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.702         0.684 7977.      37.7 1.12e-12     3  -539. 1088. 1098.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Observation-level statistics (fitted values, residuals, etc.)
head(augment(model2))
## # A tibble: 6 × 10
##   median_household_income bachelors_or_higher unemployment_rate
##                     <dbl>               <dbl>             <dbl>
## 1                   59609               0.272            0.0512
## 2                   86370               0.307            0.0601
## 3                   72581               0.318            0.0533
## 4                   56335               0.247            0.0513
## 5                   91905               0.359            0.0636
## 6                   87598               0.437            0.0448
## # ℹ 7 more variables: homeownership_rate <dbl>, .fitted <dbl>, .resid <dbl>,
## #   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

8.8.2 Publication-Quality Tables with gt

library(gt)

# Create a nice coefficient table
tidy(model2, conf.int = TRUE) %>%
  mutate(
    term = case_when(
      term == "(Intercept)" ~ "Constant",
      term == "bachelors_or_higher" ~ "Bachelor's Degree Rate",
      term == "unemployment_rate" ~ "Unemployment Rate",
      term == "homeownership_rate" ~ "Homeownership Rate"
    )
  ) %>%
  gt() %>%
  tab_header(
    title = "Determinants of State Median Household Income",
    subtitle = "OLS Regression Results"
  ) %>%
  fmt_number(columns = c(estimate, std.error, statistic, conf.low, conf.high),
             decimals = 2) %>%
  fmt_number(columns = p.value, decimals = 4) %>%
  cols_label(
    term = "Variable",
    estimate = "Coefficient",
    std.error = "Std. Error",
    statistic = "t-statistic",
    p.value = "p-value",
    conf.low = "95% CI Low",
    conf.high = "95% CI High"
  )
Determinants of State Median Household Income
OLS Regression Results
Variable Coefficient Std. Error t-statistic p-value 95% CI Low 95% CI High
Constant 70,504.77 23,678.50 2.98 0.0045 22,895.96 118,113.59
Bachelor's Degree Rate 142,070.15 20,291.97 7.00 0.0000 101,270.41 182,869.89
Unemployment Rate −325,676.29 77,261.91 −4.22 0.0001 −481,021.78 −170,330.80
Homeownership Rate −42,012.77 25,936.91 −1.62 0.1118 −94,162.42 10,136.89

8.8.3 Comparing Multiple Models

# Multiple models for comparison
model_a <- lm(median_household_income ~ bachelors_or_higher, data = states)
model_b <- lm(median_household_income ~ bachelors_or_higher + unemployment_rate, data = states)
model_c <- lm(median_household_income ~ bachelors_or_higher + unemployment_rate +
                homeownership_rate, data = states)

# Combine results
comparison_df <- bind_rows(
  tidy(model_a) %>% mutate(model = "Model 1"),
  tidy(model_b) %>% mutate(model = "Model 2"),
  tidy(model_c) %>% mutate(model = "Model 3")
) %>%
  filter(term != "(Intercept)")

comparison_df %>%
  dplyr::select(model, term, estimate, std.error, p.value) %>%
  pivot_wider(
    names_from = model,
    values_from = c(estimate, std.error, p.value)
  )
## # A tibble: 3 × 10
##   term                `estimate_Model 1` `estimate_Model 2` `estimate_Model 3`
##   <chr>                            <dbl>              <dbl>              <dbl>
## 1 bachelors_or_higher            162940.            160769.            142070.
## 2 unemployment_rate                  NA            -279394.           -325676.
## 3 homeownership_rate                 NA                 NA             -42013.
## # ℹ 6 more variables: `std.error_Model 1` <dbl>, `std.error_Model 2` <dbl>,
## #   `std.error_Model 3` <dbl>, `p.value_Model 1` <dbl>,
## #   `p.value_Model 2` <dbl>, `p.value_Model 3` <dbl>

8.9 Example: Phillips Curve Estimation

The Phillips curve describes the relationship between unemployment and inflation. Let’s estimate it using the macro data.

# Prepare data
macro_phillips <- macro %>%
  filter(year >= 1960) %>%
  mutate(
    inflation_pct = inflation * 100,
    unemployment_pct = unemployment * 100
  )

# Simple Phillips curve
phillips1 <- lm(inflation_pct ~ unemployment_pct, data = macro_phillips)
summary(phillips1)
## 
## Call:
## lm(formula = inflation_pct ~ unemployment_pct, data = macro_phillips)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3067 -1.5412 -0.5412  0.6502 10.0760 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        2.5843     1.3115   1.971   0.0532 .
## unemployment_pct   0.1914     0.2100   0.911   0.3657  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.738 on 63 degrees of freedom
## Multiple R-squared:  0.01301,    Adjusted R-squared:  -0.00266 
## F-statistic: 0.8302 on 1 and 63 DF,  p-value: 0.3657
# Plot
ggplot(macro_phillips, aes(x = unemployment_pct, y = inflation_pct)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "The Phillips Curve (1960-Present)",
    x = "Unemployment Rate (%)",
    y = "Inflation Rate (%)"
  ) +
  theme_minimal()

The simple Phillips curve shows a weak negative relationship. Let’s explore whether this relationship has changed over time.

# Phillips curve by era
macro_phillips <- macro_phillips %>%
  mutate(era = case_when(
    year < 1980 ~ "1960-1979",
    year < 2000 ~ "1980-1999",
    TRUE ~ "2000-Present"
  ))

ggplot(macro_phillips, aes(x = unemployment_pct, y = inflation_pct, color = era)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "The Phillips Curve by Era",
    x = "Unemployment Rate (%)",
    y = "Inflation Rate (%)",
    color = "Period"
  ) +
  theme_minimal()

8.10 Example: Wage Determinants with PUMS

Let’s build a more complete wage equation including multiple factors.

# Comprehensive wage regression
wage_model <- lm(log_wage ~ education + experience + experience_sq + sex +
                   hours_worked + state,
                 data = pums_workers)

# Summary (showing just key coefficients)
tidy(wage_model) %>%
  filter(!str_starts(term, "state")) %>%
  gt() %>%
  tab_header(title = "Log Wage Regression Results") %>%
  fmt_number(columns = c(estimate, std.error, statistic), decimals = 3) %>%
  fmt_number(columns = p.value, decimals = 4)
Log Wage Regression Results
term estimate std.error statistic p.value
(Intercept) 7.527 0.049 154.546 0.0000
educationHigh school 0.254 0.037 6.821 0.0000
educationSome college 0.452 0.036 12.450 0.0000
educationBachelors 0.950 0.037 25.450 0.0000
educationMasters 1.154 0.043 27.088 0.0000
educationProfessional/Doctorate 1.313 0.055 23.874 0.0000
experience 0.066 0.002 26.467 0.0000
experience_sq −0.001 0.000 −19.732 0.0000
sexMale 0.193 0.018 10.619 0.0000
hours_worked 0.043 0.001 54.623 0.0000

8.11 Example: State Poverty Analysis

What factors predict poverty rates across states?

# Poverty rate determinants
poverty_model <- lm(poverty_rate ~ bachelors_or_higher + unemployment_rate +
                      median_age + homeownership_rate,
                    data = states %>% filter(state != "Puerto Rico"))

summary(poverty_model)
## 
## Call:
## lm(formula = poverty_rate ~ bachelors_or_higher + unemployment_rate + 
##     median_age + homeownership_rate, data = states %>% filter(state != 
##     "Puerto Rico"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.045224 -0.010126 -0.000937  0.010853  0.049406 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.2288309  0.0637471   3.590 0.000801 ***
## bachelors_or_higher -0.2467247  0.0485590  -5.081  6.7e-06 ***
## unemployment_rate    1.1450603  0.3034283   3.774 0.000459 ***
## median_age          -0.0004338  0.0013097  -0.331 0.742004    
## homeownership_rate  -0.0924357  0.0745499  -1.240 0.221294    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01824 on 46 degrees of freedom
## Multiple R-squared:  0.5583, Adjusted R-squared:  0.5199 
## F-statistic: 14.54 on 4 and 46 DF,  p-value: 9.52e-08
# Visualize partial relationships
states %>%
  filter(state != "Puerto Rico") %>%
  ggplot(aes(x = bachelors_or_higher, y = poverty_rate)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Education and Poverty Across States",
    x = "Percent with Bachelor's Degree",
    y = "Poverty Rate"
  ) +
  theme_minimal()

8.12 Exercises

8.12.1 Linear Regression

  1. Using the state data, regress median rent on median household income and homeownership rate. Interpret the coefficients and R-squared.

  2. Estimate a Mincer wage equation using the PUMS data that includes education, experience, experience-squared, sex, and state fixed effects. Calculate the return to an additional year of education.

  3. Using the macro data, estimate the relationship between real GDP per capita and the unemployment rate (Okun’s Law). Include a time trend. Plot the results.

  4. Test whether the relationship between education and income differs by region. Estimate a model with region dummy variables and interaction terms.

8.12.2 Binary Response Models

  1. Using the PUMS data, estimate a probit model for homeownership (tenure == “Owner”) as a function of age, household income, education, and marital status. Calculate marginal effects.

  2. Compare logit and probit models for the employment probability regression. Are the predicted probabilities similar?

  3. Create a visualization showing how the predicted probability of homeownership varies with household income for different education levels.

8.12.3 Count Data Models

  1. Using the cafe data, estimate a Poisson regression for daily transaction counts as a function of day of week, month, and year. Test for overdispersion.

  2. Compare Poisson, quasi-Poisson, and negative binomial models for the cafe transaction data. Which fits best?

  3. Using the PUMS data, model the number of children (num_children) as a function of age, education, marital status, and household income using Poisson regression.

8.12.4 Model Comparison and Diagnostics

  1. For the state income regression, create diagnostic plots and identify any outliers or influential observations. How do results change if you exclude them?

  2. Estimate three nested models predicting state median income with progressively more variables. Use likelihood ratio tests and information criteria to compare them.

  3. Calculate robust standard errors for the wage regression. How much do the standard errors change?

8.12.5 Advanced

  1. Estimate an augmented Phillips curve that includes lagged inflation (inflation expectations). Interpret the results.

  2. Create a publication-ready regression table comparing at least three model specifications for an outcome of your choice, including coefficient estimates, standard errors, and fit statistics.