Chapter 7 Summary Statistics and Tables

In an empirical economic analysis, we often want to compute various summary statistics and present them in clear, well-formatted tables. Summary statistics condense large datasets into interpretable numbers—means, medians, standard deviations, and other measures that characterize the central tendency, spread, and shape of distributions. Tables organize these statistics for presentation in reports, papers, and presentations. This chapter covers how to calculate summary statistics in R and how to create publication-quality tables.

library(tidyverse)

# Load all datasets
macro <- read_csv("data/us_macrodata.csv")
states <- read_csv("data/acs_state.csv")
pums <- read_csv("data/acs_pums.csv")
cafe <- read_csv("data/cafe_sales.csv")

7.1 Summary Statistics in Base R

R provides built-in functions for common summary statistics. These functions work on vectors and return single values.

7.1.1 Measures of Central Tendency

Central tendency describes where the “center” of a distribution lies. The mean (arithmetic average) is the most common measure, but it can be heavily influenced by outliers. The median (the middle value when data are sorted) is more robust to extreme observations, making it preferred for skewed distributions like income.

# Mean: arithmetic average
mean(macro$unemployment)
## [1] 0.06779221
# Median: middle value
median(macro$unemployment)
## [1] 0.06
# For data with missing values, use na.rm = TRUE
mean(macro$one_year_rate, na.rm = TRUE)
## [1] 4.560333

7.1.2 Measures of Spread

Spread (or dispersion) describes how variable the data are around the center. Standard deviation and variance measure average distance from the mean, while the interquartile range (IQR) measures the spread of the middle 50% of observations.

# Standard deviation
sd(macro$unemployment)
## [1] 0.03705228
# Variance (standard deviation squared)
var(macro$unemployment)
## [1] 0.001372872
# Range: minimum and maximum
range(macro$unemployment)
## [1] 0.01 0.21
# Interquartile range (75th percentile minus 25th percentile)
IQR(macro$unemployment)
## [1] 0.04
# Specific quantiles
quantile(macro$unemployment, probs = c(0.25, 0.50, 0.75))
##  25%  50%  75% 
## 0.04 0.06 0.08

7.1.3 Other Useful Functions

# Minimum and maximum
min(macro$unemployment)
## [1] 0.01
max(macro$unemployment)
## [1] 0.21
# Number of observations
length(macro$unemployment)
## [1] 154
# Number of non-missing observations
sum(!is.na(macro$one_year_rate))
## [1] 150
# Sum of values
sum(macro$population)
## [1] 25855795

7.1.4 The summary() Function

The summary() function provides a quick overview of a variable or entire data frame.

# Summary of a single variable
summary(macro$unemployment)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01000 0.04000 0.06000 0.06779 0.08000 0.21000
# Summary of selected columns
macro %>%
  select(unemployment, inflation, realgdp_percap) %>%
  summary()
##   unemployment       inflation        realgdp_percap 
##  Min.   :0.01000   Min.   :-0.11000   Min.   : 3099  
##  1st Qu.:0.04000   1st Qu.: 0.00000   1st Qu.: 6317  
##  Median :0.06000   Median : 0.02000   Median :14782  
##  Mean   :0.06779   Mean   : 0.02247   Mean   :20326  
##  3rd Qu.:0.08000   3rd Qu.: 0.04000   3rd Qu.:32454  
##  Max.   :0.21000   Max.   : 0.20000   Max.   :60792

7.2 Summary Statistics with dplyr

While base R functions work on individual vectors, dplyr’s summarize() function lets you compute multiple statistics at once and organize them into a data frame. Combined with group_by(), this becomes a powerful tool for comparing statistics across groups.

7.2.1 Basic Summaries

# Multiple statistics in one call
macro %>%
  summarize(
    mean_unemployment = mean(unemployment, na.rm = TRUE),
    median_unemployment = median(unemployment, na.rm = TRUE),
    sd_unemployment = sd(unemployment, na.rm = TRUE),
    min_unemployment = min(unemployment, na.rm = TRUE),
    max_unemployment = max(unemployment, na.rm = TRUE),
    n_observations = n()
  )
## # A tibble: 1 × 6
##   mean_unemployment median_unemployment sd_unemployment min_unemployment
##               <dbl>               <dbl>           <dbl>            <dbl>
## 1            0.0678                0.06          0.0371             0.01
## # ℹ 2 more variables: max_unemployment <dbl>, n_observations <int>

7.2.2 Grouped Summaries

Grouping allows you to calculate statistics separately for each level of a categorical variable. This is essential for comparing economic outcomes across time periods, countries, demographic groups, or policy regimes.

# Summary statistics by president
macro %>%
  filter(year >= 1950) %>%
  group_by(president) %>%
  summarize(
    years = n(),
    mean_unemployment = mean(unemployment, na.rm = TRUE),
    mean_inflation = mean(inflation, na.rm = TRUE),
    mean_gdp_pc = mean(realgdp_percap, na.rm = TRUE)
  ) %>%
  arrange(desc(mean_unemployment))
## # A tibble: 14 × 5
##    president  years mean_unemployment mean_inflation mean_gdp_pc
##    <chr>      <int>             <dbl>          <dbl>       <dbl>
##  1 Ford           2            0.085          0.075       25480.
##  2 Reagan         8            0.0775         0.0462      31137.
##  3 Obama          8            0.0738         0.0138      49146.
##  4 Carter         4            0.065          0.0975      28038 
##  5 GHWBush        4            0.065          0.0425      35668.
##  6 Kennedy        2            0.065          0.01        17741 
##  7 GWBush         8            0.055          0.0288      47114.
##  8 Nixon          6            0.0533         0.0583      24172 
##  9 Clinton        8            0.0525         0.0262      40167 
## 10 Eisenhower     8            0.05           0.015       16541.
## 11 Trump          4            0.05           0.0175      54066.
## 12 LJohnson       6            0.0467         0.0233      20800.
## 13 Biden          4            0.0425         0.05        58828.
## 14 Truman         3            0.0367         0.0367      15114.
# Summary by decade
macro %>%
  mutate(decade = paste0(floor(year / 10) * 10, "s")) %>%
  group_by(decade) %>%
  summarize(
    mean_unemployment = mean(unemployment, na.rm = TRUE),
    sd_unemployment = sd(unemployment, na.rm = TRUE),
    mean_inflation = mean(inflation, na.rm = TRUE),
    sd_inflation = sd(inflation, na.rm = TRUE)
  )
## # A tibble: 16 × 5
##    decade mean_unemployment sd_unemployment mean_inflation sd_inflation
##    <chr>              <dbl>           <dbl>          <dbl>        <dbl>
##  1 1870s              0.11          0.015          -0.0222      0.0205 
##  2 1880s              0.044         0.0126         -0.008       0.0162 
##  3 1890s              0.089         0.0345         -0.009       0.0129 
##  4 1900s              0.046         0.00843         0.008       0.0175 
##  5 1910s              0.053         0.0142          0.071       0.0758 
##  6 1920s              0.056         0.0143          0.002       0.0692 
##  7 1930s              0.167         0.0377         -0.019       0.0493 
##  8 1940s              0.053         0.0424          0.057       0.0481 
##  9 1950s              0.045         0.0143          0.021       0.0242 
## 10 1960s              0.051         0.0110          0.023       0.0142 
## 11 1970s              0.064         0.0126          0.07        0.0271 
## 12 1980s              0.074         0.0165          0.056       0.0366 
## 13 1990s              0.059         0.0120          0.03        0.00943
## 14 2000s              0.057         0.0134          0.026       0.0107 
## 15 2010s              0.062         0.0220          0.017       0.00823
## 16 2020s              0.05          0.0173          0.042       0.0259

The same approach works with any grouped data. Here are examples with our other datasets:

# Summary statistics by region (using state data)
# First, let's add region classifications
states %>%
  mutate(
    region = case_when(
      state %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire",
                   "Rhode Island", "Vermont", "New Jersey", "New York",
                   "Pennsylvania") ~ "Northeast",
      state %in% c("Illinois", "Indiana", "Iowa", "Kansas", "Michigan",
                   "Minnesota", "Missouri", "Nebraska", "North Dakota",
                   "Ohio", "South Dakota", "Wisconsin") ~ "Midwest",
      state %in% c("Alabama", "Arkansas", "Delaware", "Florida", "Georgia",
                   "Kentucky", "Louisiana", "Maryland", "Mississippi",
                   "North Carolina", "Oklahoma", "South Carolina", "Tennessee",
                   "Texas", "Virginia", "West Virginia",
                   "District of Columbia") ~ "South",
      state %in% c("Arizona", "Colorado", "Idaho", "Montana", "Nevada",
                   "New Mexico", "Utah", "Wyoming", "Alaska", "California",
                   "Hawaii", "Oregon", "Washington") ~ "West",
      TRUE ~ "Territory"
    )
  ) %>%
  filter(region != "Territory") %>%
  group_by(region) %>%
  summarize(
    n_states = n(),
    mean_income = mean(median_household_income),
    mean_poverty = mean(poverty_rate) * 100,
    mean_education = mean(bachelors_or_higher) * 100
  )
## # A tibble: 4 × 5
##   region    n_states mean_income mean_poverty mean_education
##   <chr>        <int>       <dbl>        <dbl>          <dbl>
## 1 Midwest         12      71604          11.6           32.3
## 2 Northeast        9      83653.         10.6           39.3
## 3 South           17      69203.         14.4           32.4
## 4 West            13      78960.         11.6           33.5
# Summary statistics by education level (using PUMS microdata)
pums %>%
  filter(!is.na(wage_income), wage_income > 0, !is.na(education)) %>%
  group_by(education) %>%
  summarize(
    n = n(),
    mean_wage = mean(wage_income),
    median_wage = median(wage_income),
    sd_wage = sd(wage_income)
  ) %>%
  arrange(desc(mean_wage))
## # A tibble: 6 × 5
##   education                  n mean_wage median_wage sd_wage
##   <chr>                  <int>     <dbl>       <dbl>   <dbl>
## 1 Professional/Doctorate   508   146061.      100000 151356.
## 2 Masters                 1308   101543.       80000 100578.
## 3 Bachelors               2913    84949.       60000  97290.
## 4 Some college            3352    47340.       37000  50614.
## 5 High school             2584    39886.       30000  42293.
## 6 Less than HS             910    30968.       25000  26666.
# Summary statistics by product (using cafe data)
cafe %>%
  group_by(product) %>%
  summarize(
    transactions = n(),
    total_revenue = sum(price),
    avg_price = mean(price),
    min_price = min(price),
    max_price = max(price)
  ) %>%
  arrange(desc(total_revenue))
## # A tibble: 5 × 6
##   product   transactions total_revenue avg_price min_price max_price
##   <chr>            <int>         <dbl>     <dbl>     <dbl>     <dbl>
## 1 Latte           169747       908444.      5.35      4.5       6.18
## 2 Croissant       113248       453530.      4.00      3.25      4.72
## 3 Cold Brew        72258       368071.      5.09      4.25      5.88
## 4 Espresso         85978       297890.      3.46      2.75      4.16
## 5 Chai Tea         48761       221682.      4.55      3.75      5.29

7.2.3 The across() Function

When you need to apply the same summary functions to multiple columns, across() reduces repetition.

# Apply multiple functions to multiple columns
macro %>%
  filter(year >= 1950) %>%
  summarize(
    across(
      c(unemployment, inflation),
      list(
        mean = ~mean(.x, na.rm = TRUE),
        sd = ~sd(.x, na.rm = TRUE),
        min = ~min(.x, na.rm = TRUE),
        max = ~max(.x, na.rm = TRUE)
      )
    )
  )
## # A tibble: 1 × 8
##   unemployment_mean unemployment_sd unemployment_min unemployment_max
##               <dbl>           <dbl>            <dbl>            <dbl>
## 1            0.0583          0.0168             0.03              0.1
## # ℹ 4 more variables: inflation_mean <dbl>, inflation_sd <dbl>,
## #   inflation_min <dbl>, inflation_max <dbl>
# Grouped across() summary
macro %>%
  filter(year >= 1950) %>%
  mutate(era = case_when(
    year < 1980 ~ "1950-1979",
    year < 2000 ~ "1980-1999",
    TRUE ~ "2000-Present"
  )) %>%
  group_by(era) %>%
  summarize(
    across(
      c(unemployment, inflation, realgdp_percap),
      list(mean = ~mean(.x, na.rm = TRUE), sd = ~sd(.x, na.rm = TRUE))
    ),
    n = n()
  )
## # A tibble: 3 × 8
##   era          unemployment_mean unemployment_sd inflation_mean inflation_sd
##   <chr>                    <dbl>           <dbl>          <dbl>        <dbl>
## 1 1950-1979               0.0533          0.0147         0.038        0.0317
## 2 1980-1999               0.0665          0.0160         0.043        0.0292
## 3 2000-Present            0.0576          0.0179         0.0256       0.0164
## # ℹ 3 more variables: realgdp_percap_mean <dbl>, realgdp_percap_sd <dbl>,
## #   n <int>

7.3 Correlation

Correlation measures the linear relationship between two variables. The correlation coefficient ranges from -1 (perfect negative relationship) to +1 (perfect positive relationship), with 0 indicating no linear relationship.

# Correlation between two variables
cor(macro$unemployment, macro$inflation, use = "complete.obs")
## [1] -0.3424419
# Correlation matrix for multiple variables
macro %>%
  filter(year >= 1950) %>%
  select(unemployment, inflation, realgdp_percap, sp_composite) %>%
  cor(use = "complete.obs")
##                unemployment  inflation realgdp_percap sp_composite
## unemployment     1.00000000  0.1315356     0.03854902   -0.2100319
## inflation        0.13153562  1.0000000    -0.10759600   -0.1149998
## realgdp_percap   0.03854902 -0.1075960     1.00000000    0.8372281
## sp_composite    -0.21003195 -0.1149998     0.83722809    1.0000000
# Store correlation matrix for later use
cor_matrix <- macro %>%
  filter(year >= 1950) %>%
  select(unemployment, inflation, ten_year_rate) %>%
  cor(use = "complete.obs")

cor_matrix
##               unemployment inflation ten_year_rate
## unemployment     1.0000000 0.1315356     0.4265915
## inflation        0.1315356 1.0000000     0.5739462
## ten_year_rate    0.4265915 0.5739462     1.0000000

Correlation analysis works well with cross-sectional data too. Let’s examine relationships across U.S. states:

# Correlation among state-level economic indicators
states %>%
  filter(state != "Puerto Rico") %>%
  select(median_household_income, poverty_rate,
         bachelors_or_higher, unemployment_rate) %>%
  cor(use = "complete.obs")
##                         median_household_income poverty_rate
## median_household_income              1.00000000   -0.7231950
## poverty_rate                        -0.72319502    1.0000000
## bachelors_or_higher                  0.82561730   -0.4881848
## unemployment_rate                    0.06019203    0.5017541
##                         bachelors_or_higher unemployment_rate
## median_household_income          0.82561730        0.06019203
## poverty_rate                    -0.48818481        0.50175406
## bachelors_or_higher              1.00000000        0.07847918
## unemployment_rate                0.07847918        1.00000000

This reveals strong relationships: states with higher education levels tend to have higher incomes and lower poverty rates.

7.4 Creating Basic Tables with knitr

The knitr package provides kable() for creating simple tables. This is often sufficient for quick summaries.

# Create a summary table
president_summary <- macro %>%
  filter(year >= 1950) %>%
  group_by(president) %>%
  summarize(
    Years = n(),
    `Avg Unemployment` = mean(unemployment, na.rm = TRUE),
    `Avg Inflation` = mean(inflation, na.rm = TRUE)
  ) %>%
  arrange(desc(`Avg Inflation`))

# Display with kable
knitr::kable(president_summary, digits = 3)
president Years Avg Unemployment Avg Inflation
Carter 4 0.065 0.098
Ford 2 0.085 0.075
Nixon 6 0.053 0.058
Biden 4 0.043 0.050
Reagan 8 0.078 0.046
GHWBush 4 0.065 0.043
Truman 3 0.037 0.037
GWBush 8 0.055 0.029
Clinton 8 0.053 0.026
LJohnson 6 0.047 0.023
Trump 4 0.050 0.018
Eisenhower 8 0.050 0.015
Obama 8 0.074 0.014
Kennedy 2 0.065 0.010
# Add caption and format
knitr::kable(
  president_summary,
  digits = 3,
  caption = "Economic Performance by President (Post-1950)",
  col.names = c("President", "Years in Office", "Unemployment Rate", "Inflation Rate")
)
Table 7.1: Economic Performance by President (Post-1950)
President Years in Office Unemployment Rate Inflation Rate
Carter 4 0.065 0.098
Ford 2 0.085 0.075
Nixon 6 0.053 0.058
Biden 4 0.043 0.050
Reagan 8 0.078 0.046
GHWBush 4 0.065 0.043
Truman 3 0.037 0.037
GWBush 8 0.055 0.029
Clinton 8 0.053 0.026
LJohnson 6 0.047 0.023
Trump 4 0.050 0.018
Eisenhower 8 0.050 0.015
Obama 8 0.074 0.014
Kennedy 2 0.065 0.010

7.5 Publication-Quality Tables with kableExtra

The kableExtra package extends kable() with extensive formatting options for both HTML and LaTeX/PDF output.

library(kableExtra)

7.5.1 Basic Styling

president_summary %>%
  kable(
    digits = 3,
    col.names = c("President", "Years", "Unemployment", "Inflation"),
    caption = "Average Economic Indicators by President"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 7.2: Table 7.3: Average Economic Indicators by President
President Years Unemployment Inflation
Carter 4 0.065 0.098
Ford 2 0.085 0.075
Nixon 6 0.053 0.058
Biden 4 0.043 0.050
Reagan 8 0.078 0.046
GHWBush 4 0.065 0.043
Truman 3 0.037 0.037
GWBush 8 0.055 0.029
Clinton 8 0.053 0.026
LJohnson 6 0.047 0.023
Trump 4 0.050 0.018
Eisenhower 8 0.050 0.015
Obama 8 0.074 0.014
Kennedy 2 0.065 0.010

7.5.2 Adding Headers and Formatting

# Create a more detailed table
decade_stats <- macro %>%
  filter(year >= 1950) %>%
  mutate(decade = paste0(floor(year / 10) * 10, "s")) %>%
  group_by(decade) %>%
  summarize(
    n = n(),
    unemp_mean = mean(unemployment, na.rm = TRUE) * 100,
    unemp_sd = sd(unemployment, na.rm = TRUE) * 100,
    inf_mean = mean(inflation, na.rm = TRUE) * 100,
    inf_sd = sd(inflation, na.rm = TRUE) * 100
  )

decade_stats %>%
  kable(
    digits = 2,
    col.names = c("Decade", "N", "Mean", "SD", "Mean", "SD"),
    caption = "Unemployment and Inflation by Decade (Percent)"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "Unemployment" = 2, "Inflation" = 2))
Table 7.4: Table 7.5: Unemployment and Inflation by Decade (Percent)
Unemployment
Inflation
Decade N Mean SD Mean SD
1950s 10 4.5 1.43 2.1 2.42
1960s 10 5.1 1.10 2.3 1.42
1970s 10 6.4 1.26 7.0 2.71
1980s 10 7.4 1.65 5.6 3.66
1990s 10 5.9 1.20 3.0 0.94
2000s 10 5.7 1.34 2.6 1.07
2010s 10 6.2 2.20 1.7 0.82
2020s 5 5.0 1.73 4.2 2.59

7.5.3 Conditional Formatting

# Highlight rows based on values
president_summary %>%
  kable(digits = 3, col.names = c("President", "Years", "Unemployment", "Inflation")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(president_summary$`Avg Inflation` > 0.05), background = "#ffcccc") %>%
  row_spec(which(president_summary$`Avg Inflation` < 0.03), background = "#ccffcc")
President Years Unemployment Inflation
Carter 4 0.065 0.098
Ford 2 0.085 0.075
Nixon 6 0.053 0.058
Biden 4 0.043 0.050
Reagan 8 0.078 0.046
GHWBush 4 0.065 0.043
Truman 3 0.037 0.037
GWBush 8 0.055 0.029
Clinton 8 0.053 0.026
LJohnson 6 0.047 0.023
Trump 4 0.050 0.018
Eisenhower 8 0.050 0.015
Obama 8 0.074 0.014
Kennedy 2 0.065 0.010

7.5.4 Footnotes and Notes

decade_stats %>%
  kable(
    digits = 2,
    col.names = c("Decade", "N", "Mean", "SD", "Mean", "SD"),
    caption = "Unemployment and Inflation Statistics by Decade"
  ) %>%
  kable_styling(bootstrap_options = c("striped"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "Unemployment (%)" = 2, "Inflation (%)" = 2)) %>%
  footnote(
    general = "Data from Historical Statistics of the United States.",
    number = c("Unemployment and inflation expressed as percentages.",
               "SD = Standard Deviation.")
  )
Table 7.6: Table 7.7: Unemployment and Inflation Statistics by Decade
Unemployment (%)
Inflation (%)
Decade N Mean SD Mean SD
1950s 10 4.5 1.43 2.1 2.42
1960s 10 5.1 1.10 2.3 1.42
1970s 10 6.4 1.26 7.0 2.71
1980s 10 7.4 1.65 5.6 3.66
1990s 10 5.9 1.20 3.0 0.94
2000s 10 5.7 1.34 2.6 1.07
2010s 10 6.2 2.20 1.7 0.82
2020s 5 5.0 1.73 4.2 2.59
Note:
Data from Historical Statistics of the United States.
1 Unemployment and inflation expressed as percentages.
2 SD = Standard Deviation.

7.6 The gt Package

The gt package provides another approach to creating tables with a grammar similar to ggplot2. It offers fine-grained control over every aspect of table appearance.

library(gt)

7.6.1 Basic gt Table

president_summary %>%
  gt() %>%
  tab_header(
    title = "Economic Performance by President",
    subtitle = "Average unemployment and inflation rates, 1950-2024"
  )
Economic Performance by President
Average unemployment and inflation rates, 1950-2024
president Years Avg Unemployment Avg Inflation
Carter 4 0.06500000 0.09750000
Ford 2 0.08500000 0.07500000
Nixon 6 0.05333333 0.05833333
Biden 4 0.04250000 0.05000000
Reagan 8 0.07750000 0.04625000
GHWBush 4 0.06500000 0.04250000
Truman 3 0.03666667 0.03666667
GWBush 8 0.05500000 0.02875000
Clinton 8 0.05250000 0.02625000
LJohnson 6 0.04666667 0.02333333
Trump 4 0.05000000 0.01750000
Eisenhower 8 0.05000000 0.01500000
Obama 8 0.07375000 0.01375000
Kennedy 2 0.06500000 0.01000000

7.6.2 Formatting Columns

president_summary %>%
  gt() %>%
  tab_header(
    title = "Economic Performance by President",
    subtitle = "Post-World War II administrations"
  ) %>%
  fmt_percent(
    columns = c(`Avg Unemployment`, `Avg Inflation`),
    decimals = 1
  ) %>%
  cols_label(
    president = "President",
    Years = "Years in Office",
    `Avg Unemployment` = "Unemployment",
    `Avg Inflation` = "Inflation"
  )
Economic Performance by President
Post-World War II administrations
President Years in Office Unemployment Inflation
Carter 4 6.5% 9.8%
Ford 2 8.5% 7.5%
Nixon 6 5.3% 5.8%
Biden 4 4.2% 5.0%
Reagan 8 7.8% 4.6%
GHWBush 4 6.5% 4.2%
Truman 3 3.7% 3.7%
GWBush 8 5.5% 2.9%
Clinton 8 5.3% 2.6%
LJohnson 6 4.7% 2.3%
Trump 4 5.0% 1.8%
Eisenhower 8 5.0% 1.5%
Obama 8 7.4% 1.4%
Kennedy 2 6.5% 1.0%

7.6.3 Adding Color and Style

decade_stats %>%
  gt() %>%
  tab_header(
    title = "U.S. Economic Indicators by Decade",
    subtitle = "Unemployment and inflation statistics"
  ) %>%
  cols_label(
    decade = "Decade",
    n = "Years",
    unemp_mean = "Mean",
    unemp_sd = "SD",
    inf_mean = "Mean",
    inf_sd = "SD"
  ) %>%
  tab_spanner(
    label = "Unemployment (%)",
    columns = c(unemp_mean, unemp_sd)
  ) %>%
  tab_spanner(
    label = "Inflation (%)",
    columns = c(inf_mean, inf_sd)
  ) %>%
  fmt_number(
    columns = c(unemp_mean, unemp_sd, inf_mean, inf_sd),
    decimals = 2
  ) %>%
  data_color(
    columns = inf_mean,
    palette = c("white", "orange", "red"),
    domain = c(0, 10)
  )
U.S. Economic Indicators by Decade
Unemployment and inflation statistics
Decade Years
Unemployment (%)
Inflation (%)
Mean SD Mean SD
1950s 10 4.50 1.43 2.10 2.42
1960s 10 5.10 1.10 2.30 1.42
1970s 10 6.40 1.26 7.00 2.71
1980s 10 7.40 1.65 5.60 3.66
1990s 10 5.90 1.20 3.00 0.94
2000s 10 5.70 1.34 2.60 1.07
2010s 10 6.20 2.20 1.70 0.82
2020s 5 5.00 1.73 4.20 2.59

7.6.4 Source Notes and Footnotes

macro %>%
  filter(year >= 2015) %>%
  select(year, president, unemployment, inflation, realgdp_percap) %>%
  gt() %>%
  tab_header(
    title = "Recent U.S. Economic Data",
    subtitle = "Annual observations, 2015-2024"
  ) %>%
  fmt_percent(columns = c(unemployment, inflation), decimals = 1) %>%
  fmt_currency(columns = realgdp_percap, decimals = 0) %>%
  cols_label(
    year = "Year",
    president = "President",
    unemployment = "Unemployment",
    inflation = "Inflation",
    realgdp_percap = "Real GDP/Capita"
  ) %>%
  tab_source_note("Source: Historical Statistics of the United States") %>%
  tab_footnote(
    footnote = "Real GDP per capita in 2017 dollars",
    locations = cells_column_labels(columns = realgdp_percap)
  )
Recent U.S. Economic Data
Annual observations, 2015-2024
Year President Unemployment Inflation Real GDP/Capita1
2015 Obama 5.0% 0.0% $50,820
2016 Obama 5.0% 1.0% $51,563
2017 Trump 4.0% 2.0% $52,532
2018 Trump 4.0% 2.0% $53,988
2019 Trump 4.0% 2.0% $55,131
2020 Trump 8.0% 1.0% $54,612
2021 Biden 5.0% 5.0% $56,837
2022 Biden 4.0% 8.0% $58,253
2023 Biden 4.0% 4.0% $59,428
2024 Biden 4.0% 3.0% $60,792
1 Real GDP per capita in 2017 dollars
Source: Historical Statistics of the United States

7.7 Descriptive Statistics Tables

A common task in empirical research is creating a “Table 1” that summarizes the key variables in your dataset.

# Create comprehensive descriptive statistics
descriptive_stats <- macro %>%
  filter(year >= 1950) %>%
  summarize(
    across(
      c(unemployment, inflation, realgdp_percap, ten_year_rate),
      list(
        N = ~sum(!is.na(.x)),
        Mean = ~mean(.x, na.rm = TRUE),
        SD = ~sd(.x, na.rm = TRUE),
        Min = ~min(.x, na.rm = TRUE),
        Max = ~max(.x, na.rm = TRUE)
      ),
      .names = "{.col}_{.fn}"
    )
  ) %>%
  pivot_longer(
    everything(),
    names_to = c("variable", "statistic"),
    names_sep = "_",
    values_to = "value"
  ) %>%
  pivot_wider(
    names_from = statistic,
    values_from = value
  ) %>%
  mutate(
    variable = case_when(
      variable == "unemployment" ~ "Unemployment Rate",
      variable == "inflation" ~ "Inflation Rate",
      variable == "realgdp_percap" ~ "Real GDP per Capita",
      variable == "ten_year_rate" ~ "10-Year Treasury Rate"
    )
  )

descriptive_stats %>%
  gt() %>%
  tab_header(
    title = "Descriptive Statistics",
    subtitle = "U.S. Economic Indicators, 1950-2024"
  ) %>%
  fmt_number(columns = c(Mean, SD, Min, Max), decimals = 3) %>%
  fmt_number(columns = N, decimals = 0) %>%
  cols_label(variable = "Variable")
Descriptive Statistics
U.S. Economic Indicators, 1950-2024
Variable N Mean SD Min Max percap year
Unemployment Rate 75 0.05826667 0.01679554 0.03 0.1
Inflation Rate 75 0.0352 0.02737922 0 0.14
NA 75.00, 34411.41, 13645.89, 14398.00, 60792.00
NA 75.000000, 5.360933, 2.865575, 0.890000, 14.590000

7.8 Correlation Tables

Presenting correlation matrices is common in empirical research.

# Calculate correlation matrix
cor_data <- macro %>%
  filter(year >= 1950) %>%
  select(
    Unemployment = unemployment,
    Inflation = inflation,
    `GDP per Capita` = realgdp_percap,
    `10-Year Rate` = ten_year_rate
  ) %>%
  cor(use = "pairwise.complete.obs")

# Display as formatted table
cor_data %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  gt() %>%
  tab_header(title = "Correlation Matrix") %>%
  fmt_number(columns = -Variable, decimals = 3) %>%
  data_color(
    columns = -Variable,
    palette = c("#4575b4", "white", "#d73027"),
    domain = c(-1, 1)
  )
Correlation Matrix
Variable Unemployment Inflation GDP per Capita 10-Year Rate
Unemployment 1.000 0.132 0.039 0.427
Inflation 0.132 1.000 −0.108 0.574
GDP per Capita 0.039 −0.108 1.000 −0.224
10-Year Rate 0.427 0.574 −0.224 1.000

7.9 Exporting Tables

7.9.1 Saving to Files

# Save kable table to HTML file
president_summary %>%
  kable(digits = 3) %>%
  kable_styling() %>%
  save_kable("president_summary.html")

# Save gt table to various formats
my_table <- president_summary %>%
  gt() %>%
  tab_header(title = "Economic Performance by President")

# Save as HTML
gtsave(my_table, "president_table.html")

# Save as PNG image
gtsave(my_table, "president_table.png")

# Save as Word document
gtsave(my_table, "president_table.docx")

7.9.2 Exporting to CSV

For further analysis or use in other software, you can export your summary statistics as CSV files.

# Export summary statistics to CSV
decade_stats %>%
  write_csv("decade_summary_stats.csv")

# Export with better formatting
president_summary %>%
  mutate(
    `Avg Unemployment` = scales::percent(`Avg Unemployment`, accuracy = 0.1),
    `Avg Inflation` = scales::percent(`Avg Inflation`, accuracy = 0.1)
  ) %>%
  write_csv("president_summary.csv")

7.10 Example: Economic Report Table

Let’s create a comprehensive table suitable for an economic report.

# Prepare data
report_data <- macro %>%
  filter(year >= 1980) %>%
  mutate(
    decade = case_when(
      year < 1990 ~ "1980s",
      year < 2000 ~ "1990s",
      year < 2010 ~ "2000s",
      year < 2020 ~ "2010s",
      TRUE ~ "2020s"
    )
  ) %>%
  group_by(decade) %>%
  summarize(
    years = paste(min(year), max(year), sep = "-"),
    unemp_mean = mean(unemployment, na.rm = TRUE) * 100,
    unemp_sd = sd(unemployment, na.rm = TRUE) * 100,
    inf_mean = mean(inflation, na.rm = TRUE) * 100,
    inf_sd = sd(inflation, na.rm = TRUE) * 100,
    gdp_growth = (last(realgdp_percap) / first(realgdp_percap))^(1/n()) - 1,
    gdp_growth = gdp_growth * 100
  )

report_data %>%
  gt() %>%
  tab_header(
    title = md("**U.S. Economic Performance by Decade**"),
    subtitle = "Summary statistics for key macroeconomic indicators"
  ) %>%
  cols_label(
    decade = "Decade",
    years = "Period",
    unemp_mean = "Mean",
    unemp_sd = "SD",
    inf_mean = "Mean",
    inf_sd = "SD",
    gdp_growth = "Avg. Annual"
  ) %>%
  tab_spanner(label = "Unemployment (%)", columns = c(unemp_mean, unemp_sd)) %>%
  tab_spanner(label = "Inflation (%)", columns = c(inf_mean, inf_sd)) %>%
  tab_spanner(label = "GDP Growth (%)", columns = gdp_growth) %>%
  fmt_number(columns = c(unemp_mean, unemp_sd, inf_mean, inf_sd, gdp_growth), decimals = 2) %>%
  tab_source_note(md("*Source: Historical Statistics of the United States*")) %>%
  tab_footnote(
    footnote = "Average annual real GDP per capita growth rate",
    locations = cells_column_labels(columns = gdp_growth)
  ) %>%
  opt_row_striping()
U.S. Economic Performance by Decade
Summary statistics for key macroeconomic indicators
Decade Period
Unemployment (%)
Inflation (%)
GDP Growth (%)
Mean SD Mean SD Avg. Annual1
1980s 1980-1989 7.40 1.65 5.60 3.66 2.29
1990s 1990-1999 5.90 1.20 3.00 0.94 1.90
2000s 2000-2009 5.70 1.34 2.60 1.07 0.54
2010s 2010-2019 6.20 2.20 1.70 0.82 1.45
2020s 2020-2024 5.00 1.73 4.20 2.59 2.17
1 Average annual real GDP per capita growth rate
Source: Historical Statistics of the United States

7.11 Example: Comparison Table

When comparing groups, a well-structured table helps readers quickly identify differences.

# Compare pre- and post-1980 periods
era_comparison <- macro %>%
  filter(year >= 1950) %>%
  mutate(era = ifelse(year < 1980, "1950-1979", "1980-2024")) %>%
  group_by(era) %>%
  summarize(
    n_years = n(),
    unemp_mean = mean(unemployment, na.rm = TRUE),
    unemp_sd = sd(unemployment, na.rm = TRUE),
    inf_mean = mean(inflation, na.rm = TRUE),
    inf_sd = sd(inflation, na.rm = TRUE),
    rate_mean = mean(ten_year_rate, na.rm = TRUE),
    rate_sd = sd(ten_year_rate, na.rm = TRUE)
  )

era_comparison %>%
  gt() %>%
  tab_header(
    title = "The Great Moderation",
    subtitle = "Comparing economic volatility before and after 1980"
  ) %>%
  cols_label(
    era = "Period",
    n_years = "Years",
    unemp_mean = "Mean",
    unemp_sd = "SD",
    inf_mean = "Mean",
    inf_sd = "SD",
    rate_mean = "Mean",
    rate_sd = "SD"
  ) %>%
  tab_spanner(label = "Unemployment", columns = c(unemp_mean, unemp_sd)) %>%
  tab_spanner(label = "Inflation", columns = c(inf_mean, inf_sd)) %>%
  tab_spanner(label = "10-Year Rate", columns = c(rate_mean, rate_sd)) %>%
  fmt_percent(columns = c(unemp_mean, unemp_sd, inf_mean, inf_sd, rate_mean, rate_sd), decimals = 1) %>%
  tab_footnote(
    footnote = "Note the reduction in standard deviations after 1980",
    locations = cells_column_spanners("Inflation")
  )
The Great Moderation
Comparing economic volatility before and after 1980
Period Years
Unemployment
Inflation1
10-Year Rate
Mean SD Mean SD Mean SD
1950-1979 30 5.3% 1.5% 3.8% 3.2% 491.6% 197.9%
1980-2024 45 6.2% 1.7% 3.3% 2.4% 565.7% 331.7%
1 Note the reduction in standard deviations after 1980

7.12 Example: State Comparison Tables

Summary tables are especially useful for comparing geographic units. Let’s create a table of the top and bottom states by median income.

# Top 10 and bottom 10 states by median household income
top_bottom_states <- states %>%
  filter(state != "Puerto Rico", state != "District of Columbia") %>%
  arrange(desc(median_household_income)) %>%
  mutate(rank = row_number()) %>%
  filter(rank <= 10 | rank > n() - 10) %>%
  select(rank, state, median_household_income, poverty_rate, bachelors_or_higher)

top_bottom_states %>%
  gt() %>%
  tab_header(
    title = "States by Median Household Income",
    subtitle = "Top 10 and bottom 10 states"
  ) %>%
  fmt_currency(columns = median_household_income, decimals = 0) %>%
  fmt_percent(columns = c(poverty_rate, bachelors_or_higher), decimals = 1) %>%
  cols_label(
    rank = "Rank",
    state = "State",
    median_household_income = "Median Income",
    poverty_rate = "Poverty Rate",
    bachelors_or_higher = "Bachelor's+"
  ) %>%
  tab_row_group(
    label = "Top 10 States",
    rows = rank <= 10
  ) %>%
  tab_row_group(
    label = "Bottom 10 States",
    rows = rank > 10
  ) %>%
  tab_source_note("Source: American Community Survey")
States by Median Household Income
Top 10 and bottom 10 states
Rank State Median Income Poverty Rate Bachelor's+
Bottom 10 States
41 Tennessee $64,035 14.0% 29.7%
42 South Carolina $63,623 14.4% 30.6%
43 Oklahoma $61,364 15.2% 27.3%
44 Kentucky $60,183 16.1% 26.5%
45 Alabama $59,609 15.7% 27.2%
46 New Mexico $58,722 18.3% 29.1%
47 Louisiana $57,852 18.7% 26.1%
48 Arkansas $56,335 16.2% 24.7%
49 West Virginia $55,217 16.8% 22.7%
50 Mississippi $52,985 19.2% 23.9%
Top 10 States
1 Maryland $98,461 9.3% 42.2%
2 New Jersey $97,126 9.7% 42.3%
3 Massachusetts $96,505 9.9% 45.9%
4 Hawaii $94,814 9.6% 34.7%
5 California $91,905 12.1% 35.9%
6 New Hampshire $90,845 7.3% 39.0%
7 Washington $90,325 9.9% 38.0%
8 Connecticut $90,213 10.1% 41.4%
9 Colorado $87,598 9.6% 43.7%
10 Virginia $87,249 10.0% 41.0%
Source: American Community Survey

Regional comparisons are also informative:

# Regional summary table
regional_summary <- states %>%
  filter(state != "Puerto Rico") %>%
  mutate(
    region = case_when(
      state %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire",
                   "Rhode Island", "Vermont", "New Jersey", "New York",
                   "Pennsylvania") ~ "Northeast",
      state %in% c("Illinois", "Indiana", "Iowa", "Kansas", "Michigan",
                   "Minnesota", "Missouri", "Nebraska", "North Dakota",
                   "Ohio", "South Dakota", "Wisconsin") ~ "Midwest",
      state %in% c("Alabama", "Arkansas", "Delaware", "Florida", "Georgia",
                   "Kentucky", "Louisiana", "Maryland", "Mississippi",
                   "North Carolina", "Oklahoma", "South Carolina", "Tennessee",
                   "Texas", "Virginia", "West Virginia",
                   "District of Columbia") ~ "South",
      TRUE ~ "West"
    )
  ) %>%
  group_by(region) %>%
  summarize(
    states = n(),
    total_pop = sum(population),
    # Calculate population-weighted averages manually
    median_income = sum(median_household_income * population) / sum(population),
    poverty_rate = sum(poverty_rate * population) / sum(population),
    bachelors = sum(bachelors_or_higher * population) / sum(population)
  )

regional_summary %>%
  gt() %>%
  tab_header(
    title = "Economic Indicators by Census Region",
    subtitle = "Population-weighted averages"
  ) %>%
  cols_label(
    region = "Region",
    states = "States",
    total_pop = "Population",
    median_income = "Median Income",
    poverty_rate = "Poverty Rate",
    bachelors = "Bachelor's+"
  ) %>%
  fmt_number(columns = total_pop, suffixing = TRUE) %>%
  fmt_currency(columns = median_income, decimals = 0) %>%
  fmt_percent(columns = c(poverty_rate, bachelors), decimals = 1)
Economic Indicators by Census Region
Population-weighted averages
Region States Population Median Income Poverty Rate Bachelor's+
Midwest 12 68.86M $71,682 12.0% 32.6%
Northeast 9 57.31M $84,294 11.6% 39.1%
South 17 126.50M $69,950 13.8% 32.3%
West 13 78.43M $85,924 11.7% 35.4%

7.13 Example: Demographic Tables from PUMS

Individual-level survey data like PUMS allows for detailed demographic analysis.

# Summary statistics by sex and education
pums %>%
  filter(!is.na(education), !is.na(sex),
         !is.na(wage_income), wage_income > 0) %>%
  group_by(sex, education) %>%
  summarize(
    n = n(),
    mean_wage = mean(wage_income),
    median_wage = median(wage_income),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = sex,
    values_from = c(n, mean_wage, median_wage)
  ) %>%
  gt() %>%
  tab_header(
    title = "Wage Income by Education and Sex",
    subtitle = "Working-age adults with positive wage income"
  ) %>%
  cols_label(
    education = "Education",
    n_Male = "N",
    mean_wage_Male = "Mean",
    median_wage_Male = "Median",
    n_Female = "N",
    mean_wage_Female = "Mean",
    median_wage_Female = "Median"
  ) %>%
  tab_spanner(label = "Male", columns = ends_with("Male")) %>%
  tab_spanner(label = "Female", columns = ends_with("Female")) %>%
  fmt_number(columns = starts_with("n_"), decimals = 0) %>%
  fmt_currency(columns = c(starts_with("mean_"), starts_with("median_")), decimals = 0)
Wage Income by Education and Sex
Working-age adults with positive wage income
Female
Female
Female
Education
Male
N N Mean Mean Median Median
Bachelors 1,499 1,414 $68,691 $102,184 $54,000 $75,000
High school 1,075 1,509 $31,845 $45,614 $26,000 $36,000
Less than HS 345 565 $24,238 $35,078 $20,800 $30,000
Masters 740 568 $81,291 $127,928 $70,000 $100,000
Professional/Doctorate 217 291 $114,953 $169,259 $90,000 $110,000
Some college 1,657 1,695 $38,877 $55,614 $30,000 $45,000

We can also create a “Table 1” style summary of the entire PUMS sample:

# Comprehensive sample description
pums_summary <- pums %>%
  summarize(
    `Sample Size` = n(),
    `Mean Age` = mean(age, na.rm = TRUE),
    `% Female` = mean(sex == "Female", na.rm = TRUE) * 100,
    `% Employed` = mean(employed == 1, na.rm = TRUE) * 100,
    `Mean Hours Worked` = mean(hours_worked[hours_worked > 0], na.rm = TRUE),
    `Median Wage Income` = median(wage_income[wage_income > 0], na.rm = TRUE),
    `Median Household Income` = median(household_income, na.rm = TRUE),
    `% Homeowner` = mean(tenure == "Owner", na.rm = TRUE) * 100
  ) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Value")

pums_summary %>%
  gt() %>%
  tab_header(
    title = "Sample Characteristics",
    subtitle = "American Community Survey PUMS Data"
  ) %>%
  fmt_number(columns = Value, decimals = 1) %>%
  tab_source_note("Source: U.S. Census Bureau, ACS PUMS")
Sample Characteristics
American Community Survey PUMS Data
Variable Value
Sample Size 20,000.0
Mean Age 50.5
% Female 50.9
% Employed 0.0
Mean Hours Worked 38.4
Median Wage Income 44,000.0
Median Household Income 83,650.0
% Homeowner 0.0
Source: U.S. Census Bureau, ACS PUMS

7.14 Example: Business Analytics Tables

Transaction data enables operational analysis. Here’s a sales summary table for our cafe:

# Product performance table
cafe %>%
  group_by(product) %>%
  summarize(
    transactions = n(),
    revenue = sum(price),
    avg_price = mean(price),
    first_sale = min(date),
    last_sale = max(date)
  ) %>%
  arrange(desc(revenue)) %>%
  gt() %>%
  tab_header(
    title = "Product Performance Summary",
    subtitle = "2019-2023"
  ) %>%
  cols_label(
    product = "Product",
    transactions = "Transactions",
    revenue = "Total Revenue",
    avg_price = "Avg Price",
    first_sale = "First Sale",
    last_sale = "Last Sale"
  ) %>%
  fmt_number(columns = transactions, decimals = 0) %>%
  fmt_currency(columns = c(revenue, avg_price)) %>%
  tab_source_note("Note: Simulated data for educational purposes")
Product Performance Summary
2019-2023
Product Transactions Total Revenue Avg Price First Sale Last Sale
Latte 169,747 $908,444.22 $5.35 2019-01-01 2023-12-31
Croissant 113,248 $453,530.48 $4.00 2019-01-01 2023-12-31
Cold Brew 72,258 $368,071.42 $5.09 2019-01-01 2023-12-31
Espresso 85,978 $297,890.46 $3.46 2019-01-01 2023-12-31
Chai Tea 48,761 $221,681.59 $4.55 2019-01-01 2023-12-31
Note: Simulated data for educational purposes

Year-over-year comparisons are essential for business reporting:

# Year-over-year summary
cafe %>%
  group_by(year) %>%
  summarize(
    transactions = n(),
    revenue = sum(price),
    avg_price = mean(price),
    unique_customers = n_distinct(customer_id)
  ) %>%
  mutate(
    yoy_revenue = (revenue / lag(revenue) - 1) * 100,
    yoy_customers = (unique_customers / lag(unique_customers) - 1) * 100
  ) %>%
  gt() %>%
  tab_header(
    title = "Annual Sales Performance",
    subtitle = "Year-over-year analysis"
  ) %>%
  cols_label(
    year = "Year",
    transactions = "Transactions",
    revenue = "Revenue",
    avg_price = "Avg Price",
    unique_customers = "Customers",
    yoy_revenue = "Revenue Δ",
    yoy_customers = "Customers Δ"
  ) %>%
  fmt_number(columns = c(transactions, unique_customers), decimals = 0) %>%
  fmt_currency(columns = c(revenue, avg_price)) %>%
  fmt_number(columns = c(yoy_revenue, yoy_customers), decimals = 1) %>%
  sub_missing(missing_text = "—") %>%
  tab_footnote(
    footnote = "Note the COVID-19 impact in 2020",
    locations = cells_body(columns = year, rows = year == 2020)
  )
Annual Sales Performance
Year-over-year analysis
Year Transactions Revenue Avg Price Customers Revenue Δ Customers Δ
2019 104,139 $408,711.61 $3.92 2,000
1 2020 65,955 $278,485.58 $4.22 2,000 −31.9 0.0
2021 97,246 $443,548.89 $4.56 2,000 59.3 0.0
2022 110,639 $538,454.95 $4.87 2,000 21.4 0.0
2023 112,013 $580,417.14 $5.18 2,000 7.8 0.0
1 Note the COVID-19 impact in 2020

Day-of-week analysis helps with staffing and inventory decisions:

# Sales by day of week
cafe %>%
  mutate(day_of_week = factor(day_of_week,
                              levels = c("Mon", "Tue", "Wed",
                                        "Thu", "Fri", "Sat", "Sun"))) %>%
  group_by(day_of_week) %>%
  summarize(
    avg_daily_transactions = n() / n_distinct(date),
    avg_daily_revenue = sum(price) / n_distinct(date),
    pct_of_total = n() / nrow(cafe) * 100
  ) %>%
  gt() %>%
  tab_header(
    title = "Sales by Day of Week",
    subtitle = "Average daily performance"
  ) %>%
  cols_label(
    day_of_week = "Day",
    avg_daily_transactions = "Avg Transactions",
    avg_daily_revenue = "Avg Revenue",
    pct_of_total = "% of Total"
  ) %>%
  fmt_number(columns = avg_daily_transactions, decimals = 0) %>%
  fmt_currency(columns = avg_daily_revenue) %>%
  fmt_number(columns = pct_of_total, decimals = 1) %>%
  data_color(
    columns = avg_daily_revenue,
    palette = c("white", "#228B22"),
    domain = c(200, 350)
  )
Sales by Day of Week
Average daily performance
Day Avg Transactions Avg Revenue % of Total
Mon 243 $1,117.28 12.9
Tue 241 $1,105.58 12.9
Wed 242 $1,108.44 12.9
Thu 242 $1,109.63 12.9
Fri 241 $1,107.15 12.8
Sat 334 $1,537.45 17.8
Sun 335 $1,537.98 17.8

7.15 Exercises

7.15.1 Macro Data Exercises

  1. Calculate the mean, median, standard deviation, and range of real GDP per capita for the post-war period (1950-present). What do the differences between mean and median tell you about the distribution?

  2. Create a summary table showing average unemployment and inflation for each president since 1950. Format it with kableExtra, highlighting presidents with average inflation above 5%.

  3. Calculate the correlation matrix for unemployment, inflation, the S&P 500, and the 10-year Treasury rate. Which pairs of variables are most strongly correlated?

  4. Using the across() function, calculate the mean and standard deviation of all numeric variables in the macro dataset for the 1970s versus the 1990s. Present the results in a gt table.

7.15.2 State Data Exercises

  1. Create a table showing the 5 states with the highest and 5 states with the lowest poverty rates. Include median income and education levels. Use gt with row grouping to separate high and low poverty states.

  2. Build a correlation table for state-level variables: population, median income, poverty rate, unemployment rate, and homeownership rate. Add color coding to highlight strong correlations.

  3. Create a regional comparison table (Northeast, Midwest, South, West) showing weighted averages for key economic indicators. Add footnotes explaining the weighting methodology.

7.15.3 PUMS Data Exercises

  1. Create a “Table 1” style summary of the PUMS sample, showing characteristics separately for employed and unemployed individuals. Compare age, education distribution, and household income.

  2. Build a table showing median wage income by education level and state. Use pivot_wider() to put states in columns and education levels in rows.

  3. Calculate and present the gender wage gap (ratio of female to male median wages) by education level. Format as a gt table with conditional coloring.

7.15.4 Cafe Data Exercises

  1. Create a monthly revenue table for each year (2019-2023). Calculate year-over-year percentage changes and highlight months with negative growth in red.

  2. Build a product profitability table showing each product’s contribution to total revenue over time. Has the product mix changed across years?

  3. Create a customer segmentation table: calculate summary statistics (total spending, visit frequency, average transaction) for customers in different spending quintiles.

7.15.5 Advanced Exercises

  1. Export three tables (one from macro data, one from state data, one from PUMS) to a single HTML file using gt’s tab_source_note() to ensure consistent sourcing.

  2. Create a publication-ready table comparing economic indicators across the four datasets. This will require careful thought about what comparable statistics can be computed from each dataset.