Chapter 7 Summary Statistics and Tables

In an empirical economic analysis, we often want to compute various summary statistics and present them in clear, well-formatted tables. Summary statistics condense large datasets into interpretable numbers—means, medians, standard deviations, and other measures that characterize the central tendency, spread, and shape of distributions. Tables organize these statistics for presentation in reports, papers, and presentations. This chapter covers how to calculate summary statistics in R and how to create publication-quality tables.

library(tidyverse)

# Load all datasets
macro <- read_csv("data/us_macrodata.csv")
states <- read_csv("data/acs_state.csv")
pums <- read_csv("data/acs_pums.csv")
cafe <- read_csv("data/cafe_sales.csv")

7.1 Summary Statistics in Base R

R provides built-in functions for common summary statistics. These functions work on vectors and return single values.

7.1.1 Measures of Central Tendency

Central tendency describes where the “center” of a distribution lies. The mean (arithmetic average) is the most common measure, but it can be heavily influenced by outliers. The median (the middle value when data are sorted) is more robust to extreme observations, making it preferred for skewed distributions like income.

# Mean: arithmetic average
mean(macro$unemployment)

## [1] 0.06779221

# Median: middle value
median(macro$unemployment)

## [1] 0.06

# For data with missing values, use na.rm = TRUE
mean(macro$one_year_rate, na.rm = TRUE)

## [1] 4.560333

7.1.2 Measures of Spread

Spread (or dispersion) describes how variable the data are around the center. Standard deviation and variance measure average distance from the mean, while the interquartile range (IQR) measures the spread of the middle 50% of observations.

# Standard deviation
sd(macro$unemployment)

## [1] 0.03705228

# Variance (standard deviation squared)
var(macro$unemployment)

## [1] 0.001372872

# Range: minimum and maximum
range(macro$unemployment)

## [1] 0.01 0.21

# Interquartile range (75th percentile minus 25th percentile)
IQR(macro$unemployment)

## [1] 0.04

# Specific quantiles
quantile(macro$unemployment, probs = c(0.25, 0.50, 0.75))

##  25%  50%  75% 
## 0.04 0.06 0.08

7.1.3 Other Useful Functions

# Minimum and maximum
min(macro$unemployment)

## [1] 0.01

max(macro$unemployment)

## [1] 0.21

# Number of observations
length(macro$unemployment)

## [1] 154

# Number of non-missing observations
sum(!is.na(macro$one_year_rate))

## [1] 150

# Sum of values
sum(macro$population)

## [1] 25855795

7.1.4 The summary() Function

The summary() function provides a quick overview of a variable or entire data frame.

# Summary of a single variable
summary(macro$unemployment)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01000 0.04000 0.06000 0.06779 0.08000 0.21000

# Summary of selected columns
macro %>%
  select(unemployment, inflation, realgdp_percap) %>%
  summary()

##   unemployment       inflation        realgdp_percap 
##  Min.   :0.01000   Min.   :-0.11000   Min.   : 3099  
##  1st Qu.:0.04000   1st Qu.: 0.00000   1st Qu.: 6317  
##  Median :0.06000   Median : 0.02000   Median :14782  
##  Mean   :0.06779   Mean   : 0.02247   Mean   :20326  
##  3rd Qu.:0.08000   3rd Qu.: 0.04000   3rd Qu.:32454  
##  Max.   :0.21000   Max.   : 0.20000   Max.   :60792

7.2 Summary Statistics with dplyr

While base R functions work on individual vectors, dplyr’s summarize() function lets you compute multiple statistics at once and organize them into a data frame. Combined with group_by(), this becomes a powerful tool for comparing statistics across groups.

7.2.1 Basic Summaries

# Multiple statistics in one call
macro %>%
  summarize(
    mean_unemployment = mean(unemployment, na.rm = TRUE),
    median_unemployment = median(unemployment, na.rm = TRUE),
    sd_unemployment = sd(unemployment, na.rm = TRUE),
    min_unemployment = min(unemployment, na.rm = TRUE),
    max_unemployment = max(unemployment, na.rm = TRUE),
    n_observations = n()
  )

## # A tibble: 1 × 6
##   mean_unemployment median_unemployment sd_unemployment min_unemployment
##               <dbl>               <dbl>           <dbl>            <dbl>
## 1            0.0678                0.06          0.0371             0.01
## # ℹ 2 more variables: max_unemployment <dbl>, n_observations <int>

7.2.2 Grouped Summaries

Grouping allows you to calculate statistics separately for each level of a categorical variable. This is essential for comparing economic outcomes across time periods, countries, demographic groups, or policy regimes.

# Summary statistics by president
macro %>%
  filter(year >= 1950) %>%
  group_by(president) %>%
  summarize(
    years = n(),
    mean_unemployment = mean(unemployment, na.rm = TRUE),
    mean_inflation = mean(inflation, na.rm = TRUE),
    mean_gdp_pc = mean(realgdp_percap, na.rm = TRUE)
  ) %>%
  arrange(desc(mean_unemployment))

## # A tibble: 14 × 5
##    president  years mean_unemployment mean_inflation mean_gdp_pc
##    <chr>      <int>             <dbl>          <dbl>       <dbl>
##  1 Ford           2            0.085          0.075       25480.
##  2 Reagan         8            0.0775         0.0462      31137.
##  3 Obama          8            0.0738         0.0138      49146.
##  4 Carter         4            0.065          0.0975      28038 
##  5 GHWBush        4            0.065          0.0425      35668.
##  6 Kennedy        2            0.065          0.01        17741 
##  7 GWBush         8            0.055          0.0288      47114.
##  8 Nixon          6            0.0533         0.0583      24172 
##  9 Clinton        8            0.0525         0.0262      40167 
## 10 Eisenhower     8            0.05           0.015       16541.
## 11 Trump          4            0.05           0.0175      54066.
## 12 LJohnson       6            0.0467         0.0233      20800.
## 13 Biden          4            0.0425         0.05        58828.
## 14 Truman         3            0.0367         0.0367      15114.

# Summary by decade
macro %>%
  mutate(decade = paste0(floor(year / 10) * 10, "s")) %>%
  group_by(decade) %>%
  summarize(
    mean_unemployment = mean(unemployment, na.rm = TRUE),
    sd_unemployment = sd(unemployment, na.rm = TRUE),
    mean_inflation = mean(inflation, na.rm = TRUE),
    sd_inflation = sd(inflation, na.rm = TRUE)
  )

## # A tibble: 16 × 5
##    decade mean_unemployment sd_unemployment mean_inflation sd_inflation
##    <chr>              <dbl>           <dbl>          <dbl>        <dbl>
##  1 1870s              0.11          0.015          -0.0222      0.0205 
##  2 1880s              0.044         0.0126         -0.008       0.0162 
##  3 1890s              0.089         0.0345         -0.009       0.0129 
##  4 1900s              0.046         0.00843         0.008       0.0175 
##  5 1910s              0.053         0.0142          0.071       0.0758 
##  6 1920s              0.056         0.0143          0.002       0.0692 
##  7 1930s              0.167         0.0377         -0.019       0.0493 
##  8 1940s              0.053         0.0424          0.057       0.0481 
##  9 1950s              0.045         0.0143          0.021       0.0242 
## 10 1960s              0.051         0.0110          0.023       0.0142 
## 11 1970s              0.064         0.0126          0.07        0.0271 
## 12 1980s              0.074         0.0165          0.056       0.0366 
## 13 1990s              0.059         0.0120          0.03        0.00943
## 14 2000s              0.057         0.0134          0.026       0.0107 
## 15 2010s              0.062         0.0220          0.017       0.00823
## 16 2020s              0.05          0.0173          0.042       0.0259

The same approach works with any grouped data. Here are examples with our other datasets:

# Summary statistics by region (using state data)
# First, let's add region classifications
states %>%
  mutate(
    region = case_when(
      state %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire",
                   "Rhode Island", "Vermont", "New Jersey", "New York",
                   "Pennsylvania") ~ "Northeast",
      state %in% c("Illinois", "Indiana", "Iowa", "Kansas", "Michigan",
                   "Minnesota", "Missouri", "Nebraska", "North Dakota",
                   "Ohio", "South Dakota", "Wisconsin") ~ "Midwest",
      state %in% c("Alabama", "Arkansas", "Delaware", "Florida", "Georgia",
                   "Kentucky", "Louisiana", "Maryland", "Mississippi",
                   "North Carolina", "Oklahoma", "South Carolina", "Tennessee",
                   "Texas", "Virginia", "West Virginia",
                   "District of Columbia") ~ "South",
      state %in% c("Arizona", "Colorado", "Idaho", "Montana", "Nevada",
                   "New Mexico", "Utah", "Wyoming", "Alaska", "California",
                   "Hawaii", "Oregon", "Washington") ~ "West",
      TRUE ~ "Territory"
    )
  ) %>%
  filter(region != "Territory") %>%
  group_by(region) %>%
  summarize(
    n_states = n(),
    mean_income = mean(median_household_income),
    mean_poverty = mean(poverty_rate) * 100,
    mean_education = mean(bachelors_or_higher) * 100
  )

## # A tibble: 4 × 5
##   region    n_states mean_income mean_poverty mean_education
##   <chr>        <int>       <dbl>        <dbl>          <dbl>
## 1 Midwest         12      71604          11.6           32.3
## 2 Northeast        9      83653.         10.6           39.3
## 3 South           17      69203.         14.4           32.4
## 4 West            13      78960.         11.6           33.5

# Summary statistics by education level (using PUMS microdata)
pums %>%
  filter(!is.na(wage_income), wage_income > 0, !is.na(education)) %>%
  group_by(education) %>%
  summarize(
    n = n(),
    mean_wage = mean(wage_income),
    median_wage = median(wage_income),
    sd_wage = sd(wage_income)
  ) %>%
  arrange(desc(mean_wage))

## # A tibble: 6 × 5
##   education                  n mean_wage median_wage sd_wage
##   <chr>                  <int>     <dbl>       <dbl>   <dbl>
## 1 Professional/Doctorate   508   146061.      100000 151356.
## 2 Masters                 1308   101543.       80000 100578.
## 3 Bachelors               2913    84949.       60000  97290.
## 4 Some college            3352    47340.       37000  50614.
## 5 High school             2584    39886.       30000  42293.
## 6 Less than HS             910    30968.       25000  26666.

# Summary statistics by product (using cafe data)
cafe %>%
  group_by(product) %>%
  summarize(
    transactions = n(),
    total_revenue = sum(price),
    avg_price = mean(price),
    min_price = min(price),
    max_price = max(price)
  ) %>%
  arrange(desc(total_revenue))

## # A tibble: 5 × 6
##   product   transactions total_revenue avg_price min_price max_price
##   <chr>            <int>         <dbl>     <dbl>     <dbl>     <dbl>
## 1 Latte           169747       908444.      5.35      4.5       6.18
## 2 Croissant       113248       453530.      4.00      3.25      4.72
## 3 Cold Brew        72258       368071.      5.09      4.25      5.88
## 4 Espresso         85978       297890.      3.46      2.75      4.16
## 5 Chai Tea         48761       221682.      4.55      3.75      5.29

7.2.3 The across() Function

When you need to apply the same summary functions to multiple columns, across() reduces repetition.

# Apply multiple functions to multiple columns
macro %>%
  filter(year >= 1950) %>%
  summarize(
    across(
      c(unemployment, inflation),
      list(
        mean = ~mean(.x, na.rm = TRUE),
        sd = ~sd(.x, na.rm = TRUE),
        min = ~min(.x, na.rm = TRUE),
        max = ~max(.x, na.rm = TRUE)
      )
    )
  )

## # A tibble: 1 × 8
##   unemployment_mean unemployment_sd unemployment_min unemployment_max
##               <dbl>           <dbl>            <dbl>            <dbl>
## 1            0.0583          0.0168             0.03              0.1
## # ℹ 4 more variables: inflation_mean <dbl>, inflation_sd <dbl>,
## #   inflation_min <dbl>, inflation_max <dbl>

# Grouped across() summary
macro %>%
  filter(year >= 1950) %>%
  mutate(era = case_when(
    year < 1980 ~ "1950-1979",
    year < 2000 ~ "1980-1999",
    TRUE ~ "2000-Present"
  )) %>%
  group_by(era) %>%
  summarize(
    across(
      c(unemployment, inflation, realgdp_percap),
      list(mean = ~mean(.x, na.rm = TRUE), sd = ~sd(.x, na.rm = TRUE))
    ),
    n = n()
  )

## # A tibble: 3 × 8
##   era          unemployment_mean unemployment_sd inflation_mean inflation_sd
##   <chr>                    <dbl>           <dbl>          <dbl>        <dbl>
## 1 1950-1979               0.0533          0.0147         0.038        0.0317
## 2 1980-1999               0.0665          0.0160         0.043        0.0292
## 3 2000-Present            0.0576          0.0179         0.0256       0.0164
## # ℹ 3 more variables: realgdp_percap_mean <dbl>, realgdp_percap_sd <dbl>,
## #   n <int>

7.3 Correlation

Correlation measures the linear relationship between two variables. The correlation coefficient ranges from -1 (perfect negative relationship) to +1 (perfect positive relationship), with 0 indicating no linear relationship.

# Correlation between two variables
cor(macro$unemployment, macro$inflation, use = "complete.obs")

## [1] -0.3424419

# Correlation matrix for multiple variables
macro %>%
  filter(year >= 1950) %>%
  select(unemployment, inflation, realgdp_percap, sp_composite) %>%
  cor(use = "complete.obs")

##                unemployment  inflation realgdp_percap sp_composite
## unemployment     1.00000000  0.1315356     0.03854902   -0.2100319
## inflation        0.13153562  1.0000000    -0.10759600   -0.1149998
## realgdp_percap   0.03854902 -0.1075960     1.00000000    0.8372281
## sp_composite    -0.21003195 -0.1149998     0.83722809    1.0000000

# Store correlation matrix for later use
cor_matrix <- macro %>%
  filter(year >= 1950) %>%
  select(unemployment, inflation, ten_year_rate) %>%
  cor(use = "complete.obs")

cor_matrix

##               unemployment inflation ten_year_rate
## unemployment     1.0000000 0.1315356     0.4265915
## inflation        0.1315356 1.0000000     0.5739462
## ten_year_rate    0.4265915 0.5739462     1.0000000

Correlation analysis works well with cross-sectional data too. Let’s examine relationships across U.S. states:

# Correlation among state-level economic indicators
states %>%
  filter(state != "Puerto Rico") %>%
  select(median_household_income, poverty_rate,
         bachelors_or_higher, unemployment_rate) %>%
  cor(use = "complete.obs")

##                         median_household_income poverty_rate
## median_household_income              1.00000000   -0.7231950
## poverty_rate                        -0.72319502    1.0000000
## bachelors_or_higher                  0.82561730   -0.4881848
## unemployment_rate                    0.06019203    0.5017541
##                         bachelors_or_higher unemployment_rate
## median_household_income          0.82561730        0.06019203
## poverty_rate                    -0.48818481        0.50175406
## bachelors_or_higher              1.00000000        0.07847918
## unemployment_rate                0.07847918        1.00000000

This reveals strong relationships: states with higher education levels tend to have higher incomes and lower poverty rates.

7.4 Creating Basic Tables with knitr

The knitr package provides kable() for creating simple tables. This is often sufficient for quick summaries.

# Create a summary table
president_summary <- macro %>%
  filter(year >= 1950) %>%
  group_by(president) %>%
  summarize(
    Years = n(),
    `Avg Unemployment` = mean(unemployment, na.rm = TRUE),
    `Avg Inflation` = mean(inflation, na.rm = TRUE)
  ) %>%
  arrange(desc(`Avg Inflation`))

# Display with kable
knitr::kable(president_summary, digits = 3)

president	Years	Avg Unemployment	Avg Inflation
Carter	4	0.065	0.098
Ford	2	0.085	0.075
Nixon	6	0.053	0.058
Biden	4	0.043	0.050
Reagan	8	0.078	0.046
GHWBush	4	0.065	0.043
Truman	3	0.037	0.037
GWBush	8	0.055	0.029
Clinton	8	0.053	0.026
LJohnson	6	0.047	0.023
Trump	4	0.050	0.018
Eisenhower	8	0.050	0.015
Obama	8	0.074	0.014
Kennedy	2	0.065	0.010

# Add caption and format
knitr::kable(
  president_summary,
  digits = 3,
  caption = "Economic Performance by President (Post-1950)",
  col.names = c("President", "Years in Office", "Unemployment Rate", "Inflation Rate")
)

Table 7.1: Economic Performance by President (Post-1950)
President	Years in Office	Unemployment Rate	Inflation Rate
Carter	4	0.065	0.098
Ford	2	0.085	0.075
Nixon	6	0.053	0.058
Biden	4	0.043	0.050
Reagan	8	0.078	0.046
GHWBush	4	0.065	0.043
Truman	3	0.037	0.037
GWBush	8	0.055	0.029
Clinton	8	0.053	0.026
LJohnson	6	0.047	0.023
Trump	4	0.050	0.018
Eisenhower	8	0.050	0.015
Obama	8	0.074	0.014
Kennedy	2	0.065	0.010

7.5 Publication-Quality Tables with kableExtra

The kableExtra package extends kable() with extensive formatting options for both HTML and LaTeX/PDF output.

library(kableExtra)

7.5.1 Basic Styling

president_summary %>%
  kable(
    digits = 3,
    col.names = c("President", "Years", "Unemployment", "Inflation"),
    caption = "Average Economic Indicators by President"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 7.2: Table 7.3: Average Economic Indicators by President
President	Years	Unemployment	Inflation
Carter	4	0.065	0.098
Ford	2	0.085	0.075
Nixon	6	0.053	0.058
Biden	4	0.043	0.050
Reagan	8	0.078	0.046
GHWBush	4	0.065	0.043
Truman	3	0.037	0.037
GWBush	8	0.055	0.029
Clinton	8	0.053	0.026
LJohnson	6	0.047	0.023
Trump	4	0.050	0.018
Eisenhower	8	0.050	0.015
Obama	8	0.074	0.014
Kennedy	2	0.065	0.010

7.5.2 Adding Headers and Formatting

# Create a more detailed table
decade_stats <- macro %>%
  filter(year >= 1950) %>%
  mutate(decade = paste0(floor(year / 10) * 10, "s")) %>%
  group_by(decade) %>%
  summarize(
    n = n(),
    unemp_mean = mean(unemployment, na.rm = TRUE) * 100,
    unemp_sd = sd(unemployment, na.rm = TRUE) * 100,
    inf_mean = mean(inflation, na.rm = TRUE) * 100,
    inf_sd = sd(inflation, na.rm = TRUE) * 100
  )

decade_stats %>%
  kable(
    digits = 2,
    col.names = c("Decade", "N", "Mean", "SD", "Mean", "SD"),
    caption = "Unemployment and Inflation by Decade (Percent)"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "Unemployment" = 2, "Inflation" = 2))

Table 7.4: Table 7.5: Unemployment and Inflation by Decade (Percent)
		Unemployment		Inflation
Decade	N	Mean	SD	Mean	SD
1950s	10	4.5	1.43	2.1	2.42
1960s	10	5.1	1.10	2.3	1.42
1970s	10	6.4	1.26	7.0	2.71
1980s	10	7.4	1.65	5.6	3.66
1990s	10	5.9	1.20	3.0	0.94
2000s	10	5.7	1.34	2.6	1.07
2010s	10	6.2	2.20	1.7	0.82
2020s	5	5.0	1.73	4.2	2.59

7.5.3 Conditional Formatting

# Highlight rows based on values
president_summary %>%
  kable(digits = 3, col.names = c("President", "Years", "Unemployment", "Inflation")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(president_summary$`Avg Inflation` > 0.05), background = "#ffcccc") %>%
  row_spec(which(president_summary$`Avg Inflation` < 0.03), background = "#ccffcc")

President	Years	Unemployment	Inflation
Carter	4	0.065	0.098
Ford	2	0.085	0.075
Nixon	6	0.053	0.058
Biden	4	0.043	0.050
Reagan	8	0.078	0.046
GHWBush	4	0.065	0.043
Truman	3	0.037	0.037
GWBush	8	0.055	0.029
Clinton	8	0.053	0.026
LJohnson	6	0.047	0.023
Trump	4	0.050	0.018
Eisenhower	8	0.050	0.015
Obama	8	0.074	0.014
Kennedy	2	0.065	0.010

7.5.4 Footnotes and Notes

decade_stats %>%
  kable(
    digits = 2,
    col.names = c("Decade", "N", "Mean", "SD", "Mean", "SD"),
    caption = "Unemployment and Inflation Statistics by Decade"
  ) %>%
  kable_styling(bootstrap_options = c("striped"), full_width = FALSE) %>%
  add_header_above(c(" " = 2, "Unemployment (%)" = 2, "Inflation (%)" = 2)) %>%
  footnote(
    general = "Data from Historical Statistics of the United States.",
    number = c("Unemployment and inflation expressed as percentages.",
               "SD = Standard Deviation.")
  )

Table 7.6: Table 7.7: Unemployment and Inflation Statistics by Decade
		Unemployment (%)		Inflation (%)
Decade	N	Mean	SD	Mean	SD
1950s	10	4.5	1.43	2.1	2.42
1960s	10	5.1	1.10	2.3	1.42
1970s	10	6.4	1.26	7.0	2.71
1980s	10	7.4	1.65	5.6	3.66
1990s	10	5.9	1.20	3.0	0.94
2000s	10	5.7	1.34	2.6	1.07
2010s	10	6.2	2.20	1.7	0.82
2020s	5	5.0	1.73	4.2	2.59
Note:
Data from Historical Statistics of the United States.
¹ Unemployment and inflation expressed as percentages.
² SD = Standard Deviation.

7.6 The gt Package

The gt package provides another approach to creating tables with a grammar similar to ggplot2. It offers fine-grained control over every aspect of table appearance.

library(gt)

7.6.1 Basic gt Table

president_summary %>%
  gt() %>%
  tab_header(
    title = "Economic Performance by President",
    subtitle = "Average unemployment and inflation rates, 1950-2024"
  )

president	Years	Avg Unemployment	Avg Inflation
Economic Performance by President
Average unemployment and inflation rates, 1950-2024
Carter	4	0.06500000	0.09750000
Ford	2	0.08500000	0.07500000
Nixon	6	0.05333333	0.05833333
Biden	4	0.04250000	0.05000000
Reagan	8	0.07750000	0.04625000
GHWBush	4	0.06500000	0.04250000
Truman	3	0.03666667	0.03666667
GWBush	8	0.05500000	0.02875000
Clinton	8	0.05250000	0.02625000
LJohnson	6	0.04666667	0.02333333
Trump	4	0.05000000	0.01750000
Eisenhower	8	0.05000000	0.01500000
Obama	8	0.07375000	0.01375000
Kennedy	2	0.06500000	0.01000000

7.6.2 Formatting Columns

president_summary %>%
  gt() %>%
  tab_header(
    title = "Economic Performance by President",
    subtitle = "Post-World War II administrations"
  ) %>%
  fmt_percent(
    columns = c(`Avg Unemployment`, `Avg Inflation`),
    decimals = 1
  ) %>%
  cols_label(
    president = "President",
    Years = "Years in Office",
    `Avg Unemployment` = "Unemployment",
    `Avg Inflation` = "Inflation"
  )

President	Years in Office	Unemployment	Inflation
Economic Performance by President
Post-World War II administrations
Carter	4	6.5%	9.8%
Ford	2	8.5%	7.5%
Nixon	6	5.3%	5.8%
Biden	4	4.2%	5.0%
Reagan	8	7.8%	4.6%
GHWBush	4	6.5%	4.2%
Truman	3	3.7%	3.7%
GWBush	8	5.5%	2.9%
Clinton	8	5.3%	2.6%
LJohnson	6	4.7%	2.3%
Trump	4	5.0%	1.8%
Eisenhower	8	5.0%	1.5%
Obama	8	7.4%	1.4%
Kennedy	2	6.5%	1.0%

7.6.3 Adding Color and Style

decade_stats %>%
  gt() %>%
  tab_header(
    title = "U.S. Economic Indicators by Decade",
    subtitle = "Unemployment and inflation statistics"
  ) %>%
  cols_label(
    decade = "Decade",
    n = "Years",
    unemp_mean = "Mean",
    unemp_sd = "SD",
    inf_mean = "Mean",
    inf_sd = "SD"
  ) %>%
  tab_spanner(
    label = "Unemployment (%)",
    columns = c(unemp_mean, unemp_sd)
  ) %>%
  tab_spanner(
    label = "Inflation (%)",
    columns = c(inf_mean, inf_sd)
  ) %>%
  fmt_number(
    columns = c(unemp_mean, unemp_sd, inf_mean, inf_sd),
    decimals = 2
  ) %>%
  data_color(
    columns = inf_mean,
    palette = c("white", "orange", "red"),
    domain = c(0, 10)
  )

Decade	Years	Unemployment (%)		Inflation (%)
U.S. Economic Indicators by Decade
Unemployment and inflation statistics
Decade	Years	Mean	SD	Mean	SD
1950s	10	4.50	1.43	2.10	2.42
1960s	10	5.10	1.10	2.30	1.42
1970s	10	6.40	1.26	7.00	2.71
1980s	10	7.40	1.65	5.60	3.66
1990s	10	5.90	1.20	3.00	0.94
2000s	10	5.70	1.34	2.60	1.07
2010s	10	6.20	2.20	1.70	0.82
2020s	5	5.00	1.73	4.20	2.59

7.6.4 Source Notes and Footnotes

macro %>%
  filter(year >= 2015) %>%
  select(year, president, unemployment, inflation, realgdp_percap) %>%
  gt() %>%
  tab_header(
    title = "Recent U.S. Economic Data",
    subtitle = "Annual observations, 2015-2024"
  ) %>%
  fmt_percent(columns = c(unemployment, inflation), decimals = 1) %>%
  fmt_currency(columns = realgdp_percap, decimals = 0) %>%
  cols_label(
    year = "Year",
    president = "President",
    unemployment = "Unemployment",
    inflation = "Inflation",
    realgdp_percap = "Real GDP/Capita"
  ) %>%
  tab_source_note("Source: Historical Statistics of the United States") %>%
  tab_footnote(
    footnote = "Real GDP per capita in 2017 dollars",
    locations = cells_column_labels(columns = realgdp_percap)
  )

Year	President	Unemployment	Inflation	Real GDP/Capita¹
Recent U.S. Economic Data
Annual observations, 2015-2024
2015	Obama	5.0%	0.0%	$50,820
2016	Obama	5.0%	1.0%	$51,563
2017	Trump	4.0%	2.0%	$52,532
2018	Trump	4.0%	2.0%	$53,988
2019	Trump	4.0%	2.0%	$55,131
2020	Trump	8.0%	1.0%	$54,612
2021	Biden	5.0%	5.0%	$56,837
2022	Biden	4.0%	8.0%	$58,253
2023	Biden	4.0%	4.0%	$59,428
2024	Biden	4.0%	3.0%	$60,792
¹ Real GDP per capita in 2017 dollars
Source: Historical Statistics of the United States

7.7 Descriptive Statistics Tables

A common task in empirical research is creating a “Table 1” that summarizes the key variables in your dataset.

# Create comprehensive descriptive statistics
descriptive_stats <- macro %>%
  filter(year >= 1950) %>%
  summarize(
    across(
      c(unemployment, inflation, realgdp_percap, ten_year_rate),
      list(
        N = ~sum(!is.na(.x)),
        Mean = ~mean(.x, na.rm = TRUE),
        SD = ~sd(.x, na.rm = TRUE),
        Min = ~min(.x, na.rm = TRUE),
        Max = ~max(.x, na.rm = TRUE)
      ),
      .names = "{.col}_{.fn}"
    )
  ) %>%
  pivot_longer(
    everything(),
    names_to = c("variable", "statistic"),
    names_sep = "_",
    values_to = "value"
  ) %>%
  pivot_wider(
    names_from = statistic,
    values_from = value
  ) %>%
  mutate(
    variable = case_when(
      variable == "unemployment" ~ "Unemployment Rate",
      variable == "inflation" ~ "Inflation Rate",
      variable == "realgdp_percap" ~ "Real GDP per Capita",
      variable == "ten_year_rate" ~ "10-Year Treasury Rate"
    )
  )

descriptive_stats %>%
  gt() %>%
  tab_header(
    title = "Descriptive Statistics",
    subtitle = "U.S. Economic Indicators, 1950-2024"
  ) %>%
  fmt_number(columns = c(Mean, SD, Min, Max), decimals = 3) %>%
  fmt_number(columns = N, decimals = 0) %>%
  cols_label(variable = "Variable")

Variable	N	Mean	SD	Min	Max	percap	year
Descriptive Statistics
U.S. Economic Indicators, 1950-2024
Unemployment Rate	75	0.05826667	0.01679554	0.03	0.1
Inflation Rate	75	0.0352	0.02737922	0	0.14
NA						75.00, 34411.41, 13645.89, 14398.00, 60792.00
NA							75.000000, 5.360933, 2.865575, 0.890000, 14.590000

7.8 Correlation Tables

Presenting correlation matrices is common in empirical research.

# Calculate correlation matrix
cor_data <- macro %>%
  filter(year >= 1950) %>%
  select(
    Unemployment = unemployment,
    Inflation = inflation,
    `GDP per Capita` = realgdp_percap,
    `10-Year Rate` = ten_year_rate
  ) %>%
  cor(use = "pairwise.complete.obs")

# Display as formatted table
cor_data %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  gt() %>%
  tab_header(title = "Correlation Matrix") %>%
  fmt_number(columns = -Variable, decimals = 3) %>%
  data_color(
    columns = -Variable,
    palette = c("#4575b4", "white", "#d73027"),
    domain = c(-1, 1)
  )

Variable	Unemployment	Inflation	GDP per Capita	10-Year Rate
Correlation Matrix
Unemployment	1.000	0.132	0.039	0.427
Inflation	0.132	1.000	−0.108	0.574
GDP per Capita	0.039	−0.108	1.000	−0.224
10-Year Rate	0.427	0.574	−0.224	1.000

7.9 Exporting Tables

7.9.1 Saving to Files

# Save kable table to HTML file
president_summary %>%
  kable(digits = 3) %>%
  kable_styling() %>%
  save_kable("president_summary.html")

# Save gt table to various formats
my_table <- president_summary %>%
  gt() %>%
  tab_header(title = "Economic Performance by President")

# Save as HTML
gtsave(my_table, "president_table.html")

# Save as PNG image
gtsave(my_table, "president_table.png")

# Save as Word document
gtsave(my_table, "president_table.docx")

7.9.2 Exporting to CSV

For further analysis or use in other software, you can export your summary statistics as CSV files.

# Export summary statistics to CSV
decade_stats %>%
  write_csv("decade_summary_stats.csv")

# Export with better formatting
president_summary %>%
  mutate(
    `Avg Unemployment` = scales::percent(`Avg Unemployment`, accuracy = 0.1),
    `Avg Inflation` = scales::percent(`Avg Inflation`, accuracy = 0.1)
  ) %>%
  write_csv("president_summary.csv")

7.10 Example: Economic Report Table

Let’s create a comprehensive table suitable for an economic report.

# Prepare data
report_data <- macro %>%
  filter(year >= 1980) %>%
  mutate(
    decade = case_when(
      year < 1990 ~ "1980s",
      year < 2000 ~ "1990s",
      year < 2010 ~ "2000s",
      year < 2020 ~ "2010s",
      TRUE ~ "2020s"
    )
  ) %>%
  group_by(decade) %>%
  summarize(
    years = paste(min(year), max(year), sep = "-"),
    unemp_mean = mean(unemployment, na.rm = TRUE) * 100,
    unemp_sd = sd(unemployment, na.rm = TRUE) * 100,
    inf_mean = mean(inflation, na.rm = TRUE) * 100,
    inf_sd = sd(inflation, na.rm = TRUE) * 100,
    gdp_growth = (last(realgdp_percap) / first(realgdp_percap))^(1/n()) - 1,
    gdp_growth = gdp_growth * 100
  )

report_data %>%
  gt() %>%
  tab_header(
    title = md("**U.S. Economic Performance by Decade**"),
    subtitle = "Summary statistics for key macroeconomic indicators"
  ) %>%
  cols_label(
    decade = "Decade",
    years = "Period",
    unemp_mean = "Mean",
    unemp_sd = "SD",
    inf_mean = "Mean",
    inf_sd = "SD",
    gdp_growth = "Avg. Annual"
  ) %>%
  tab_spanner(label = "Unemployment (%)", columns = c(unemp_mean, unemp_sd)) %>%
  tab_spanner(label = "Inflation (%)", columns = c(inf_mean, inf_sd)) %>%
  tab_spanner(label = "GDP Growth (%)", columns = gdp_growth) %>%
  fmt_number(columns = c(unemp_mean, unemp_sd, inf_mean, inf_sd, gdp_growth), decimals = 2) %>%
  tab_source_note(md("*Source: Historical Statistics of the United States*")) %>%
  tab_footnote(
    footnote = "Average annual real GDP per capita growth rate",
    locations = cells_column_labels(columns = gdp_growth)
  ) %>%
  opt_row_striping()

Decade	Period	Unemployment (%)		Inflation (%)		GDP Growth (%)
U.S. Economic Performance by Decade
Summary statistics for key macroeconomic indicators
Decade	Period	Mean	SD	Mean	SD	Avg. Annual¹
1980s	1980-1989	7.40	1.65	5.60	3.66	2.29
1990s	1990-1999	5.90	1.20	3.00	0.94	1.90
2000s	2000-2009	5.70	1.34	2.60	1.07	0.54
2010s	2010-2019	6.20	2.20	1.70	0.82	1.45
2020s	2020-2024	5.00	1.73	4.20	2.59	2.17
¹ Average annual real GDP per capita growth rate
Source: Historical Statistics of the United States

7.11 Example: Comparison Table

When comparing groups, a well-structured table helps readers quickly identify differences.

# Compare pre- and post-1980 periods
era_comparison <- macro %>%
  filter(year >= 1950) %>%
  mutate(era = ifelse(year < 1980, "1950-1979", "1980-2024")) %>%
  group_by(era) %>%
  summarize(
    n_years = n(),
    unemp_mean = mean(unemployment, na.rm = TRUE),
    unemp_sd = sd(unemployment, na.rm = TRUE),
    inf_mean = mean(inflation, na.rm = TRUE),
    inf_sd = sd(inflation, na.rm = TRUE),
    rate_mean = mean(ten_year_rate, na.rm = TRUE),
    rate_sd = sd(ten_year_rate, na.rm = TRUE)
  )

era_comparison %>%
  gt() %>%
  tab_header(
    title = "The Great Moderation",
    subtitle = "Comparing economic volatility before and after 1980"
  ) %>%
  cols_label(
    era = "Period",
    n_years = "Years",
    unemp_mean = "Mean",
    unemp_sd = "SD",
    inf_mean = "Mean",
    inf_sd = "SD",
    rate_mean = "Mean",
    rate_sd = "SD"
  ) %>%
  tab_spanner(label = "Unemployment", columns = c(unemp_mean, unemp_sd)) %>%
  tab_spanner(label = "Inflation", columns = c(inf_mean, inf_sd)) %>%
  tab_spanner(label = "10-Year Rate", columns = c(rate_mean, rate_sd)) %>%
  fmt_percent(columns = c(unemp_mean, unemp_sd, inf_mean, inf_sd, rate_mean, rate_sd), decimals = 1) %>%
  tab_footnote(
    footnote = "Note the reduction in standard deviations after 1980",
    locations = cells_column_spanners("Inflation")
  )

Period	Years	Unemployment		Inflation¹		10-Year Rate
The Great Moderation
Comparing economic volatility before and after 1980
Period	Years	Mean	SD	Mean	SD	Mean	SD
1950-1979	30	5.3%	1.5%	3.8%	3.2%	491.6%	197.9%
1980-2024	45	6.2%	1.7%	3.3%	2.4%	565.7%	331.7%
¹ Note the reduction in standard deviations after 1980

7.12 Example: State Comparison Tables

Summary tables are especially useful for comparing geographic units. Let’s create a table of the top and bottom states by median income.

# Top 10 and bottom 10 states by median household income
top_bottom_states <- states %>%
  filter(state != "Puerto Rico", state != "District of Columbia") %>%
  arrange(desc(median_household_income)) %>%
  mutate(rank = row_number()) %>%
  filter(rank <= 10 | rank > n() - 10) %>%
  select(rank, state, median_household_income, poverty_rate, bachelors_or_higher)

top_bottom_states %>%
  gt() %>%
  tab_header(
    title = "States by Median Household Income",
    subtitle = "Top 10 and bottom 10 states"
  ) %>%
  fmt_currency(columns = median_household_income, decimals = 0) %>%
  fmt_percent(columns = c(poverty_rate, bachelors_or_higher), decimals = 1) %>%
  cols_label(
    rank = "Rank",
    state = "State",
    median_household_income = "Median Income",
    poverty_rate = "Poverty Rate",
    bachelors_or_higher = "Bachelor's+"
  ) %>%
  tab_row_group(
    label = "Top 10 States",
    rows = rank <= 10
  ) %>%
  tab_row_group(
    label = "Bottom 10 States",
    rows = rank > 10
  ) %>%
  tab_source_note("Source: American Community Survey")

Rank	State	Median Income	Poverty Rate	Bachelor's+
States by Median Household Income
Top 10 and bottom 10 states
Bottom 10 States
41	Tennessee	$64,035	14.0%	29.7%
42	South Carolina	$63,623	14.4%	30.6%
43	Oklahoma	$61,364	15.2%	27.3%
44	Kentucky	$60,183	16.1%	26.5%
45	Alabama	$59,609	15.7%	27.2%
46	New Mexico	$58,722	18.3%	29.1%
47	Louisiana	$57,852	18.7%	26.1%
48	Arkansas	$56,335	16.2%	24.7%
49	West Virginia	$55,217	16.8%	22.7%
50	Mississippi	$52,985	19.2%	23.9%
Top 10 States
1	Maryland	$98,461	9.3%	42.2%
2	New Jersey	$97,126	9.7%	42.3%
3	Massachusetts	$96,505	9.9%	45.9%
4	Hawaii	$94,814	9.6%	34.7%
5	California	$91,905	12.1%	35.9%
6	New Hampshire	$90,845	7.3%	39.0%
7	Washington	$90,325	9.9%	38.0%
8	Connecticut	$90,213	10.1%	41.4%
9	Colorado	$87,598	9.6%	43.7%
10	Virginia	$87,249	10.0%	41.0%
Source: American Community Survey

Regional comparisons are also informative:

# Regional summary table
regional_summary <- states %>%
  filter(state != "Puerto Rico") %>%
  mutate(
    region = case_when(
      state %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire",
                   "Rhode Island", "Vermont", "New Jersey", "New York",
                   "Pennsylvania") ~ "Northeast",
      state %in% c("Illinois", "Indiana", "Iowa", "Kansas", "Michigan",
                   "Minnesota", "Missouri", "Nebraska", "North Dakota",
                   "Ohio", "South Dakota", "Wisconsin") ~ "Midwest",
      state %in% c("Alabama", "Arkansas", "Delaware", "Florida", "Georgia",
                   "Kentucky", "Louisiana", "Maryland", "Mississippi",
                   "North Carolina", "Oklahoma", "South Carolina", "Tennessee",
                   "Texas", "Virginia", "West Virginia",
                   "District of Columbia") ~ "South",
      TRUE ~ "West"
    )
  ) %>%
  group_by(region) %>%
  summarize(
    states = n(),
    total_pop = sum(population),
    # Calculate population-weighted averages manually
    median_income = sum(median_household_income * population) / sum(population),
    poverty_rate = sum(poverty_rate * population) / sum(population),
    bachelors = sum(bachelors_or_higher * population) / sum(population)
  )

regional_summary %>%
  gt() %>%
  tab_header(
    title = "Economic Indicators by Census Region",
    subtitle = "Population-weighted averages"
  ) %>%
  cols_label(
    region = "Region",
    states = "States",
    total_pop = "Population",
    median_income = "Median Income",
    poverty_rate = "Poverty Rate",
    bachelors = "Bachelor's+"
  ) %>%
  fmt_number(columns = total_pop, suffixing = TRUE) %>%
  fmt_currency(columns = median_income, decimals = 0) %>%
  fmt_percent(columns = c(poverty_rate, bachelors), decimals = 1)

Region	States	Population	Median Income	Poverty Rate	Bachelor's+
Economic Indicators by Census Region
Population-weighted averages
Midwest	12	68.86M	$71,682	12.0%	32.6%
Northeast	9	57.31M	$84,294	11.6%	39.1%
South	17	126.50M	$69,950	13.8%	32.3%
West	13	78.43M	$85,924	11.7%	35.4%

7.13 Example: Demographic Tables from PUMS

Individual-level survey data like PUMS allows for detailed demographic analysis.

# Summary statistics by sex and education
pums %>%
  filter(!is.na(education), !is.na(sex),
         !is.na(wage_income), wage_income > 0) %>%
  group_by(sex, education) %>%
  summarize(
    n = n(),
    mean_wage = mean(wage_income),
    median_wage = median(wage_income),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = sex,
    values_from = c(n, mean_wage, median_wage)
  ) %>%
  gt() %>%
  tab_header(
    title = "Wage Income by Education and Sex",
    subtitle = "Working-age adults with positive wage income"
  ) %>%
  cols_label(
    education = "Education",
    n_Male = "N",
    mean_wage_Male = "Mean",
    median_wage_Male = "Median",
    n_Female = "N",
    mean_wage_Female = "Mean",
    median_wage_Female = "Median"
  ) %>%
  tab_spanner(label = "Male", columns = ends_with("Male")) %>%
  tab_spanner(label = "Female", columns = ends_with("Female")) %>%
  fmt_number(columns = starts_with("n_"), decimals = 0) %>%
  fmt_currency(columns = c(starts_with("mean_"), starts_with("median_")), decimals = 0)

	Female		Female		Female
Wage Income by Education and Sex
Working-age adults with positive wage income
Education	Male
Education	N	N	Mean	Mean	Median	Median
Bachelors	1,499	1,414	$68,691	$102,184	$54,000	$75,000
High school	1,075	1,509	$31,845	$45,614	$26,000	$36,000
Less than HS	345	565	$24,238	$35,078	$20,800	$30,000
Masters	740	568	$81,291	$127,928	$70,000	$100,000
Professional/Doctorate	217	291	$114,953	$169,259	$90,000	$110,000
Some college	1,657	1,695	$38,877	$55,614	$30,000	$45,000

We can also create a “Table 1” style summary of the entire PUMS sample:

# Comprehensive sample description
pums_summary <- pums %>%
  summarize(
    `Sample Size` = n(),
    `Mean Age` = mean(age, na.rm = TRUE),
    `% Female` = mean(sex == "Female", na.rm = TRUE) * 100,
    `% Employed` = mean(employed == 1, na.rm = TRUE) * 100,
    `Mean Hours Worked` = mean(hours_worked[hours_worked > 0], na.rm = TRUE),
    `Median Wage Income` = median(wage_income[wage_income > 0], na.rm = TRUE),
    `Median Household Income` = median(household_income, na.rm = TRUE),
    `% Homeowner` = mean(tenure == "Owner", na.rm = TRUE) * 100
  ) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Value")

pums_summary %>%
  gt() %>%
  tab_header(
    title = "Sample Characteristics",
    subtitle = "American Community Survey PUMS Data"
  ) %>%
  fmt_number(columns = Value, decimals = 1) %>%
  tab_source_note("Source: U.S. Census Bureau, ACS PUMS")

Variable	Value
Sample Characteristics
American Community Survey PUMS Data
Sample Size	20,000.0
Mean Age	50.5
% Female	50.9
% Employed	0.0
Mean Hours Worked	38.4
Median Wage Income	44,000.0
Median Household Income	83,650.0
% Homeowner	0.0
Source: U.S. Census Bureau, ACS PUMS

7.14 Example: Business Analytics Tables

Transaction data enables operational analysis. Here’s a sales summary table for our cafe:

# Product performance table
cafe %>%
  group_by(product) %>%
  summarize(
    transactions = n(),
    revenue = sum(price),
    avg_price = mean(price),
    first_sale = min(date),
    last_sale = max(date)
  ) %>%
  arrange(desc(revenue)) %>%
  gt() %>%
  tab_header(
    title = "Product Performance Summary",
    subtitle = "2019-2023"
  ) %>%
  cols_label(
    product = "Product",
    transactions = "Transactions",
    revenue = "Total Revenue",
    avg_price = "Avg Price",
    first_sale = "First Sale",
    last_sale = "Last Sale"
  ) %>%
  fmt_number(columns = transactions, decimals = 0) %>%
  fmt_currency(columns = c(revenue, avg_price)) %>%
  tab_source_note("Note: Simulated data for educational purposes")

Product	Transactions	Total Revenue	Avg Price	First Sale	Last Sale
Product Performance Summary
2019-2023
Latte	169,747	$908,444.22	$5.35	2019-01-01	2023-12-31
Croissant	113,248	$453,530.48	$4.00	2019-01-01	2023-12-31
Cold Brew	72,258	$368,071.42	$5.09	2019-01-01	2023-12-31
Espresso	85,978	$297,890.46	$3.46	2019-01-01	2023-12-31
Chai Tea	48,761	$221,681.59	$4.55	2019-01-01	2023-12-31
Note: Simulated data for educational purposes

Year-over-year comparisons are essential for business reporting:

# Year-over-year summary
cafe %>%
  group_by(year) %>%
  summarize(
    transactions = n(),
    revenue = sum(price),
    avg_price = mean(price),
    unique_customers = n_distinct(customer_id)
  ) %>%
  mutate(
    yoy_revenue = (revenue / lag(revenue) - 1) * 100,
    yoy_customers = (unique_customers / lag(unique_customers) - 1) * 100
  ) %>%
  gt() %>%
  tab_header(
    title = "Annual Sales Performance",
    subtitle = "Year-over-year analysis"
  ) %>%
  cols_label(
    year = "Year",
    transactions = "Transactions",
    revenue = "Revenue",
    avg_price = "Avg Price",
    unique_customers = "Customers",
    yoy_revenue = "Revenue Δ",
    yoy_customers = "Customers Δ"
  ) %>%
  fmt_number(columns = c(transactions, unique_customers), decimals = 0) %>%
  fmt_currency(columns = c(revenue, avg_price)) %>%
  fmt_number(columns = c(yoy_revenue, yoy_customers), decimals = 1) %>%
  sub_missing(missing_text = "—") %>%
  tab_footnote(
    footnote = "Note the COVID-19 impact in 2020",
    locations = cells_body(columns = year, rows = year == 2020)
  )

Year	Transactions	Revenue	Avg Price	Customers	Revenue Δ	Customers Δ
Annual Sales Performance
Year-over-year analysis
2019	104,139	$408,711.61	$3.92	2,000	—	—
¹ 2020	65,955	$278,485.58	$4.22	2,000	−31.9	0.0
2021	97,246	$443,548.89	$4.56	2,000	59.3	0.0
2022	110,639	$538,454.95	$4.87	2,000	21.4	0.0
2023	112,013	$580,417.14	$5.18	2,000	7.8	0.0
¹ Note the COVID-19 impact in 2020

Day-of-week analysis helps with staffing and inventory decisions:

# Sales by day of week
cafe %>%
  mutate(day_of_week = factor(day_of_week,
                              levels = c("Mon", "Tue", "Wed",
                                        "Thu", "Fri", "Sat", "Sun"))) %>%
  group_by(day_of_week) %>%
  summarize(
    avg_daily_transactions = n() / n_distinct(date),
    avg_daily_revenue = sum(price) / n_distinct(date),
    pct_of_total = n() / nrow(cafe) * 100
  ) %>%
  gt() %>%
  tab_header(
    title = "Sales by Day of Week",
    subtitle = "Average daily performance"
  ) %>%
  cols_label(
    day_of_week = "Day",
    avg_daily_transactions = "Avg Transactions",
    avg_daily_revenue = "Avg Revenue",
    pct_of_total = "% of Total"
  ) %>%
  fmt_number(columns = avg_daily_transactions, decimals = 0) %>%
  fmt_currency(columns = avg_daily_revenue) %>%
  fmt_number(columns = pct_of_total, decimals = 1) %>%
  data_color(
    columns = avg_daily_revenue,
    palette = c("white", "#228B22"),
    domain = c(200, 350)
  )

Day	Avg Transactions	Avg Revenue	% of Total
Sales by Day of Week
Average daily performance
Mon	243	$1,117.28	12.9
Tue	241	$1,105.58	12.9
Wed	242	$1,108.44	12.9
Thu	242	$1,109.63	12.9
Fri	241	$1,107.15	12.8
Sat	334	$1,537.45	17.8
Sun	335	$1,537.98	17.8

7.15 Exercises

7.15.1 Macro Data Exercises

Calculate the mean, median, standard deviation, and range of real GDP per capita for the post-war period (1950-present). What do the differences between mean and median tell you about the distribution?
Create a summary table showing average unemployment and inflation for each president since 1950. Format it with kableExtra, highlighting presidents with average inflation above 5%.
Calculate the correlation matrix for unemployment, inflation, the S&P 500, and the 10-year Treasury rate. Which pairs of variables are most strongly correlated?
Using the across() function, calculate the mean and standard deviation of all numeric variables in the macro dataset for the 1970s versus the 1990s. Present the results in a gt table.

7.15.2 State Data Exercises

Create a table showing the 5 states with the highest and 5 states with the lowest poverty rates. Include median income and education levels. Use gt with row grouping to separate high and low poverty states.
Build a correlation table for state-level variables: population, median income, poverty rate, unemployment rate, and homeownership rate. Add color coding to highlight strong correlations.
Create a regional comparison table (Northeast, Midwest, South, West) showing weighted averages for key economic indicators. Add footnotes explaining the weighting methodology.

7.15.3 PUMS Data Exercises

Create a “Table 1” style summary of the PUMS sample, showing characteristics separately for employed and unemployed individuals. Compare age, education distribution, and household income.
Build a table showing median wage income by education level and state. Use pivot_wider() to put states in columns and education levels in rows.
Calculate and present the gender wage gap (ratio of female to male median wages) by education level. Format as a gt table with conditional coloring.

7.15.4 Cafe Data Exercises

Create a monthly revenue table for each year (2019-2023). Calculate year-over-year percentage changes and highlight months with negative growth in red.
Build a product profitability table showing each product’s contribution to total revenue over time. Has the product mix changed across years?
Create a customer segmentation table: calculate summary statistics (total spending, visit frequency, average transaction) for customers in different spending quintiles.

7.15.5 Advanced Exercises

Export three tables (one from macro data, one from state data, one from PUMS) to a single HTML file using gt’s tab_source_note() to ensure consistent sourcing.
Create a publication-ready table comparing economic indicators across the four datasets. This will require careful thought about what comparable statistics can be computed from each dataset.