Chapter 5 Data Wrangling

Raw data rarely comes in the exact form you need for analysis. Data wrangling—also called data manipulation or data cleaning—is the process of transforming raw data into a format suitable for analysis. This chapter introduces the tidyverse approach to data wrangling, which provides a consistent and readable set of tools for common data tasks.

5.1 The Tidyverse and the Pipe

The tidyverse is a collection of R packages designed for data science. They share a common philosophy and syntax, making them work well together.

install.packages("tidyverse")  # Only needed once
library(tidyverse)

Loading tidyverse loads several packages including dplyr (data manipulation), tidyr (reshaping data), ggplot2 (visualization), readr (reading files), and stringr (string manipulation).

The pipe operator %>% takes the output from one function and passes it as the first argument to the next function. This makes code more readable:

# Without pipe: nested functions (read inside-out)
round(mean(c(1.5, 2.7, 3.2)), 1)

# With pipe: sequential operations (read left-to-right)
c(1.5, 2.7, 3.2) %>%
  mean() %>%
  round(1)

You can read %>% as “and then.” R 4.1+ also has a native pipe |> that works similarly.

5.2 Core dplyr Verbs

Let’s work with several datasets to demonstrate the main data manipulation functions. We’ll use macroeconomic time series data, state-level Census data, individual-level survey data, and cafe transaction data.

library(tidyverse)

# Load our example datasets
macro <- read_csv("data/us_macrodata.csv")      # Time series
states <- read_csv("data/acs_state.csv")         # Cross-sectional
pums <- read_csv("data/acs_pums.csv")            # Microdata
cafe <- read_csv("data/cafe_sales.csv")          # Transaction data

5.2.1 select() — Choose Columns

# Select specific columns from macro data
macro %>% select(year, unemployment, inflation)

# Select columns from state data
states %>% select(state, median_household_income, poverty_rate)

# Select a range of columns
states %>% select(state:per_capita_income)

# Remove columns with minus sign
pums %>% select(-puma, -weight)

# Select columns by pattern
states %>% select(state, starts_with("median"))
states %>% select(state, contains("rate"))
pums %>% select(state, ends_with("income"))

5.2.2 filter() — Choose Rows

# Filter macro data to recent years
macro %>% filter(year >= 2000)

# Filter states with high poverty
states %>% filter(poverty_rate > 0.15)

# Filter PUMS data: college-educated workers in California
pums %>% filter(
  state == "California",
  education %in% c("Bachelors", "Masters", "Professional/Doctorate"),
  employed == "Employed"
)

# Multiple conditions (AND)
states %>% filter(
  median_household_income > 70000,
  poverty_rate < 0.12
)

# OR condition
pums %>% filter(education == "Bachelors" | education == "Masters")

# Match multiple values with %in%
macro %>% filter(president %in% c("Obama", "Trump", "Biden"))
pums %>% filter(state %in% c("California", "Texas", "New York"))

# Filter cafe sales: morning coffee orders
cafe %>% filter(
  product %in% c("Espresso", "Latte"),
  year == 2023
)

5.2.3 mutate() — Create or Modify Columns

# Create new columns in macro data
macro %>%
  mutate(
    real_rate = ten_year_rate - inflation,
    decade = floor(year / 10) * 10
  )

# Create income categories in state data
states %>%
  mutate(
    income_tier = case_when(
      median_household_income >= 80000 ~ "High",
      median_household_income >= 60000 ~ "Middle",
      TRUE ~ "Lower"
    ),
    pop_millions = population / 1000000
  )

# Create age groups in PUMS data
pums %>%
  mutate(
    age_group = case_when(
      age < 25 ~ "18-24",
      age < 35 ~ "25-34",
      age < 45 ~ "35-44",
      age < 55 ~ "45-54",
      age < 65 ~ "55-64",
      TRUE ~ "65+"
    ),
    has_degree = education %in% c("Bachelors", "Masters", "Professional/Doctorate")
  )

# Calculate revenue in cafe data
cafe %>%
  mutate(
    revenue = price,  # Each row is one sale
    is_weekend = day_of_week %in% c("Sat", "Sun")
  )

5.2.4 arrange() — Sort Rows

# Sort states by income (descending)
states %>% arrange(desc(median_household_income))

# Sort PUMS by income (highest earners)
pums %>%
  filter(total_income > 0) %>%
  arrange(desc(total_income)) %>%
  head(20)

# Find states with lowest unemployment
states %>% arrange(unemployment_rate) %>% head(10)

# Sort cafe data by date
cafe %>% arrange(date, product)

5.3 Summarizing and Grouping

Use summarize() to compute summary statistics and group_by() to calculate statistics by group:

# Overall summary of state data
states %>%
  summarize(
    avg_income = mean(median_household_income),
    avg_poverty = mean(poverty_rate),
    total_pop = sum(population),
    n_states = n()
  )

# Summary of PUMS data
pums %>%
  filter(employed == "Employed", wage_income > 0) %>%
  summarize(
    avg_wage = mean(wage_income),
    median_wage = median(wage_income),
    avg_hours = mean(hours_worked, na.rm = TRUE),
    n_workers = n()
  )

# Group by: Income by state in PUMS data
pums %>%
  filter(employed == "Employed", wage_income > 0) %>%
  group_by(state) %>%
  summarize(
    avg_wage = mean(wage_income),
    median_wage = median(wage_income),
    n_workers = n()
  ) %>%
  arrange(desc(median_wage))

# Group by: Wages by education level
pums %>%
  filter(employed == "Employed", wage_income > 0) %>%
  group_by(education) %>%
  summarize(
    avg_wage = mean(wage_income),
    median_wage = median(wage_income),
    n = n()
  ) %>%
  arrange(desc(median_wage))

# Group by: Cafe sales by product
cafe %>%
  group_by(product) %>%
  summarize(
    total_sales = n(),
    total_revenue = sum(price),
    avg_price = mean(price)
  ) %>%
  arrange(desc(total_revenue))

# Multiple grouping: Cafe sales by year and product
cafe %>%
  group_by(year, product) %>%
  summarize(
    total_sales = n(),
    avg_price = mean(price),
    .groups = "drop"
  )

You can also use group_by() with mutate() to create group-specific calculations:

# Calculate each state's deviation from the national average
states %>%
  mutate(
    national_avg_income = mean(median_household_income),
    income_vs_national = median_household_income - national_avg_income,
    pct_of_national = median_household_income / national_avg_income
  )

# Calculate within-state income percentiles in PUMS data
pums %>%
  filter(total_income > 0) %>%
  group_by(state) %>%
  mutate(
    state_median = median(total_income),
    income_pctile = percent_rank(total_income)
  ) %>%
  ungroup()

# Daily sales as percent of product's average
cafe %>%
  group_by(product, date) %>%
  summarize(daily_sales = n(), .groups = "drop") %>%
  group_by(product) %>%
  mutate(
    avg_daily_sales = mean(daily_sales),
    pct_of_avg = daily_sales / avg_daily_sales * 100
  ) %>%
  ungroup()

5.4 Combining Operations

The real power of the tidyverse comes from chaining multiple operations:

# Find high-income, low-poverty states
states %>%
  filter(population > 1000000) %>%  # Exclude small states
  mutate(
    income_rank = rank(desc(median_household_income)),
    poverty_rank = rank(poverty_rate)
  ) %>%
  filter(income_rank <= 15, poverty_rank <= 15) %>%
  select(state, median_household_income, poverty_rate, income_rank, poverty_rank) %>%
  arrange(income_rank)

# Gender wage gap by state
pums %>%
  filter(employed == "Employed", wage_income > 0, hours_worked >= 35) %>%
  group_by(state, sex) %>%
  summarize(median_wage = median(wage_income), .groups = "drop") %>%
  pivot_wider(names_from = sex, values_from = median_wage) %>%
  mutate(
    wage_gap = Female / Male,
    gap_pct = (1 - wage_gap) * 100
  ) %>%
  arrange(desc(gap_pct))

# Cafe: Weekend vs weekday sales by product
cafe %>%
  mutate(is_weekend = day_of_week %in% c("Sat", "Sun")) %>%
  group_by(product, is_weekend) %>%
  summarize(
    avg_daily_sales = n() / n_distinct(date),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = is_weekend,
    values_from = avg_daily_sales,
    names_prefix = "weekend_"
  ) %>%
  mutate(weekend_boost = weekend_TRUE / weekend_FALSE - 1)

# Education premium: comparing wages by education within age groups
pums %>%
  filter(employed == "Employed", wage_income > 0, age >= 25, age <= 54) %>%
  mutate(
    age_group = case_when(
      age < 35 ~ "25-34",
      age < 45 ~ "35-44",
      TRUE ~ "45-54"
    ),
    has_ba = education %in% c("Bachelors", "Masters", "Professional/Doctorate")
  ) %>%
  group_by(age_group, has_ba) %>%
  summarize(median_wage = median(wage_income), .groups = "drop") %>%
  pivot_wider(names_from = has_ba, values_from = median_wage) %>%
  mutate(ba_premium = `TRUE` / `FALSE` - 1)

5.5 Reshaping Data

Sometimes you need to change the structure of your data between “wide” and “long” formats.

# Wide to long: State economic indicators
states_long <- states %>%
  select(state, unemployment_rate, poverty_rate, homeownership_rate) %>%
  pivot_longer(
    cols = -state,
    names_to = "indicator",
    values_to = "rate"
  )

# Long to wide: Cafe prices over time
cafe_prices <- cafe %>%
  group_by(year, product) %>%
  summarize(avg_price = mean(price), .groups = "drop")

cafe_prices_wide <- cafe_prices %>%
  pivot_wider(
    names_from = product,
    values_from = avg_price
  )

# PUMS: Create education-by-sex summary, then reshape
pums %>%
  filter(employed == "Employed", wage_income > 0) %>%
  group_by(education, sex) %>%
  summarize(median_wage = median(wage_income), .groups = "drop") %>%
  pivot_wider(
    names_from = sex,
    values_from = median_wage
  )

5.6 Joining Data

Combine data from multiple sources with join functions:

# Join state-level data with regional classifications
regions <- tibble(
  state = c("California", "Texas", "Florida", "New York", "Illinois"),
  region = c("West", "South", "South", "Northeast", "Midwest")
)

states %>%
  filter(state %in% regions$state) %>%
  left_join(regions, by = "state")

# Join PUMS summary back to individuals
state_medians <- pums %>%
  filter(total_income > 0) %>%
  group_by(state) %>%
  summarize(state_median_income = median(total_income))

pums_with_state_median <- pums %>%
  left_join(state_medians, by = "state") %>%
  mutate(above_state_median = total_income > state_median_income)

# Join macro indicators to cafe data by year
yearly_macro <- macro %>%
  filter(year >= 2019) %>%
  select(year, inflation, unemployment)

cafe_with_macro <- cafe %>%
  left_join(yearly_macro, by = "year")

Types of joins:

left_join(): Keep all rows from left table, match from right
right_join(): Keep all rows from right table, match from left
inner_join(): Keep only rows that match in both tables
full_join(): Keep all rows from both tables

5.7 Handling Missing Values

# Check for missing values in PUMS
pums %>%
  summarize(
    missing_wage = sum(is.na(wage_income)),
    missing_hours = sum(is.na(hours_worked)),
    missing_children = sum(is.na(num_children))
  )

# Filter to complete cases for wage analysis
pums %>%
  filter(!is.na(wage_income), !is.na(hours_worked)) %>%
  filter(wage_income > 0, hours_worked > 0)

# Replace missing values
pums %>%
  mutate(
    num_children = replace_na(num_children, 0),
    hours_worked = replace_na(hours_worked, 0)
  )

5.8 Example: Analyzing Cafe Sales Trends

Let’s use the cafe data to analyze business patterns over time.

library(tidyverse)
cafe <- read_csv("data/cafe_sales.csv")

# Daily sales summary
daily_sales <- cafe %>%
  group_by(date, year, month, day_of_week) %>%
  summarize(
    total_sales = n(),
    revenue = sum(price),
    .groups = "drop"
  )

# Monthly trends
monthly_sales <- cafe %>%
  group_by(year, month) %>%
  summarize(
    total_sales = n(),
    revenue = sum(price),
    avg_price = mean(price),
    .groups = "drop"
  ) %>%
  mutate(
    year_month = paste(year, sprintf("%02d", month), sep = "-")
  )

# Year-over-year growth by product
yoy_growth <- cafe %>%
  group_by(year, product) %>%
  summarize(sales = n(), .groups = "drop") %>%
  arrange(product, year) %>%
  group_by(product) %>%
  mutate(
    prev_year_sales = lag(sales),
    yoy_growth = (sales / prev_year_sales - 1) * 100
  ) %>%
  filter(!is.na(yoy_growth))

# Seasonal patterns: average daily sales by month
seasonal <- cafe %>%
  group_by(month, product) %>%
  summarize(
    total_sales = n(),
    n_days = n_distinct(date),
    avg_daily = total_sales / n_days,
    .groups = "drop"
  ) %>%
  group_by(product) %>%
  mutate(
    product_avg = mean(avg_daily),
    seasonal_index = avg_daily / product_avg
  )

# Customer analysis: identify top customers
top_customers <- cafe %>%
  group_by(customer_id) %>%
  summarize(
    total_purchases = n(),
    total_spent = sum(price),
    first_purchase = min(date),
    last_purchase = max(date),
    favorite_product = names(which.max(table(product)))
  ) %>%
  arrange(desc(total_spent)) %>%
  head(100)

5.9 Example: State-Level Economic Analysis

Let’s analyze the relationship between education and economic outcomes across states.

library(tidyverse)
states <- read_csv("data/acs_state.csv")

# Education and income relationship
states %>%
  filter(state != "Puerto Rico") %>%
  mutate(
    college_pct = bachelors_or_higher * 100,
    income_thousands = median_household_income / 1000
  ) %>%
  select(state, college_pct, income_thousands, poverty_rate, unemployment_rate) %>%
  arrange(desc(college_pct))

# Correlation between education and outcomes
states %>%
  filter(state != "Puerto Rico") %>%
  summarize(
    edu_income_cor = cor(bachelors_or_higher, median_household_income),
    edu_poverty_cor = cor(bachelors_or_higher, poverty_rate),
    edu_unemp_cor = cor(bachelors_or_higher, unemployment_rate)
  )

# Categorize states by education level
states %>%
  filter(state != "Puerto Rico") %>%
  mutate(
    edu_tier = case_when(
      bachelors_or_higher >= 0.35 ~ "High (35%+)",
      bachelors_or_higher >= 0.30 ~ "Medium (30-35%)",
      TRUE ~ "Lower (<30%)"
    )
  ) %>%
  group_by(edu_tier) %>%
  summarize(
    n_states = n(),
    avg_income = mean(median_household_income),
    avg_poverty = mean(poverty_rate),
    avg_unemployment = mean(unemployment_rate)
  )

5.10 Exercises

Using the PUMS data, find the median wage for full-time workers (35+ hours) by education level. Which education level has the highest median wage?
Using the cafe data, calculate the average daily revenue by day of the week. Which day generates the most revenue?
Using the state data, create a new variable that categorizes states as “High Cost” (median rent > $1200), “Medium Cost” ($900-$1200), or “Low Cost” (< $900). How many states fall into each category?
Using the PUMS data, calculate the employment rate (employed / (employed + unemployed)) by state and sex. Which state has the largest gender gap in employment?
Using the cafe data, identify the month with the highest Cold Brew sales. Does this align with what you’d expect seasonally?
Join the state data with a region classification of your choice. Calculate average income and poverty rate by region.
Using the PUMS data, find the 10 most common occupations among workers with a Bachelor’s degree versus those with a high school diploma.
Analyze price changes in the cafe data: calculate the average price of each product by year and compute the cumulative price increase from 2019 to 2023.