MethodAtlas
replication120 minutes

Replication Lab: Are Emily and Greg More Employable?

Replicate the key findings from Bertrand and Mullainathan's landmark audit study on racial discrimination in hiring. Simulate data matching their published summary statistics, estimate callback differentials by race, and explore heterogeneity by resume quality.

Overview

In this replication lab, you will reproduce the main results from one of the most influential papers in labor economics:

Bertrand, Marianne, and Sendhil Mullainathan. 2004. "Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination." American Economic Review 94(4): 991–1013.

Bertrand and Mullainathan sent nearly 5,000 fictitious resumes to help-wanted ads in Boston and Chicago. Each resume was randomly assigned either a "white-sounding" or "African-American-sounding" name. The headline finding: resumes with white-sounding names received about 50 percent more callbacks than identical resumes with African-American-sounding names.

Why this paper matters: It provided clean experimental evidence of racial discrimination in the U.S. labor market, bypassing the usual concerns about unobserved differences between applicants. The paper has been cited thousands of times and spawned a large literature on audit and correspondence studies.

What you will do:

  • Learn why simulation is used when original microdata are restricted and how matching summary statistics enables pedagogical replication
  • Simulate data that matches the published summary statistics in Table 1
  • Estimate the racial gap in callback rates using OLS
  • Add resume-quality controls and test for heterogeneity
  • Compare your results to the published findings

Step 1: Simulate the Audit Study Data

The original study sent 4,870 resumes in response to 1,300 job ads. Each ad received four resumes: two high-quality and two low-quality, with one white-sounding and one African-American-sounding name in each quality pair.

library(estimatr)
library(modelsummary)

# Simulate data matching Bertrand & Mullainathan (2004) Table 1
set.seed(2004)
n <- 4870

black <- rbinom(n, 1, 0.5)
high_quality <- rbinom(n, 1, 0.5)
chicago <- rbinom(n, 1, 0.53)
college <- rbinom(n, 1, 0.72)
years_exp <- pmax(round(rnorm(n, 7.8, 5.0), 1), 0)
military <- rbinom(n, 1, 0.10)
email <- rbinom(n, 1, 0.48)
honors <- high_quality * rbinom(n, 1, 0.35)
computer_skills <- rbinom(n, 1, 0.82)
special_skills <- rbinom(n, 1, 0.33)

callback_latent <- -2.5 + 0.032 * (1 - black) + 0.015 * high_quality +
0.008 * college + 0.001 * years_exp + 0.005 * honors +
0.003 * email + 0.005 * computer_skills + rlogis(n)
callback <- as.integer(callback_latent > 0)

df <- data.frame(callback, black, high_quality, chicago, college,
               years_exp, military, email, honors,
               computer_skills, special_skills)

cat("=== Callback Rates (Published: White=9.65%, Black=6.45%) ===\n")
tapply(df$callback, df$black, mean)
cat("N =", nrow(df), "\n")

Step 2: Replicate Table 1 — Callback Rates by Race

The central result in the paper is a simple comparison of means: the callback rate for white-sounding names versus African-American-sounding names.

# Table 1: Simple difference in callback rates
white_rate <- mean(df$callback[df$black == 0])
black_rate <- mean(df$callback[df$black == 1])

cat("=== Replicating Table 1 ===\n")
cat("White callback rate:", round(white_rate * 100, 2), "%\n")
cat("Black callback rate:", round(black_rate * 100, 2), "%\n")
cat("Difference:", round((white_rate - black_rate) * 100, 2), "pp\n")
cat("Ratio:", round(white_rate / black_rate, 2), "\n")

# OLS: callback on race indicator
m1 <- lm_robust(callback ~ black, data = df, se_type = "HC1")
summary(m1)
Concept Check

In this experiment, why can we interpret the coefficient on 'black' as a causal effect of perceived race on callbacks?


Step 3: Add Resume Controls

Bertrand and Mullainathan show that the racial gap persists even after controlling for resume characteristics. Since names were randomly assigned, these controls should not change the coefficient much but may improve precision.

# Model 2: Add resume characteristics
m2 <- lm_robust(callback ~ black + high_quality + college + years_exp +
               military + email + honors + computer_skills + special_skills,
               data = df, se_type = "HC1")

# Model 3: Add city FE
m3 <- lm_robust(callback ~ black + high_quality + college + years_exp +
               military + email + honors + computer_skills +
               special_skills + chicago,
               data = df, se_type = "HC1")

# Compare
modelsummary(list("No controls" = m1, "+ Resume" = m2, "+ City" = m3),
           coef_map = c("black" = "Black name"),
           stars = c('*' = 0.1, '**' = 0.05, '***' = 0.01),
           gof_map = c("nobs", "r.squared"))
Requiresmodelsummary

Step 4: Test for Heterogeneity by Resume Quality

A striking finding in the paper is that higher resume quality increases callbacks for white names but not for black names. This finding means the "return to skills" differs by race.

# Interaction model
m4 <- lm_robust(callback ~ black * high_quality, data = df, se_type = "HC1")
summary(m4)

# 2x2 table
cat("\n=== 2x2 Table: Callback Rates ===\n")
tapply(df$callback, list(Race = ifelse(df$black, "Black", "White"),
                        Quality = ifelse(df$high_quality, "High", "Low")),
     mean)
Concept Check

Bertrand and Mullainathan find that better resumes lead to more callbacks for white names but not for black names. What does this imply about the nature of discrimination?


Step 5: Compare with Published Results

cat("============================================\n")
cat("COMPARISON: Our Replication vs. Published\n")
cat("============================================\n")
cat("White callback rate:  Published=9.65%  Ours=",
  round(mean(df$callback[df$black == 0]) * 100, 2), "%\n")
cat("Black callback rate:  Published=6.45%  Ours=",
  round(mean(df$callback[df$black == 1]) * 100, 2), "%\n")
cat("Ratio:                Published=1.50   Ours=",
  round(mean(df$callback[df$black == 0]) /
        mean(df$callback[df$black == 1]), 2), "\n")

Summary

Our replication confirms the core findings of Bertrand and Mullainathan (2004):

  1. Racial discrimination in callbacks is large and statistically significant. White-sounding names receive roughly 50% more callbacks than African-American-sounding names, even on identical resumes.

  2. The gap persists across specifications. Adding resume controls, city fixed effects, and occupation indicators does not eliminate the racial gap — as expected in a randomized experiment.

  3. Higher resume quality helps white applicants more than black applicants. This heterogeneity suggests that discrimination may operate as a screening barrier early in the evaluation process.

  4. Differences between our results and the published results arise from data simulation. With the original data, all estimates would match exactly. The key qualitative findings are robust.


Extension Exercises

  1. Probit/Logit model. Since the outcome is binary (callback or not), re-estimate the main specification using probit or logit. Compute marginal effects and compare to the linear probability model.

  2. By-city analysis. Estimate the racial gap separately for Boston and Chicago. Is discrimination worse in one city?

  3. Gender of name. The original paper also varied whether names were male or female. Add a gender dimension to the simulation and test for race-by-gender interactions.

  4. Power analysis. Given the effect size you estimated, how many resumes would you need to detect the interaction between race and resume quality at 80% power?

  5. Multiple hypothesis testing. The paper tests many subgroup comparisons. Apply Bonferroni or Benjamini-Hochberg corrections and discuss which results survive.