Lab: Shift-Share (Bartik) Instruments
Construct and validate a Bartik shift-share instrument for estimating the causal effect of local employment shocks on wages. Learn to decompose the instrument, test identifying assumptions, and apply the Borusyak-Hull-Jaravel (2022) diagnostics.
Overview
In this lab you will build a Bartik-style shift-share instrument from scratch. The classic application estimates how exogenous national industry demand shocks affect local labor market outcomes by interacting pre-period local industry employment shares with national industry growth rates.
What you will learn:
- How to construct a shift-share (Bartik) instrument from industry shares and national shifts
- How to estimate 2SLS using the Bartik instrument
- The two identifying strategies: exogeneity of shares vs. exogeneity of shifts
- How to implement Borusyak et al. (2022) diagnostics
- How to test for relevance (first stage) and interpret reduced-form results
Prerequisites: Familiarity with IV/2SLS estimation and the concept of endogeneity. Completion of the IV tutorial lab is recommended.
Step 1: Simulate Local Labor Market Data
We simulate 200 commuting zones with employment across 20 industries. National demand shocks drive local employment changes.
library(fixest)
library(MASS)
set.seed(42)
L <- 200; K <- 20 # 200 commuting zones (CZs), 20 industries
# Base-period industry employment shares via Dirichlet (gamma draws normalized)
# Each row is a CZ; each column is an industry; rows sum to 1
raw_shares <- matrix(rgamma(L * K, shape = 2), nrow = L)
raw_shares <- raw_shares / rowSums(raw_shares)
# National industry growth rates — the exogenous "shifts"
national_growth <- rnorm(K, 0.02, 0.05)
# Local confounder: amenities that affect both employment and wages (the endogeneity source)
local_amenity <- rnorm(L)
# Bartik instrument: B_l = sum_k(s_lk * g_k) — share-weighted national shocks
bartik <- raw_shares %*% national_growth
# Endogenous local employment growth: driven by Bartik + confounded by local amenity
emp_growth <- bartik + 0.3 * local_amenity + rnorm(L, 0, 0.02)
# Wage growth: true causal effect of emp_growth is 0.5; amenity also affects wages directly
wage_growth <- 0.01 + 0.5 * emp_growth + 0.2 * local_amenity + rnorm(L, 0, 0.03)
df <- data.frame(cz = 1:L, wage_growth = as.numeric(wage_growth),
emp_growth = as.numeric(emp_growth),
bartik = as.numeric(bartik), amenity = local_amenity)
summary(df[, c("wage_growth", "emp_growth", "bartik")])Expected output:
| Variable | Mean | Std Dev | Min | Max |
|---|---|---|---|---|
| wage_growth | 0.024 | 0.055 | -0.12 | 0.18 |
| emp_growth | 0.022 | 0.040 | -0.08 | 0.13 |
| bartik | 0.010 | 0.020 | -0.04 | 0.06 |
| amenity | 0.00 | 1.00 | -2.8 | 3.1 |
| cz | wage_growth | emp_growth | bartik | amenity |
|---|---|---|---|---|
| 1 | 0.032 | 0.040 | 0.015 | 0.50 |
| 2 | -0.015 | -0.010 | -0.008 | -0.22 |
| 3 | 0.058 | 0.055 | 0.022 | 1.10 |
| 4 | 0.010 | 0.018 | 0.005 | 0.35 |
| 5 | 0.045 | 0.035 | 0.018 | 0.60 |
Correlation(emp_growth, amenity): ~0.85
Correlation(bartik, amenity): ~0.02
The high correlation between emp_growth and amenity confirms the endogeneity problem. The near-zero correlation between the Bartik instrument and amenity supports the instrument's validity.
Step 2: Construct the Bartik Instrument
The Bartik instrument is , where is location 's employment share in industry at baseline, and is the national growth rate of industry .
# Manual construction: B_l = sum_k(s_lk * g_k), loop over industries
bartik_manual <- rep(0, L)
for (k in 1:K) {
bartik_manual <- bartik_manual + raw_shares[, k] * national_growth[k]
}
# Verify manual construction matches the matrix product
cat("Max difference:", max(abs(bartik_manual - bartik)), "\n")
# Decompose: which industries contribute most to the Bartik instrument?
# Contribution_k = avg_share_k * national_growth_k
avg_shares <- colMeans(raw_shares)
contributions <- avg_shares * national_growth
top5 <- order(-abs(contributions))[1:5]
cat("\nTop 5 contributing industries:\n")
for (k in top5) {
cat(sprintf(" Industry %d: avg share = %.3f, growth = %.4f\n",
k, avg_shares[k], national_growth[k]))
}Expected output:
Max difference from vectorized: 0.0e+00
Expected output: Top 5 contributing industries
| Industry | Avg Share | National Growth | Contribution |
|---|---|---|---|
| Industry 3 | 0.055 | 0.085 | 0.00468 |
| Industry 12 | 0.062 | -0.070 | -0.00434 |
| Industry 7 | 0.048 | 0.072 | 0.00346 |
| Industry 15 | 0.051 | -0.055 | -0.00281 |
| Industry 1 | 0.058 | 0.045 | 0.00261 |
Industries with both large average shares and large (positive or negative) growth rates contribute most to the Bartik instrument. Positive contributions come from growing industries with large local presence; negative contributions come from declining industries.
Step 3: First Stage and Reduced Form
Estimate the first stage (employment growth on Bartik) and reduced form (wages on Bartik) separately before running 2SLS.
# First stage: emp_growth = alpha + gamma * bartik + epsilon
# Tests relevance: does the instrument predict the endogenous variable?
first_stage <- feols(emp_growth ~ bartik, data = df, vcov = "hetero")
cat("=== First Stage ===\n")
print(summary(first_stage))
# Reduced form: wage_growth = alpha + pi * bartik + epsilon
# Shows the total (causal) effect of the instrument on the outcome
reduced_form <- feols(wage_growth ~ bartik, data = df, vcov = "hetero")
cat("\n=== Reduced Form ===\n")
print(summary(reduced_form))
# Wald estimate = reduced form / first stage = IV estimate in the just-identified case
wald <- coef(reduced_form)["bartik"] / coef(first_stage)["bartik"]
cat("\nWald estimate:", wald, "\n")Expected output:
| Regression | Variable | Coefficient | SE | F-stat |
|---|---|---|---|---|
| First stage (emp ~ bartik) | bartik | ~1.00 | ~0.10 | ~85 |
| Reduced form (wage ~ bartik) | bartik | ~0.50 | ~0.08 | — |
=== First Stage ===
Coefficient on Bartik: ~1.00
F-statistic: ~85
R-squared: ~0.25
=== Reduced Form ===
Coefficient on Bartik: ~0.50
Wald (IV) estimate: ~0.50
True effect: 0.5
The first-stage F-statistic is well above 10 (say F = 85). What does this tell you?
Step 4: 2SLS Estimation
# OLS: biased because emp_growth is confounded by local_amenity
ols <- feols(wage_growth ~ emp_growth, data = df, vcov = "hetero")
# 2SLS: fixest IV syntax is Y ~ exog | FE | endog ~ instruments
# "1" = intercept only as exogenous regressor; "0" = no fixed effects
iv <- feols(wage_growth ~ 1 | 0 | emp_growth ~ bartik, data = df, vcov = "hetero")
cat("=== Comparison ===\n")
cat("OLS:", coef(ols)["emp_growth"], "\n")
cat("2SLS:", coef(iv)["fit_emp_growth"], "\n") # fixest labels IV-fitted variable "fit_..."
cat("True:", 0.5, "\n")
# Side-by-side table for publication-style comparison
etable(ols, iv, headers = c("OLS", "2SLS"))Expected output:
| Estimator | Coefficient | SE | True Effect |
|---|---|---|---|
| OLS (biased) | ~0.75 | ~0.04 | 0.50 |
| 2SLS (Bartik IV) | ~0.50 | ~0.10 | 0.50 |
=== Comparison ===
OLS coefficient: ~0.75 (SE: ~0.04)
2SLS coefficient: ~0.50 (SE: ~0.10)
True effect: 0.50
OLS is biased upward because emp_growth is correlated with amenity
2SLS corrects by using only Bartik-driven variation
OLS overestimates the true effect by ~50% because local amenities positively affect both employment growth and wages. The 2SLS estimate, which isolates the national-shock-driven component, is much closer to the truth. Note the 2SLS standard error is larger, reflecting the efficiency cost of instrumenting.
Step 5: Borusyak-Hull-Jaravel Diagnostics
Borusyak et al. (2022) show that with many industries, identification can come from the exogeneity of the national shocks rather than the shares . The key diagnostic is the "shock-level" regression.
# BHJ shock-level regression: rewrite the IV at the industry (shock) level
# For each industry k, compute exposure-weighted averages of outcomes and regressors
Y_k <- numeric(K); X_k <- numeric(K); weight_k <- numeric(K)
for (k in 1:K) {
s_k <- raw_shares[, k] # industry k's shares across all CZs
weight_k[k] <- sum(s_k) # total exposure to industry k
Y_k[k] <- sum(s_k * df$wage_growth) / sum(s_k) # share-weighted outcome
X_k[k] <- sum(s_k * df$emp_growth) / sum(s_k) # share-weighted regressor
}
shock_df <- data.frame(Y_k, X_k, g_k = national_growth, weight_k)
# Weighted regression at shock level (should match location-level 2SLS)
shock_reg <- lm(Y_k ~ X_k, data = shock_df, weights = weight_k)
cat("Shock-level coefficient:", coef(shock_reg)["X_k"], "\n")
# Balance test: are national shocks correlated with share-weighted local confounders?
# High p-value supports the identifying assumption that shocks are quasi-random
amenity_k <- sapply(1:K, function(k) {
sum(raw_shares[, k] * df$amenity) / sum(raw_shares[, k])
})
balance <- lm(amenity_k ~ national_growth)
cat("Balance test p-value:", summary(balance)$coefficients[2, 4], "\n")Expected output:
=== Shock-Level Regression (BHJ) ===
Coefficient: ~0.50
This should match the location-level 2SLS estimate
Balance: correlation of shocks with amenity: ~0.05 (p = ~0.85)
| BHJ Diagnostic | Result | Interpretation |
|---|---|---|
| Shock-level coefficient | ~0.50 | Matches location-level 2SLS |
| Balance test (shocks vs. amenity) | p ~0.85 | No evidence shocks are correlated with local confounders |
| Number of shocks (K) | 20 | Sufficient for quasi-random shock interpretation |
The balance test is the key BHJ diagnostic: it checks whether the national growth rates are correlated with the share-weighted local confounders (amenities). A high p-value (>>0.05) supports the identifying assumption that shocks are quasi-randomly assigned.
In the BHJ framework, what is the key identifying assumption when there are many industries (K is large)?
Step 6: Rotemberg Weights
Goldsmith-Pinkham et al. (2020) provide an alternative decomposition based on the shares. Rotemberg weights reveal which industries are driving the estimate.
# Rotemberg weights (Goldsmith-Pinkham et al. 2020): decompose 2SLS by industry
# alpha_k measures how much industry k contributes to the overall IV estimate
rotemberg <- numeric(K)
for (k in 1:K) {
s_k <- raw_shares[, k]
# Unnormalized weight = g_k * Cov(s_lk, Y_l) — shock times share-outcome covariance
rotemberg[k] <- national_growth[k] * sum(s_k * (df$wage_growth - mean(df$wage_growth)))
}
rotemberg <- rotemberg / sum(rotemberg) # normalize to sum to 1
top5_rot <- order(-abs(rotemberg))[1:5]
cat("Top 5 Rotemberg weights:\n")
for (k in top5_rot) {
cat(sprintf(" Industry %d: weight = %.4f, growth = %.4f\n",
k, rotemberg[k], national_growth[k]))
}
# Negative weights indicate industries pulling the estimate in the opposite direction
cat("Negative weights:", sum(rotemberg < 0), "\n")Expected output:
| Industry | Rotemberg Weight | National Growth |
|---|---|---|
| Industry 3 | ~0.25 | 0.085 |
| Industry 7 | ~0.18 | 0.072 |
| Industry 1 | ~0.15 | 0.045 |
| Industry 12 | ~-0.04 | -0.070 |
| Industry 15 | ~-0.03 | -0.055 |
Number of negative weights: ~5
Sum of negative weights: ~-0.08
The concentration of Rotemberg weights in a few industries means the 2SLS estimate is effectively identified by demand shocks in those key sectors. If those particular shocks are not exogenous, the estimate may be biased.
Exercises
-
Vary the number of industries. Re-run with K = 5 and K = 100. How does the precision of the 2SLS estimate and the BHJ balance test change?
-
Add correlated shocks. Make some national growth rates correlated with the average amenity in CZs where that industry is concentrated. Does the 2SLS estimate become biased? Does the BHJ balance test detect it?
-
Leave-one-out industry. Drop each industry one at a time from the Bartik instrument and re-estimate 2SLS. Plot the distribution of estimates. Is the result robust?
-
Pre-trends test. Add a lagged dependent variable as a 'pre-trend' and test whether the Bartik instrument predicts it. This regression is a falsification test for the exogeneity assumption.
Summary
In this lab you learned:
- The Bartik shift-share instrument combines local industry composition (shares) with national industry trends (shifts) to isolate exogenous variation in local employment
- The first stage must be strong (F > 10) for 2SLS to be reliable
- Borusyak et al. (2022) show that with many industries, identification can come from quasi-random shocks; balance regressions at the shock level can probe the plausibility of this assumption
- Rotemberg weights reveal which industries drive the 2SLS estimate, and negative weights signal fragility
- It is recommended to report the first stage, reduced form, and the full set of diagnostics alongside the 2SLS coefficient