MethodAtlas
tutorial120 minutes

Lab: Shift-Share (Bartik) Instruments

Construct and validate a Bartik shift-share instrument for estimating the causal effect of local employment shocks on wages. Learn to decompose the instrument, test identifying assumptions, and apply the Borusyak-Hull-Jaravel (2022) diagnostics.

Overview

In this lab you will build a Bartik-style shift-share instrument from scratch. The classic application estimates how exogenous national industry demand shocks affect local labor market outcomes by interacting pre-period local industry employment shares with national industry growth rates.

What you will learn:

  • How to construct a shift-share (Bartik) instrument from industry shares and national shifts
  • How to estimate 2SLS using the Bartik instrument
  • The two identifying strategies: exogeneity of shares vs. exogeneity of shifts
  • How to implement Borusyak et al. (2022) diagnostics
  • How to test for relevance (first stage) and interpret reduced-form results

Prerequisites: Familiarity with IV/2SLS estimation and the concept of endogeneity. Completion of the IV tutorial lab is recommended.


Step 1: Simulate Local Labor Market Data

We simulate 200 commuting zones with employment across 20 industries. National demand shocks drive local employment changes.

library(fixest)
library(MASS)

set.seed(42)
L <- 200; K <- 20

# Industry shares from Dirichlet
raw_shares <- matrix(rgamma(L * K, shape = 2), nrow = L)
raw_shares <- raw_shares / rowSums(raw_shares)

# National growth rates
national_growth <- rnorm(K, 0.02, 0.05)

# Local confounder
local_amenity <- rnorm(L)

# Bartik instrument
bartik <- raw_shares %*% national_growth

# Local employment growth (endogenous)
emp_growth <- bartik + 0.3 * local_amenity + rnorm(L, 0, 0.02)

# Wage growth (true effect = 0.5)
wage_growth <- 0.01 + 0.5 * emp_growth + 0.2 * local_amenity + rnorm(L, 0, 0.03)

df <- data.frame(cz = 1:L, wage_growth = as.numeric(wage_growth),
               emp_growth = as.numeric(emp_growth),
               bartik = as.numeric(bartik), amenity = local_amenity)
summary(df[, c("wage_growth", "emp_growth", "bartik")])
RequiresfixestMASS

Expected output:

VariableMeanStd DevMinMax
wage_growth0.0240.055-0.120.18
emp_growth0.0220.040-0.080.13
bartik0.0100.020-0.040.06
amenity0.001.00-2.83.1
czwage_growthemp_growthbartikamenity
10.0320.0400.0150.50
2-0.015-0.010-0.008-0.22
30.0580.0550.0221.10
40.0100.0180.0050.35
50.0450.0350.0180.60
Correlation(emp_growth, amenity): ~0.85
Correlation(bartik, amenity):     ~0.02

The high correlation between emp_growth and amenity confirms the endogeneity problem. The near-zero correlation between the Bartik instrument and amenity supports the instrument's validity.


Step 2: Construct the Bartik Instrument

The Bartik instrument is B=kskgkB_\ell = \sum_k s_{\ell k} \cdot g_k, where sks_{\ell k} is location \ell's employment share in industry kk at baseline, and gkg_k is the national growth rate of industry kk.

# Manual construction
bartik_manual <- rep(0, L)
for (k in 1:K) {
bartik_manual <- bartik_manual + raw_shares[, k] * national_growth[k]
}
cat("Max difference:", max(abs(bartik_manual - bartik)), "\n")

# Industry contributions
avg_shares <- colMeans(raw_shares)
contributions <- avg_shares * national_growth
top5 <- order(-abs(contributions))[1:5]
cat("\nTop 5 contributing industries:\n")
for (k in top5) {
cat(sprintf("  Industry %d: avg share = %.3f, growth = %.4f\n",
            k, avg_shares[k], national_growth[k]))
}

Expected output:

Max difference from vectorized: 0.0e+00

Expected output: Top 5 contributing industries

IndustryAvg ShareNational GrowthContribution
Industry 30.0550.0850.00468
Industry 120.062-0.070-0.00434
Industry 70.0480.0720.00346
Industry 150.051-0.055-0.00281
Industry 10.0580.0450.00261

Industries with both large average shares and large (positive or negative) growth rates contribute most to the Bartik instrument. Positive contributions come from growing industries with large local presence; negative contributions come from declining industries.


Step 3: First Stage and Reduced Form

Estimate the first stage (employment growth on Bartik) and reduced form (wages on Bartik) separately before running 2SLS.

# First stage
first_stage <- feols(emp_growth ~ bartik, data = df, vcov = "hetero")
cat("=== First Stage ===\n")
print(summary(first_stage))

# Reduced form
reduced_form <- feols(wage_growth ~ bartik, data = df, vcov = "hetero")
cat("\n=== Reduced Form ===\n")
print(summary(reduced_form))

# Wald estimate
wald <- coef(reduced_form)["bartik"] / coef(first_stage)["bartik"]
cat("\nWald estimate:", wald, "\n")

Expected output:

RegressionVariableCoefficientSEF-stat
First stage (emp ~ bartik)bartik~1.00~0.10~85
Reduced form (wage ~ bartik)bartik~0.50~0.08
=== First Stage ===
Coefficient on Bartik: ~1.00
F-statistic: ~85
R-squared: ~0.25

=== Reduced Form ===
Coefficient on Bartik: ~0.50

Wald (IV) estimate: ~0.50
True effect: 0.5
Concept Check

The first-stage F-statistic is well above 10 (say F = 85). What does this tell you?


Step 4: 2SLS Estimation

# OLS (biased)
ols <- feols(wage_growth ~ emp_growth, data = df, vcov = "hetero")

# 2SLS
iv <- feols(wage_growth ~ 1 | 0 | emp_growth ~ bartik, data = df, vcov = "hetero")

cat("=== Comparison ===\n")
cat("OLS:", coef(ols)["emp_growth"], "\n")
cat("2SLS:", coef(iv)["fit_emp_growth"], "\n")
cat("True:", 0.5, "\n")

etable(ols, iv, headers = c("OLS", "2SLS"))

Expected output:

EstimatorCoefficientSETrue Effect
OLS (biased)~0.75~0.040.50
2SLS (Bartik IV)~0.50~0.100.50
=== Comparison ===
OLS coefficient:  ~0.75  (SE: ~0.04)
2SLS coefficient: ~0.50  (SE: ~0.10)
True effect:       0.50

OLS is biased upward because emp_growth is correlated with amenity
2SLS corrects by using only Bartik-driven variation

OLS overestimates the true effect by ~50% because local amenities positively affect both employment growth and wages. The 2SLS estimate, which isolates the national-shock-driven component, is much closer to the truth. Note the 2SLS standard error is larger, reflecting the efficiency cost of instrumenting.


Step 5: Borusyak-Hull-Jaravel Diagnostics

BHJ (2022) show that with many industries, identification can come from the exogeneity of the national shocks gkg_k rather than the shares sks_{\ell k}. The key diagnostic is the "shock-level" regression.

# BHJ shock-level regression
Y_k <- numeric(K); X_k <- numeric(K); weight_k <- numeric(K)
for (k in 1:K) {
s_k <- raw_shares[, k]
weight_k[k] <- sum(s_k)
Y_k[k] <- sum(s_k * df$wage_growth) / sum(s_k)
X_k[k] <- sum(s_k * df$emp_growth) / sum(s_k)
}

shock_df <- data.frame(Y_k, X_k, g_k = national_growth, weight_k)

# Weighted regression at shock level
shock_reg <- lm(Y_k ~ X_k, data = shock_df, weights = weight_k)
cat("Shock-level coefficient:", coef(shock_reg)["X_k"], "\n")

# Balance: shocks vs pre-existing amenity
amenity_k <- sapply(1:K, function(k) {
sum(raw_shares[, k] * df$amenity) / sum(raw_shares[, k])
})
balance <- lm(amenity_k ~ national_growth)
cat("Balance test p-value:", summary(balance)$coefficients[2, 4], "\n")

Expected output:

=== Shock-Level Regression (BHJ) ===
Coefficient: ~0.50
This should match the location-level 2SLS estimate

Balance: correlation of shocks with amenity: ~0.05 (p = ~0.85)
BHJ DiagnosticResultInterpretation
Shock-level coefficient~0.50Matches location-level 2SLS
Balance test (shocks vs. amenity)p ~0.85No evidence shocks are correlated with local confounders
Number of shocks (K)20Sufficient for quasi-random shock interpretation

The balance test is the key BHJ diagnostic: it checks whether the national growth rates are correlated with the share-weighted local confounders (amenities). A high p-value (>>0.05) supports the identifying assumption that shocks are quasi-randomly assigned.

Concept Check

In the BHJ framework, what is the key identifying assumption when there are many industries (K is large)?


Step 6: Rotemberg Weights

Goldsmith-Pinkham et al. (2020) provide an alternative decomposition based on the shares. Rotemberg weights reveal which industries are driving the estimate.

# Rotemberg weights (simplified)
rotemberg <- numeric(K)
for (k in 1:K) {
s_k <- raw_shares[, k]
rotemberg[k] <- national_growth[k] * sum(s_k * (df$wage_growth - mean(df$wage_growth)))
}
rotemberg <- rotemberg / sum(rotemberg)

top5_rot <- order(-abs(rotemberg))[1:5]
cat("Top 5 Rotemberg weights:\n")
for (k in top5_rot) {
cat(sprintf("  Industry %d: weight = %.4f, growth = %.4f\n",
            k, rotemberg[k], national_growth[k]))
}
cat("Negative weights:", sum(rotemberg < 0), "\n")

Expected output:

IndustryRotemberg WeightNational Growth
Industry 3~0.250.085
Industry 7~0.180.072
Industry 1~0.150.045
Industry 12~-0.04-0.070
Industry 15~-0.03-0.055
Number of negative weights: ~5
Sum of negative weights:    ~-0.08

The concentration of Rotemberg weights in a few industries means the 2SLS estimate is effectively identified by demand shocks in those key sectors. If those particular shocks are not exogenous, the estimate may be biased.


Exercises

  1. Vary the number of industries. Re-run with K = 5 and K = 100. How does the precision of the 2SLS estimate and the BHJ balance test change?

  2. Add correlated shocks. Make some national growth rates correlated with the average amenity in CZs where that industry is concentrated. Does the 2SLS estimate become biased? Does the BHJ balance test detect it?

  3. Leave-one-out industry. Drop each industry one at a time from the Bartik instrument and re-estimate 2SLS. Plot the distribution of estimates. Is the result robust?

  4. Pre-trends test. Add a lagged dependent variable as a 'pre-trend' and test whether the Bartik instrument predicts it. This regression is a falsification test for the exogeneity assumption.


Summary

In this lab you learned:

  • The Bartik shift-share instrument combines local industry composition (shares) with national industry trends (shifts) to isolate exogenous variation in local employment
  • The first stage must be strong (F > 10) for 2SLS to be reliable
  • BHJ (2022) show that with many industries, identification comes from quasi-random shocks, which is testable via balance regressions at the shock level
  • Rotemberg weights reveal which industries drive the 2SLS estimate, and negative weights signal fragility
  • It is recommended to report the first stage, reduced form, and the full set of diagnostics alongside the 2SLS coefficient