Lab·tutorial·7 min read

tutorial120 minutes

Lab: Shift-Share (Bartik) Instruments

Construct and validate a Bartik shift-share IV for local employment shocks on wages: decompose the instrument, test assumptions, apply Borusyak-Hull-Jaravel.

Method: Shift-Share / Bartik Instruments
Languages: Python, R, Stata
Dataset: Simulated local labor market data (Bartik-style)

Overview

In this lab you will build a Bartik-style shift-share instrument from scratch. The classic application estimates how exogenous national industry demand shocks affect local labor market outcomes by interacting pre-period local industry employment shares with national industry growth rates.

What you will learn:

How to construct a shift-share (Bartik) instrument from industry shares and national shifts
How to estimate 2SLS using the Bartik instrument
The two identifying strategies: exogeneity of shares vs. exogeneity of shifts
How to implement Borusyak et al. (2022) diagnostics
How to test for relevance (first stage) and interpret reduced-form results

Prerequisites: Familiarity with IV/2SLS estimation and the concept of endogeneity. Completion of the IV tutorial lab is recommended.

Step 1: Simulate Local Labor Market Data

We simulate 200 commuting zones with employment across 20 industries. National demand shocks drive local employment changes.

1# First-time setup: install.packages(c("fixest", "MASS"))
2library(fixest)
3library(MASS)
4
5set.seed(42)
6L <- 200; K <- 20  # 200 commuting zones (CZs), 20 industries
7
8# Base-period industry employment shares via Dirichlet (gamma draws normalized)
9# Each row is a CZ; each column is an industry; rows sum to 1
10raw_shares <- matrix(rgamma(L * K, shape = 2), nrow = L)
11raw_shares <- raw_shares / rowSums(raw_shares)
12
13# National industry growth rates — the exogenous "shifts"
14national_growth <- rnorm(K, 0.02, 0.05)
15
16# Local confounder: amenities that affect both employment and wages (the endogeneity source)
17local_amenity <- rnorm(L)
18
19# Bartik instrument: B_l = sum_k(s_lk * g_k) — share-weighted national shocks
20bartik <- raw_shares %*% national_growth
21
22# Endogenous local employment growth: driven by Bartik + confounded by local amenity
23emp_growth <- bartik + 0.3 * local_amenity + rnorm(L, 0, 0.02)
24
25# Wage growth: true causal effect of emp_growth is 0.5; amenity also affects wages directly
26wage_growth <- 0.01 + 0.5 * emp_growth + 0.2 * local_amenity + rnorm(L, 0, 0.03)
27
28df <- data.frame(cz = 1:L, wage_growth = as.numeric(wage_growth),
29               emp_growth = as.numeric(emp_growth),
30               bartik = as.numeric(bartik), amenity = local_amenity)
31summary(df[, c("wage_growth", "emp_growth", "bartik")])

Requiresfixest MASS

Expected output:

Variable	Mean	Std Dev	Min	Max
wage_growth	0.024	0.055	-0.12	0.18
emp_growth	0.022	0.040	-0.08	0.13
bartik	0.010	0.020	-0.04	0.06
amenity	0.00	1.00	-2.8	3.1

cz	wage_growth	emp_growth	bartik	amenity
1	0.032	0.040	0.015	0.50
2	-0.015	-0.010	-0.008	-0.22
3	0.058	0.055	0.022	1.10
4	0.010	0.018	0.005	0.35
5	0.045	0.035	0.018	0.60

Correlation(emp_growth, amenity): ~0.85
Correlation(bartik, amenity):     ~0.02

The high correlation between emp_growth and amenity confirms the endogeneity problem. The near-zero correlation between the Bartik instrument and amenity supports the instrument's validity.

Step 2: Construct the Bartik Instrument

The Bartik instrument is $B_\ell = \sum_k s_{\ell k} \cdot g_k$ , where $s_{\ell k}$ is location $\ell$ 's employment share in industry $k$ at baseline, and $g_k$ is the national growth rate of industry $k$ .

1# Manual construction: B_l = sum_k(s_lk * g_k), loop over industries
2bartik_manual <- rep(0, L)
3for (k in 1:K) {
4bartik_manual <- bartik_manual + raw_shares[, k] * national_growth[k]
5}
6# Verify manual construction matches the matrix product
7cat("Max difference:", max(abs(bartik_manual - bartik)), "\n")
8
9# Decompose: which industries contribute most to the Bartik instrument?
10# Contribution_k = avg_share_k * national_growth_k
11avg_shares <- colMeans(raw_shares)
12contributions <- avg_shares * national_growth
13top5 <- order(-abs(contributions))[1:5]
14cat("\nTop 5 contributing industries:\n")
15for (k in top5) {
16cat(sprintf("  Industry %d: avg share = %.3f, growth = %.4f\n",
17            k, avg_shares[k], national_growth[k]))
18}

Expected output:

Max difference from vectorized: 0.0e+00

Expected output: Top 5 contributing industries

Industry	Avg Share	National Growth	Contribution
Industry 3	0.055	0.085	0.00468
Industry 12	0.062	-0.070	-0.00434
Industry 7	0.048	0.072	0.00346
Industry 15	0.051	-0.055	-0.00281
Industry 1	0.058	0.045	0.00261

Industries with both large average shares and large (positive or negative) growth rates contribute most to the Bartik instrument. Positive contributions come from growing industries with large local presence; negative contributions come from declining industries.

Step 3: First Stage and Reduced Form

Estimate the first stage (employment growth on Bartik) and reduced form (wages on Bartik) separately before running 2SLS.

1# First stage: emp_growth = alpha + gamma * bartik + epsilon
2# Tests relevance: does the instrument predict the endogenous variable?
3first_stage <- feols(emp_growth ~ bartik, data = df, vcov = "hetero")
4cat("=== First Stage ===\n")
5print(summary(first_stage))
6
7# Reduced form: wage_growth = alpha + pi * bartik + epsilon
8# Shows the total (causal) effect of the instrument on the outcome
9reduced_form <- feols(wage_growth ~ bartik, data = df, vcov = "hetero")
10cat("\n=== Reduced Form ===\n")
11print(summary(reduced_form))
12
13# Wald estimate = reduced form / first stage = IV estimate in the just-identified case
14wald <- coef(reduced_form)["bartik"] / coef(first_stage)["bartik"]
15cat("\nWald estimate:", wald, "\n")

Expected output:

Regression	Variable	Coefficient	SE	F-stat
First stage (emp ~ bartik)	bartik	~1.00	~0.10	~85
Reduced form (wage ~ bartik)	bartik	~0.50	~0.08	—

=== First Stage ===
Coefficient on Bartik: ~1.00
F-statistic: ~85
R-squared: ~0.25

=== Reduced Form ===
Coefficient on Bartik: ~0.50

Wald (IV) estimate: ~0.50
True effect: 0.5

Concept Check

The first-stage F-statistic is well above the Staiger-Stock screening threshold of 10 (say F = 85). What does this tell you?

The Bartik instrument is a valid (exogenous) instrument.The Bartik instrument is strongly correlated with the endogenous variable (relevance condition satisfied), so weak-instrument bias is unlikely to be a problem.The 2SLS estimate is consistent.You can safely ignore the reduced-form regression.

Step 4: 2SLS Estimation

1# OLS: biased because emp_growth is confounded by local_amenity
2ols <- feols(wage_growth ~ emp_growth, data = df, vcov = "hetero")
3
4# 2SLS: fixest IV syntax is Y ~ exog | FE | endog ~ instruments
5# "1" = intercept only as exogenous regressor; "0" = no fixed effects
6iv <- feols(wage_growth ~ 1 | 0 | emp_growth ~ bartik, data = df, vcov = "hetero")
7
8cat("=== Comparison ===\n")
9cat("OLS:", coef(ols)["emp_growth"], "\n")
10cat("2SLS:", coef(iv)["fit_emp_growth"], "\n")  # fixest labels IV-fitted variable "fit_..."
11cat("True:", 0.5, "\n")
12
13# Side-by-side table for publication-style comparison
14etable(ols, iv, headers = c("OLS", "2SLS"))

Requiresfixest

Expected output:

Estimator	Coefficient	SE	True Effect
OLS (biased)	~0.75	~0.04	0.50
2SLS (Bartik IV)	~0.50	~0.10	0.50

=== Comparison ===
OLS coefficient:  ~0.75  (SE: ~0.04)
2SLS coefficient: ~0.50  (SE: ~0.10)
True effect:       0.50

OLS is biased upward because emp_growth is correlated with amenity
2SLS corrects by using only Bartik-driven variation

OLS overestimates the true effect by ~50% because local amenities positively affect both employment growth and wages. The 2SLS estimate, which isolates the national-shock-driven component, is much closer to the truth. Note the 2SLS standard error is larger, reflecting the efficiency cost of instrumenting.

Step 5: Borusyak-Hull-Jaravel Diagnostics

Borusyak et al. (2022) show that with many industries, identification can come from the exogeneity of the national shocks $g_k$ rather than the shares $s_{\ell k}$ . The key diagnostic is the "shock-level" regression.

1# BHJ shock-level regression: rewrite the IV at the industry (shock) level
2# For each industry k, compute exposure-weighted averages of outcomes and regressors
3Y_k <- numeric(K); X_k <- numeric(K); weight_k <- numeric(K)
4for (k in 1:K) {
5s_k <- raw_shares[, k]          # industry k's shares across all CZs
6weight_k[k] <- sum(s_k)         # total exposure to industry k
7Y_k[k] <- sum(s_k * df$wage_growth) / sum(s_k)  # share-weighted outcome
8X_k[k] <- sum(s_k * df$emp_growth) / sum(s_k)   # share-weighted regressor
9}
10
11shock_df <- data.frame(Y_k, X_k, g_k = national_growth, weight_k)
12
13# Weighted regression at shock level (should match location-level 2SLS)
14shock_reg <- lm(Y_k ~ X_k, data = shock_df, weights = weight_k)
15cat("Shock-level coefficient:", coef(shock_reg)["X_k"], "\n")
16
17# Balance test: are national shocks correlated with share-weighted local confounders?
18# High p-value supports the identifying assumption that shocks are quasi-random
19amenity_k <- sapply(1:K, function(k) {
20sum(raw_shares[, k] * df$amenity) / sum(raw_shares[, k])
21})
22balance <- lm(amenity_k ~ national_growth)
23cat("Balance test p-value:", summary(balance)$coefficients[2, 4], "\n")

Expected output:

=== Shock-Level Regression (BHJ) ===
Coefficient: ~0.50
This should match the location-level 2SLS estimate

Balance: correlation of shocks with amenity: ~0.05 (p = ~0.85)

BHJ Diagnostic	Result	Interpretation
Shock-level coefficient	~0.50	Matches location-level 2SLS
Balance test (shocks vs. amenity)	p ~0.85	No evidence shocks are correlated with local confounders
Number of shocks (K)	20	Sufficient for quasi-random shock interpretation

The balance test is the key BHJ diagnostic: it checks whether the national growth rates are correlated with the share-weighted local confounders (amenities). A high p-value (>>0.05) supports the identifying assumption that shocks are quasi-randomly assigned.

Concept Check

In the BHJ framework, what is the key identifying assumption when there are many industries (K is large)?

Local industry shares must be randomly assigned across locations.The national industry growth rates (shocks) must be as-good-as-randomly assigned, conditional on controls.Employment growth must be strictly exogenous.Both shares and shocks must be exogenous.

Step 6: Rotemberg Weights

Goldsmith-Pinkham et al. (2020) provide an alternative decomposition based on the shares. Rotemberg weights reveal which industries are driving the estimate.

1# Rotemberg weights (Goldsmith-Pinkham et al. 2020): decompose 2SLS by industry
2# alpha_k measures how much industry k contributes to the overall IV estimate
3rotemberg <- numeric(K)
4for (k in 1:K) {
5s_k <- raw_shares[, k]
6# Unnormalized weight = g_k * Cov(s_lk, Y_l) — shock times share-outcome covariance
7rotemberg[k] <- national_growth[k] * sum(s_k * (df$wage_growth - mean(df$wage_growth)))
8}
9rotemberg <- rotemberg / sum(rotemberg)  # normalize to sum to 1
10
11top5_rot <- order(-abs(rotemberg))[1:5]
12cat("Top 5 Rotemberg weights:\n")
13for (k in top5_rot) {
14cat(sprintf("  Industry %d: weight = %.4f, growth = %.4f\n",
15            k, rotemberg[k], national_growth[k]))
16}
17# Negative weights indicate industries pulling the estimate in the opposite direction
18cat("Negative weights:", sum(rotemberg < 0), "\n")

Expected output:

Industry	Rotemberg Weight	National Growth
Industry 3	~0.25	0.085
Industry 7	~0.18	0.072
Industry 1	~0.15	0.045
Industry 12	~-0.04	-0.070
Industry 15	~-0.03	-0.055

Number of negative weights: ~5
Sum of negative weights:    ~-0.08

The concentration of Rotemberg weights in a few industries means the 2SLS estimate is effectively identified by demand shocks in those key sectors. If those particular shocks are not exogenous, the estimate may be biased.

Exercises

Vary the number of industries. Re-run with K = 5 and K = 100. How does the precision of the 2SLS estimate and the BHJ balance test change?
Add correlated shocks. Make some national growth rates correlated with the average amenity in CZs where that industry is concentrated. Does the 2SLS estimate become biased? Does the BHJ balance test detect it?
Leave-one-out industry. Drop each industry one at a time from the Bartik instrument and re-estimate 2SLS. Plot the distribution of estimates. Is the result robust?
Pre-trends test. Add a lagged dependent variable as a 'pre-trend' and test whether the Bartik instrument predicts it. This regression is a falsification test for the exogeneity assumption.

Expected output

If your code runs correctly, expect to see:

OLS (biased): Coefficient on employment growth around 0.6–0.9, biased upward because local amenities affect both employment and wages
First-stage F-statistic: Well above the Staiger-Stock 1997 screening threshold of 10 (typically 50–200, often above the LMMP 2022 F > 104.7 just-identified threshold), confirming the Bartik instrument is relevant
2SLS estimate: Around 0.4–0.6 (true value: 0.5), closer to the truth than OLS after removing the amenity bias
Reduced form: Positive and significant effect of the Bartik instrument on wage growth
Bartik-amenity correlation: Near zero (by construction, national shocks are independent of local amenities)
BHJ balance test: Shock-level regressions show no significant correlation between national industry growth and local amenities (p > 0.05 for most industries)
Rotemberg weights: Most weights positive; a few industries with large positive weights drive the 2SLS estimate
Sample size: 200 commuting zones, 20 industries

Summary

In this lab you learned:

The Bartik shift-share instrument combines local industry composition (shares) with national industry trends (shifts) to isolate exogenous variation in local employment
The first stage must be strong: the Staiger-Stock (1997) screening rule is F > 10, and Lee, McCrary, Moreira, Porter (2022) show that valid 5% t-test inference in the just-identified case requires F > 104.7
Borusyak et al. (2022) show that with many industries, identification can come from quasi-random shocks; balance regressions at the shock level can probe the plausibility of this assumption
Rotemberg weights reveal which industries drive the 2SLS estimate, and negative weights signal fragility
It is recommended to report the first stage, reduced form, and the full set of diagnostics alongside the 2SLS coefficient

Overview#

Step 1: Simulate Local Labor Market Data#

Step 2: Construct the Bartik Instrument#

Step 3: First Stage and Reduced Form#

Step 4: 2SLS Estimation#

Step 5: Borusyak-Hull-Jaravel Diagnostics#

Step 6: Rotemberg Weights#

Exercises#

Summary#

Overview

Step 1: Simulate Local Labor Market Data

Step 2: Construct the Bartik Instrument

Step 3: First Stage and Reduced Form

Step 4: 2SLS Estimation

Step 5: Borusyak-Hull-Jaravel Diagnostics

Step 6: Rotemberg Weights

Exercises

Summary