MethodAtlas
tutorial2 hours

Lab: Regression Discontinuity Design from Scratch

Implement a sharp regression discontinuity design step by step. Simulate data with a cutoff, estimate the treatment effect using local linear regression, choose optimal bandwidth, and test for manipulation.

Overview

Regression Discontinuity Design (RDD) exploits situations where treatment is assigned based on whether a running variable (also called a forcing variable or score) crosses a known cutoff. Near the cutoff, treatment assignment is essentially random, giving us a credible causal estimate.

What you will learn:

  • How to set up and visualize an RD design
  • How to estimate the treatment effect using local linear regression
  • How to choose bandwidth and assess sensitivity
  • How to test the no-manipulation assumption (McCrary density test)
  • How to run validity checks (covariate balance at the cutoff)

Prerequisites: OLS regression (see the OLS lab).


Step 1: The Setting

Imagine a university that places students on academic probation if their GPA falls below 2.0. We want to estimate whether academic probation affects graduation rates. The running variable is GPA, the cutoff is 2.0, and the outcome is whether the student eventually graduates.

Students just below 2.0 are placed on probation; students just above 2.0 are not. The key insight: students at 1.99 GPA and 2.01 GPA are expected to be very similar on observed and unobserved characteristics — the primary difference is probation status.


Step 2: Simulate the Data

library(rdrobust)
library(rddensity)
library(ggplot2)

set.seed(42)
n <- 5000

# Running variable: GPA centered at 2.0
gpa_raw <- pmin(pmax(rnorm(n, 2.5, 0.7), 0), 4.0)
X <- gpa_raw - 2.0

# Treatment: probation if GPA < 2.0 (X < 0)
D <- as.integer(X < 0)

# Potential outcomes
tau <- 0.08
Y0 <- 0.5 + 0.15 * X + 0.02 * X^2 + rnorm(n, 0, 0.15)
Y1 <- Y0 + tau
Y <- D * Y1 + (1 - D) * Y0
Y <- pmin(pmax(Y, 0), 1)

df <- data.frame(gpa = gpa_raw, X = X, probation = D, graduated = Y)

cat("Sample size:", n, "\n")
cat("On probation:", sum(D), "(", round(mean(D)*100,1), "%)\n")
cat("Mean graduation:", round(mean(Y), 3), "\n")

Expected output:

gpaXprobationgraduated
02.7330.73300.614
12.4030.40300.549
23.1541.15400.684
33.5671.56700.751
41.862-0.13810.574

Summary statistics:

StatisticValue
Sample size5,000
Students on probation~1,100 (approximately 22%)
Mean graduation rate~0.56
GPA range[0.0, 4.0]
Mean GPA~2.50

Step 3: Visualize the Discontinuity

The first and most important step in any RD analysis is to plot the data. Bin the running variable and plot mean outcomes within each bin.

# RD plot using rdrobust package
rdplot(y = df$graduated, x = df$X, c = 0,
     title = "RD Plot: Academic Probation and Graduation",
     x.label = "GPA - 2.0 (Running Variable)",
     y.label = "Graduation Rate")
Requiresrdrobust

Step 4: Estimate the Treatment Effect

The standard RD estimator uses local linear regression: fit a separate linear regression on each side of the cutoff, using only observations within a bandwidth hh of the cutoff.

# Using rdrobust (the standard approach)
rd_result <- rdrobust(y = df$graduated, x = df$X, c = 0)
summary(rd_result)

# The default uses MSE-optimal bandwidth selection
# and bias-corrected confidence intervals
Requiresrdrobust

Expected output: Local linear RD estimate (manual, bandwidth = 0.5)

VariableCoeffSEtp
Intercept0.4980.00771.10.000
probation0.0790.0174.650.000
X0.1540.0227.000.000
X_D-0.0180.044-0.410.684
DetailValue
MethodLocal linear regression, HC1 robust SEs
Bandwidth (h)0.5 GPA points
N in bandwidth~3,200
Treatment effect~0.079 (true value: 0.08)
95% CI[0.045, 0.113]
KernelUniform (rectangular)

The rdrobust estimate with data-driven MSE-optimal bandwidth will produce a point estimate near 0.08, with a bandwidth typically between 0.3 and 0.8 GPA points and bias-corrected confidence intervals that cover the true value.

Concept Check

Why do we use local linear regression (fitting lines near the cutoff) rather than a global polynomial (fitting a curve through all the data)?


Step 5: Bandwidth Sensitivity

The choice of bandwidth hh involves a bias-variance tradeoff. A smaller bandwidth means less bias (more "local" comparison) but fewer observations and more variance. A larger bandwidth means more data but potentially more bias.

# Bandwidth sensitivity
bandwidths <- c(0.2, 0.3, 0.4, 0.5, 0.7, 1.0, 1.5)

for (h in bandwidths) {
rd <- rdrobust(y = df$graduated, x = df$X, c = 0, h = h)
cat(sprintf("h = %.1f: estimate = %.4f, SE = %.4f, N = %d\n",
            h, rd$coef[1], rd$se[3], rd$N_h[1] + rd$N_h[2]))
}
cat("\nTrue effect: 0.08\n")
cat("Estimate should be stable across bandwidths.\n")
Requiresrdrobust

Expected output: Bandwidth sensitivity

Bandwidth (h)EstimateSEN in bandwidth
0.20.0820.029~1,300
0.30.0800.023~1,950
0.40.0780.019~2,600
0.50.0790.017~3,200
0.70.0760.014~3,900
1.00.0740.012~4,500
1.50.0700.010~4,900

The true treatment effect is 0.08. Estimates are stable across bandwidths from 0.2 to 1.0, ranging between approximately 0.07 and 0.08. At larger bandwidths (1.0+), estimates drift slightly downward as the linear specification picks up curvature from the quadratic term in the DGP. This stability is reassuring and characteristic of a well-behaved RD design.


Step 6: Test for Manipulation (McCrary Test)

The key RD assumption is that units cannot precisely manipulate the running variable to sort above or below the cutoff. If students can manipulate their GPA to avoid probation, there will be an unusual density of students just above 2.0 and a gap just below 2.0.

# McCrary-style density test
library(rddensity)
density_test <- rddensity(X = df$X, c = 0)
summary(density_test)

# Visual
rdplotdensity(density_test, df$X,
            title = "Density Test at Cutoff",
            xlabel = "GPA - 2.0")
Requiresrddensity

Expected output: McCrary density test

TestValue
McCrary-style density test p-value> 0.05 (not significant)
InterpretationNo evidence of manipulation at the cutoff

Since the data is simulated from a smooth normal distribution with no manipulation, the p-value should be well above 0.05, confirming that the density of the running variable is continuous through the cutoff.


Step 7: Covariate Balance at the Cutoff

If the RD design is valid, pre-determined covariates should be balanced at the cutoff. We can test this by running the RD regression with covariates as the outcome.

# Add covariates
df$female <- rbinom(n, 1, 0.5)
df$age <- 18 + rpois(n, 2)
df$sat_score <- 1000 + 100 * (df$gpa - 2.5) + rnorm(n, 0, 80)

# Balance tests
covariates <- c("female", "age", "sat_score")
for (cov in covariates) {
rd <- rdrobust(y = df[[cov]], x = df$X, c = 0)
cat(sprintf("%s: coef = %.3f, p = %.3f\n",
            cov, rd$coef[1], rd$pv[3]))
}
cat("\nAll p-values should be large (no discontinuities)\n")
Requiresrdrobust

Expected output: Covariate balance tests at cutoff (bandwidth = 0.5)

CovariateCoeffSEp-valueBalanced?
female-0.010.040.75Yes
age0.050.120.69Yes
sat_score1.203.500.73Yes

All p-values are large (well above 0.05), indicating no statistically significant discontinuities in pre-determined covariates at the cutoff. This finding is expected: female and age are generated independently of GPA, and sat_score is correlated with GPA but the local linear specification absorbs the smooth relationship, leaving no jump at the cutoff. These results support the validity of the RD design.

Concept Check

You find a statistically significant jump in SAT scores at the GPA cutoff. What does this suggest about your RD design?


Step 8: Exercises

  1. Fuzzy RD. Modify the simulation so that only 70% of students below 2.0 actually receive probation (imperfect compliance). Estimate the treatment effect using fuzzy RD (essentially IV at the cutoff). Compare with the sharp RD estimate.

  2. Donut hole RD. Drop observations very close to the cutoff (e.g., within 0.05 of 2.0) and re-estimate. If the estimate changes substantially, this shift suggests manipulation right at the cutoff.

  3. Placebo cutoffs. Estimate the "treatment effect" at fake cutoffs (e.g., GPA = 2.5, GPA = 3.0). You should find no discontinuity at these placebo cutoffs.

  4. Quadratic specification. Instead of local linear, try local quadratic regression. Compare the estimates. The results should be similar if the bandwidth is not too large.


Summary

In this lab you learned:

  • RD exploits a known cutoff in the running variable to estimate causal effects for units near the threshold
  • The RD plot is the most important piece of evidence — include it in every RD analysis
  • Local linear regression with optimal bandwidth (via rdrobust) is the standard estimation approach
  • Check bandwidth sensitivity by reporting estimates across a range of bandwidths
  • Test the no-manipulation assumption using the McCrary density test
  • Verify covariate balance at the cutoff as additional evidence of design validity
  • The RD estimate is local: it applies to units near the cutoff and may not generalize to other parts of the running variable distribution