Lab·tutorial·8 min read

tutorial120 minutes

Lab: Regression Discontinuity Design from Scratch

Implement a sharp RDD: simulate data with a cutoff, estimate the treatment effect via local linear regression, choose optimal bandwidth, and test manipulation.

Method: Regression Discontinuity Design – Sharp
Languages: Python, R, Stata
Dataset: Academic probation and graduation (simulated Lindo et al. style)

Overview

Regression Discontinuity Design (RDD) exploits situations where treatment is assigned based on whether a running variable (also called a forcing variable or score) crosses a known cutoff. Near the cutoff, treatment assignment is as-good-as random, giving us a credible causal estimate.

What you will learn:

How to set up and visualize an RD design
How to estimate the treatment effect using local linear regression
How to choose bandwidth and assess sensitivity
How to test the no-manipulation assumption (McCrary density test)
How to run validity checks (covariate balance at the cutoff)

Prerequisites: OLS regression (see the OLS lab).

Step 1: The Setting

Imagine a university that places students on academic probation if their GPA falls below 2.0. We want to estimate whether academic probation affects graduation rates. The running variable is GPA, the cutoff is 2.0, and the outcome is whether the student eventually graduates.

Students just below 2.0 are placed on probation; students just above 2.0 are not. The key insight: students at 1.99 GPA and 2.01 GPA are expected to be very similar on observed and unobserved characteristics — the primary difference is probation status.

Step 2: Simulate the Data

1# First-time setup: install.packages(c("rdrobust", "rddensity", "ggplot2"))
2library(rdrobust)
3library(rddensity)
4library(ggplot2)
5
6set.seed(42)
7n <- 5000
8
9# Running variable: GPA centered at 2.0
10gpa_raw <- pmin(pmax(rnorm(n, 2.5, 0.7), 0), 4.0)
11X <- gpa_raw - 2.0
12
13# Treatment: probation if GPA < 2.0 (X < 0)
14D <- as.integer(X < 0)
15
16# Potential outcomes
17tau <- 0.08
18Y0 <- 0.5 + 0.15 * X + 0.02 * X^2 + rnorm(n, 0, 0.15)
19Y1 <- Y0 + tau
20Y <- D * Y1 + (1 - D) * Y0
21Y <- pmin(pmax(Y, 0), 1)
22
23df <- data.frame(gpa = gpa_raw, X = X, probation = D, graduated = Y)
24
25cat("Sample size:", n, "\n")
26cat("On probation:", sum(D), "(", round(mean(D)*100,1), "%)\n")
27cat("Mean graduation:", round(mean(Y), 3), "\n")

Requiresrdrobust rddensity ggplot2

Expected output:

	gpa	X	probation	graduated
0	2.733	0.733	0	0.614
1	2.403	0.403	0	0.549
2	3.154	1.154	0	0.684
3	3.567	1.567	0	0.751
4	1.862	-0.138	1	0.574

Summary statistics:

Statistic	Value
Sample size	5,000
Students on probation	~1,100 (approximately 22%)
Mean graduation rate	~0.56
GPA range	[0.0, 4.0]
Mean GPA	~2.50

Step 3: Visualize the Discontinuity

The first and most important step in any RD analysis is to plot the data. Bin the running variable and plot mean outcomes within each bin.

# RD plot using rdrobust package
rdplot(y = df$graduated, x = df$X, c = 0,
     title = "RD Plot: Academic Probation and Graduation",
     x.label = "GPA - 2.0 (Running Variable)",
     y.label = "Graduation Rate")

Requiresrdrobust

Expected visualization: RD plot

The plot shows binned mean graduation rates on the y-axis against the centered running variable (GPA minus 2.0) on the x-axis. Orange dots represent students below the cutoff (on probation); blue dots represent students above the cutoff (not on probation). Dot sizes are proportional to the number of observations in each bin.

At X = 0 (GPA = 2.0), there is a visible upward jump of approximately 0.08 in the graduation rate — the treated group just below the cutoff has a higher graduation rate than what the smooth trend from the right side would predict. Two fitted local linear regression lines (one on each side) converge toward the cutoff, with the left line (probation side) sitting approximately 0.08 above where the right line (no probation side) would extrapolate. The vertical dashed line marks the cutoff at X = 0.

Step 4: Estimate the Treatment Effect

The standard RD estimator uses local linear regression: fit a separate linear regression on each side of the cutoff, using only observations within a bandwidth $h$ of the cutoff.

1# Using rdrobust (the standard approach)
2rd_result <- rdrobust(y = df$graduated, x = df$X, c = 0)
3summary(rd_result)
4
5# The default uses MSE-optimal bandwidth selection
6# and bias-corrected confidence intervals

Requiresrdrobust

Expected output: Local linear RD estimate (manual, bandwidth = 0.5)

Variable	Coeff	SE	t	p
Intercept	0.498	0.007	71.1	0.000
probation	0.079	0.017	4.65	0.000
X	0.154	0.022	7.00	0.000
X_D	-0.018	0.044	-0.41	0.684

Detail	Value
Method	Local linear regression, HC1 robust SEs
Bandwidth (h)	0.5 GPA points
N in bandwidth	~2,100
Treatment effect	~0.079 (true value: 0.08)
95% CI	[0.045, 0.113]
Kernel	Uniform (rectangular)

The rdrobust estimate with data-driven MSE-optimal bandwidth will produce a point estimate near 0.08, with a bandwidth typically between 0.3 and 0.8 GPA points and bias-corrected confidence intervals that cover the true value.

Concept Check

Why do we use local linear regression (fitting lines near the cutoff) rather than a global polynomial (fitting a curve through all the data)?

Global polynomials are computationally too expensive.Global polynomials can produce unreliable estimates because observations far from the cutoff influence the estimate at the cutoff, and high-order polynomials can oscillate wildly near boundaries.Local linear regression always gives larger treatment effects.Global polynomials require the parallel trends assumption.

Step 5: Bandwidth Sensitivity

The choice of bandwidth $h$ involves a bias-variance tradeoff. A smaller bandwidth means less bias (more "local" comparison) but fewer observations and more variance. A larger bandwidth means more data but potentially more bias.

1# Bandwidth sensitivity
2bandwidths <- c(0.2, 0.3, 0.4, 0.5, 0.7, 1.0, 1.5)
3
4for (h in bandwidths) {
5rd <- rdrobust(y = df$graduated, x = df$X, c = 0, h = h)
6cat(sprintf("h = %.1f: estimate = %.4f, SE = %.4f, N = %d\n",
7            h, rd$coef[1], rd$se[3], rd$N_h[1] + rd$N_h[2]))
8}
9cat("\nTrue effect: 0.08\n")
10cat("Estimate should be stable across bandwidths.\n")

Requiresrdrobust

Expected output: Bandwidth sensitivity

Bandwidth (h)	Estimate	SE	N in bandwidth
0.2	0.082	0.029	~900
0.3	0.080	0.023	~1,300
0.4	0.078	0.019	~1,700
0.5	0.079	0.017	~2,100
0.7	0.076	0.014	~2,800
1.0	0.074	0.012	~3,750
1.5	0.070	0.010	~4,600

The true treatment effect is 0.08. Estimates are stable across bandwidths from 0.2 to 1.0, ranging between approximately 0.07 and 0.08. At larger bandwidths (1.0+), estimates drift slightly downward as the linear specification picks up curvature from the quadratic term in the DGP. This stability is reassuring and characteristic of a well-behaved RD design.

Step 6: Test for Manipulation (McCrary Test)

The key RD assumption is that units cannot precisely manipulate the running variable to sort above or below the cutoff. If students can manipulate their GPA to avoid probation, there will be an unusual density of students just above 2.0 and a gap just below 2.0.

1# First-time setup: install.packages(c("rddensity"))
2# McCrary-style density test
3library(rddensity)
4density_test <- rddensity(X = df$X, c = 0)
5summary(density_test)
6
7# Visual
8rdplotdensity(density_test, df$X,
9            title = "Density Test at Cutoff",
10            xlabel = "GPA - 2.0")

Requiresrddensity

Expected output: McCrary density test

Test	Value
McCrary-style density test p-value	> 0.05 (not significant)
Interpretation	No evidence of manipulation at the cutoff

Since the data are simulated from a smooth normal distribution with no manipulation, the p-value should be well above 0.05, confirming that the density of the running variable is continuous through the cutoff.

Step 7: Covariate Balance at the Cutoff

If the RD design is valid, pre-determined covariates should be balanced at the cutoff. We can test this by running the RD regression with covariates as the outcome.

1# Add covariates
2df$female <- rbinom(n, 1, 0.5)
3df$age <- 18 + rpois(n, 2)
4df$sat_score <- 1000 + 100 * (df$gpa - 2.5) + rnorm(n, 0, 80)
5
6# Balance tests
7covariates <- c("female", "age", "sat_score")
8for (cov in covariates) {
9rd <- rdrobust(y = df[[cov]], x = df$X, c = 0)
10cat(sprintf("%s: coef = %.3f, p = %.3f\n",
11            cov, rd$coef[1], rd$pv[3]))
12}
13cat("\nAll p-values should be large (no discontinuities)\n")

Requiresrdrobust

Expected output: Covariate balance tests at cutoff (bandwidth = 0.5)

Covariate	Coeff	SE	p-value	Balanced?
female	-0.01	0.04	0.75	Yes
age	0.05	0.12	0.69	Yes
sat_score	1.20	3.50	0.73	Yes

All p-values are large (well above 0.05), indicating no statistically significant discontinuities in pre-determined covariates at the cutoff. The pattern is expected: female and age are generated independently of GPA, and sat_score is correlated with GPA but the local linear specification absorbs the smooth relationship, leaving no jump at the cutoff. The covariate-balance results support the validity of the RD design.

Concept Check

You find a statistically significant jump in SAT scores at the GPA cutoff. What does this suggest about your RD design?

Nothing — SAT scores might naturally vary at the cutoff.The RD design may be invalid: either students are sorting around the cutoff based on SAT scores, or the cutoff correlates with some other policy that affects SAT-related characteristics.Control for SAT scores in the RD regression to fix the problem.The treatment effect on graduation is actually working through SAT scores.

Step 8: Exercises

Fuzzy RD. Modify the simulation so that only 70% of students below 2.0 actually receive probation (imperfect compliance). Estimate the treatment effect using fuzzy RD (an IV approach at the cutoff). Compare with the sharp RD estimate.
Donut hole RD. Drop observations very close to the cutoff (e.g., within 0.05 of 2.0) and re-estimate. If the estimate changes substantially, this shift suggests manipulation right at the cutoff.
Placebo cutoffs. Estimate the "treatment effect" at fake cutoffs (e.g., GPA = 2.5, GPA = 3.0). You generally want to find no discontinuity at these placebo cutoffs.
Quadratic specification. Instead of local linear, try local quadratic regression. Compare the estimates. The results should be similar if the bandwidth is not too large.

Expected output

If your code runs correctly, expect to see:

Sharp RD estimate: Around 0.05–0.12 (true value: 0.08), estimated via local linear regression at the cutoff
rdrobust point estimate: Around 0.06–0.10 with bias-corrected confidence interval covering the true value of 0.08
Optimal bandwidth: Data-driven bandwidth from rdrobust, typically between 0.3 and 0.8 GPA points around the cutoff
Bandwidth sensitivity: Estimates should be relatively stable across bandwidths from 0.3 to 1.0 around the cutoff
McCrary density test: McCrary test typically shows p > 0.05, with no significant discontinuity in the running variable density at the cutoff, indicating no manipulation
Covariate balance: Pre-determined covariates (e.g., SAT scores) should show no significant discontinuity at the cutoff
RD plot: A clear visual jump in the outcome at GPA = 2.0
Sample size: 5,000 students

Summary

In this lab you learned:

RD exploits a known cutoff in the running variable to estimate causal effects for units near the threshold
The RD plot is the most important piece of evidence — include it in every RD analysis
Local linear regression with optimal bandwidth (via rdrobust) is the recommended estimation approach
Check bandwidth sensitivity by reporting estimates across a range of bandwidths
Test the no-manipulation assumption using the McCrary density test
Verify covariate balance at the cutoff as additional evidence of design validity
The RD estimate is local: it applies to units near the cutoff and may not generalize to other parts of the running variable distribution

Overview#

Step 1: The Setting#

Step 2: Simulate the Data#

Step 3: Visualize the Discontinuity#

Step 4: Estimate the Treatment Effect#

Step 5: Bandwidth Sensitivity#

Step 6: Test for Manipulation (McCrary Test)#

Step 7: Covariate Balance at the Cutoff#

Step 8: Exercises#

Summary#

Overview

Step 1: The Setting

Step 2: Simulate the Data

Step 3: Visualize the Discontinuity

Step 4: Estimate the Treatment Effect

Step 5: Bandwidth Sensitivity

Step 6: Test for Manipulation (McCrary Test)

Step 7: Covariate Balance at the Cutoff

Step 8: Exercises

Summary