Here’s the paper we will be discussing today.
Introduction: The “Problem” with Standard Errors
Prof. Bruce starts the paper by discussing the different techniques for estimating standard errors. Just a reminder, standard errors quantify the uncertainty in our estimated coefficients. They show how much our estimates (β^) would vary if we were to repeatedly sample from the population. When we use econometric methods, like DiD, we often work with data that don’t always follow the "ideal" conditions set out in the Gauss-Markov theorem. Under a set of specific assumptions (linearity, no perfect multicollinearity, zero conditional mean of errors, homoskedasticity, and no autocorrelation), the Ordinary Least Squares (OLS) estimator is the Best Linear Unbiased Estimator (BLUE).
Challenges in Real-World Data (Clustering, Heteroskedasticity)
However, real-world data almost always violate these assumptions. For example, data might be grouped into clusters (e.g., students within schools, patients within hospitals) where outcomes within a cluster are more similar to each other than to outcomes in other clusters. This clustering can lead to correlation within groups, which, if ignored, results in underestimated standard errors and overly optimistic significance levels, meaning ignoring this clustering can make standard errors too small and significance levels misleading.
In a DiD setting, it is common to cluster standard errors at the treatment level, that is, at the level at which the treatment varies (e.g., state level if the treatment is applied uniformly across states). However, we need to keep a few things in mind:
Granularity: If the treatment varies at a more granular level than the group (e.g., individuals within a state are treated, but the treatment varies across individuals), clustering at the treatment level might not fully capture the correlation structure in the data. In such cases, clustering at a more granular level (e.g., individual or sub-group level) may be necessary.
Cluster Size: If the number of clusters is small (e.g., fewer than 10 states), clustering at the treatment level can lead to downward-biased standard errors, even with clustering. This is because the asymptotic approximations used in standard clustering methods break down with a small number of clusters. In such cases, we often use bias-corrected methods, such as the wild cluster bootstrap or jackknife, to address this issue.
The variability in the data (the errors) might change across observations, a phenomenon known as heteroskedasticity. This violates the constant variance assumption of the Gauss-Markov theorem. Heteroskedasticity does not bias the coefficient estimates but can lead to incorrect standard errors. To address this, we use heteroskedasticity-robust standard errors (e.g., White standard errors).
These error adjustments are very important in methods like DiD, where failing to account for clustering or heteroskedasticity can lead to incorrect inferences. For example, we might mistakenly conclude that a policy is more effective than it truly is simply because we didn’t account for the structure of the data. This can have serious consequences, leading to poor policy decisions in the real world. Therefore, we need to properly estimate standard errors in order to drawn reliable conclusions from econometric analyses.
The Jackknife Method
So, what is Prof. Bruce proposing?
He’s proposing a solution to these longstanding problems using the jackknife method, which, as you may know, is like the bootstrap’s “efficient younger sibling”. The jackknife provides a linear approximation of the bootstrap by systematically leaving out one observation (or cluster) at a time and recalculating the statistic of interest. It’s computationally simpler than the bootstrap but still wasn't everyone’s favorite tool.
But why is the jackknife gaining traction now? Early econometricians weren’t running these methods on powerful computers; they were working with limited resources. Back then, inverting matrices meant serious work. Imagine doing that by hand? This computational bottleneck meant that methods requiring multiple recalculations, like the jackknife, were a bit too ambitious for the time. People stuck to less computationally intensive methods, even if they weren’t perfect.
Fast-forward to today. With modern computing, we’re not just solving 100×100 matrices, we’re tackling datasets with thousands (or millions) of observations across hundreds of variables. Add to that the increasing complexity of statistical models and you have a recipe for needing something more robust than the old-school clustered standard errors. The jackknife now feels like the right balance: less demanding than the bootstrap, yet more reliable in small-sample or asymmetric settings.
It’s worth noting, though, that prof Bruce isn’t just “dusting off” the jackknife, his work refines it for modern problems, like cluster-robust settings, ensuring it avoids the pitfalls of downward bias and maintains coverage even in extreme cases (like one treated cluster). So, we’re actually circling back to a simpler method, not because it’s trendy but because computing power and complex datasets now let us unlock its potential. He notes that “the calculations demonstrate that the proposed jackknife methods are computationally reasonable to implement on standard office computers, at least for data sets up to 200,000 observations and 200 regressors”.

Prof Bruce’s proposal is particularly important for the following DiD settings: small numbers of treated clusters, heterogeneous treatment effects, unbalanced cluster sizes, and high-leverage observations.
Implementation in R
How to implement it?
A Stata and R program jregress that calculates his recommended jackknife method is available on his website, in addition to data and code for full replication of all numerical results reported in the paper.
Prof Bruce's jackknife implementation consists of three main functions that work together: jregress, jreg, and Ginv.
jregress is the user-friendly interface, similar to R's lm() function. It takes a formula specification and handles all the data preparation.
jreg is where the actual computations happen (core implementation).
Ginv calculates the Moore-Penrose pseudoinverse for cases where matrices might not be fully invertible.
Both jregress and jreg produce identical outputs but they accept different input formulas. Here’s his example in R:
# Function Formats:
jregress(formula, data, clusters, fe = FALSE, df.adjust = TRUE,
df.display = FALSE, level = .95, tolerance = 1e-8,
print = TRUE, digits = 4L, signif.stars = TRUE)
# Or:
jreg(X, Y, clusterid, fe = FALSE, df.adjust = TRUE,
df.display = FALSE, level = .95, tolerance = 1e-8,
print = TRUE, digits = 4L, signif.stars = TRUE)
# The jregress function handles all the data preparation:
# Parses formula and data inputs
# Checks for errors (e.g., non-numeric cluster variables)
# Creates model matrix (X) and response vector (Y)
# Handles fixed effects if specified
# Calls jreg for core calculations
# Arguments for jregress:
# formula regression formula, similar to lm function
# data data file
# clusters cluster variable name (if omitted, treated as non-clustered)
# Arguments for jreg:
# X nxk regressor matrix
# if fe=FALSE, should include intercept column
# if fe=TRUE, should not include intercept column
# Y nx1 dependent variable
# clusterid nx1 cluster variable
# Common arguments for both functions:
# fe TRUE to include cluster-level fixed effects (default: FALSE)
# df.adjust TRUE for degree-of-freedom adjustment (default)
# FALSE for t(G-1) or t(n-k) distribution
# df.display TRUE to display adjustment coefficients K and a (default: FALSE)
# level confidence level (default: .95)
# tolerance tolerance for singularity of X'X (default: 1e-8)
# print TRUE to print results (default)
# digits number of digits for display (default: 4L)
# signif.stars TRUE to display significance stars (default)
# Function outputs:
# coefficients least squares coefficient estimates
# se jackknife standard errors
# pvalue p-values for coefficients
# L lower bounds of confidence intervals
# U upper bounds of confidence intervals
# cov.mat estimated covariance matrix
# residuals least squares residuals
# loo.residuals leave-cluster-out residuals
# K adjustment degree-of-freedom coefficients
# a scale adjustment coefficients
# CV leave-cluster-out cross-validation criterion
The next part of the code takes a formula-based input (like in lm()) and converts it into matrices that can be used by the core estimation function (jreg). In steps: it takes R formula notation (like y ~ x1 + x2), converts it to matrices (Y and X), handles fixed effects if specified, performs error checking, calls the core estimation function (jreg), and then returns results in a format familiar to R users.
First it defines the function and the initial parsing:
# Define function with arguments
jregress <- function(formula, data, clusters, fe = FALSE, df.adjust = TRUE,
df.display = FALSE, level = .95, tolerance = 1e-8,
print = TRUE, digits = 4L, signif.stars = TRUE){
# Capture the function call
cl <- match.call()
# Find positions of core arguments (formula, data, clusters) in the call
m <- match(c("formula","data","clusters"), names(cl), 0L)
# Extract these arguments
mf <- cl[c(1L,m)]
# Set first element to model.frame function
mf[[1L]] <- quote(stats::model.frame)
Then it creates the model frame:
# Evaluate the model frame expression
mf <- eval(mf, envir = parent.frame())
# Extract the terms structure
mt <- attr(mf, "terms")
# Add cluster information to terms
attr(mt,"clustervar") <- cl$clusters
# Extract cluster variable if specified
clustervar <- mf$`(clusters)`
And check for errors:
# Check if cluster variable is numeric
if (!is.null(clustervar) & !is.numeric(clustervar)){
stop("Non-numeric cluster variable")
}
# Check fixed effects specification
if (fe == TRUE){
# Must have clusters for fixed effects
if (is.null(clustervar)){
stop("Fixed effects option specified without a cluster variable")
}
# Cluster variable can't be a regressor with fixed effects
cluster_id <- as.character(attr(mt,"clustervar"))
varlist <- all.vars(attr(mt,"variables"))
if (match(cluster_id,varlist,0L) > 0L){
stop("Cluster variable cannot be included as an (in)dependent variable when fixed effects option is specified.")
}
}
Then it creates responses and design the matrices:
# Extract response variable (Y)
Y <- model.response(mf, type = "numeric")
# Create design matrix (X)
X <- model.matrix(mt,mf)
# Handle fixed effects by removing intercept if needed
if (fe == TRUE){
attr(mt,"intercept") <- 0
X <- X[,-1] # Remove first column (intercept)
}
Finally, it generates the output:
# Print function call if requested
if (print == TRUE)
cat("\nCall:\n",paste(deparse(cl),sep="\n",collapse="\n"),"\n",sep="")
# Call core estimation function (jreg)
out <- jreg(X,Y,clustervar,fe,df.adjust,df.display,level,tolerance,print,digits,signif.stars)
# Add call and terms to output
out$call <- cl
out$terms <- mt
return(out)
The third function Ginv calculates the Moore-Penrose pseudoinverse for cases where matrices might not be fully invertible. This is an important step because standard matrix inversion can fail in certain cases.
Definition of function and tolerance:
Ginv <- function(A, tolerance=1e-15){
# Calculate adjusted tolerance based on matrix size and values
tol <- tolerance * max(A) * nrow(A)
Eigendecomposition:
# Get eigenvalues and eigenvectors
ei <- eigen(A, symmetric=TRUE)
d <- ei$values # eigenvalues
H <- ei$vectors # eigenvectors
Handling small eigenvalues:
# Identify "too small" eigenvalues
n <- (d < tol)
# Create inverse eigenvalues, adding n to avoid division by zero
di <- 1/(d+n)
# Set inverse of small eigenvalues to zero
di[n] <- 0
Computing the pseudoinverse:
# Compute the pseudoinverse
B <- H %*% (di*t(H))
# Calculate condition number
c <- min(d)/max(d)
# The pseudoinverse is constructed using the formula H * D⁺ * H', where:
# H is the matrix of eigenvectors
# D⁺ is the diagonal matrix of inverse eigenvalues (with small ones set to zero)
# The condition number c measures numerical stability of the matrix
Return results:
return(list(inverse=B, rcond=c))
# It returns both the pseudoinverse and the condition number
In summary, this code: takes a symmetric matrix A and a tolerance level, adjusts the tolerance based on matrix size and magnitude, decomposes matrix A into eigenvalues (d) and eigenvectors (H), identifies eigenvalues smaller than tolerance, creates inverse eigenvalues carefully to avoid division by zero, sets inverse of small eigenvalues to zero, constructs pseudoinverse using B = H * D⁺ * H', calculates condition number as min(eigenvalues)/max(eigenvalues), returns pseudoinverse matrix (B) and the condition number (c) to assess numerical stability.
The following code for the jreg function was divided into a few steps: it takes input data (X and Y matrices), sets up clustering if specified (groups observations that are likely correlated), applies fixed effects transformation if requested (removes time-invariant differences between groups and common time trend), performs basic OLS regression (OLS gives initial estimates of treatment effects), calculates initial coefficients and R-squared (shows model fit), checks for matrix singularity issues, for each cluster (or observation) it removes that cluster/re-estimates the model/stores how estimates change, then uses these changes to compute robust standard errors, calculates degrees of freedom adjustments (it prevents overconfidence when clusters are few), computes adjusted t-statistics (to give more accurate p-values), produces confidence intervals and p-values, and returns regression results (including coefficient estimates, jackknife standard errors, adjusted confidence intervals, p-values, and diagnostic information).
Function setup and initial processing:
jreg <- function(X,Y,clusterid,fe = FALSE,df.adjust = TRUE,df.display = FALSE,
level=.95,tolerance=1e-15,print = TRUE,digits = 4L,
signif.stars = TRUE){
# Get dimensions
n <- length(Y) # number of observations
k <- kF <- ncol(X) # number of variables
# Process clustering info if provided
if (!is.null(clusterid)){
c_id <- unique(clusterid) # get unique clusters
G <- length(c_id) # count clusters
G_tab <- table(clusterid) # observations per cluster
}
Fixed effects transformation (if requested):
if (fe == TRUE){
# Combine Y and X
YX <- cbind(Y,X)
# Demean within clusters
YX <- do.call(rbind, by(YX,clusterid,function(x){
sweep(x,2,colMeans(x),FUN = "-")
}))
# Split back into Y and X
Y <- YX[,1]
X <- as.matrix(YX[,-1])
clusterid <- sort(clusterid)
kF <- kF + G
}
Basic OLS estimation:
# Compute basic matrices
XX <- crossprod(X) # X'X
if (rcond(XX) < tolerance)
stop("Singular design matrix...") # Check singularity
Q <- solve(XX) # (X'X)^(-1)
XY <- crossprod(X,Y) # X'Y
betahat <- Q%*%XY # β̂ = (X'X)^(-1)X'Y
ehat <- Y - X%*%betahat # residuals
etilde <- ehat
r.squared <- 1 - var(ehat)/var(Y) # R²
Jackknife VCE estimation:
if (!is.null(clusterid)){ # Clustered case
beta_loo <- matrix(NA,G,k) # Store leave-one-out estimates
rcond_loo <- matrix(NA,G,1) # Store condition numbers
# For each cluster
for (g in 1:G) {
gi <- (clusterid==c_id[g]) # Get cluster observations
Yg <- matrix(Y[gi],nrow=sum(gi))
Xg <- matrix(X[gi,],nrow=sum(gi),ncol=k)
# Leave-one-out estimation
XXg <- crossprod(Xg)
Ig <- Ginv(XX - XXg) # Use pseudoinverse
Qg <- Ig$inverse
beta_g <- Qg%*%(XY - crossprod(Xg,Yg))
beta_loo[g,] <- beta_g - betahat
etilde[gi] <- Yg - Xg%*%beta_g
}
}
Df adjustment:
Vhat <- crossprod(beta_loo) # Variance matrix
se <- matrix(sqrt(diag(Vhat))) # Standard errors
if (df.adjust == FALSE) {
# Simple adjustment
if (!is.null(clusterid)) K <- G-1 else K <- n-kF
a <- 1
} else {
# Hansen's more sophisticated adjustment
a <- rep(0,k)
K <- rep(0,k)
for (j in 1:k) {
Sj <- S[,j]
Uj <- U[,,j]
Vj <- V[,,j]
Wj <- sweep(Uj,1,Sj,FUN="*")
UU <- crossprod(Uj)
UV <- crossprod(Uj,Vj)
VV <- crossprod(Vj)
UUXX <- UU%*%XX
trL <- sum(Sj) - sum(diag(UV))
# The following uses the matrix fact: sum(diag(A%*%B)) = sum(t(A)*B))
trLL <- sum(Sj^2) + sum(t(UUXX)*UUXX) + 2*sum(t(UV)*UV) -
2*sum(Wj*Vj) - 4*sum(t(UUXX)*UV) + 2*sum(UU*VV)
K[j] <- (trL^2)/trLL
a[j] <- sqrt(trL/Q[j,j])
}
}
# This section computes the variance matrix and standard errors, then either
# applies a simple degrees of freedom adjustment (if df.adjust = FALSE)
# or Hansen's more sophisticated adjustment that accounts for leverage and
# heterogeneity in the data.
Final calculations and output:
# Calculate confidence intervals and p-values
c <- qt((1+level)/2,K)/a
Lb <- betahat - c*se
Ub <- betahat + c*se
pvalue <- 1 - pf((a*tstat)^2,1,K)
# Print results if requested
# If printing is requested (print=TRUE), it displays:
# Number of observations
# Number of regressors
# If clustered: cluster information (number of clusters, observations per cluster)
# A coefficient table showing:
# Estimates, standard errors, t-ratios, confidence intervals, p-values
# If requested, degrees of freedom and scale adjustments
# Whether standard errors are from delete-cluster jackknife or regular
# jackknife, which distribution was used for inference, R-squared value
# Cross-validation criterion, warnings about any non-invertible cases
if (print == TRUE){
cat("\nNumber of observations:", n,"\n")
cat("Number of regressors:", k,"\n")
if (!is.null(clusterid)){
G.tab = G_tab
G.summary = c(min(G.tab),mean(G.tab),median(G.tab),max(G.tab))
G.info = structure(G.summary, names = c("Min","Mean","Median","Max"))
cat("Number of clusters:", G,"\n")
cat("\nNumber of observations per cluster:\n")
print(G.info, digits = digits)
}
coef.table <- cbind(betahat, se, tstat, Lb, Ub, pvalue)
if (df.adjust == TRUE & df.display == TRUE) {
coef.table <- cbind(coef.table[,1:5], K, a, coef.table[,6])
colnames(coef.table) <- c("Estimate", "Std. Error", "t-ratio",
"Lower","Upper","df","scale", "Pr(>|t|)")
} else colnames(coef.table) <- c("Estimate", "Std. Error", "t-ratio",
"Lower","Upper","Pr(>|t|)")
cat("\nCoefficients:\n")
printCoefmat(coef.table, digits = digits, signif.stars = signif.stars)
if (!is.null(clusterid))
cat("\nStandard errors calculated by the delete-cluster jackknife.")
else cat("\nStandard errors calculated by the jackknife.")
if (df.adjust == TRUE)
cat("\nConfidence intervals and p-values calculated using adjusted student t distribution.\n")
else {
if (!is.null(clusterid))
cat("\nConfidence intervals and p-values calculated using t(G-1) distribution.\n")
else cat("\nConfidence intervals and p-values calculated using t(n-k) distribution.\n")
}
cat("\nR-squared: ", formatC(r.squared, digits = digits))
if (!is.null(clusterid))
cat("\nCluster Cross-Validation Criterion: ", formatC(CV, digits = digits, format = "f"))
else cat("\nCross-Validation Criterion: ", formatC(CV, digits = digits, format = "f"))
cat("\n")
if (noninvertible > 0) {
cat("\nWarning: Jackknife coefficients not identified in",
formatC(noninvertible,digits=0,format = "f"))
cat("/")
if (!is.null(clusterid))
cat(formatC(G,digits=0,format = "f"),"clusters.")
else cat(formatC(n,digits=0,format = "f"),"iterations.")
cat("\nJackknife standard errors may be conservative. Recommend respecification of regression model.")
cat("\n")
}
}
# Return results
out <- list(
coefficients = betahat, # Point estimates
se = se, # Standard errors
pvalue = pvalue, # P-values
cov.mat = Vhat, # Covariance matrix
residuals = ehat, # Regular residuals
loo.residuals = etilde, # Leave-one-out residuals
K = K, # Degrees of freedom adjustment
a = a, # Scale adjustment
CV = CV, # Cross-validation criterion
L = Lb, # Lower confidence bounds
U = Ub # Upper confidence bounds
)
return(out)
For the Stata implementation, you can check the code available on his website (it should be a bit easier).
Why This Matters
To wrap up, Prof. Bruce’s jackknife implementation represents a significant step forward in how we handle uncertainty in DiD analysis. His method provides more reliable standard errors and inference, especially in tricky situations like having few clusters, heterogeneous treatment effects, or unbalanced data. What makes this approach so valuable is its practicality: it’s computationally efficient enough to handle large modern datasets while being more reliable than traditional clustering methods.
The fact that the code is freely available and integrates nicely with R’s existing regression framework makes it easy for researchers to adopt. This isn’t just a theoretical improvement, it’s a tool that can be immediately applied to real-world problems, ensuring more accurate assessments of policy impacts. Prof. Bruce’s work helps us better understand statistical uncertainty in DiD estimates, ultimately leading to more trustworthy conclusions and better-informed decisions.
In a world where data complexity is only increasing, circling back to simpler, yet refined methods like the jackknife feels both timely and necessary. It’s a reminder that sometimes, the best solutions aren’t the most complicated ones, but the ones that strike the right balance between simplicity, reliability, and practicality.
Didn't expect to meet Roger Penrose here!