Standard Errors for Difference-in-Difference Regression, by Bruce Hansen

Jan 27, 2025

Here’s the paper we will be discussing today.

Introduction: The “Problem” with Standard Errors

Prof. Bruce starts the paper by discussing the different techniques for estimating standard errors. Just a reminder, standard errors quantify the uncertainty in our estimated coefficients. They show how much our estimates (β^) would vary if we were to repeatedly sample from the population. When we use econometric methods, like DiD, we often work with data that don’t always follow the "ideal" conditions set out in the Gauss-Markov theorem. Under a set of specific assumptions (linearity, no perfect multicollinearity, zero conditional mean of errors, homoskedasticity, and no autocorrelation), the Ordinary Least Squares (OLS) estimator is the Best Linear Unbiased Estimator (BLUE).

Challenges in Real-World Data (Clustering, Heteroskedasticity)

However, real-world data almost always violate these assumptions. For example, data might be grouped into clusters (e.g., students within schools, patients within hospitals) where outcomes within a cluster are more similar to each other than to outcomes in other clusters. This clustering can lead to correlation within groups, which, if ignored, results in underestimated standard errors and overly optimistic significance levels, meaning ignoring this clustering can make standard errors too small and significance levels misleading.

In a DiD setting, it is common to cluster standard errors at the treatment level, that is, at the level at which the treatment varies (e.g., state level if the treatment is applied uniformly across states). However, we need to keep a few things in mind:

Granularity: If the treatment varies at a more granular level than the group (e.g., individuals within a state are treated, but the treatment varies across individuals), clustering at the treatment level might not fully capture the correlation structure in the data. In such cases, clustering at a more granular level (e.g., individual or sub-group level) may be necessary.
Cluster Size: If the number of clusters is small (e.g., fewer than 10 states), clustering at the treatment level can lead to downward-biased standard errors, even with clustering. This is because the asymptotic approximations used in standard clustering methods break down with a small number of clusters. In such cases, we often use bias-corrected methods, such as the wild cluster bootstrap or jackknife, to address this issue.

The variability in the data (the errors) might change across observations, a phenomenon known as heteroskedasticity. This violates the constant variance assumption of the Gauss-Markov theorem. Heteroskedasticity does not bias the coefficient estimates but can lead to incorrect standard errors. To address this, we use heteroskedasticity-robust standard errors (e.g., White standard errors).

These error adjustments are very important in methods like DiD, where failing to account for clustering or heteroskedasticity can lead to incorrect inferences. For example, we might mistakenly conclude that a policy is more effective than it truly is simply because we didn’t account for the structure of the data. This can have serious consequences, leading to poor policy decisions in the real world. Therefore, we need to properly estimate standard errors in order to drawn reliable conclusions from econometric analyses.

The Jackknife Method

So, what is Prof. Bruce proposing?

He’s proposing a solution to these longstanding problems using the jackknife method, which, as you may know, is like the bootstrap’s “efficient younger sibling”. The jackknife provides a linear approximation of the bootstrap by systematically leaving out one observation (or cluster) at a time and recalculating the statistic of interest. It’s computationally simpler than the bootstrap but still wasn't everyone’s favorite tool.

But why is the jackknife gaining traction now? Early econometricians weren’t running these methods on powerful computers; they were working with limited resources. Back then, inverting matrices meant serious work. Imagine doing that by hand? This computational bottleneck meant that methods requiring multiple recalculations, like the jackknife, were a bit too ambitious for the time. People stuck to less computationally intensive methods, even if they weren’t perfect.

Fast-forward to today. With modern computing, we’re not just solving 100×100 matrices, we’re tackling datasets with thousands (or millions) of observations across hundreds of variables. Add to that the increasing complexity of statistical models and you have a recipe for needing something more robust than the old-school clustered standard errors. The jackknife now feels like the right balance: less demanding than the bootstrap, yet more reliable in small-sample or asymmetric settings.

It’s worth noting, though, that prof Bruce isn’t just “dusting off” the jackknife, his work refines it for modern problems, like cluster-robust settings, ensuring it avoids the pitfalls of downward bias and maintains coverage even in extreme cases (like one treated cluster). So, we’re actually circling back to a simpler method, not because it’s trendy but because computing power and complex datasets now let us unlock its potential. He notes that “the calculations demonstrate that the proposed jackknife methods are computationally reasonable to implement on standard office computers, at least for data sets up to 200,000 observations and 200 regressors”.

undefined — The **jackknife method** is a resampling technique used to estimate standard errors, variances, and other statistical measures. It works by systematically leaving out one observation (or one group, in the case of clustering) at a time from the dataset, recalculating the statistic of interest for each subset, and then using the variability across these recalculated statistics to estimate the overall variability. **Think of it as testing how stable your results are by repeatedly leaving out one piece of the puzzle.** For an example of implementation, click here.

Prof Bruce’s proposal is particularly important for the following DiD settings: small numbers of treated clusters, heterogeneous treatment effects, unbalanced cluster sizes, and high-leverage observations.

Implementation in R

How to implement it?

A Stata and R program jregress that calculates his recommended jackknife method is available on his website, in addition to data and code for full replication of all numerical results reported in the paper.

Prof Bruce's jackknife implementation consists of three main functions that work together: jregress, jreg, and Ginv.

jregress is the user-friendly interface, similar to R's lm() function. It takes a formula specification and handles all the data preparation.
jreg is where the actual computations happen (core implementation).
Ginv calculates the Moore-Penrose pseudoinverse for cases where matrices might not be fully invertible.

Both jregress and jreg produce identical outputs but they accept different input formulas. Here’s his example in R:

# Function Formats:

jregress(formula, data, clusters, fe = FALSE, df.adjust = TRUE, 
         df.display = FALSE, level = .95, tolerance = 1e-8, 
         print = TRUE, digits = 4L, signif.stars = TRUE)

# Or:

jreg(X, Y, clusterid, fe = FALSE, df.adjust = TRUE, 
     df.display = FALSE, level = .95, tolerance = 1e-8, 
     print = TRUE, digits = 4L, signif.stars = TRUE)

# The jregress function handles all the data preparation:
# Parses formula and data inputs
# Checks for errors (e.g., non-numeric cluster variables)
# Creates model matrix (X) and response vector (Y)
# Handles fixed effects if specified
# Calls jreg for core calculations

# Arguments for jregress:
#   formula     regression formula, similar to lm function
#   data        data file
#   clusters    cluster variable name (if omitted, treated as non-clustered)

# Arguments for jreg:
#   X          nxk regressor matrix
#              if fe=FALSE, should include intercept column
#              if fe=TRUE, should not include intercept column
#   Y          nx1 dependent variable
#   clusterid  nx1 cluster variable

# Common arguments for both functions:
#   fe         TRUE to include cluster-level fixed effects (default: FALSE)
#   df.adjust  TRUE for degree-of-freedom adjustment (default)
#              FALSE for t(G-1) or t(n-k) distribution
#   df.display TRUE to display adjustment coefficients K and a (default: FALSE)
#   level      confidence level (default: .95)
#   tolerance  tolerance for singularity of X'X (default: 1e-8)
#   print      TRUE to print results (default) 
#   digits     number of digits for display (default: 4L)
#   signif.stars TRUE to display significance stars (default)

# Function outputs:
#   coefficients    least squares coefficient estimates
#   se             jackknife standard errors
#   pvalue         p-values for coefficients
#   L              lower bounds of confidence intervals
#   U              upper bounds of confidence intervals
#   cov.mat        estimated covariance matrix
#   residuals      least squares residuals
#   loo.residuals  leave-cluster-out residuals
#   K              adjustment degree-of-freedom coefficients
#   a              scale adjustment coefficients
#   CV             leave-cluster-out cross-validation criterion

The next part of the code takes a formula-based input (like in lm()) and converts it into matrices that can be used by the core estimation function (jreg). In steps: it takes R formula notation (like y ~ x1 + x2), converts it to matrices (Y and X), handles fixed effects if specified, performs error checking, calls the core estimation function (jreg), and then returns results in a format familiar to R users.

First it defines the function and the initial parsing:

# Define function with arguments
jregress <- function(formula, data, clusters, fe = FALSE, df.adjust = TRUE, 
                    df.display = FALSE, level = .95, tolerance = 1e-8, 
                    print = TRUE, digits = 4L, signif.stars = TRUE){
  
  # Capture the function call
  cl <- match.call()
  
  # Find positions of core arguments (formula, data, clusters) in the call
  m <- match(c("formula","data","clusters"), names(cl), 0L)
  
  # Extract these arguments
  mf <- cl[c(1L,m)]
  
  # Set first element to model.frame function
  mf[[1L]] <- quote(stats::model.frame)

Then it creates the model frame:

# Evaluate the model frame expression
  mf <- eval(mf, envir = parent.frame())
  
  # Extract the terms structure
  mt <- attr(mf, "terms")
  
  # Add cluster information to terms
  attr(mt,"clustervar") <- cl$clusters
  
  # Extract cluster variable if specified
  clustervar <- mf$`(clusters)`

And check for errors:

# Check if cluster variable is numeric
  if (!is.null(clustervar) & !is.numeric(clustervar)){
    stop("Non-numeric cluster variable")
  }
  
  # Check fixed effects specification
  if (fe == TRUE){
    # Must have clusters for fixed effects
    if (is.null(clustervar)){
      stop("Fixed effects option specified without a cluster variable")
    }
    
    # Cluster variable can't be a regressor with fixed effects
    cluster_id <- as.character(attr(mt,"clustervar"))
    varlist <- all.vars(attr(mt,"variables"))
    if (match(cluster_id,varlist,0L) > 0L){
      stop("Cluster variable cannot be included as an (in)dependent variable when fixed effects option is specified.")
    }
  }

Then it creates responses and design the matrices:

# Extract response variable (Y)
  Y <- model.response(mf, type = "numeric")
  
  # Create design matrix (X)
  X <- model.matrix(mt,mf)
  
  # Handle fixed effects by removing intercept if needed
  if (fe == TRUE){
    attr(mt,"intercept") <- 0
    X <- X[,-1]  # Remove first column (intercept)
  }

Finally, it generates the output:

# Print function call if requested
  if (print == TRUE)   
    cat("\nCall:\n",paste(deparse(cl),sep="\n",collapse="\n"),"\n",sep="")

  # Call core estimation function (jreg)
  out <- jreg(X,Y,clustervar,fe,df.adjust,df.display,level,tolerance,print,digits,signif.stars)
  
  # Add call and terms to output
  out$call <- cl
  out$terms <- mt
  
  return(out)

The third function Ginv calculates the Moore-Penrose pseudoinverse for cases where matrices might not be fully invertible. This is an important step because standard matrix inversion can fail in certain cases.

Definition of function and tolerance:

Ginv <- function(A, tolerance=1e-15){

# Calculate adjusted tolerance based on matrix size and values
    tol <- tolerance * max(A) * nrow(A)

Eigendecomposition:

# Get eigenvalues and eigenvectors
    ei <- eigen(A, symmetric=TRUE)
    d <- ei$values      # eigenvalues
    H <- ei$vectors     # eigenvectors

Handling small eigenvalues:

# Identify "too small" eigenvalues
    n <- (d < tol)
    
# Create inverse eigenvalues, adding n to avoid division by zero
    di <- 1/(d+n)
    
# Set inverse of small eigenvalues to zero
    di[n] <- 0

Computing the pseudoinverse:

# Compute the pseudoinverse
    B <- H %*% (di*t(H))
    
# Calculate condition number
    c <- min(d)/max(d)

# The pseudoinverse is constructed using the formula H * D⁺ * H', where:
# H is the matrix of eigenvectors
# D⁺ is the diagonal matrix of inverse eigenvalues (with small ones set to zero)
# The condition number c measures numerical stability of the matrix

Return results:

return(list(inverse=B, rcond=c))
# It returns both the pseudoinverse and the condition number

In summary, this code: takes a symmetric matrix A and a tolerance level, adjusts the tolerance based on matrix size and magnitude, decomposes matrix A into eigenvalues (d) and eigenvectors (H), identifies eigenvalues smaller than tolerance, creates inverse eigenvalues carefully to avoid division by zero, sets inverse of small eigenvalues to zero, constructs pseudoinverse using B = H * D⁺ * H', calculates condition number as min(eigenvalues)/max(eigenvalues), returns pseudoinverse matrix (B) and the condition number (c) to assess numerical stability.

The following code for the jreg function was divided into a few steps: it takes input data (X and Y matrices), sets up clustering if specified (groups observations that are likely correlated), applies fixed effects transformation if requested (removes time-invariant differences between groups and common time trend), performs basic OLS regression (OLS gives initial estimates of treatment effects), calculates initial coefficients and R-squared (shows model fit), checks for matrix singularity issues, for each cluster (or observation) it removes that cluster/re-estimates the model/stores how estimates change, then uses these changes to compute robust standard errors, calculates degrees of freedom adjustments (it prevents overconfidence when clusters are few), computes adjusted t-statistics (to give more accurate p-values), produces confidence intervals and p-values, and returns regression results (including coefficient estimates, jackknife standard errors, adjusted confidence intervals, p-values, and diagnostic information).

Function setup and initial processing:

jreg <- function(X,Y,clusterid,fe = FALSE,df.adjust = TRUE,df.display = FALSE,
                level=.95,tolerance=1e-15,print = TRUE,digits = 4L,
                signif.stars = TRUE){
  # Get dimensions
  n <- length(Y)               # number of observations
  k <- kF <- ncol(X)          # number of variables
  
  # Process clustering info if provided
  if (!is.null(clusterid)){ 
    c_id <- unique(clusterid)  # get unique clusters
    G <- length(c_id)          # count clusters
    G_tab <- table(clusterid)  # observations per cluster
  }

Fixed effects transformation (if requested):

if (fe == TRUE){
    # Combine Y and X
    YX <- cbind(Y,X)        
    # Demean within clusters
    YX <- do.call(rbind, by(YX,clusterid,function(x){
        sweep(x,2,colMeans(x),FUN = "-")
    })) 
    # Split back into Y and X
    Y <- YX[,1] 
    X <- as.matrix(YX[,-1])
    clusterid <- sort(clusterid)
    kF <- kF + G
  }

Basic OLS estimation:

# Compute basic matrices
  XX <- crossprod(X)                    # X'X
  if (rcond(XX) < tolerance) 
    stop("Singular design matrix...")    # Check singularity
  Q <- solve(XX)                        # (X'X)^(-1)
  XY <- crossprod(X,Y)                  # X'Y
  betahat <- Q%*%XY                     # β̂ = (X'X)^(-1)X'Y
  ehat <- Y - X%*%betahat               # residuals
  etilde <- ehat                        
  r.squared <- 1 - var(ehat)/var(Y)     # R²

Jackknife VCE estimation:

if (!is.null(clusterid)){ # Clustered case
    beta_loo <- matrix(NA,G,k)     # Store leave-one-out estimates
    rcond_loo <- matrix(NA,G,1)    # Store condition numbers
    
    # For each cluster
    for (g in 1:G) {
      gi <- (clusterid==c_id[g])   # Get cluster observations
      Yg <- matrix(Y[gi],nrow=sum(gi))
      Xg <- matrix(X[gi,],nrow=sum(gi),ncol=k)
      
      # Leave-one-out estimation
      XXg <- crossprod(Xg)
      Ig <- Ginv(XX - XXg)         # Use pseudoinverse
      Qg <- Ig$inverse
      beta_g <- Qg%*%(XY - crossprod(Xg,Yg))
      beta_loo[g,] <- beta_g - betahat
      etilde[gi] <- Yg - Xg%*%beta_g
    }
  }

Df adjustment:

Vhat <- crossprod(beta_loo)          # Variance matrix
se <- matrix(sqrt(diag(Vhat)))       # Standard errors
  
if (df.adjust == FALSE) {
    # Simple adjustment
    if (!is.null(clusterid)) K <- G-1 else K <- n-kF
    a <- 1
} else {
    # Hansen's more sophisticated adjustment
    a <- rep(0,k)
    K <- rep(0,k)
    for (j in 1:k) {
        Sj <- S[,j]
        Uj <- U[,,j]
        Vj <- V[,,j]
        Wj <- sweep(Uj,1,Sj,FUN="*")
        UU <- crossprod(Uj)
        UV <- crossprod(Uj,Vj)
        VV <- crossprod(Vj)
        UUXX <- UU%*%XX
        trL <- sum(Sj) - sum(diag(UV))
        # The following uses the matrix fact: sum(diag(A%*%B)) = sum(t(A)*B))
        trLL <- sum(Sj^2) + sum(t(UUXX)*UUXX) + 2*sum(t(UV)*UV) - 
                2*sum(Wj*Vj) - 4*sum(t(UUXX)*UV) + 2*sum(UU*VV)
        K[j] <- (trL^2)/trLL
        a[j] <- sqrt(trL/Q[j,j])
    }
}

# This section computes the variance matrix and standard errors, then either 
# applies a simple degrees of freedom adjustment (if df.adjust = FALSE) 
# or Hansen's more sophisticated adjustment that accounts for leverage and
# heterogeneity in the data.

Final calculations and output:

# Calculate confidence intervals and p-values
c <- qt((1+level)/2,K)/a
Lb <- betahat - c*se
Ub <- betahat + c*se
pvalue <- 1 - pf((a*tstat)^2,1,K)

# Print results if requested 
# If printing is requested (print=TRUE), it displays:
# Number of observations
# Number of regressors
# If clustered: cluster information (number of clusters, observations per cluster)
# A coefficient table showing:
# Estimates, standard errors, t-ratios, confidence intervals, p-values
# If requested, degrees of freedom and scale adjustments

# Whether standard errors are from delete-cluster jackknife or regular 
# jackknife, which distribution was used for inference, R-squared value
# Cross-validation criterion, warnings about any non-invertible cases

if (print == TRUE){
   cat("\nNumber of observations:", n,"\n")
   cat("Number of regressors:", k,"\n")
   if (!is.null(clusterid)){
       G.tab = G_tab
       G.summary = c(min(G.tab),mean(G.tab),median(G.tab),max(G.tab))
       G.info = structure(G.summary, names = c("Min","Mean","Median","Max"))      
       cat("Number of clusters:", G,"\n")
       cat("\nNumber of observations per cluster:\n")
       print(G.info, digits = digits)
   }
   coef.table <- cbind(betahat, se, tstat, Lb, Ub, pvalue)
   if (df.adjust == TRUE & df.display == TRUE) {
       coef.table <- cbind(coef.table[,1:5], K, a, coef.table[,6])
       colnames(coef.table) <- c("Estimate", "Std. Error", "t-ratio",
                                "Lower","Upper","df","scale", "Pr(>|t|)")
   } else colnames(coef.table) <- c("Estimate", "Std. Error", "t-ratio",
                                   "Lower","Upper","Pr(>|t|)")
   cat("\nCoefficients:\n")
   printCoefmat(coef.table, digits = digits, signif.stars = signif.stars)  

   if (!is.null(clusterid)) 
       cat("\nStandard errors calculated by the delete-cluster jackknife.")
   else cat("\nStandard errors calculated by the jackknife.")

   if (df.adjust == TRUE) 
       cat("\nConfidence intervals and p-values calculated using adjusted student t distribution.\n")
   else { 
       if (!is.null(clusterid)) 
           cat("\nConfidence intervals and p-values calculated using t(G-1) distribution.\n")
       else cat("\nConfidence intervals and p-values calculated using t(n-k) distribution.\n")
   }

   cat("\nR-squared: ", formatC(r.squared, digits = digits))
   if (!is.null(clusterid)) 
       cat("\nCluster Cross-Validation Criterion: ", formatC(CV, digits = digits, format = "f"))
   else cat("\nCross-Validation Criterion: ", formatC(CV, digits = digits, format = "f"))
   cat("\n")
   
   if (noninvertible > 0) {
       cat("\nWarning: Jackknife coefficients not identified in", 
           formatC(noninvertible,digits=0,format = "f"))
       cat("/")
       if (!is.null(clusterid)) 
           cat(formatC(G,digits=0,format = "f"),"clusters.")
       else cat(formatC(n,digits=0,format = "f"),"iterations.")
       cat("\nJackknife standard errors may be conservative. Recommend respecification of regression model.")
       cat("\n")
   }
}

# Return results
out <- list(
    coefficients = betahat,      # Point estimates
    se = se,                     # Standard errors
    pvalue = pvalue,             # P-values
    cov.mat = Vhat,             # Covariance matrix
    residuals = ehat,           # Regular residuals
    loo.residuals = etilde,     # Leave-one-out residuals
    K = K,                      # Degrees of freedom adjustment
    a = a,                      # Scale adjustment
    CV = CV,                    # Cross-validation criterion
    L = Lb,                     # Lower confidence bounds
    U = Ub                      # Upper confidence bounds
)
return(out)

For the Stata implementation, you can check the code available on his website (it should be a bit easier).

Why This Matters

To wrap up, Prof. Bruce’s jackknife implementation represents a significant step forward in how we handle uncertainty in DiD analysis. His method provides more reliable standard errors and inference, especially in tricky situations like having few clusters, heterogeneous treatment effects, or unbalanced data. What makes this approach so valuable is its practicality: it’s computationally efficient enough to handle large modern datasets while being more reliable than traditional clustering methods.

The fact that the code is freely available and integrates nicely with R’s existing regression framework makes it easy for researchers to adopt. This isn’t just a theoretical improvement, it’s a tool that can be immediately applied to real-world problems, ensuring more accurate assessments of policy impacts. Prof. Bruce’s work helps us better understand statistical uncertainty in DiD estimates, ultimately leading to more trustworthy conclusions and better-informed decisions.

In a world where data complexity is only increasing, circling back to simpler, yet refined methods like the jackknife feels both timely and necessary. It’s a reminder that sometimes, the best solutions aren’t the most complicated ones, but the ones that strike the right balance between simplicity, reliability, and practicality.

DiD Digest

Discussion about this post