Parallel Trends… Conditionally: Covariates, Violations, and More Robust DiD Methods

The devil in the ~DiD details~

Aug 11, 2025

Hi there! Following a friend’s suggestion, I decided to add some quick explanations of concepts that eventually show up in the main text. I don’t want to overcrowd it, so they will appear as footnotes whenever it’s necessary. I hope this will make things clearer :) For example, the first paper talks about the covariate balancing propensity score, which is a method for estimating propensity scores in a way that automatically balances covariates between treated and control groups (rather than estimating the score first and then checking balance later). I’ll no longer assume everyone is already familiar with these “specifics”. Any other suggestions are welcome, just send me an email (b[dot]gietner[at]gmail[dot]com) or comment in the comment box at the end of the post.

Let’s goooo then. Here are the papers:

A difference-in-differences estimator by covariate balancing propensity score, by Junjie Li and Yukitoshi Matsushita
Good Controls Gone Bad: Difference-in-Differences with Covariates, by Sunny Karim and Matthew D. Webb
Bayesian Sensitivity Analyses for Policy Evaluation with Difference-in-Differences under Violations of Parallel Trends, by Seong Woo Han, Nandita Mitra, Gary Hettinger and Arman Oganisian
The Labor Market Effects of Generative AI: A Difference-in-Differences Analysis of AI Exposure, by Andrew Johnston and Christos Makridis (this is an applied paper that I judged to be super relevant to the current discussion we are having regarding the role of AI on the labour market: is it good, is it bad, does it generate consumer surplus, who’s benefitting from it, who’s being left out, is the generative AI revolution fundamentally different from past technological shifts (like the industrial revolution or the rise of computers), if generative AI is so powerful, why aren't we seeing a massive, economy-wide productivity boom? etc. The narratives are swinging between utopian optimism and dystopian fear, and it doesn’t need to be like that. I think you should check it out for a couple of reasons: it provides evidence that AI appears to be both creating and destroying opportunities, but in different parts of the labour market; and it directly tackles the central question of whether AI will help or replace human workers. I’m not telling you more, go read it :))

A difference-in-differences estimator by covariate balancing propensity score

TL;DR: Professors Li and Matsushita adapt the Covariate Balancing Propensity Score (CBPS) to estimate treatment effects in DiD designs. Their proposed CBPS-DiD estimator enforces covariate balance while offering double robustness for both the point estimate and its statistical inference. Tested in simulations and on the classic LaLonde dataset, it often delivers lower bias and more reliable confidence intervals than standard estimators like regression adjustment, IPW or AIPW. Its advantage is most pronounced when the underlying models are slightly misspecified (a common challenge in applied policy evaluation).

What is this paper about?

This paper by Li and Matsushita takes the Covariate Balancing Propensity Score (CBPS)1 method introduced by Imai and Ratkovic (2014) and adapts it to the DiD framework. Their focus is on estimating the ATT in situations where the PTA is unlikely to hold unconditionally but may be more plausible after conditioning on covariates2.

The authors’ contribution is to integrate CBPS into a DiD setup by showing that their “CBPS-DiD” estimator inherits many “desirable” properties. When both the outcome model and the propensity score model are specified correctly, the estimator approaches the semiparametric efficiency bound3 (meaning it makes full use of the available information in the data). When only one of the models is correct, the estimator remains consistent (doubly-robust4), and unlike the augmented inverse probability weighting (AIPW) DiD estimator, it is also doubly robust in terms of *inference* (so confidence intervals remain valid even if one model is wrong and you’re not being misled about how precise your estimates are).

In the paper’s simulation exercises they also find that CBPS-DiD converges faster to the true ATT than AIPW-DiD when both models are only “slightly” misspecified (it’s a subtle but practically relevant advantage since exact model correctness is very rare in applied work).

What do the authors do?

To examine how their CBPS-DiD estimator performs, the authors perform Monte Carlo simulations that let them control the “truth” and test the method under different conditions. They construct artificial datasets in which they vary whether the outcome model and the propensity score model are both correct, only one is correct, both are wrong or both are “slightly” wrong. These scenarios are important because they mimic the realities of applied work: rarely do we specify every model perfectly and small deviations from the truth are very common even with good theory AND data. For each of the proposed scenarios they compare CBPS-DiD to standard alternatives (outcome regression, IPW and AIPW) tracking bias, coverage probability and confidence interval length (checking if point estimates are close to the truth and whether the measures of uncertainty are reliable).

They then move from simulations to an empirical illustration using LaLonde’s (1986) dataset, which is a “benchmark”5 in program evaluation where the true average treatment effect is known to be zero. It’s like a tough test: the dataset is well known for “tripping up” estimators, often producing spurious effects due to covariate imbalance. By applying CBPS-DiD alongside the other estimators to this data they can see how much bias each method produces when estimating an effect that should, IN THEORY, be absent. They run these comparisons both with a small covariate set and a larger one, displaying how performance changes as the dimensionality of the adjustment set grows.

Why is this important?

The PTA is one of the pillars of DiD, but in many applications it is unlikely to hold unless you adjust for pre-treatment differences in observed characteristics. The problem is that some traditional approaches (like regression adjustment or IPW) often struggle here for a variety of reasons, such as poor covariate balance or relying too much on the model being specified correctly. CBPS-DiD addresses both of these problems by a) building covariate balance into the estimation process and b) keeping the “protection” of double robustness. It also offers double robustness for inference, meaning confidence intervals remain valid even when one of the models is wrong. This matters because in observational data the covariates that matter for trends are often measured with error or modeled imperfectly. If balance is poor the ATT estimate can be biased. If inference is wrong you can be misled about the effect’s precision. CBPS-DiD reduces the risk of both problems without requiring perfect knowledge of the true data-generating process6. It is designed for the realities of applied policy evaluation where theory guides model choice but can’t guarantee correctness.

Who should care?

Applied researchers who rely on DiD in settings where the PTA is unlikely to hold without covariate adjustment and where model misspecification is a concern. It is especially relevant if you work with observational data where balance between treated and control units is hard to achieve or if you want more reliable inference when model assumptions are only partly correct. Policy evaluators, labour economists, education researchers and public health analysts working with staggered or selective treatment adoption can all benefit from a method that strengthens both balance and inference without requiring perfect models.

Do we have code?

No, the paper doesn’t provide replication files or a package info. If you want to try CBPS-DiD you would need to adapt the existing CBPS framework from Imai and Ratkovic (2014) (available in the CBPS package for R) and embed it into a DiD setup following the steps in the paper. The package itself only covers cross-sectional and basic panel weighting so this is not a “plug and play” situation for DiD. It’s doable if you’re comfortable working with both propensity score weighting and DiD estimation but there’s no one-line function for it yet.

In summary, this paper extends the CBPS to the DiD framework, producing an estimator that binds covariate balance into the weighting stage while maintaining double robustness, and even extends it to inference. It’s a method built for real-world policy evaluation where covariates matter for trends, models are rarely perfectly specified and you want both credible point estimates and trustworthy CIs. The simulations and the LaLonde application show it can outperform standard regression adjustment, IPW and AIPW in both bias and precision, with the advantage being more pronounced when model misspecification is mild but inevitable.

Good Controls Gone Bad: Difference-in-Differences with Covariates

(We have plots, DAGs AND flowcharts in this paper, which I love. Also lots of things in this paper follow from the previous one - check the footnotes)

TL;DR; many DiD studies add time-varying covariates assuming they help, but this only works if the CCC assumption holds (meaning covariate effects are stable across groups and time). When CCC fails, popular estimators like TWFE, CS-DiD, imputation and FLEX can produce biased treatment effects, even if PT hold without covariates. This paper formalizes CCC, shows how violations cause bias and introduces DiD-INT, a new estimator that remains unbiased under CCC violations and can recover PT that are hidden by misspecified covariate adjustments.

What is this paper about?

This paper is about what happens when the relationship between your covariates and your outcome isn’t stable across time or across groups, a situation the authors formalize as the Common Causal Covariates (CCC) assumption7.

In most DiD studies, we add covariates to make the PTA more plausible or to control for other influences, but when the effect of such covariates changes over time or differs between treated and control groups, “standard” estimators (from the familiar TWFE model to modern options like Callaway–Sant’Anna, imputation and FLEX) can produce biased treatment effect estimates. Karim and Webb show that this problem is widespread, formalize three versions of the CCC assumption and demonstrate how violations cause bias in widely used estimators. They also introduce a new estimator, the Intersection Difference-in-Differences (DID-INT) which remains unbiased even when CCC fails and works in settings with staggered treatment rollout and heterogeneous treatment effects.

What do the authors do?

They do lots of stuff, so I separated the contributions into four subsections.

First, they formalize the CCC assumption
They introduce three explicit versions of the CCC assumption: two-way, state-invariant, and time-invariant, and explain how each shapes the way covariates should be handled in DiD. This step is super important because until now most methods implicitly assumed two-way CCC without stating it.
Second, they show the problem with existing estimators
Through theory and Monte Carlo simulations, they show that when CCC fails, widely used estimators (TWFE8, CS-DiD9, imputation10, FLEX11) can produce biased ATT estimates. This bias appears even if you have correctly specified the rest of your model and your data meets other standard DiD assumptions.
Third, they propose the DiD-INT
DiD-INT is designed to adjust for covariates in a way that allows CCC to be violated, recover parallel trends hidden by misspecified covariate adjustments, work with staggered adoption and heterogeneous treatment effects, and avoid “forbidden comparisons” and negative weighting issues12. DiD-INT runs in five steps starting with a model selection algorithm that visually checks pre-trends under different ways of interacting covariates with group and time. This identifies the correct functional form *before* estimating the ATT.
Fourth, they develop a model selection algorithm
The authors provide a structured sequence to decide how to model covariates: 1) start without covariates and plot pre-trends; 2) if they fail, try covariates under the two-way CCC assumption; 3) if they still fail, interact covariates with group (state-varying DiD-INT) or time (time-varying DiD-INT); 4) if that fails, interact with both (two-way DiD-INT); 5) if pre-trends still look implausible, no DiD method is recommended. This approach is meant to replace the common “give up when pre-trends fail” practice with a systematic check for hidden parallel trends.

Why is it important?

Many of us include time-varying covariates in our DiD models without thinking too much about whether the relationship between those covariates and the outcome is actually “stable”. If that stability (aka the CCC assumption) fails, the resulting ATT estimates can be biased even if everything else about the design looks good13.

The problem is that CCC violations are not rare. In real datasets, covariate effects often change over time (due to shifting macroeconomic conditions, policy changes or industry composition) or vary between groups (because of structural differences in demographics, institutions or markets), and in some cases both happen.

This paper’s message is that CCC should be treated as a first-order identification issue rather than a minor technical detail. It also offers a way forward when CCC fails: DiD-INT broadens the set of situations where PT can plausibly hold, which then will probably reduce the number of projects that get abandoned just because pre-trends “don’t look parallel” under a misspecified covariate adjustment.

Who should care?

Lots of people (pretty much anyone running DiD models with covariates should care). Applied researchers running DiD models with covariates in repeated cross-sections or administrative datasets (this includes anyone working with survey microdata, labour force stats, education records or health data where both outcomes and covariates change over time); policy evaluators working with interventions that roll out across locations or institutions at different times, where standard estimators might be biased by shifting covariate effects; data scientists in government agencies and research institutes who produce official impact evaluations and must navigate both methodological correctness and practical constraints on variable selection.

Do we have code?

Yes, the authors say: “to ease in the implementation of the DID-INT estimator we have a package available in Julia. We also have a wrapper for Stata which calls the Julia program to perform the calculations, using the approach in Roodman (2025). The Stata program is available [here]. A wrapper in R is forthcoming. The software package allows for cluster robust inference using both a cluster-jackknife and randomization inference. The details of these routines, and their finite sample performance is discussed in the companion paper Karim et al. (2025).”

In summary, this paper reframes something many of us treat as a minor “specification choice” (adding covariates) into a core identification concern for DiD. By making the CCC assumption explicit, it shows that the stability of covariate effects is not a given and that violations can bias even modern estimators. This bias can appear whether or not PT hold unconditionally, meaning that covariates can sometimes turn a valid design into an invalid one. So rather than abandoning a project when pre-trends look implausible, the authors show how re-specifying the covariate adjustment can uncover “hidden” PT. Their DID-INT estimator (combined with a model selection algorithm) provides a structured way to test for and adjust to CCC violations while avoiding forbidden comparisons in staggered adoption settings14. For applied work, the contribution is twofold: a warning that covariates can do harm if misused and a practical method to recover unbiased estimates when a key stability assumption fails15.

Bayesian Sensitivity Analyses for Policy Evaluation with Difference-in-Differences under Violations of Parallel Trends

(Moving away from our frequentist framework…)

TL;DR: this paper presents a Bayesian sensitivity analysis for DiD when the PTA is likely violated. The approach models the size and persistence of violations with an AR(1) process and allows these parameters to be fixed, fully estimated within a Bayesian model or calibrated from pre-treatment data. The authors demonstrated how treatment effect estimates change under different assumptions about violations by using the example of beverage sales data from Philadelphia and Baltimore.

What is this paper about?

In this paper the authors examines how to use DiD when the PTA is likely violated. They propose a Bayesian sensitivity analysis that introduces a formal parameter for the size and persistence of these violations, modeled with an AR(1)16 process to capture temporal correlation. They present three ways to set the priors for this process: fixed values, fully Bayesian estimation and empirical Bayes calibrated from pre-treatment data. The approach is illustrated by estimating the effect of Philadelphia’s sweetened beverage tax using Baltimore as a control city, showing how treatment effect estimates change under different assumptions about the violation.

What do the authors do?

The authors start by extending the standard DiD setup to include a term that measures how much the treated group’s trend could diverge from the control group after treatment. This deviation term is modeled with an AR(1) process, which captures both the average size of the violation and how persistent it is over time. They then specify three strategies for setting the priors on the AR(1) parameters: 1) fixed values, where the level of persistence is set in advance and only the size of the deviations varies; 2) fully Bayesian estimation, where the model learns both persistence and deviation size from the data within chosen prior ranges; and 3) empirical Bayes, where these parameters are first estimated from pre-treatment data and then used as priors in the post-treatment model. They then apply these models to monthly beverage sales data from Philadelphia (treated) and Baltimore (control) before and after the 2017 sweetened beverage tax. The authors tested how large the violation of PT would need to be to make the beverage tax’s effect statistically insignificant. For most of their models the required violation was *implausibly* large (e.g. for supermarkets, one model required a 980-fold increase in counterfactual sales), which strengthens their conclusion that the tax did have an effect.

Why is this important?

In the two previous paper we went on and on about violation of the PTA. It is super important, but less discussed in Bayesian settings17. When the PTA does not hold, the estimated treatment effect can reflect underlying differences between treated and control groups rather than the impact of the policy. Many studies check for PT by looking at pre-treatment data but these tests are often underpowered and can’t guarantee that trends would have remained aligned after treatment. The Bayesian framework in this paper gives us a more structured way to relax the assumption, quantify the size and persistence of violations and see how the results change under different plausible scenarios. This makes it possible to present policy conclusions alongside a transparent assessment of how much they depend on the PTA.

Who should care?

I feel like I am repeating myself here, but this paper will interest applied researchers who use DiD and worry that the PTA may not hold particularly in policy evaluations where treated and control groups have different pre-treatment dynamics. It is also relevant for statisticians and econometricians working on sensitivity analysis methods and for policy analysts who need to present results with a clear statement of how robust they are to key assumptions. Anyone who works with short panels, noisy outcomes or limited pre-treatment data will find the discussion of prior choices and empirical Bayes calibration really useful (I did).

Do we have code?

The paper does not share replication files, but it states that the models were implemented in PyMC (version 4.0) using the No-U-Turn Sampler, a form of Hamiltonian Monte Carlo. Since the main addition to a standard Bayesian DiD is the AR(1) process for violations, the method could be reproduced by combining a DiD setup with an AR(1) prior on the deviation term. PyMC has public examples of AR(1) time-series modeling that can serve as a starting point for implementing this framework.

In summary, this paper adapts DiD to situations where the PTA is unlikely to hold by modeling violations directly in a Bayesian framework. The AR(1) process for the deviation term allows both the size and persistence of violations to be estimated or calibrated from pre-treatment data. The authors show that different prior choices (fixed, fully Bayesian and empirical Bayes) can lead to different conclusions about the policy effect, making sensitivity analysis an essential part of interpretation. The application to Philadelphia’s sweetened beverage tax demonstrates how this approach can make the robustness of DiD results explicit, rather than leaving it as an untested assumption.

CBPS is an alternative way of estimating the propensity score (the probability that a unit receives treatment given its observed characteristics) that directly enforces covariate balance between treated and control groups at the estimation stage, which is different to the standard approach which estimates the propensity score first (often using a logit or probit model) and then checks balance afterward. CBPS produces weights that better align the pre-treatment characteristics of treated and untreated units by making balance a “target” rather than an “afterthought”, which in turn improves the plausibility of the conditional PTA when covariates are important.

For example, when treated units differ systematically from controls in ways that affect how outcomes evolve over time, such as having younger populations, higher baseline income, or different industry structures, but may be more plausible once these differences are accounted for through covariates. In these cases, simply comparing pre and post-treatment changes between treated and untreated groups would risk conflating treatment effects with the influence of those underlying characteristics, whereas conditioning on covariates can “help” isolate the policy’s impact. Let’s think of a concrete example related to my area (Econ of Education): consider a policy that expands access to after-school tutoring but is rolled out first in schools serving lower-income communities (let’s not get into what the policymaker might have in mind such as their specific goals with the policy, whether there is even a real problem this policy would solve, or whether they have a detailed plan for measuring its effectiveness). If we just compared changes in test scores at these schools to changes in more affluent schools without the program, the trends might differ even without the policy because achievement growth often follows different trajectories across socioeconomic groups. But if we adjust for covariates (such as baseline test scores, parents’ education or school resources) the assumption that treated and control schools would have evolved in parallel becomes more plausible, but not completely. Remember that misspecification is a huge issue that can arise from omitted variable bias, wrong functional form, mismeasured variables, incorrect distributional assumptions or inappropriate fixed effects structure. If any other important factors remain unobserved (or if they can’t be measured and theory offers no guidance - aka the unobserved confounders), even conditioning on a “good” set of variables won’t guarantee that the PTA holds. The PTA is an assumption about the true data generating process. Misspecification means your model doesn’t correctly represent that process, and if it leaves out factors that actually drive differences in trends, the *conditional* PTA will still be violated even if you think you’ve adjusted for “enough” covariates. This is such a good topic, I can recommend some reading: “What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature” (Roth, Sant’Anna, Bilinski, Poe, 2023); “Difference-in-differences when parallel trends holds conditional on covariates” (Caetano, Callaway, 2024); “Nothing to see here? Non-inferiority approaches to parallel trends and other model assumptions” (Bilinski, Hatfield, 2019).

In simpler terms, the semiparametric efficiency bound is the “smallest possible variance an estimator can achieve in a given setting without making strong, unrealistic assumptions”, (e.g. knowing the exact functional form of the model, assuming perfectly normally distributed errors or ruling out heteroskedasticity). Hitting that bound means you’re extracting the maximum precision possible from your data under credible assumptions. The type of modeling used is based on how much information are available about the form of the relationship between response variable and explanatory variables, and the random error distribution. Remember the difference: in a parametric model you fully specify the functional form AND distribution of the errors (e.g. linear regression with normally distributed, homoskedastic errors), and all parameters are finite-dimensional; in a nonparametric model you make NO functional form assumptions, you let the data determine the shape of the relationship (often at the cost of efficiency since you need large samples to get precise estimates); and in a semiparametric model you specify a finite-dimensional parameter of interest (like a treatment effect) BUT leave part of the model (it can be the error distribution or the functional form of covariates) unspecified. Going back to our education examples, let’s consider estimating the effect of reducing class size on test scores (you can think of a couple of mechanisms in place). In a parametric model, you might assume a linear (up or down) relationship between class size and scores, control for other factors in a fixed way and assume the errors are normally distributed and homoskedastic. In a nonparametric model, you would make no assumption about the shape of the relationship, letting the data reveal it (by having lots of fun plotting the data) but you would need a much larger sample to get a precise answer (often applied micro people we don’t have that). In a semiparametric model, you could focus on estimating the ATE of small classes while leaving the relationship between other covariates (like parents’ income or teacher experience) and test scores unspecified, which give you a bit more flexibility while retaining more precision than a fully nonparametric approach.

“In a missing data model [where the missingness mechanism can be MCAR, MAR or MNAR], an estimator is doubly robust (DR) or doubly protected if it remains consistent when either a model for the missingness mechanism or a model for the distribution of the complete data is correctly specified. In a causal inference model, an estimator is DR if it remains consistent when either a model for the treatment assignment mechanism or a model for counterfactual data is correctly specified. Because of the frequency and near inevitability of model misspecification, double robustness is a highly desirable property”. (Bang, Robins, 2005)

Check “LaLonde (1986) after Nearly Four Decades: Lessons Learned” (Imbens, Xu , 2024) for a look at the LaLonde dataset.

Even with better balance and double robustness, CBPS-DiD (like any DiD estimator) can’t account for unobserved confounders that affect trends differently between treated and control groups. The conditional PTA must still hold for the set of observed covariates you include.

This is an assumption which says that the effect of each covariate on the outcome is constant either across time, across groups or both. There are three forms: 1) two-way CCC, which is constant across time and groups; 2) state-invariant CCC, which is constant across groups but can vary over time; and 3) time-invariant CCC, which is - as you can guess by - constant over time but can vary between groups. Violation means that the “adjustment” provided by covariates is not the same for all observations, which leads to bias. Think of it as a recipe for a cake that works the same way and produces the same result whether you bake it in California or New York (it’s state-invariant), and whether you bake it today or in five years (it’s time-invariant). The recipe is “stable”.

TWFE “suffers” from negative weighting and forbidden comparisons in staggered designs (Goodman-Bacon, 2021).

Callaway–Sant’Anna avoids forbidden comparisons by estimating group-by-time ATTs and aggregating, and it defaults to a doubly robust DiD approach that still assumes CCC when covariates are time-varying.

Imputation is an approach that predicts untreated counterfactual outcomes for treated units using a model fit on controls, then takes differences. Can use time-varying covariates but still assumes CCC holds.

A flexible regression specification allowing for time-varying covariates, but like others, does not address CCC violations.

In staggered adoption some treated units serve as controls for others which introduces bias when treatment effects are heterogeneous. Negative weights mean some group-time effects get subtracted in aggregation.

Even in designs where PT unconditionally hold, adding covariates that violate CCC can create bias that wouldn’t otherwise exist. Covariates are not always “safe” to include and bad covariates can turn a valid design into an invalid one.

The paper makes it clear that while DiD-INT is unbiased even when CCC fails, this robustness comes at the cost of efficiency. The estimates from DiD-INT will have higher variance (i.e. wider CIs) than a standard estimator. We are familiar with this trade-off: gaining accuracy at the cost of some precision.

Given the bias-variance trade-off, the authors suggest a practical path forward: researchers can either use the model selection algorithm to find the most efficient (parsimonious) model that satisfies PT or simply default to the two-way DiD-INT estimator, which is unbiased across all potential CCC violation scenarios - albeit at the cost of statistical power.

Macro friends will get this one hehe ;) AR(1) stands for autoregressive process of order 1, it’s a way to model how a value at one time point depends on its own value in the previous time period + some random noise. Here it means that deviations from PT are assumed to change gradually over time rather than jumping around randomly. Broader explanations here and here.

Bayesian methods are used here because they make it straightforward to incorporate prior beliefs about the size and persistence of violations into the model and to quantify uncertainty about those violations in a coherent way. In a frequentist framework, sensitivity analyses require separate runs for each assumed violation level and they *do not* produce a single posterior distribution that integrates over uncertainty about these parameters. The Bayesian approach can treat the violation parameters as random variables (which are neither variable nor random, iykyk), update their distributions with the data and directly propagate that uncertainty into the treatment effect estimates.

DiD Digest

Discussion about this post