DiD+: Handling Data Constraints, Timing Issues and TWFE Limitations

What to do when data can't be pooled, timing gets messy, and TWFE isn't enough

Aug 01, 2025

(Claude creates the image prompt based on the post and it gets crazier by the day. I appreciate it tho)

Hi!

Before we start today’s post, I’d like to recommend this one “Omitted Variables versus Replicating the RCT” by prof Scott. For those of you who don’t know, prof Rubin is the Rubin behind the Rubin causal model1. He’s one of the founding figures of modern causal inference, who introduced the potential outcomes framework that is now in all discussions of treatment effects, from RTs to DiD models. Also we got a mention on my friend Sam’s post, thanks! To conclude the introduction, I know this is not the purpose of this newsletter so I avoid adding links not directly related to the post (for useful ones I have an entire section on my website), but I found a link on LinkedIn by professor Vladislav Morozov on his Advanced Econometrics course and it is reeeeally good and worth checking out :)

The content we will be covering today is:

Difference-in-Differences with Unpoolable Data, by Sunny Karim, Matthew D. Webb, Nichole Austin, and Erin Strumpf
Staggered Adoption DiD Designs with Misclassification and Anticipation, by Clara Augustin, Daniel Gutknecht, and Cenchen Liu
Interactive, Grouped and Non-separable Fixed Effects: A Practitioner's Guide to the New Panel Data Econometrics, by Jan Ditzen and Yiannis Karavias

And there are three “applied” papers I won’t go through but that might be interesting for you to read:

Using did multiplegt dyn in Stata to Estimate Event-Study Effects in Complex Designs: Four Examples Based on Real Datasets, by Clément de Chaisemartin and Bingxue Li (I found this on LinkedIn, it’s a guide that shows you how to estimate dynamic treatment effects in complex DiD designs using the did_multiplegt_dyn Stata command, with real examples involving staggered, continuous, and multi-treatment settings)

Re-Assessing the Impact of Brexit on British Fertility Using Difference-in-Difference Estimation, by Ross Macmillan and Carmel Hannan (this paper offers a forensic re-assessment of a published study, showing how a statistically significant “Brexit effect” on fertility is entirely an artifact of inappropriate control group selection. It provides a practical template for how rigorously ensuring pre-treatment PT can completely reverse a study’s conclusions. Read it if you, like me, love a good discussion on internal validity).
Emerging Techniques for Causal Inference in Transportation: Integrating Synthetic Difference-indifferences and Double/Debiased Machine Learning to Evaluate Japan’s Shinkansen, by Jingyuan Wang, Shintaro Terabe and Hideki Yaginuma (this paper has all the buzzwords I like + it’s about Japan and trains :-) What to do when your treated units are unique megacities like Tokyo or Osaka? The authors show how traditional DiD can give unstable or biased answers and instead use both Synthetic DiD and Double/Debiased Machine Learning to evaluate Japan’s bullet train network. They argue that the real power comes from seeing these two completely different, advanced methods point to the same conclusion, which is a great way to build confidence in your findings).

Difference-in-Differences with Unpoolable Data

(This one hit close to home)

TL;DR: standard DiD often requires merging datasets, but privacy rules sometimes forbid this. This paper introduces UN-DID, a method that works around this by calculating changes within each secure dataset first. You then combine just the summary results (not the sensitive data) to get a valid treatment effect. The authors provide code to do it.

What is this paper about?

This paper is about how to estimate treatment effects using DiD when data from treatment and control groups cannot be combined. In many applications, we have to work with administrative data that is stored in separate secure environments such as in different provinces, agencies or countries. Legal or privacy restrictions may prevent pooling these datasets into a single file for analysis. The authors here propose a method called UN-DID that would allow us to estimate DiD models without needing to combine datasets. It works by estimating pre-post differences separately within each “silo”, then combining these differences externally to recover the ATT. The method supports covariates, multiple groups, staggered adoption and cluster-robust inference, making it suitable for a wide range of applied settings where conventional DiD is infeasible due to data restrictions.

What do the authors do?

They start with the simple 2x2 case and show that UN-DID gives the same estimate as conventional DiD when there are no covariates, then they extend the method to allow for covariates, staggered treatment timing and multiple jurisdictions.

UN-DID works by running separate regressions in each data silo to estimate within-group pre-post differences. These differences, along with their standard errors and covariances, are extracted and combined outside the silos to calculate the treatment effect. The method remains valid even when the effect of covariates differs across silos which is a setting where conventional DiD may be biased.

They support their method with formal proofs, Monte Carlo simulations and two empirical applications using real-world data. In both examples they take pooled datasets and artificially treat them as unpoolable to show that UN-DID produces results that closely match those from standard DiD even in the presence of covariates and staggered adoption.

Why is this important?

Many administrative datasets like health records, tax files, or student registers are stored in secure environments with privacy protections that prohibit exporting or combining data across jurisdictions. These rules are designed to prevent the risk of re-identifying individuals, and as a result, we can’t pool treatment and control data into a single file, which is what conventional DiD methods require.

UN-DID makes it possible to estimate DiD effects in these settings without violating data-sharing rules. UN-DID works within the boundaries of legal and ethical data use while still enabling credible policy evaluation. The method also handles situations where covariates affect outcomes differently across silos, which helps avoid bias that can affect standard DiD models. For example, the authors note that researchers have been unable to evaluate policies like Canada’s national cannabis legalization because they couldn’t combine provincial data. UN-DID is designed for precisely this kind of setting.

Who should care?

You, me, and anyone who works with administrative data/records behind a government “paywall”, which includes even health policy analysts, education researchers, and anyone using secure data that can’t be exported or merged. UN-DID also matters for applied researchers dealing with staggered adoption or treatment heterogeneity where credible counterfactuals lie in datasets they can’t directly combine. It gives them a way to recover treatment effects without weakening their design or dropping research questions entirely due to data restrictions.

Do we have code?

To everybody’s happiness, yes. The authors provide UN-DID software packages in R, Python, Stata, AND Julia (!) which are described in Section 7.3 of the paper. These tools are designed to work within the constraints of siloed data environments, meaning each analyst can run their part locally, and only summary estimates (like pre-post differences and standard errors) need to be exported. A separate user guide with a full empirical example using real-world siloed data is forthcoming.

In summary, UN-DID is a very practical extension of DiD for settings where legal or privacy rules prevent combining datasets across jurisdictions. It produces valid treatment effect estimates by working within each silo and then combining results externally. The method supports covariates, staggered timing, and heterogeneous effects, and performs well in both simulations and real data. For researchers working with administrative data that can’t be pooled, it offers a way to keep using DiD without compromising identification (or breaking the law).

Staggered Adoption DiD Designs with Misclassification and Anticipation

TL;DR: when treatment dates are misclassified or people anticipate a policy before it officially starts, standard staggered DiD estimators produce biased results. This paper offers a solution by introducing new, bias-corrected estimators that account for these issues and provides a diagnostic test to detect the presence and timing of such misspecification in the data.

What is this paper about?

This paper studies two issues that many of us face when using staggered DiD: misclassification of treatment timing and anticipation effects. Such problems aren’t new or even overlooked, but in practice we often do the best we can with the data we have, knowing damn well that policy implementation may not be cleanly observed or that people can react to treatment announcements before the official start date. What’s been missing is a formal framework to understand how these issues threaten identification and how to adjust for them.

Staggered adoption designs are now widely used in policy evaluation. They extend the basic DiD setup to cases where groups are treated at different times. Recent papers, like Callaway and Sant’Anna (2021) and De Chaisemartin and D'Haultfoeuille (2020) have improved how we deal with treatment effect heterogeneity and timing, but they assume that treatment dates are correct and that outcomes don’t change before treatment starts.

This paper builds on that work and shows what goes wrong when those assumptions don’t hold. It explains how misclassification and anticipation can distort the comparisons made in staggered DiD and how that leads to biased estimates. It also connects to other research on errors in treatment status, like Lewbel (2007) and Denteh and Kedagni (2022) and to recent work on checking pre-trends before running DiD such as Rambachan and Roth (2023).

In short, the paper shows that when treatment is misclassified or anticipated even the best available DiD estimators may not be reliable2 unless we adjust for those issues, and proposes ways to both detect and correct the resulting bias.

What do the authors do?

The authors make three main contributions. First, they formally characterize the bias introduced by misclassification and anticipation in staggered DiD. They show that TWFE and other estimators aren’t “reliable” when the timing or incidence of treatment is misspecified (even under homogeneous treatment effects). This happens because these estimators end up comparing units that shouldn’t be treated as untreated, or vice versa, thus creating what the authors call “forbidden comparisons”.

Second, they propose bias-corrected estimators that recover causal effects under these conditions. One version adjusts an existing staggered estimator (from De Chaisemartin and D’Haultfoeuille, 2020) to account for misclassified timing, while another version targets the ATE among actually treated units (even when actual treatment dates aren’t observed) by imposing a weak homogeneity assumption on misclassification probabilities, which lets them estimate how many units were really treated at each point in time and then adjust accordingly.

Third, they introduce a 2-step testing procedure to detect the presence, extent, and timing of misspecification. The first step checks whether the PTA holds using standard pre-trend comparisons. If it passes, then the second step compares outcomes across switchers and not-yet-switched groups in earlier periods to flag potential misclassification or anticipation. These tests help determine when the bias corrections are needed and whether the assumptions behind them are plausible.

They support all of this with simulation evidence and an application to the rollout of a computer-based testing system in Indonesian schools. Interestingly, their results suggest that schools were adjusting behaviour even before officially adopting the system, and that failing to account for this leads to underestimating the true treatment effect.

Why is this important?

Staggered adoption designs are popular in applied economics for estimating treatment effects when policies roll out at different times. They handle heterogeneity and exploit timing variation, but rely on hard-to-verify assumptions.

2 key challenges are misclassification and anticipation. Misclassification happens when recorded treatment timing doesn’t match actual implementation due to measurement error or delayed adoption. Anticipation occurs when units respond before treatment formally begins, perhaps due to advance announcements. Both distort the comparisons staggered DiD depends on, biasing even sophisticated estimators designed for treatment heterogeneity.

These problems appear often in practice (you might have come across them yourself). Studies show institutional behaviour shifting before formal implementation (Bindler and Hjalmarsson, 2018), misalignment between administrative eligibility and actual responses (Goldschmidt and Schmieder, 2017), and anticipatory adjustments before official adoption (Berkhout et al., 2024). In each case, recorded dates poorly reflect actual exposure or behavioural change. So it’s welcoming to see this paper addressing how timing misspecification creates biased comparisons and further providing diagnostic tools and corrected estimators. Rather than treating timing as a secondary issue, it integrates these concerns directly into identification and estimation, and this is particularly valuable for researchers using administrative or policy data where treatment definitions are inherently noisy (aren’t them all, to some extent?).

Who should care?

Anyone using staggered DiD designs with administrative or policy data should look this one up. That includes applied researchers studying policy reforms, education interventions, labour regulations or any other setting where treatment doesn’t arrive cleanly and uniformly. If your treatment variable comes from government records, institutional reports, or eligibility rules (and not from a randomized rollout) then the risk of misclassification or anticipation is probably already in your data. This is also relevant if you’re working with event study plots and interpreting pre-trends. The paper shows that violations from misclassified timing or early behavioral responses can sneak into the pre-treatment window and look like non-parallel trends even when the actual causal structure is sound. So if you’ve ever seen a bumpy pre-trend and wondered whether to throw out your design, this gives you a reason to pause and test before you abandon the setup.

Do we have code?

Not yet. As of the first version of the arXiv post (July 2025), there’s no replication repository linked, and the paper doesn’t mention any software package or implementation details. The estimators and tests are well defined and nothing looks too difficult to implement, but it would be helpful to have code or even a worked example. If that shows up in a later version or companion repo, I’ll update this post.

In summary, this paper strengthens the foundation of staggered DiD by tackling something most of us have seen but rarely model directly: the gap between recorded treatment timing and actual exposure. It formalizes how misclassification and anticipation distort group comparisons, shows that even advanced estimators are vulnerable and offers tools to detect and correct those problems. The framework is useful, the diagnostics are intuitive and the estimators are practical. If you work with policy data where timing is a bit messy (which is most of it) this is worth your attention.

Interactive, Grouped and Non-separable Fixed Effects: A Practitioner's Guide to the New Panel Data Econometrics

(Speaking of TWFE, prof Jeffrey was on about it today here, and prof Scott’s 3 latest posts also discuss it here and here and here. You have a lot of reading to do, I don’t make the rules :-)).

TL;DR: this paper is a practical guide arguing that the standard TWFE model is often too restrictive for modern panel data which can lead to biased results when unobserved characteristics have complex, time-varying impacts. It provides an accessible overview of more flexible methods like interactive, grouped, and non-separable fixed effects, and shows through empirical examples that these newer estimators often fit the data better and can significantly change your conclusions.

What is this paper about?

This is not a “paper” per se but a practical guide to the new generation of panel data models that better capture unobserved heterogeneity. The standard TWFE approach assumes that unobserved traits are either fixed over time or shared across units in a given period. But this often misses how units might respond differently to common shocks or how the value of unobserved characteristics can evolve.

The authors walk us through 3 more flexible frameworks: interactive fixed effects (IFE), which let unobserved unit traits interact with time-varying shocks; grouped fixed effects (GFE), which cluster units into latent groups that share patterns over time; and non-separable two-way (NSTW) models, which allow for nonlinear interactions between units and periods. So rather than introducing new theory, the paper focuses on helping us understand when and how to use these models. It covers the main estimators, lays out key assumptions, compares performance and discusses how to test which model fits best. It also includes two empirical examples and a software appendix to make the methods easier to apply.

What do the authors do?

(They do a lot and I learned so many new terms by reading this paper)

The provide a very structured overview of the main estimation methods for modeling unobserved heterogeneity in panel data using interactive, grouped and non-separable FE. Again, they don’t propose a new estimator but consolidate and explain a wide set of recent contributions (many of which are *technically* demanding), and translate them into accessible terms for applied people. They cover how to estimate models with IFE, where unobserved unit characteristics interact with time-varying common shocks. For that, they walk through estimation techniques like: iterative least squares (ILS), which alternates between estimating factors and coefficients; penalized least squares (PLS), which uses adaptive LASSO to select the number of factors; nuclear norm regularization (NNR), which reframes factor estimation as a convex optimization problem; IV approaches, which include two-step IV estimators and GMM estimators that avoid estimating incidental parameters; and common correlated Effects (CCE), which “sidestep” direct factor estimation by using cross-sectional averages as proxies. They also explain how GFE can be estimated by clustering units into latent groups, reducing dimensionality and avoiding bias when the number of time periods is limited. Finally they discuss non-separable models that go beyond linear IFE structures, allowing for flexible, possibly nonlinear interactions between units and time effects.

The good thing is that throughout the paper they emphasize the trade-offs: whether you need to estimate the number of factors, how estimation bias behaves in small samples, and which estimators work best when regressors are endogenous, when N and T are large (I never had this issue hehe) or when models include lags.

They back this up with two empirical applications, one looking at the inflation-growth nexus and another reassessing the Feldstein-Horioka puzzle in international finance, to show that newer models like NSTW can give quite different (and more plausible) results than standard TWFE.

Why is this important?

This is an important guide because most applied people still rely (or just really like) TWFE even when the assumptions behind it don’t hold (not our fault!). TWFE assumes that unobserved characteristics are either constant over time (like geography or ability) or affect all units the same way in a given year (like a national shock), but in many panels (like when T is large) these assumptions are too restrictive. Unobserved factors often evolve over time and different units may respond to the same shock in different ways. If we *ignore* (sometimes we don’t have an option) this and keep using TWFE, the estimates can be biased and misleading, which means we might be drawing the wrong conclusions from the data. And in many cases, even using IFE isn’t enough: the paper shows that nonlinear and group-based models often fit the data better. This guide matters because it lowers the barrier to using these newer methods, it explains what each estimator assumes, when it “breaks down” and how to choose the right “model” (prof Jeffrey might have an issue with this nomenclature). It also shows that applying these methods doesn’t have to be intimidating! There are diagnostics, workflows and open-source tools that make it manageable.

Who should care?

Anyone working with panel data where unobserved heterogeneity might evolve over time or affect units differently, and that includes applied micro people using survey or firm-level data, macro people modeling country or regional dynamics and financial people studying markets or portfolios over time.

If you’re running panel regressions with many time periods or even just worried that your unobserved variables aren’t neatly time-invariant or uniform across units, this guide gives you the tools to do better. It’s also useful if you’re dealing with potential cross-sectional dependence, incidental parameter bias or unclear model fit and want diagnostics that go beyond residual plots. Even if you don’t plan to use these estimators like right now, understanding them helps you interpret existing empirical work more critically and helps you avoid relying on TWFE out of “habit” (or convenience).

Do we have code?

The paper includes a super helpful appendix listing open-source implementations for most of the methods discussed, available in both Stata and R, with some support in Matlab as well. For example: for CCE, you can use the popular XTDCCE2 command in Stata or the plm package in R; for the iterative IFE estimator, there's REGIFE in Stata and INTERFE in R; for GFE, the authors point to code available on Stéphane Bonhomme's webpage; and for the newer NSTW models, you can find PCLUSTER in R. Some of the very latest estimators may require more hands-on coding, but the paper points to companion codebases and replication files for their applications. It’s definitely worth checking out!

In summary, this is a practical and well-organized guide to modern fixed effects models for panel data. It doesn’t introduce new estimators, but it brings together a wide set of tools that applied people should know about, more so if they’re still relying on TWFE.

For a chapter about it written by him and prof Imbens, email me or access here. I also highly recommend this interview he did with profs Fan Li and Fabrizia Mealli.

For years researchers used TWFE as the default “DiD” estimator, even in staggered settings. It wasn’t until Goodman-Bacon (2021), Sun and Abraham (2021) and others showed it could mix already-treated units into control groups (the negative weights!) that we realized it was doing more than we thought. Now most new work tries to avoid these comparisons but legacy papers and habits still linger.

DiD Digest

Discussion about this post