DiD Digest

The estimate is in the design

Beatriz Gietner — Mon, 30 Mar 2026 14:36:41 GMT

Hi there! Apologies for the disappearance! My thesis is due in a month so I was/am busy with that. There are 10 papers to catch up on, so we will start with four of them today and hopefully I can write another post next week.

I picked these first four because I see an overall theme in them: DiD looks fairly straightforward in theory, but irl applications tend to make the assumptions do a lot of work, and these papers all show, in different ways, how design, outcome structure, and inference determine what your estimates are identifying in practice. So today’s post is based on:

Stacked Difference-in-Differences, by Coady Wing, Seth M. Freedman and Alex Hollingsworth
Event-Study Designs for Discrete Outcomes under Transition Independence, by Young Ahn and Hiroyuki Kasahara
Inference for Matched Difference-in-Differences, by Mijeong Kim and Mingue Park
Parallel Trends Forest: Data-Driven Control Sample Selection in Difference-in-Differences, by Yesol Huh and Matthew Vanderpool Kling

Stacked Difference-in-Differences

( This is for Prof Khoa :) Also there are lots of info in the footnotes because this topic is important - not that the others aren’t but I feel like applied people may come across this sort of approach/design-based implementation or want to apply it themselves more often than some of the other methods we covered. So just for you to remember, stacked DiD is a regression-based approach used in staggered adoption settings, where different units receive treatment at different times1. It became popular as researchers moved away from standard two-way fixed effects (TWFE) in settings with heterogeneous treatment effects, since TWFE can compare already-treated units with later-treated units in ways that don’t recover a clear causal average2.)

TLD;DR: stacked DiD became popular in staggered adoption settings because it avoids the late-versus-early treated comparisons that caused problems for standard TWFE. This paper shows that the usual unweighted stacked regression still doesn’t have a clear causal interpretation because treatment and control trends are combined with different implicit weights across sub-experiments. The authors define a target parameter - the trimmed aggregate ATT - and show that a weighted stacked estimator lines up with it. The result is a more defensible stacked DiD approach for applied work, with code provided for the weighting procedure.

What is this paper about?

This paper studies the performance of stacked DiD estimators3 in staggered adoption settings. The authors take a trimmed aggregate ATT4 as the target parameter and ask whether stacked regressions recover it. Their main point is that the most basic stacked estimator doesn’t identify that target because it places different implicit weights on treatment and control trends across sub-experiments. As a result, even if each individual sub-experiment satisfies the standard DiD assumptions, the pooled stacked regression5 can’t be given a clean causal interpretation as an ATT. The paper then derives corrective sample weights and proposes a weighted stacked estimator that does identify the trimmed aggregate ATT.

What do the authors do?

The paper moves in a fairly clear sequence. First, the authors define the parameter they want to study: the trimmed aggregate ATT, which is a weighted average of group-time treatment effects over a trimmed set of adoption cohorts. The trimming step is there to keep the composition of the event-study sample stable across the chosen pre- and post-treatment window so that movements in the estimates over event time are not driven by different cohorts entering and leaving the average.

They then turn to the standard unweighted stacked regression and ask whether it recovers that target parameter. Their answer is no :) The problem is not that the individual sub-experiments are invalid on their own. The problem is that, once they are pooled into one stacked regression, treatment and control trends are combined with different implicit weights across sub-experiments. Because of that the basic stacked estimator doesn’t identify the trimmed aggregate ATT, or any other average causal effect in general.

The next step is the main constructive part of the paper. The authors derive corrective sample weights and use them to build a weighted stacked DID estimator. With those weights in place, the stacked regression does recover the trimmed aggregate ATT. They show that this can be implemented through weighted least squares, either in a simple DID setup or in an event-study specification. They also note that the same logic can be adapted if the researcher wants a different aggregate such as a population-weighted version or a sample-share-weighted version.

After that, the paper looks at stacked fixed-effects specifications that had already been used in applied work. This part is useful because many of you will have seen exactly those regressions in practice. The authors show that these fixed-effects versions run into the same basic problem: in general, the unweighted specification doesn’t identify the target aggregate either. Once the corrective weights are added, that problem is resolved and the fixed effects themselves become unnecessary for identification. One of the paper’s broader points is that a simpler weighted event-study specification is enough.

The final step is inference. Because stacked datasets can reuse the same groups across multiple sub-experiments, the paper discusses how standard errors should be handled and compares clustering approaches. The authors run a small Monte Carlo exercise and find that clustering at the group level and clustering at the group-by-sub-experiment level both perform reasonably well when the number of clusters is not too small. So the paper is not only about identification but also gives us some guidance on implementation, which is always welcome.

Why is this important?

This paper is important because stacked DiD was already being used in applied work yet the literature hadn’t clearly established what the standard stacked regression identified. That is a problem if researchers are treating the coefficient from a stacked regression as a causal average without being able to say what average it is (isn’t this always a problem?). One of the paper’s main contributions is to show that the usual unweighted stacked regression doesn’t, in general, identify the target aggregate, or any other convex combination of causal effects.

It’s also important because the paper deals directly with a common practical use of event studies. We often want to look at treatment effects over time and use the pre-period to check for possible violations of the DiD assumptions. The authors show why that can go wrong if the composition of the sample changes across event time. Their trimmed aggregate ATT is built to avoid that problem so that changes in the event-study line reflect treatment dynamics in the post-period or differential pre-trends in the pre-period, rather than different cohorts entering and leaving the average at different horizons.

There is also a practical gain for applied researchers. The paper’s weighted stacked approach is still regression-based, which makes it relatively easy to implement and explain. The authors are quite explicit on this point. They argue that stacked estimation is attractive for applied work because it’s regression-based, keeps attention on the underlying research design and doesn’t rely on extra modelling assumptions beyond the standard DiD setup. They also show that the weighted estimator gives coefficients that correspond to a well-defined average of group-time ATT parameters, which gives us a clearer rationale for using the method.

Finally, this paper is useful because it speaks directly to what people were already doing. The authors discuss earlier applied papers using stacked DiD and show that the stacked fixed-effects versions used there run into the same general identification problem when left unweighted. They show that a simpler weighted event-study specification is sufficient to identify the trimmed aggregate ATT. By contrast, the more complicated fixed-effects setup does not solve the core issue on its own. For readers who may want to use stacked DiD in their own work, that is a very practical takeaway.

Who should care?

Applied researchers may want to use stacked DiD in staggered adoption settings, where different units are treated at different times and the aim is to compare each treated cohort with clean controls that remain untreated over the relevant event window. This can come up in labour, health, public, education and other applied fields where policies are rolled out in stages across places or organisations.

Do we have code?

Yes. The authors provide a GitHub repo with example code for calculating and implementing the weights used in their stacked DiD approach.

In summary, this paper gives stacked DiD a much clearer foundation. Wing, Freedman and Hollingsworth show that the standard unweighted stacked regression doesn’t have a clear causal interpretation in staggered adoption settings even when the underlying sub-experiments satisfy the usual DiD assumptions. They then define a target parameter, the trimmed aggregate ATT, and show that a weighted stacked estimator can recover it. For applied researchers, the takeaway is fairly straightforward: stacking the data is only part of the job. You also need to think really carefully about which cohorts enter the sample, which units count as clean controls and how the regression weights are constructed. That makes this a very useful paper for anyone who may want to use stacked DiD in practice, or who wants a clearer link between the regression they run and the parameter they claim to estimate.

Event-Study Designs for Discrete Outcomes under Transition Independence

(Young Ahn is a 5th-year Ph.D. Candidate at UPenn, but he will soon start as a postdoc at Brown. All the best!)

TL;DR: this paper asks how to do DiD or event-study analysis when the outcome is discrete, such as employment status, complaints, or patenting. The authors argue that standard PT can be a poor fit in these settings, so they propose transition independence and then add a latent-type Markov structure to handle unobserved heterogeneity and short panels. In their applications, this leads to estimates that differ markedly from conventional DiD.

What is this paper about?

This paper is trying to solve a basic identification problem in DiD with discrete outcomes6. Standard DiD relies on PT, but that logic can be a poor fit when outcomes are bounded and evolve through transitions across categories7. The authors’ point is that this isn’t a minor technical issue. With discrete outcomes, differences in baseline distributions can generate mean reversion even in the absence of treatment, binary outcomes can imply impossible counterfactual probabilities and for multi-category outcomes the notion of a single trend becomes hard to pin down. The paper proposes transition independence as an alternative identification strategy8, and then adds a latent-type Markov structure to deal with unobserved heterogeneity and short panels9. The broader aim is to give researchers a way to do event-study or DiD analysis with discrete outcomes without leaning on a PTA that may be “incoherent” in this setting.

What do the authors do?

They set up a potential-outcomes framework for discrete outcomes and show how the ATT can be identified under transition independence. The basic idea is to construct the treated group’s counterfactual using control-group transition probabilities rather than mean outcome trends. They then relate this to conventional DiD, show that transition independence is equivalent to a version of conditional PT based on the full pre-treatment outcome history, and derive the bias of DiD under their framework. To make the approach workable in practice, they introduce a latent-type Markov structure, establish identification of latent-type-specific treatment effects and the overall ATT, and then develop an estimator. Finally, they apply the method to the Dodd-Frank Act, Norway’s patent reform, and the ADA. In all three cases their results differ substantially from conventional DiD10.

Why is this important?

This paper is important because discrete outcomes are everywhere in applied work, yet DiD is still often used as though the usual PT logic carries over without much trouble. The authors show that this can lead to badly misspecified counterfactuals and, in turn, misleading treatment effects. Their alternative gives researchers a way to build counterfactuals that respects the bounded, transition-based nature of discrete outcomes. The applications make the stakes clear: in one case DiD implies complaint rates below zero, in another it overstates a negative effect, and in the ADA application the transition-based framework also shows which channels are driving the employment effect. So the paper is useful both methodologically and empirically, because it presents a different identification strategy for a common class of outcomes and shows that the choice of strategy can change the substantive conclusion.

Who should care?

This paper should be useful for researchers working with outcomes like employment status, unemployment, labour-force participation, occupational choice, complaint filing, patenting, disability status, or other bounded categorical variables. It is directly relevant in settings where the outcome moves across states over time and researchers might still be tempted to use a standard DiD or event-study design on binarised outcomes.

Do we have code?

Yes. The authors provide a standalone R package with replication codes, available as bayesiahn/ak.

In summary, Ahn and Kasahara argue that standard DiD can be a poor fit for discrete outcomes because the counterfactual is built from mean trends in settings where outcomes are bounded and evolve through transitions across states. Their solution is to replace PT with transition independence and then use a latent-type Markov structure to make the framework workable with unobserved heterogeneity and short panels. The paper gives applied researchers a different way to study treatment effects with discrete outcomes, and the empirical applications show that it can lead to very different conclusions from conventional DiD.

Inference for Matched Difference-in-Differences

TL;DR: this paper studies inference in matched DiD and shows that standard variance estimators can be too conservative once matching induces dependence within treated-control pairs. Kim and Park propose a projection-based variance estimator that removes the variation explained by the matching covariates and targets the correct design-based variance instead. In their simulations it performs much better than standard unit-level or pair-level procedures in informative matching designs - but watch out: when matching is uninformative (R²_Z ≈ 0), the projection step overcorrects and the estimator undershoots, so standard pair-cluster inference is the safer default in that setting.

What is this paper about?

This paper is about inference in matched DiD11. Kim and Park start from a simple point: matching is often used in DiD to improve covariate balance, but once we match treated units to controls, the design itself creates dependence within matched pairs12. Standard variance estimators usually ignore that feature, so they target the wrong variance. In this setting, the consequence is systematic overestimation of standard errors13 14. The paper’s goal is to fix that. More specifically, the authors study the variance implied by the matched design and show why standard unit-level and cluster-level inference can be too conservative once matching is built into the estimator. They then propose a projection-based variance estimator that removes the component of variation explained by the matching covariates and is designed to recover the relevant design-based variance15. The broader aim is to show that, in matched DiD, getting the point estimate is only part of the job. You also need a standard error that reflects the dependence structure created by the matching step.

What do the authors do?

Kim and Park begin by setting up a matched DiD framework in which each treated unit is matched one-to-one, without replacement, to a control unit using pre-treatment or time-invariant covariates. They then write the matched DiD estimator in an asymptotic linear form and use that setup to show what variance target the matched design implies. The next step is to explain why standard inference goes wrong. The paper shows that matching creates positive covariance within matched pairs because treated units and their matched controls share a common component linked to the matching covariates. Once the estimator is differenced, that covariance reduces the true variance. Standard unit-level CRVE16 ignores this term, so it overstates uncertainty, and pair-level clustering only partly fixes the problem because it still doesn’t use the covariate structure that generated the dependence.

Their solution is a projection-based variance estimator. The basic idea is to take the within-pair sum of first-differenced residuals, project it onto the matching covariates and remove the variation explained by those covariates. This yields an adjusted variance estimator designed to recover the design-based variance implied by the matched sample. They also show that the estimator can be written as a scaled version of the pair-cluster variance estimator where the scaling factor depends on how much of the within-pair variation is explained by the matching covariates.

They then study finite-sample performance in Monte Carlo simulations. The results show that standard unit-level and pair-level variance estimators are too conservative in informative matching designs, while the projection-based estimator comes much closer to the Monte Carlo benchmark.

Why is this important?

This paper is important because matched DiD is widely used to improve comparability between treated and control units, yet the matching step also changes the dependence structure of the estimator. Kim and Park show that standard variance estimators can then become systematically too conservative because they miss the variance reduction created by matching-induced covariance within pairs. It is also important because the paper offers a very concrete correction. The projection-based estimator adjusts inference using the same matching covariates that created the dependence in the first place. The simulation results show that this performs much better than standard unit-level or pair-level procedures in informative matching designs.

Who should care?

This paper should be useful for researchers using matched DiD to improve covariate balance before estimation. It’s really relevant in settings where one-to-one matching is used and inference is reported with standard cluster-robust standard errors without much discussion of what variance is being estimated17. More broadly, it should interest applied researchers who use matching and then run DiD as though the inference problem were unchanged by the design step.

Do we have code?

Not yet. Will update if this changes.

In summary, Kim and Park show that, in matched DiD, the matching step changes the variance structure of the estimator in a way that standard inference usually ignores. Because matching induces positive covariance within matched pairs, standard unit-level and pair-level variance estimators can become too conservative. Their solution is a projection-based variance estimator that adjusts for the dependence created by the matching covariates. The paper’s main message is simple: once matching is part of the design, the standard errors need to reflect that design too.

Parallel Trends Forest: Data-Driven Control Sample Selection in Difference-in-Differences

TL;DR: this paper proposes parallel trends forest, a machine-learning method for choosing control units in DiD when treatment assignment has little randomness and there are many plausible controls. Instead of fixing the control sample by hand, it uses pre-treatment data to build a weighted control for each treated unit that moves as closely in parallel as possible.

What is this paper about?

This paper is trying to solve a very practical problem in DiD: when treatment assignment has little randomness and there are many possible control units, how do we choose the control sample in a disciplined way? The authors’ answer is parallel trends forest, a data-driven method that uses pre-treatment data and machine learning to construct, for each treated unit, an optimal weighted control sample18 that moves as closely in parallel as possible before treatment19. The broader concern is familiar. In many DiD applications, we don’t have random treatment assignment, so we choose a control group that looks “reasonable”, inspect the pre-treatment series, and hope the PTA is credible. Huh and Kling are trying to improve on that workflow. Their method uses random forests20 to search over a large set of covariates and control units rather than relying on researcher judgment alone or on a standard TWFE regression with a fixed control sample21.

The paper is also aimed at a specific type of setting: relatively long panels with noisy, granular and sometimes non-normal data. The authors argue that existing methods such as SC, synthetic DiD and matrix completion can struggle there, either because they fit poorly, overfit in-sample22 or don’t adapt well to a large and chaotic donor pool23 (which is true). Parallel trends forest is meant to work better in exactly those cases. The main goal is to give us a more reliable way to build a control sample before running DiD, especially when the design is useful but far from clean.

What do the authors do?

They start by formalising the problem: for each treated unit, they want to find a weighted combination of control units that would have moved in parallel with it in the absence of treatment. That means the method constructs an optimal weighted control for each treated unit using pre-treatment data.

They then define a new measure of deviation from parallel trends24. Instead of relying on the usual sum of squared errors, they use a placebo-style criterion based on how large the estimated treatment effect would be if treatment were falsely assigned at different dates in the pre-treatment period. This is meant to work better with noisy, granular and non-normal data.

The next step is the machine-learning part. They build trees using this deviation measure, split units only along the unit dimension and then combine many trees into a parallel trends forest. From that forest, they recover weights for each treated-control pair based on how often they end up in the same leaf. Those weights are then used to construct the counterfactual for each treated unit and, from there, the ATT.

After laying out the method, they compare it to SC, synthetic DiD and matrix completion using a placebo exercise and Monte Carlo simulations. In their setting, they find that the parallel trends forest tracks the treated sample more closely and performs better than the existing methods. They then apply the method to the rollout of post-trade transparency in the corporate bond market. They use it to study the effect of TRACE Phase 2 on bond turnover, compare the results to TWFE estimates using several different control samples and then also show how the method can be combined with the Rambachan-Roth framework to allow for bounded deviations from parallel trends25. They also introduce an “honest”26 version of the forest to reduce overfitting.

Why is this important?

A few reasons. This paper is important because control-sample choice is often one of the weakest parts of applied DiD. In many settings we don’t have much randomness in treatment assignment so the credibility of the design depends heavily on which controls are chosen. Huh and Kling turn that choice into an explicit data-driven step rather than leaving it to researcher judgment alone. It is also important because the paper is aimed at a setting where many existing methods struggle: long panels with many candidate controls, noisy outcomes and granular data. The empirical application makes the point clearly. In their TRACE setting, different “reasonable” control samples produce quite different TWFE estimates, sometimes even with different signs. Their method gives a smaller estimate of the effect of transparency on turnover and once they allow for bounded deviations from parallel trends, the effect is not statistically significant.

Who should care?

Anyone working in settings where there are many plausible controls, little randomness in treatment assignment and enough pre-treatment data to compare how units move over time. That makes it relevant for work on phased policy rollouts, regulatory changes, market-design reforms and other settings where the main practical difficulty is choosing a credible control sample rather than writing down the DiD regression itself.

Do we have code?

No yet. Will update if anything changes. The method is described well enough to attempt a reconstruction, but not enough to call it fully reproducible out of the box.

In summary, Huh and Kling take one of the messiest parts of applied DiD - choosing the control sample - and turn it into an explicit data-driven step. Their method is built for settings with long panels, large donor pools and noisy data, where standard tools can fit badly or become very sensitive to which controls the researcher picked. The broader contribution is practical: rather than asking readers to accept a “reasonable” control group on faith, the paper offers a structured way to search for one using the pre-treatment data itself.

In stacked DiD, the we construct a separate sub-experiment for each treatment cohort using that cohort as the treated group and a set of clean controls, then pools those sub-experiments into one stacked dataset for estimation.

The appeal of this approach is that it avoids the late-versus-early treated comparisons that created problems for conventional TWFE in staggered adoption designs under treatment effect heterogeneity. In the paper the authors note that versions of stacked DiD were already being used in applied work before this formal treatment.

Plural because there are 3: the basic unweighted stacked estimator, the weighted stacked estimator they propose, and the stacked fixed-effects specification used in earlier applied work, which they also evaluate.

The trimmed aggregate ATT is a weighted average of cohort-specific treatment effects at each event time in a staggered adoption design, computed after trimming the sample so that the same cohorts contribute across the chosen event window. This keeps the event-study graph from being driven by cohorts entering and leaving the average at different horizons.

A pooled stacked regression is just the regression you run after you have built all the sub-experiments and stacked them into one dataset, so instead of estimating a separate DiD for each treatment cohort, you create a sub-experiment for each valid adoption cohort, then append those sub-experiments into one stacked file, and finally run one regression on the pooled stacked dataset.

By “discrete outcomes” the authors mean outcomes that take a limited set of categories, such as employed, unemployed or out of the labour force, or which occupation a worker is in. These are settings where we often still use DiD, usually by turning categories into binary indicators.

The paper argues that PT can fail in discrete settings for three reasons: mean reversion can create different trends when groups start from different baseline distributions, binary outcomes can imply impossible counterfactual probabilities 0 < P or P > 1, and for multi-category outcomes the notion of a single “trend” isn’t well defined.

Transition independence means that, absent treatment, treated and control units with the same pre-treatment outcome path would have the same transition dynamics. So the counterfactual is built from transition probabilities rather than from average outcome trends.

To allow for unobserved heterogeneity, the authors assume that units belong to latent types with different transition dynamics, and that outcomes follow type-specific Markov processes. This is what lets them recover both latent-type-specific and aggregate treatment effects from short panels.

Funnily enough, it’s all due to the identification strategy :) The two methods are building different counterfactuals, and in these applications the usual DiD counterfactual is a poor fit for the DGP. The paper’s general argument is that with discrete outcomes, conventional DiD can go wrong for 3 main reasons: baseline differences create mean reversion, binary outcomes can generate impossible probabilities, and multi-category outcomes are driven by transitions across states rather than a single smooth trend. In the 3 applications, each one highlights a different version of that problem. In the Dodd-Frank case, the DiD counterfactual complaint rate falls below zero, so the estimates differ because DiD is extrapolating linearly in a setting where probabilities are bounded. Their transition-based method avoids that by construction since it works with transition probabilities rather than mean trends. In the Norway patent reform case, the treated group starts with a much higher patenting rate than the control group, so there is more room for patenting to fall “mechanically”. DiD reads that decline as treatment, while the authors argue that much of it is mean reversion. Once they model the transition dynamics directly, the estimated effect moves much closer to zero. In the ADA case, the issue is that level comparisons hide what is happening in the underlying transitions. The paper says DiD misses statistically significant employment effects because pre-treatment level differences mask the transition dynamics, whereas their method can trace flows across labour-force states and shows that the negative employment effect is operating through specific exit channels (mainly transitions from employment into out-of-labour-force status). The estimates are so different because the paper replaces the identification strategy. DiD uses PT in outcome levels. Their method uses transition independence and constructs counterfactuals from transition probabilities conditional on pre-treatment histories. When the DiD counterfactual is badly misspecified, the resulting treatment effects can move a lot, and sometimes even flip sign.

Matched DiD combines matching with a DiD setup: treated units are first matched to similar control units using observed covariates and the DiD estimator is then computed on the matched sample. The attraction is better covariate balance and - in principle! - a more credible comparison group.

The key point is that matched treated and control units are deliberately chosen to be similar on the matching covariates, which then creates positive correlation within matched pairs since both units share a common component related to those covariates.

Here “conservative” means the standard errors are too large. The paper’s argument is that standard variance estimators miss the variance reduction coming from the positive covariance within matched pairs, so they systematically overstate uncertainty.

Abadie and Imbens (2008) identified the same mechanism in paired experiments and proposed a resampling-based correction. Kim and Park’s contribution is a closed-form fix that targets the same problem specifically in matched DiD designs.

The proposed estimator works by projecting the within-pair error sum onto the matching covariates and removing the part explained by them. Intuitively, this filters out the variation induced by the matching design and leaves the idiosyncratic component relevant for variance estimation.

Standard unit-level CRVE means the usual cluster-robust variance estimator applied at the unit level, treating each unit as independent from the others and allowing correlation only within the unit over time. In this paper the problem is that this estimator ignores the extra dependence created by the matching step, so it misses the covariance within matched pairs and ends up overstating uncertainty.

The framing here draws on prof Wooldridge (2023)’s “What is a standard error?”, which argues that the validity of any standard error depends on correctly specifying the variance target relative to the underlying source of randomness. Worth reading alongside this paper.

Instead of picking one control group for the whole treatment sample, the method builds a weighted control for each treated unit. The weights are based on which control units end up in the same leaves as the treated unit across many trees in the forest.

Parallel trends forest is a machine-learning method that builds an optimal control sample for each treated unit using only pre-treatment data. It assigns weights to control units so that the weighted control outcome moves as closely in parallel as possible with the treated unit before treatment.

Random forests let the method work with many covariates and many potential control units. They also automate covariate selection so the algorithm can use the variables that really help predict which units move in parallel instead of relying only on what the researcher guessed would matter.

The paper is motivated by settings with little or no randomness in treatment assignment. In those cases, we often choose controls by hand, compare pre-treatment trends visually, and then run DiD. Huh and Kling are trying to replace that partly ad hoc step with a data-driven selection rule.

Synthetic control actually does the opposite of fitting poorly, it overfits in-sample very closely but then performs badly out-of-sample. Synthetic DiD and matrix completion fail for a different reason: they assign near-equal weights across the whole control pool rather than upweighting the units that actually track the treated series.

In the paper the authors include earlier-treated units - bonds that became transparent before the sample window - in the donor pool. They argue these would organically receive near-zero weight if their trends have already diverged from the newly-treated bonds by the start of the sample period, which serves as a data-driven answer to the Goodman-Bacon contamination concern.

The paper doesn’t use the usual sum of squared pre-treatment errors as its objective. Instead, it defines a placebo-style measure based on how large the estimated treatment effect would be if treatment were assigned at different dates in the pre-treatment period.

The paper doesn’t derive analytical standard errors for the parallel trends forest estimator, and this is explicitly flagged as outside scope. The convergence analysis (Figure 8 in the paper) shows the weight estimates stabilise quickly across trees, but that addresses estimation variance, not inferential uncertainty more broadly. Do keep this in mind before reaching for the method.

The “honest” version of the forest follows the logic in Wager and Athey: one sample is used to grow the tree and another is used to estimate the weights. The aim is to reduce overfitting to pre-treatment noise.

The outcomes you measure and the assumptions holding them up

Beatriz Gietner — Mon, 12 Jan 2026 16:16:00 GMT

Hi there!

Thank you (all 1553 of you) for being here for one year now :)

I wanted to close 2025 with a “banger” - meaning lots of banger papers - but I got distracted by the only 2 weeks I have of free-time in the year. Apologies. Also I am set to graduate in May while also teaching a course this term, so probably the rate of posts will come down just a bit, but after May it should resume to normal.

Here’s to many more posts in 2026 :)

Also before we get to the papers, I’d like to thank my friend Sam Enright who picked 3 posts from this newsletter for The Fitzwilliam Reading Group monthly meetup a couple of weeks ago. Special thanks to those who attended and who are reading this now :)

As always, please let me know if you would like for me to cover a specific paper. I will do my best to go through it.

Today’s papers are:

Difference-in-Differences with Compositional Changes, by Pedro H. C. Sant’Anna and Qi Xu
Difference-in-Differences with Interval Data, by Daisuke Kurisu, Yuta Okamoto and Taisuke Otsu
Testing for equivalence of pre-trends in Difference-in-Differences estimation, by Holger Dette and Martin Schumann
Harvesting Difference-in-Differences and Event-Study evidence, by Alberto Abadie, Joshua Angrist, Brigham Frandsen and Jörn-Steffen Pischke

Honorable mention:

Synthetic Parallel Trends, by Yiqi Liu

Difference-in-Differences with Compositional Changes

TL;DR: standard DiD with repeated cross-sections relies on a strong stationarity assumption that often fails in practice. Sant’Anna and Xu show that the ATT remains identifiable when composition shifts, derive the correct efficiency bound and provide robust estimators and a diagnostic test to assess whether stationarity is empirically relevant. The paper formalises the bias–variance trade-off and gives applied researchers a clear framework for working with moving samples.

What is this paper about?

This paper is about a quite strong and often unrealistic (in practice) assumption in DiD with repeated cross-sections1: no compositional changes over time, which means assuming that the treated and control groups come from the same underlying population before and after treatment. In many real applications this is straight implausible: people enter and leave samples, early adopters differ from late adopters, firms exit, migrants move, and survey frames rotate. This is the default setting for many labour, health, education and dev applications that rely on repeated survey data.

A classic example cited by the authors is the study of Napster’s effect on music sales. Over time, the composition of internet users changed substantially: early adopters were typically younger and wealthier while later adopters were more demographically diverse. If a researcher ignores these shifts, the negative effect of Napster on sales might be overestimated because the “post-Napster” group naturally includes more households with lower reservation prices for music, regardless of the technology’s impact.

Most DiD methods simply impose stationarity2 and move on. Sant’Anna and Xu do the opposite. They study DiD when composition is allowed to change and show that the ATT remains identifiable under conditional PT3. They develop a general framework that makes clear what is identified, at what cost in efficiency, and how standard approaches break down when compositional changes are ignored.

What do the authors do?

They start by dropping the usual stationarity assumption and allowing the joint distribution of (D,X) to change over time. In this setup, they show that the ATT is still identified under conditional PT, no anticipation, and overlap. However, the “math” changes: because you can no longer assume the population is stable, you have to track four distinct groups (treated/control in both periods) using a generalized propensity score.

They derive the efficient influence function (EIF) and efficiency bound4 (essentially the “gold standard” for precision) for the ATT when composition is allowed to shift, and build nonparametric doubly robust5 estimators that remain valid under these compositional shifts. Unlike classical DR, which relies on getting one of two models “correct”, their estimators enjoy a rate doubly robust property. This means your estimate is reliable even if one part of your model (like the outcome regression) is very complex, provided the other part (the propensity score) is simple enough to compensate. They prove that the widely used Sant’Anna–Zhao (2020) estimator can be biased because it wasn’t designed to handle these moving populations.

They then characterize the bias–variance trade-off. If you wrongly impose stationarity, you can get biased ATT estimates. If you don’t impose it when it actually holds, you lose efficiency. You might be attributing a change in the data to the treatment when it’s actually just a result of the “mix” of people in your sample changing. However, robustness isn’t free. If stationarity really holds but you choose to ignore it by using the robust estimator, you lose efficiency (precision). Proposition 2.2 identifies three factors that determine how much precision you lose: the period ratio (the balance between the size of your pre-treatment and post-treatment sample; the group ratio (the proportion of comparison units relative to treated units; and treatment heterogeneity (aka how much the treatment effect varies across different types of people). In a world where everyone responds to a policy in exactly the same way (homogeneous effects), the robust estimator is just as precise as the standard one. In the real world, where effects usually vary, the test they propose becomes your essential “GPS” for navigating this trade-off.

To operationalise this, they propose a Hausman-type test6 that compares the robust estimator (valid under compositional changes) with the stationarity-based estimator. The idea is simple: if composition doesn’t matter for the ATT, the two should coincide; if they differ, stationarity is empirically relevant. This concentrates power exactly in the directions that matter for the target parameter.

Then they show how to implement all of this in practice. They use local polynomial methods for outcome regressions and local multinomial logit for generalized propensity scores, allow for both continuous and discrete covariates, discuss bandwidth selection, extend the framework to cross-fitting and mixed panel/cross-section designs, and illustrate the methods in simulations and an application.

Why is this important?

Many applied DiD papers rely on repeated cross-sections and proceed under an implicit stationarity assumption. This paper shows that this choice has real consequences for identification and precision. When composition changes, standard DiD estimators can be biased even if conditional PT holds. This affects the credibility of causal claims in a wide range of applications, as researchers might mistake a demographic shift for a policy effect.

The contribution here is diagnostic as much as methodological. Sant’Anna and Xu show exactly where the bias comes from, how large it can be and which assumptions remove it. This replaces hand-waving about “changing samples” with a precise link between assumptions, estimands and estimators.

The bias–variance trade-off is very important for practice. Imposing stationarity “buys” precision at the cost of bias when the assumption fails. Dropping it “buys” robustness at the cost of efficiency when the assumption holds. The paper formalises this trade-off and provides a test to decide which side you’re on. This then gives applied researchers a way to justify their design choices rather than rely on convention.

More broadly, the paper aligns DiD theory with how data are generated. Repeated cross-sections aren’t panels. Populations evolve. Treating moving samples as if they were fixed is convenient, but often inaccurate. By building the theory for the setting we actually face, this paper closes an important gap between methodological assumptions and empirical practice.

Who should care?

Applied researchers using DiD with survey or administrative data should pay attention. This includes work in labour, health, education, development and public policy where repeated cross-sections are the norm. Those using repeated cross-sectional data from major national sources like the Current Population Survey (CPS) or the Consumer Expenditure Survey (CEX). If your identification strategy relies on treating moving samples as stable groups, this paper speaks directly to your design choices.

Policy evaluators working on national reforms, programme expansions or regulatory changes are another core audience. In many of these settings, the underlying population evolves at the same time as the policy environment. The framework here clarifies when standard DiD logic carries through and when different estimators are needed, which is directly relevant for government analysts and policy units producing causal evidence under real-world constraints.

Methodologically, this paper will be useful for researchers building or using modern DiD estimators. It tightens the link between assumptions, estimands and efficiency, and shows how robustness to compositional change interacts with precision. If you work with DR methods, semiparametric efficiency or ML in DiD, the technical contributions here are relevant.

Referees and editors should also care. Many papers implicitly rely on stationarity without stating it. This paper provides language and tools to assess that assumption rather than take it on faith which raises the bar for what we can reasonably claim in applied work.

Do we have code?

Not in the paper itself. While no dedicated software package is listed, the authors provide the exact mathematical estimators in Section 3. They recommend using local polynomial and multinomial logit tools, which are available in general-purpose R packages like np or npsf, and then combining them according to the ATT formula derived in the paper. Check Supplemental Appendix for implementation, where they give a practical recipe (local polynomial outcome regressions + local multinomial logit generalised propensity scores + cross-validation/bandwidth choices), but you’d code it yourself (or adapt existing nonparametric/kernel + multinomial logit tooling).

In summary, this great paper revisits DiD with repeated cross-sections. Sant’Anna and Xu show that the ATT remains identifiable under conditional PT, but the estimand, efficiency bound and appropriate estimator change once stationarity is dropped. They derive the efficient influence function, build rate DR estimators that remain valid under shifting composition and make the bias–variance trade-off explicit. The Hausman-type test gives applied researchers a way to assess whether stationarity is empirically relevant rather than assume it. Overall, the paper replaces a convenient convention with a clear design framework and provides tools that align DiD practice with how samples evolve in reality.

Difference-in-Differences with Interval Data

(Yuta is a 3rd-year PhD student at the Graduate School of Economics, Kyoto University)

TL;DR: traditional DiD assumes scalar (exact) outcomes, but real-world data is often “coarsened” into intervals through rounding or binned reporting. The authors demonstrate that instead of naively extending “parallel trends”, a “parallel shifts” assumption is necessary to ensure the resulting bounds are both mathematically valid and intuitively sensible. Through their reanalysis of the Card and Krueger (1994) study, they show that this new method provides a more disciplined and informative way to handle rounding and heaping in employment data.

What is this paper about?

This paper is about a problem almost nobody talks about in DiD: what happens when your outcome isn’t a number, but an interval. In a lot of survey and administrative data, outcomes are reported in brackets, bins or rounded chunks. Income, e.g., is a range. Hours worked are heaped at multiples of five. Headcounts are approximate. Even when datasets look scalar, they often aren’t, because rounding, heaping and reporting conventions mean the “true” value sits somewhere inside an interval. In this paper, Kurisu, Okamoto, and Otsu tell us to reconsider our choices. They ask: how should we do difference-in-differences when outcomes are interval-valued rather than point-valued? And more importantly: what does “parallel trends” even mean in that world?

Their first result is quite uncomfortable: if you naively apply the PT assumption to interval data, you can get bounds that are either so wide they’re useless, or that move in directions that make no sense. You can satisfy “parallel trends” and still get counterintuitive or uninformative results.

So the paper’s core contribution is conceptual. It shows that extending PT to interval outcomes is not without a cost, and that you need a different way of thinking about how treated and control groups should evolve over time. The authors propose an alternative assumption - parallel shifts - that’s designed to respect how intervals move and scale, and to avoid the pathologies that show up under naïve extensions.

What do the authors do?

They set up DiD properly for interval data. Instead of observing a single outcome, each unit is observed as a lower and an upper bound. The target is still the ATT, but now both the treated mean and the counterfactual mean are only partially identified.

They then study three ways of bringing PT into this setting. First, they apply standard PT to the unobserved true outcome. This is the most direct extension of textbook DiD. It performs badly. The bounds are extremely wide and often dominated by worst-case combinations of lower and upper bounds across groups.

Second, they apply PT directly to the interval bounds, that is, they track how the control group’s lower and upper bounds change over time and project those changes onto the treated group. This looks intuitive, but it can behave in strange ways. You can end up with treated bounds moving in the opposite direction to control bounds, even though “parallel trends” is imposed.

They then propose a different assumption: parallel shifts. Instead of matching trends over time, it matches shifts between groups. The idea is that whatever mapping takes the control group’s interval to the treated group’s interval before treatment is assumed to also apply after treatment. Under this assumption, the bounds move in the same direction across groups, interval widths behave sensibly, and the identified sets remain well-behaved. They derive closed-form bounds for the ATT so the method is easy to implement.

Finally, they apply the framework to the C&K minimum wage study, treating employment as interval-valued because of rounding and heaping, and show how the different assumptions lead to very different conclusions.

Why is this important?

Because the way outcomes are recorded affects what DiD can identify.

In many survey and administrative datasets, outcomes are reported in brackets, rounded, or subject to heaping. Income is often binned. Hours worked cluster at focal values. Headcounts are approximate. In these settings, the outcome is a range rather than a point.

The standard DiD framework treats outcomes as exact and applies PT to their means. This paper shows that once outcomes are interval-valued, that logic does not carry over in a clear way. Some natural ways of extending PT lead to bounds that are very wide or that move in ways that are difficult to justify, which has direct consequences for interpretation. With the same data, different extensions of PT can produce very different identified sets. The identifying assumptions are doing more work than is usually acknowledged.

The contribution of the paper is to make that structure explicit. It shows that transporting information from control to treated is not neutral when outcomes are coarsened, and that some ways of doing it behave better than others. The parallel shifts assumption is proposed as a disciplined way of mapping intervals across groups and over time.

More broadly, the paper sits in the partial identification space. Once outcomes are coarsened, point identification is no longer guaranteed. Bounds are often the appropriate object. That is less convenient, but it reflects the information in the data.

Who should care?

Researchers using DiD with survey or administrative data where outcomes are coarsened. This includes work on income, wages, hours worked, employment, firm size, time use, and test scores when these are reported in brackets, rounded, or subject to heaping. It is common in labour, health, education, and public policy applications.

It is also relevant for people working with repeated cross-sections. In these settings, binning and rounding are standard, and the choice to treat midpoints as exact values is widespread. This paper shows that those choices have identifying consequences.

More broadly, anyone interested in identification in DiD, rather than only estimation, will find this useful. The paper makes clear that different ways of extending PT are not equivalent once outcomes are interval-valued. It is also relevant for readers working on partial identification and bounded inference. The framework fits well in settings where point identification is not credible.

Do we have code?

The paper derives closed-form bounds, so implementation is should be more or less straightforward. Everything is based on sample means of observed lower and upper bounds, with no optimisation or simulation required. Reproducing the results should be easy in R, Stata, or Python. The authors don’t provide replication code with the paper. The C&K reanalysis uses standard data and simple transformations, so I think we could reconstruct it. If you are planning to use this in practice, you will need to code it yourself.

In summary, this paper asks what DiD is really doing when outcomes are interval-valued rather than exact. It shows that standard ways of extending PT can produce wide or poorly behaved bounds. The authors propose an alternative assumption, parallel shifts, that maps intervals across groups in a more disciplined way and leads to well-behaved identified sets. The contribution is conceptual. It makes clear that once outcomes are coarsened, identification depends heavily on how structure is imposed. Different choices lead to different answers. If your outcomes are rounded, binned, or heaped, this paper is directly relevant.

Testing for equivalence of pre-trends in Difference-in-Differences estimation

TL;DR: we test the wrong thing when we check pre-trends in DiD. The usual null is “no difference”. Failing to reject is then treated as evidence for PT. The authors say that logic is weak: it often means the data say nothing. In large samples it can reject for tiny, irrelevant differences. This paper proposes equivalence tests instead. You test whether pre-trend differences are small enough to ignore. The null is that they are too large. Rejection gives actual support for the identifying assumption.

What is this paper about?

This paper is about how we test the PTA in DiD. In applied work, the standard approach is to test whether pre-treatment differences between treated and control groups are statistically zero. If you fail to reject, you proceed. The authors argue this logic is backwards. Failing to reject zero doesn’t mean trends are similar. It often means the test has low power. In large samples, the same tests can reject for tiny, irrelevant differences. Either way, the usual pre-tests do not answer the question practitioners actually care about.

The paper proposes replacing “no difference” tests with equivalence tests. Instead of asking whether pre-trends are exactly the same, it asks whether any differences are small enough to be considered negligible. The null becomes “differences are non-negligible”. Rejection then gives positive evidence in favour of PT.

The core object here is the size of pre-treatment trend differences, and whether the data support treating them as effectively zero for identification. The paper is about formalising that logic and giving tools to implement it.

What do the authors do?

They formalise pre-trend checking as an equivalence testing problem. Instead of testing whether each pre-treatment coefficient is zero, they define null hypotheses where pre-trend differences are at least as large as a user-chosen threshold. The alternative is that all differences are smaller than that threshold. Rejection then gives evidence that deviations from PT are negligible.

They propose three ways to summarise pre-trend differences: the maximum deviation across periods, the average deviation, and the root mean square deviation. Each corresponds to a different notion of what “small enough” means. The choice is left to the researcher and should be justified in context. They develop test statistics for each case, show their asymptotic properties and discuss implementation. In practice this means estimating the usual pre-treatment coefficients, then testing whether their joint behaviour stays within the chosen bound. They also show how to recover the smallest threshold for which equivalence would be accepted, if the researcher does not want to commit to a bound ex ante.

They extend the framework to staggered adoption and heterogeneous treatment effects using regression-based DiD setups. The idea is the same: define placebo effects in pre-treatment periods and test whether those are jointly small.

They back this up with simulations and an empirical illustration. The simulations compare equivalence tests to standard pre-tests under different violations of PT. The application re-examines a well-known DiD study and shows that, under their approach, the data give weak support for treating pre-trends as negligible.

Why is this important?

This is important because the usual pre-trend checks do not answer the identification question we are pretending they answer. In DiD, PT is an assumption, we never observe the counterfactual. Pre-trend tests are meant to give reassurance that the assumption is plausible, but the standard null is “no difference”. Failing to reject that null is often read as evidence of similarity. That is a logical error. It usually means the data are uninformative. This then creates two problems. In small samples, you can have meaningful differences in trends that go undetected. In large samples, you can reject for tiny differences that are irrelevant for the treatment effect. In both cases, the test result is hard to interpret for identification.

Equivalence testing flips the burden of proof. The null is that differences are large enough to worry about. To proceed, the data must show that differences are smaller than a threshold you consider acceptable. That aligns the test with the actual identifying assumption. You are no longer asking whether trends are identical, but rather whether deviations are small enough that treating them as zero is defensible.

This also forces structure. You have to say what “small enough” means in your application. That makes the identifying assumptions explicit rather than implicit. If you can’t justify any threshold, you can report the smallest one the data would support. That gives a direct sense of how much non-parallelism is compatible with your design. From an applied perspective, this changes interpretation. Instead of “we don’t see pre-trends”, you get “the data support ruling out pre-trend differences larger than X”. That is a statement about identification strength, not about p-values.

Who should care?

Everyone. We all do pre-trend checks. They sit at the start of almost every DiD design. They are usually the first thing people look at and the first thing readers use to judge credibility. This paper is about that step. The first one in the default workflow.

If you run event studies, plot leads, or report placebo regressions to justify PT, this applies to you. The argument is that the usual logic behind those checks is weak and often misleading. Equivalence testing is a way to make that step mean what we pretend it means.

This includes work with small samples, short panels, survey data, administrative data, and everything in between. It includes clean policy designs and messy real-world ones. If PT is doing identifying work in your paper, this framework is relevant.

Do we have code?

In a narrow sense, yes. The simulations are implemented in R and the empirical illustration is fully reproducible. The paper describes the procedures in enough detail that replication is straightforward, and the authors indicate that code is available. There is no standalone package or user-friendly wrapper. This is not a plug-and-play tool. That is consistent with the paper’s aim. The contribution is conceptual and inferential. If you’re comfortable writing your own pre-trend regressions and working with bootstrap routines, you can implement these tests without much friction. If you are not, there is no ready-made function you can drop into your workflow so maybe wait a little?

In summary, this paper is about aligning pre-trend checks with what PT means for identification. The standard workflow tests whether pre-treatment differences are zero, but that’s not the question we care about. We care whether differences are small enough to ignore. The authors show how to formalise that directly, using equivalence tests that put the burden of proof where it belongs. Their contribution is a way of being honest about assumptions: you either justify a threshold for negligible differences or you report how large differences could be before the design breaks. In both cases, interpretation improves. If you take PT seriously as an identifying restriction, this is a cleaner way to assess it.

Harvesting Difference-in-Differences and Event-Study evidence

TL;DR: this paper dissects what modern DiD and event-study workflows are really identifying, using a single policy setting to show how normalisation, heterogeneity and design choices shape the estimand. It shows that dynamic paths, exposure coefficients and pre-trend checks carry more identifying content than many of us realise, and that regression output is often easier to compute than to interpret. The message is that standard implementations impose structure that is rarely acknowledged. If you use event studies or exposure designs, this paper sharpens the link between your code and your causal claims.

What is this paper about?

This paper is a guided walk through modern DiD and event-study practice, it’s a synthesis and critique of what people are actually doing with staggered adoption, dynamic effects, exposure designs and pre-trend checks.

The authors use the unilateral divorce reforms in US states as a running example. They use it to show how static DiD, event studies and newer heterogeneity-robust approaches behave in practice. Their focus is on interpretation rather than technique, what is being identified under which assumptions, and how easy it is to get something that looks clean but is hard to interpret.

Three themes run through the paper. First, normalisation isn’t harmless in event studies because when and how you anchor the coefficients can change the shape of the estimated dynamics. Second, heterogeneous and time-varying effects complicate both static DiD and event-study estimates. The paper is explicit about where regression averages do and do not correspond to meaningful causal objects. Third, common workflow steps like pre-trend testing and log specifications carry stronger identifying content than many of us realise.

The paper’s goal is rather practical. It’s a great attempt at trying to discipline how we read DiD and event-study output, what object is this coefficient estimating, which assumptions are doing the work, and where the usual visual and statistical checks can mislead.

What do the authors do?

They take a single policy setting and run it through the full modern DiD toolkit: static two-way fixed effects, event studies with leads and lags, alternative normalisations, heterogeneity-robust approaches, exposure designs, pre-trend testing, and logs versus levels. The unilateral divorce reforms are the running example, used to illustrate how identification and interpretation change across specifications.

They start from the canonical TWFE model and make the identifying assumptions explicit: additive state and time effects +PT in untreated potential outcomes + no anticipation. They then introduce event-time indicators to allow for dynamic effects and show how this immediately creates normalisation problems. With staggered adoption and no never-treated units, at least two coefficients must be set to zero. Which ones you choose changes the shape of the estimated path.

They then move to treatment effect heterogeneity. First in simple two-period settings, where TWFE still recovers an average effect under clear conditions, then in staggered designs with dynamic effects, where already-treated units act as controls and regression averages can become hard to interpret. Small numerical examples are used to show when static DiD collapses and when event studies produce misleading dynamics.

For exposure designs, they write down a model where treatment is continuous rather than binary and allow effects to vary with exposure intensity. They show that the usual regression coefficient identifies an average marginal effect, not an average of underlying treatment effects. This distinction is carried through using the potato example.

They also implement the BJS imputation estimator on the divorce data and compare it to standard event-study regressions. This is used to assess how much cross-sectional heterogeneity actually distorts regression-based dynamics in a real application.

At the end they examine pre-trend testing and inference, discuss why lead coefficients are a pre-test, why this can create problems for inference and why it is still informative in practice. They also simulate clustered standard errors in staggered event studies to show how leverage and limited treated observations at long horizons can understate uncertainty.

Why is this important?

A large share of applied DiD and event-study work relies on regression output that looks intuitive but is tightly tied to modelling choices that are rarely spelled out, and this paper shows in a very concrete way where structure is being imposed and how that structure shapes what is being identified. Normalisation choices in event studies are not cosmetic. Using already-treated units as controls is not neutral. Exposure designs are not estimating the same object as binary DiD. Lead coefficients are not just informal diagnostics. Each of these decisions changes the estimand, often in ways that are easy to miss when reading a table or a plot.

The paper is also important because it forces some discipline into how we talk about dynamics and heterogeneity. Many applied papers present event-time paths, average effects and pre-trend plots as if they were direct summaries of the data, when in fact they are products of models that restrict untreated potential outcomes, restrict how treatment effects evolve over time and restrict how heterogeneity enters. Once those restrictions are written down, it becomes much harder to treat the output as self-explanatory.

The exposure design discussion makes this very clear. A single regression coefficient is often read as “the effect of exposure”, but in the presence of heterogeneity it is an average marginal effect, not an average of underlying treatment effects, and those two objects diverge as soon as effects vary with exposure intensity. That is an interpretation problem and it applies directly to a large class of shift-share and intensity-based designs.

The same logic applies to event studies. When there are no never-treated units, at least two coefficients are pinned down by assumption rather than data, and with heterogeneous effects the estimated dynamics can reflect changing composition of contributing units rather than changes in causal impacts. None of this requires pathological designs. It arises in settings that look standard.

On pre-trends and inference, the paper is explicit about trade-offs that are often waved away. Lead coefficients are pre-tests. Pre-tests affect post-test interpretation. Clustered standard errors behave poorly when only a few units identify long leads and lags. These are features of the designs we actually use.

Overall, the value of the paper is that it tightens the link between regression output and causal interpretation in a way that is directly relevant for applied work. It keeps bringing the reader back to the same questions: what is being identified, under which assumptions, and whether that object matches the story being told.

Who should care?

Anyone running DiD or event studies in applied work, which in practice means most people doing policy evaluation with panel or repeated cross-section data. If you use staggered adoption, plot leads and lags, average dynamic effects, or rely on pre-trend checks to support identification, this paper is written for you.

This includes people working with clean administrative panels and people working with short, noisy surveys. It includes settings with clear policy shocks and settings with fuzzy timing. It includes binary treatments and exposure designs. The issues the paper raises are not niche. They show up in default workflows.

It is also relevant for readers and referees. If you are judging whether an event-study plot support PT, whether a dynamic path is meaningful, or whether an exposure coefficient has a sensible interpretation, the paper gives you a sharper set of questions to ask. Where is the normalisation? Which units are acting as controls? What object is this coefficient estimating? If DiD is doing identifying work in your paper, or in papers you read and referee, this applies.

Do we have code?

The paper is accompanied by replication code for the divorce application, and the examples are implemented in standard Stata-style workflows. The authors are explicit about how the event-study specifications, alternative normalisations, and the BJS imputation estimator are implemented, which makes it easy to map the discussion to actual code. This paper is not a software contribution and it is not packaged as a toolkit. The value is that the code mirrors what people already do, which makes the points about normalisation, heterogeneity, and inference easy to check in practice. You can reproduce the figures, change the normalisation, drop never-treated units, and see the mechanics for yourself. If you have used event-study code from recent DiD packages, nothing here will look exotic. That is part of the point.

In summary, this paper is a reality check on modern DiD practice. It does not propose a new estimator or a new identification strategy. It shows how the tools people already use behave once you take their assumptions seriously. The unifying message is simple: DiD and event studies remain useful, but their output is only as interpretable as the assumptions underneath. This paper helps make those assumptions visible, and it shows where common workflows quietly add identifying content. If you want your coefficients to mean what you think they mean, you need to be clear about what is being identified and why.

A repeated cross-section is when you observe different units at each point in time rather than following the same units over time, which is common in surveys, firm data, and rotating samples, and is precisely why compositional changes become a problem. When you write “the treated group’s outcome changed over time”, what you’re really saying is “the average outcome of whoever is in the treated group sample at time 1 is different from the average outcome of whoever was in the treated group sample at time 0”. Those aren’t the same individuals. If the composition changes, part of the difference can come from who is being observed, not from the treatment. That is why standard DiD methods impose no compositional changes/stationarity.

By stationarity, the DiD literature typically means that the joint distribution of treatment status and covariates doesn’t change over time, i.e., treatment status and covariates are independent of time:

which rules out compositional changes in who is observed before and after treatment, and is different from stationarity in time-series analysis. The kinds of people (or firms, schools, etc.) you observe before treatment and after treatment come from the same population, with the same mix of characteristics and the same treated/control composition. With repeated cross-sections, you aren’t following the same units over time, so you need something like “even though we observe different people each period, they are drawn from the same underlying population”.

We spoke about this before. Conditional parallel trends means that, after conditioning on observed covariates X, treated and untreated units would have followed the same average outcome trends in the absence of treatment. This allows different groups to have different raw trends, as long as these differences are explained by X.

The efficient influence function characterizes the best possible (lowest variance) regular estimator under the assumed model. The associated efficiency bound is the variance lower bound that no estimator can beat under those assumptions.

Here “doubly robust” is meant in the rate sense (how fast the estimation error shrinks as sample size grows). This is different from classical DR based on correct specification of parametric models. In DiD (and many causal methods), you usually have to estimate two auxiliary pieces before you get your treatment effect: an outcome model (how outcomes relate to covariates), and a propensity/selection model (how treatment or group membership relates to covariates). These are called nuisance components because they aren’t your target, but you need them to get the target. Classical DR means: if either the outcome model is correctly specified or the propensity model is correctly specified, the final estimate is still consistent. In other words: you can mess up one, as long as the other is right you still get the right answer. In this paper DR means something different and weaker: the final estimator converges to the truth at the usual speed (root-n) as long as one of the two nuisance pieces is estimated well enough, even if the other is a bit messy or slow. in summary, classical DR → about correctness of models, whereas rate DR → about speed of convergence.

A Hausman-type test is a consistency check between two estimators that rely on different assumptions. One estimator is robust: it remains valid even when the data are messy, but it’s less precise. The other is more efficient: it has lower variance, but it only works if a strong assumption holds. The test asks whether the two estimates are statistically indistinguishable. If they are, the strong assumption isn’t empirically relevant and you can safely use the more efficient estimator. If they aren’t, the efficient estimator is biased and you should stick with the robust one. In this paper, the robust estimator allows for compositional changes, while the efficient one assumes stationarity. The Hausman test tells you whether stationarity matters for the ATT in your data.

On Finding Causes vs. Finding Levers

Beatriz Gietner — Tue, 18 Nov 2025 14:28:42 GMT

Hi there!

This post is a bit different from the others, but I felt compelled to write it as someone who does research on education and runs this newsletter on a specific causal inference technique. Nothing I will say in this post is new in the sense that it hasn’t been discussed before, but I do hope to provide more perspective to those who are not “in the trenches” of these sorts of nuanced topics.

It all started with a post on Twitter/X about the effects of school spending on educational outcomes (heterogenous treatment effects’ discussion: almost absent, except maybe for this). On one side, some users pointed to specific RCTs in developing countries as evidence that increasing school funding increases test scores. On the other, some pointed to meta-analyses finding mixed results in developed countries. The biggest point of discussion was on effect size and statistical-significance.

I’m not going to weigh in on that specific example since it’s not the point of this post (also Twitter/X is a better place for this type of discussion). Instead, I want to talk about the “causal inference revolution”, how its legacy might be amplified - positively or negatively - by the problems we will (already?) have (for example, when it comes to outsourcing the reviewing process to LLMs1), and circle back to the discussion on effect size and statistical-significance. Admittedly, the referee-example is a niche concern directly affecting a small circle of researchers. But since many of us go on to teach the present and next generations (and maybe some are out there reading this!), a nuanced view here could do wonders to spark their interest in what actually drives variation.

The debate also struck a chord because it touches on something I love to explore (thanks prof Frank for asking me a very relevant question at my first ever PhD seminar): the mechanisms and channels behind a causal claim. They are the pathways through which the cause creates an effect. The “cause” (e.g., spending) doesn’t “occur” through the mechanism; it “operates” through the mechanism. The mechanism can be of all sorts, like increasing teacher training or providing meals at no direct cost to the students. It’s one thing to say “spending is a cause” (identification). It is entirely another to ask if spending is the primary driver of student success (decomposition). We need to distinguish between identifying a causal link and finding a policy “lever”, a factor that really accounts for the variation we see in the world.

How “credibility” became the new gold-standard

I’m not sure if it was an “coordinated revolution” with a defined plan2, but something undeniably changed in empirical economics in the early 00’s. The “credibility revolution” of the 90’s and 2000’s completely shifted our methodological goalposts. I would argue it wasn’t so much a planned success as it was a fundamental shift in standards.

Before, running a kitchen-sink OLS3 and hand-waving about selection bias and omitted variables was common (first thing you teach the undergrads *not* to do). After, the new gold standard became clean identification4. The “Big Four” methods - RCTs, DiD, IV, and RDD - became the “gatekeepers” of truth5. In this sense, the movement won. It won the argument over what “credible” empirical work looks like, and it gave us a powerful, trusted toolkit to find “cause”, and therefore has changed the actual output of the profession6.

This new toolkit also created a very powerful new incentive structure. We now know that “causal novelty” (finding a new relationship) and “causal narrative complexity” (using these sophisticated methods) are strong predictors of getting published in a Top 5 journal. But, anecdotally, it goes further down the line. Every week I get a few friends and colleagues asking me “how can I DiD this?”, “what should be my treatment group?”, “is it ok if I have around 20 treated units, shifting in and out of treatment, and 400 control units in some period?”. I find these questions fascinating because they make us all think deeper about the limits of research questions and proper study design. But notice what is missing from these conversations.

We spend hours debating the feasibility of the identification strategy (“can we get the standard errors right?”, “is this even worth of publishing?”) but we rarely stop to ask: if this ~does~ work, will the effect size be large enough to matter?

And that’s where the problem starts. The revolution succeeded in teaching us how to find a cause. But in our quest to publish novel, statistically significant causal chains, we often forget to ask if we’ve found a lever7.

Why a cause is not a lever

So we have the toolkit. We can “confidently” say that X causes Y. But this brings us back to the Twitter debate and that 0.056 SD effect (a number which I made up, by the way, and got smart pushback on). The problem happens when we assume finding a cause is the same as finding a lever. It isn’t.

Prof Cochrane - just a few days ago - cleverly articulated this in his essay “Causation Does Not Imply Variation”. From my understanding, his point was that you can rigorously prove a policy causes a change in an outcome, but that cause might account for almost none of the variation between success and failure in the real world. You can prove that school funding causes test scores to move (p < 0.05), but if the effect is small, it explains almost none of the difference between a struggling school and a thriving one.

Prof Galiani wrote a post after prof Prof Cochrane’s in which he takes this further, explaining why our methods fail us here. He reiterates that the “causal revolution” tools (DiD, RCTs, etc.) are built to answer a specific question - “what happens to Y when we induce an exogenous change in X?” - and that they are ~not~ built to answer THE broader question: “what are the main causes of variation in Y?”.

Prof Galiani points out a couple of limitations in treating these “causes” as “levers”. In the first one, which is a “battle” between “local vs structural”, he says that our methods identify a total effect in a specific context rather than a universal structural parameter. This focus on the “snapshot” is likely to obscure dynamic levers: a small effect size might look dismissible in a static regression, but if it captures a structural parameter that compounds over time (i.e., a daily learning gain), it can be massive. We learn what happened there, not necessarily how the machine works over time and everywhere. In the second one, characterized as “the SUTVA trap, he says that these methods rely on SUTVA, which essentially assumes no spillovers. But true “levers” - like macro policies or institutional changes - are defined by their equilibrium effects and spillovers. By assuming them away to get a clean estimate, we often strip the “cause” of the very mechanism that would make it a “lever”8.

Statisticians and econometricians have been warning about this for decades, often using a vocabulary that is painfully absent from our current seminars. Profs McCloskey and Ziliak and created the term “Oomph” to describe “the difference a treatment makes” arguing that our profession suffers from a disproportionate focus on “precision” (statistical significance) at the expense of “Oomph” (magnitude).

This outcome shouldn’t surprise us. Nearly 40 years ago, sociologist Peter Rossi formulated the “Stainless Steel Law of Evaluation”9: “The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero”. The credibility revolution gave us the stainless steel tools Rossi predicted; we shouldn’t be shocked that they are doing exactly what he said they would: revealing that many well-meaning interventions don’t work.

Prof Krämer formalized this critique. He argued that our obsession with precision has birthed a new category of scientific failure: the Type C Error (his nomenclature).

We are familiar with Type A errors (finding an effect where there is none) and Type B errors (missing a large effect because of noise). But a Type C Error occurs when there is a “small effect (no “oomph”), but due to precision it is highly “significant” and therefore taken seriously”. This is the (abstract) 0.056 SD problem. It is a Type C error masquerading as a scientific victory. We used our sophisticated “credibility” toolkit to maximize precision, driving our standard errors down to zero, which allowed us to put three stars next to a number that, for all practical purposes, is zero.

We found a cause. We did not find a lever. A “lever” is a confirmed cause that possesses “Oomph”, meaning it explains a substantial portion of the actual variation in the outcome. But we need to be precise. A “lever” isn’t a large coefficient in isolation. As Prof Sterck (2019) provides the background for me to argue, a true lever is a variable that drives the variation we see in the real world. It is a factor that explains a meaningful percentage of the deviations in our outcome. So a 0.056 SD effect might be a “cause”, but if it explains only 0.1% of the variation in test scores, it is certainly not a lever.

This distinction seems clear “enough” in theory, but in practice - as when we are staring at a regression table - it gets muddy. Where’s the “lever” column? Instead, we fall back on the standard metrics we were trained to trust, a lot of the times without realizing they aren’t answering the question we think they are.

What do we even mean by “important”?

In the back of our heads there’s always a voice that goes “check N, R^2, effect size, and statistical significance”. I don’t think that’s bad, but it is superficial.

When we find a “cause”, how do we decide if it matters? We play around with terms like “economically significant” but as Prof. Sterck (2019) argued, the literature is surprisingly vague on what that really means10. We usually rely on standard tools - like standardized coefficients (the 0.056 SD!) or R^2 decompositions, but Prof Sterck shows these are often “flawed and misused”11 (standardized coefficients, for instance, are sensitive to sample variance, making them unreliable for comparing effects across different populations). If our tools for measuring “importance” are not ideal, it’s no wonder we are left to rely on one tool that gives us a clear binary answer and that all scientists (I am being charitable) understand: the p-value.

But this is where the “stars” can “blind” us. To see why, look at the fertility case study in Prof Sterck’s paper. He analyzes the determinants of fertility rates (specifically the number of births per woman) using a massive dataset (N > 490,000). Because the sample is so huge, almost everything is statistically significant. If you look at the regression output, you see stars next to ln(GDP/capita), land quality, distance to water, child mortality, age, and education. If you are hunting for “causes” (stars), you get lost in the sauce. The regression output shows that all of the aforementioned variables are “significant” causes. This makes it incredibly easy to engage in HARKing (Hypothesizing After the Results are Known). A researcher looking at those stars could pick any variable (e.g., distance to water) and justify a paper on it. They are all “causes”. It’s not even “data mining” or “p-hacking” in the traditional sense because no manipulation was necessary. The significance was guaranteed by the sample size - which is a feature, not a bug, as it gives us precise estimates (though, as I’ve argued before, modern ML tools are better suited to handle this “star-gazing” problem in high-dimensional data than standard OLS); the error is in pretending that significance equals importance. Of course, context dictates the threshold for “Oomph”. If the outcome is “survival rates”, a 0.056 SD effect *is* a massive victory. But for continuous variables like test scores or wages, where we are trying to explain inequality or performance gaps, a 0.056 SD effect without a cost-benefit argument should be taken with a grain of salt.

But… if you shift your thinking to look for “levers”, i.e., by asking both “is it non-zero?” and “how much of the deviation does this actually explain?”, the picture changes from night to day. To go back to his example: ln(GPD/capita) explains only 0.19% of the deviation (a cause, but not a lever), distance to water explains 0.73%, age explains 45.6%, and woman’s education explains 10.0%. The “stars” told us all these variables were valid. But only by asking the right question - “does this variable actually drive the variation?” - do we see that GDP is a distraction in this context, while education is a massive policy lever.

This is the trap we fall into. We use the precision of our causal tools to find the 0.19% drivers, and because they have stars next to them, we treat them as if they matter just as much as the 10% drivers.

What we publish vs what we cite

This brings us back to the incentives. If we know that “levers” (like education in the example above) are what actually drive outcomes, why do we spend so much time hunting for and publishing tiny “causes”?

Because that is what the market *rewards*.

Garg and Fetzer (2025) find a stark divergence between what gets published (causes) and what creates impact (levers). Top 5 journals reward “novelty”, e.g., finding a new causal link, even a tiny, obscure one… is the path to acceptance. The long-term impact (citations), tho, comes from engaging with “central, widely recognized concepts”.

We are incentivized to hunt for the 0.056 SD effect because it’s novel. But we cite the papers that discuss the big, central questions. I think both are somewhat fair ways to advance the science, and not totally exclusionary. We need architects to design the building (structural/levers) and plumbers to fix the pipes (causal/identification). The problem arises when we only reward the plumbers but expect them to explain the architecture. Prof. Megan Stevenson (2023)12 argues that this expectation stems from a mistaken “Engineer’s View” of the world - the belief that society is a machine where we can isolate and pull specific levers for predictable results. But the social world is actually full of “stabilizers”, which are forces that push people back onto their original trajectories after a small, limited in scope intervention. A small “cause” (like a job training program) rarely triggers a “cascade” of success because structural forces dampen the effect. Finding a statistically significant effect in a vacuum ignores the stabilizing forces that make that effect irrelevant in the aggregate.

And as I briefly mentioned, this problem is likely to get worse. As AI-assisted tools are increasingly used in peer review, we risk amplifying this exact incentive structure. A LLM can be trained to find “novel causal claims” and check for the “right” methods, but it cannot ~easily~ be trained to understand “economic importance” (or can it???? I won’t bet against it) a concept we humans struggle to agree on sometimes since it’s highly context-dependent. An AI reviewer seems to be the ultimate “cause” hunter, as of today. I think it will likely reward the 0.056 SD effect as long as it’s novel and significant. It has no intuition for finding the “lever”. LLMs are trained on the corpus of published literature - a literature that systematically selects for significance over importance. The AI will likely entrench, rather than correct, this bias.. But I want to be proven wrong.

As Prof. Krämer wrote, “Cheap t-tests... have in equilibrium a marginal scientific product equal to their cost”. AI makes those tests almost free. My point isn’t that we should abandon the p-value, but that we must be far more rigorous about the questions we ask during the design phase. Instead of stopping at “can I identify this?” or “is this novel?”, we need to ask: “does this variable explain the variation?”. If it works, how much of the problem does it actually solve? Practically, this means reporting variance decomposition alongside our regression tables, or justifying small effects with explicit cost-benefit or compounding arguments. If we don’t start valuing the answers to these questions, we are going to drown in statistically significant noise.

Hidden Bias, Pie Charts and Spillovers

Beatriz Gietner — Mon, 03 Nov 2025 14:18:51 GMT

Hello! Today we have 3 theory papers, 1 “software”/guide paper, and 3 applied I found (the Japanese one is behind a paywall, so email me if you would like a copy). All of them are super cool.
Buckle up because we have many concepts from areas we might not be entirely familiar with… check the footnotes!

Sensitivity Analysis for Treatment Effects in Difference-in-Differences Models using Riesz Representation, by Philipp Bach, Sven Klaassen, Jannis Kueck, Mara Mattes and Martin Spindler
Compositional Difference-in-Differences for Categorical Outcomes, by Onil Boussim (on the JM!)
Efficient nonparametric estimation with difference-in-differences in the presence of network dependence and interference, by Michael Jetsupphasuk*, Didong Li and Michael G. Hudgens (* on the JM!)

Software/guide:

Using did_multiplegt_dyn to Estimate Event-Study Effects in Complex Designs: Overview, and Four Examples Based on Real Datasets, by Clément de Chaisemartin, Diego Ciccia, Felix Knau, Mélitine Malézieux, Doulo Sow, David Arboleda, Romain Angotti, Xavier D’Haultfœuille, Bingxue Li, Henri Fabre and Anzony Quispe (De Chaisemartin et al. extend their 2024 event-study estimators into a new command for Stata and R. The command handles on/off, continuous and multivalued treatments, and it replaces did_multiplegt for this use. The new command is faster because variance uses analytic formulas rather than bootstrap which they show how in the paper. It’s a neat and practical guide, they go through the estimators, validate them in simulations and show four real-data applications).

Applied:

The Labor Market Effects of Generative AI: A Difference-in-Differences Analysis of AI Exposure, by Andrew C. Johnston and Christos A. Makridis (USA, DiD with continuous treatment intensity in an event study framework) → I was seriously considering starting a Substack focused on Econ and AI, but I don’t have time. If you feel like you could do this, you should do it. It’s an area in development, at its early stages, and we would all benefit from learning more about it. This paper is the first comprehensive study providing economy-wide evidence of real labour market impacts from AI exposure. What I particularly enjoyed was the analysis at both extensive (e.g., whether workers remain employed or lose jobs, whether firms enter or exit sectors, whether establishments expand or contract their workforce) and intensive (e.g., how AI changes the composition of tasks workers perform within their jobs, changes in productivity/output per worker, changes in hours worked or work intensity) margins. They also decompose AI exposure into augmenting (where AI complements human work) versus displacing (where AI substitutes for it) components, finding that augmenting exposure drives employment growth, while displacing exposure causes contraction. Because the analysis was done using administrative data that covered the universe of U.S. employers, they were able to show that high-skilled workers see larger gains, which represents a departure from previous automation waves.
The role of human interaction in innovation: evidence from the 1918 influenza pandemic in Japan, by Hiroyasu Inoue, Kentaro Nakajima, Tetsuji Okazaki, and Yukiko U. Saito (Japan, tech-class x year DiD with common shock and differential exposure; event-study; PPML for counts) → I’m a sucker for natural experiments, even though there were lots of deaths involved. I’m also a sucker for historical data collection, which I would love to do if I spoke Japanese. This paper mixes both things. The 1918 flu in Japan had three waves: Oct 1918–Mar 1919, Dec 1919–Mar 1920, Dec 1920–Mar 1921, about 23.8 million infections and 390k deaths out of a 55 million population, which is crazy. Transport and factories were disrupted, mortality peaked among working-age adults and face-to-face contact became costly for inventors. This setting allowed the authors to test the claim that tacit knowledge and spillovers travel through direct interaction. We know that collaboration drives the flow of ideas which has a direct effect on the quality of output, therefore the test was straight-forward: if tacit knowledge moves through direct contact, then tech classes that collaborated more before 1918 should stumble more when contact gets costly. The authors proxy collaboration needs with pre-1918 co-inventing shares and track patents by class over 1911–1930, so you watch the pandemic hit play out. The drop is sharper in collaboration-intensive classes in 1919–1921, and it comes from fewer new inventors rather than incumbents retreating. The key message then becomes: raise interaction costs and innovation slows most where collaboration is the input. This paper made me think of Covid as the modern analogue: a common shock that raised interaction costs, with differential exposure by how much collaboration must be in-person. Unlike in 1918, we had Zoom and cloud tools which softened the blow, so we should expect the effects to be smaller where remote work was feasible and larger in wet-lab, field or hardware R&D(if anyone wants to write about it).
Forgoing Nuclear: Nuclear Power Plant Closures and Carbon Emissions in the United States, by Luke Petach (USA, TWFE DiD with staggered treatment timing; robustness checks with Callaway-Sant’Anna and Borusyak et al. estimators) → We were discussing the other day on X/Twitter about the difficulty of using DiD to quantify the effect of new data centres on energy prices due to several threats to identification: data centre planners select locations based on energy characteristics that independently affect price trends (selection bias), their arrival coincides with other infrastructure investments and policy changes (confounders), energy markets are interconnected across regions (spillovers) and prices may react to announcements before construction begins (anticipation effects). Fundamentally, it’s hard to find a credible counterfactual for what would have happened to energy prices without the data centre. Luke has addressed lots of those concerns in his paper, but more importantly, a plant closing is more exogenous than a data centre (or plant) opening. The “treatment” happens to the state rather than being chosen by strategic actors based on local conditions. Luke finds that nuclear closures increased state-level CO_2 emissions by 6-8%, with the gap filled almost entirely by coal rather than cleaner natural gas. Event studies showed no pre-trends and the results hold when analyzed at the regional electricity market (RTO/ISO) level to account for cross-state spillovers, meaning that the effect is real and it’s carbon-intensive. This is an important example that showcases how identification strategy matters as much as the question itself. When you can’t randomize treatment, the next best thing is finding settings where treatment is plausibly exogenous, even if that means studying the problem in reverse.

Sensitivity Analysis for Treatment Effects in Difference-in-Differences Models using Riesz Representation

TL;DR: this paper shows how to measure how much hidden bias could affect DiD estimates when the PTA might not hold. Using the Riesz representation, the authors express the treatment effect as a weighted average and show how bias from unobserved factors links to familiar quantities. The result is a practical way to report DiD effects with clear bounds instead of relying on untestable assumptions.

What is this paper about?

When analyzing a treatment in order to quantify a causal effect, we have to consider a lot of things. But we never really “see” the full picture. People, firms or regions decide to take up a policy for reasons we can’t fully observe, and those same reasons often affect how their outcomes change over time. In DiD we “deal” (to the best of our intentions) with that by conditioning on covariates and assuming that, after controlling for them, treated and control units would have followed similar trends in the absence of treatment (the PTA). The problem is that this assumption of parallel trends given covariates can still fail if we’re missing something important that we can’t quantify or account for.

This paper is about measuring how much of such missing information “could” matter. Instead of treating “unobserved confounders” as a black box, the authors show how to express their influence in terms of quantities we can describe and argue about. They build on a mathematical idea called the Riesz representation1, which expresses the treatment effect as an average of predicted outcome changes, each multiplied by a specific weight. Once written that way, we can study how much the estimate would move if some unobserved factors were left out of either the prediction or the weighting step.

The motivation is there: the goal with a DiD design is to estimate the ATT, but this estimate is only as good as the assumption that the PTA holds once we condition on covariates. Since that can never be fully verified, the question becomes how to describe and measure the possible bias rather than ignore it.

What do the authors do?

They start off by rewriting the DiD effect as an average of predicted outcome changes taken with a particular set of weights. That step (which comes from the Riesz representation) makes it easier to see what happens when some parts of the data are missing. Once written that way, the estimate depends on two things: the model used to predict outcome changes and the weights that compare treated and control units. If either of these leaves out important factors, the bias can be expressed in terms of how much those missing variables would help explain the outcome or the likelihood of treatment.

This is where the paper becomes useful for applied work. It shows that the size of the bias depends on how much the unobserved variables would improve the chosen model’s fit for outcomes *and* for treatment assignment, and on how correlated those two pieces are. Each of these quantities can be linked back to measures we are familiar with, such as the partial R^22. With that the authors turn “how sensitive is my DiD estimate to hidden bias?” into something measurable.

They go to extend the same idea to staggered adoption designs, showing how to bound both point estimates and confidence intervals when some confounders are missing. Instead of producing a single number that might rest on unrealistic assumptions, the paper shows how to report an effect that comes with well-defined bounds, or how large the unobserved bias would need to be to make the result disappear/insignificant. The estimation itself follows a DML approach, where modern ML methods are used to learn both the prediction and weighting steps without having to assume a specific functional form. This keeps the method practical for high-dimensional data while preserving standard large-sample inference.

Why is this important?

Couple of reasons, both related to real-world problems and to the econometrics itself. The first one is straight-forward: the PTA does half of the work in DiD, and it comes directly from the design. Because we never observe the treated group’s counterfactual path after treatment, we need to assume it would have followed the same trend as the control group once we condition on covariates. The issue is that this path is unobserved by construction, so we can’t ever prove that the assumption holds. That’s what makes DiD possible, but also what makes it fragile. In practice we check pre-trends, add controls, and then try to justify the identification assumptions, but there is always a gap between what we observe and what drives selection into treatment and changes in outcomes. This paper gives a way to describe that gap and put numbers on it, so we are not asked to take the key assumption on faith.

It also changes how we present results. Instead of a single estimate that depends on untestable judgment calls, you can say: here is my best estimate and here is how far it would move if unobserved factors were strong enough to improve prediction by this much, which then makes claims more honest without turning them into hand-waving. It also helps avoid overclaiming and it gives a common language for authors, referees, and seminar audiences to discuss what “plausible” unobserved bias means.

For applied work, the paper ties sensitivity to the tools we already use: pre-trend placebos, leave-one-covariate-out checks and familiar fit measures. If you have multiple pre-periods, you can calibrate scenarios from the size of the placebos. If you don’t, you can benchmark against observed covariates and ask whether any truly unobserved factor would be as powerful as the ones you already “see”. Either way, you get clear bounds for the point estimate and the confidence interval and the same logic carries over to staggered adoption settings.

There’s a policy angle too. When a result is sensitive, you can show how and by how much rather than pretending the issue isn’t there. When it’s robust, you can document that claim with numbers. In both cases the takeaway becomes easier to trust, which is the point of methods in the first place.

Who should care?

Applied economists using DiD in labour, education, health, development or policy evaluation, anyone who has ever been told by a referee to “test robustness to hidden confounders”, researchers who rely on DiD but worry about unobserved bias they can’t rule out, and methodologists interested in linking ML-based estimation with identification theory.

Do we have code?

There’s an open-source implementation in DoubleML for Python, with a user guide that covers this DiD sensitivity setup (2×2 and staggered), examples, and the Riesz-based formulas. You can run the DML estimator, set sensitivity scenarios (from pre-trends or benchmarking) and get point and CI bounds straight from the package.

In summary, this paper focuses on the most fragile part of DiD: the fact that the PTA can’t be proved. By rewriting the effect using the Riesz representation, the authors show how to express bias from unobserved factors in terms of familiar quantities like partial R^2. The estimation uses DML, which keeps it workable when there are many covariates or flexible models. The idea is simple: instead of assuming the PTA holds, measure how far it would have to fail for the result to change. It’s a way to be clearer about what we can learn from the data and what still depends on judgment.

Compositional Difference-in-Differences for Categorical Outcomes

(Onil is a PhD candidate at PSU, he’s on the JM this year and this is his JMP! Good luck, Onil :) I also liked this paper because the tables are well-structured, the plots and diagrams are pretty and it’s well formatted. See, a better world is possible)

TL;DR: this paper develops a DiD framework for categorical outcomes where shares must stay positive and sum to one. By replacing additive changes with proportional growth and redefining the PTA, it yields valid counterfactuals and interpretable effects for outcomes that are distributions rather than single values.

What is this paper about?

One of the first things you learn in data analysis is types of data. This is gonna be important here, as we deal with categorical vars3. These are vars that take on labels instead of numbers (e.g., vote choice, occupation, education level) and what we often analyse are the shares of each category. The issue is that those shares form a composition: they’re all positive and must add up to one. If one category’s share increases, at least one other must decrease. We can’t do this with our standard DiD, which relies on additive changes that can move freely in either direction. In a compositional setting, those additive differences can produce impossible results (negative shares, totals above one, and most importantly changes that have no behavioural meaning). Onil then introduces Compositional DiD (CoDiD), a framework that fixes this by working with proportional (rather than additive) changes.

The idea is to replace the PTA with a parallel growths assumption: in the absence of treatment, each category’s share would have grown (or shrunk) at the same *proportional rate* in treated and control groups. This keeps the counterfactual composition valid (shares stay positive and sum to one) and lines up with how we usually mode choices across categories in econ. Onil then uses this framework to define new treatment effects that capture how the overall distribution changes and how probability mass shifts between categories4. He extends it further to cases with staggered adoption and to a version that builds synthetic counterfactuals for compositional data (did anyone say synthetic DiD?) that works when the outcome is a distribution instead of a single number.

What does the author do?

Onil does *a lot*. He starts with the simplest possible setup (our canonical the 2×2 case) to lay out the idea of parallel growths and show both its economic and geometric meaning. From there, he generalises the framework to multiple periods, which is what most of us actually deal with in applied work. He then connects this new approach to familiar methods like standard DiD and Synthetic DiD, showing where they overlap and where the compositional logic changes the interpretation.

To make it concrete, the paper includes two empirical examples: one on how early voting reforms affected turnout and party vote shares and another on how the Regional Greenhouse Gas Initiative (RGGI) shifted the composition of electricity generation. Both illustrate how treating the outcome as a distribution rather than a single variable changes the estimated effect and keeps the counterfactuals within the realm of what’s possible.

Why is this important?

The core issue is how to think about the PTA when the outcome is categorical. A naïve way would be to run a standard DiD separately on each category’s raw share, predict what those shares would have been without treatment and then normalize everything so the shares add up to one. That might sound reasonable, yet it’s not. Doing this ignores how categories relate to each other (e.g., an increase in one share “mechanically” means a decrease in another) and it messes with the link to any behavioural or structural model of how people make choices. In other words, you’d be imposing linear trends on something that we don’t know if evolves linearly.

From an econometrics point of view, this causes several issues: it violates the probability constraint (shares might turn negative or exceed one), it treats each category as independent even though they’re jointly determined and it produces counterfactuals that have no theoretical justification in how choices across categories actually adjust. The bigger point is that additive DiD logic doesn’t fit categorical data. What Onil does is rebuild the entire framework so the counterfactual distribution is coherent (i.e., the total probability mass stays fixed, categories remain linked and the treatment effect can be interpreted in economic terms rather than as an artifact of normalization).

Who should care?

Beyond the ones Onil talks about, pretty much anyone studying outcomes where the composition is what you’re interested in. Migration (domestic, international, return), industry shifts (manufacturing, services, tech), language use (home language shares), transportation modes (car, public, walking) or household spending patterns (food, housing, leisure) all fit this setup. Development work often tracks how populations move across employment types, informal versus formal work, or agricultural crops. If you can think of a pie chart when modeling the var at stake, you should consider CoDiD.

Do we have code?

No, and I tried to think of a way to code it but couldn’t (#skillissue). If code is released later, I’ll link it here.

In summary, CoDiD provides a framework for analysing categorical outcomes within a coherent probabilistic structure. It replaces additive comparisons with proportional growth, ensuring that counterfactuals remain consistent with the underlying composition and that estimated effects reflect economically meaningful reallocation across categories. Onil’s contribution is both conceptual and practical: it formalises how to conduct DiD when the object of interest is a distribution, preserving internal consistency and interpretability across empirical settings.

Efficient nonparametric estimation with difference-in-differences in the presence of network dependence and interference

(Michael is a Ph.D. Candidate in UNC’s Department of Biostats, he’s on the JM this year and this is his JMP! Good luck!)

TL;DR: this paper builds a DiD framework for settings where one unit’s treatment affects others through networks. It replaces the usual no-interference assumption with a setup that measures both direct and spillover effects, using an estimator that stays reliable even when some parts of the model are wrong.

What is this paper about?

In lots of DiD setups, units are assumed to be “isolated” (no contamination, SUTVA anyone?): one unit’s treatment doesn’t affect another’s outcome. That assumption is broken the moment there’s a network (e.g., people, firms and/or regions connected in ways that let effects spill over). When a factory installs a “pollution scrubber”, nearby counties benefit too. When a vaccination campaign starts in one area, infection risk falls for neighbours as well. These are classic cases of interference: what happens to you depends not only on your own treatment but also on others’.

This paper deals with that. The authors build a Difference-in-Differences framework that can handle both network dependence (correlated outcomes across connected units) and interference (treatment spillovers). Instead of assuming independent, neatly separated units, they let each unit’s exposure depend on its *position* in the network and on how treatment spreads through its connections.

They then show how to estimate average treatment effects in this environment efficiently and without bias, even when nuisance functions like treatment probabilities or outcome models are learned flexibly with ML. This gives us a doubly robust, semiparametric estimator that remains valid under complex dependence structures (a kind of “network-aware” DiD that accounts for who is connected to whom and how that matters for identification and inference).

What do the authors do?

They start by defining what “treatment” means when spillovers exist. Each unit isn’t just treated or untreated, but it also has exposure through its neighbours. The authors formalise this by introducing exposure mappings which describe how each unit’s treatment and its neighbours’ treatments combine to determine potential outcomes.

From there, they extend the DiD framework to this network setting. They define a conditional parallel trends assumption: in the absence of treatment, both directly treated and indirectly exposed units would have followed the same expected trend, conditional on their covariates and network position. That assumption plays the same role as PTA in standard DiD but now accounts for dependence across connected units.

They then build a doubly robust + semiparametric estimator for the average treatment effect under interference. It combines an outcome model (predicting changes) with a treatment and exposure model (predicting who gets treated and how that treatment propagates through the network). Either model can be misspecified, but as long as one is correct, the estimator remains consistent. The authors also derive its asymptotic efficiency bound and show that the estimator achieves it, even when using flexible ML methods to estimate the nuisance functions (we talked all about these terms before).

Finally, they run simulations to check performance under “realistic” (IRL scenarios) network structures and apply the method to US county-level data, studying how the adoption of scrubbers in coal power plants affected cardiovascular mortality; both in treated counties and in neighbouring ones that received indirect benefits.

Why does this matter?

We don’t exist in isolation, we are all part of network. Factories share “air sheds”, counties share labour markets and hospitals, schools share catchment areas, firms share suppliers. In these settings treatment in one place changes exposure in nearby places. If we ignore that structure, the DiD contrast can mix up direct effects with spillovers and the policy story becomes blurred (and incredibly difficult to justify).

The paper gives a way to define and estimate effects that respect our reality. By writing outcomes in terms of own treatment and exposure through neighbours, and by stating a parallel-trends condition that conditions on network position, we get an estimand that matches the question policymakers ask: what changed for treated units, and what changed for those connected to them?

There is also a precision and credibility gain. The estimator is doubly robust (either the outcome model or the treatment/exposure model can be off and consistency is kept) and it reaches the efficiency bound while letting nuisance pieces be learned with modern ML. That combination is super important when treatment is rare, networks are sparse or irregular and/or the signal is relatively small.

Finally, it changes how we report results. Instead of one average with vague caveats about “contamination,” you can report direct and spillover effects with valid inference under dependence. For environmental regulation, vaccination campaigns, transport investments or school reforms, that is the difference between a credible evaluation and one that misses how benefits spread across the network.

Who should care?

Environmental and health economists studying pollution or disease spread across regions, labour and urban economists dealing with commuting zones and shared markets, education researchers looking at peer or school-network effects, and development economists evaluating geographically clustered programmes. It also matters for applied researchers who suspect “contamination” but don’t want to drop affected observations or pretend it away. If treatments diffuse through networks (physical, social or economic) this framework gives a way to formalise that dependence and still estimate interpretable causal effects. And for methodologists, it extends the efficiency theory of DiD to a frontier problem: identification and inference under interference.
Do we have code?

No, the paper does not appear to provide a public code repository or package for the estimator itself. The authors mention using other R packages in their application (such as disperseR to calculate the interference matrices and various ML packages like BART and HAL for the nuisance functions) but they do not link to their own code that implements the proposed network-based doubly robust DiD estimator. I will update the post if anything changes.

In summary, this paper extends our DiD to the networked world. It formalises what “treatment” means when outcomes depend on neighbours’ exposure, defines a conditional PTA that accounts for those links and derives an efficient, DR estimator for both direct and spillover effects. This results in a DiD framework that stays valid when SUTVA is frail. It treats interference as part of the design rather than a violation, which lets us measure how policies spread through connected units instead of assuming an isolation that rarely exists.

Some of the papers in this newsletter can be summarized as “econometricians writing papers using ML methods based on obscure Maths from the early 1900s”. The Riesz representation is a concept from Maths that says you can always rewrite a linear estimate (e.g., a treatment effect) as an average of predictions with the “right” weights.

Partial R^2 measures how much additional variation a variable explains once other covariates are already included in the model.

Not all outcomes behave like “heights” or “incomes”. Some are categories. If we’re looking at vote choice, we don’t measure a number for each person; we record a label (Democrat, Republican, Independent). At the group level we talk about shares across those labels. Those shares are probabilities, they’re all non-negative and they must add up to one. Think of a fixed pie cut into slices: make one slice bigger and at least one other slice must get smaller. That constraint is always there. There’s also a difference between categories and ordered categories. Letter grades are ordered (A above B above C), but the jump from B to A is not the same thing as the jump from C to B. If we map A=3, B=2, C=1, we’re pretending the steps are evenly spaced when they aren’t. That can push a linear DiD to say things that don’t make sense for the underlying scale. Standard DiD is built for additive, nearly continuous outcomes where plus/minus changes behave well. With categorical or compositional outcomes, additive changes can mess with the basic rules (negative “probabilities”, totals above one) or hide the fact that gains in one category must come from losses elsewhere. The fix is to work with the whole distribution and with changes that respect the “pie-chart constraint”. That’s the problem this paper takes on: how to do DiD when the outcome is a set of category shares rather than a single continuous number.

As Onil says, his framework is particularly suited to settings with discrete, unordered outcomes (e.g., employment status, voting choice, other health categories) where the policy question goes beyond an average estimate, where the entire composition of categories was reshaped. We can think of labour-market reforms that shift people between employment, unemployment and out-of-labour-force states; or health interventions that move patients across diagnostic categories rather than changing a single health index. Even education and migration policies often work this way: they change who ends up where. In all these cases, what matters is the redistribution of probability mass across categories, i.e., how treatment changes the mix of outcomes beyond their mean.

Missing and Messy

Beatriz Gietner — Wed, 08 Oct 2025 11:55:01 GMT

Hello! We have three new papers to cover :)

Here they are:

Efficient Difference-in-Differences Estimation when Outcomes are Missing at Random, by Lorenzo Testa, Edward H. Kennedy and Matthew Reimherr
Triply Robust Panel Estimators, by Susan Athey, Guido Imbens, Zhaonan Qu and Davide Viviano
Cohort-Anchored Robust Inference for Event-Study with Staggered Adoption, by Ziyi Liu

Before we jump to those papers, I will recommend some DiD applied papers I found (actually went looking for them) online.

→ In the next post I want to focus on DiD papers by JOB MARKET CANDIDATES, both theory and applied. If you know of someone’s paper (or you wrote one yourself), please let me know (send me an email at b.gietner@gmail.com) :) ←

Let’s begin with these ones, but there are many more on NBER:
Maternity Leave Extensions and Gender Gaps: Evidence from an Online Job Platform, by Hanming Fang, Jiayin Hu and Miao Yu (DDD, China, unintended consequences of maternity leave extension on gender gaps in the labour market)
Price and Volume Divergence in China’s Real Estate Markets: The Role of Local Governments, by Jeffery (Jinfan) Chang, Yuheng Wang, and Wei Xiong (dynamic DiD, China, divergence between price and volume in residential land and new housing transactions across Chinese cities during the Covid-19 pandemic)
Moving for Good: Educational Gains from Leaving Violence Behind, by María Padilla-Romo and Cecilia Peluffo (stacked DiD, Mexico, effects of moving away from violent environments into safer areas on migrants’ academic achievement)

Efficient Difference-in-Differences Estimation when Outcomes are Missing at Random

(There’s a considerable amount of stats lingo in this paper, which I tried to explain in the footnotes, but would recommend having this Glossary saved somewhere. Would appreciate if someone could provide a “Stats lingo for economists” *wink*)

TL;DR: when outcomes are missing in panel data, simply discarding those cases can result in a biased sample. This paper sets out a framework for DiD under missing-at-random assumptions and develops estimators that achieve efficiency while remaining robust to some model misspecification.

What is this paper about?

This paper speaks to all of us who ever had to drop observations in longitudinal/panel data because of incomplete information on covariates and/or attrition1. This practice, in one way or another, introduces selection bias2. How, then, should we conduct DiD analysis when pre-treatment outcome data are missing for a subset of the sample? This paper provides a formal framework for identifying and efficiently estimating the ATT when pre-treatment outcomes are missing at random (MAR)3. In the Appendix, they also extend it to two scenarios: one where all pre-treatment outcomes are observed, but post-treatment outcomes can be MAR, and the other in which outcomes can be missing both before and after treatment.

What do the authors do?

They start with setting up the problem by assuming a 2x2 DiD4, then provide sets of assumptions for identifiability (including two different sets of assumptions for the missingness mechanism in the pretreatment outcome5, one where baseline outcomes are missing conditional on covariates and treatment, and another where missingness can also depend on the post-treatment outcome) and semiparametric efficiency bounds6, which set the benchmark for the lowest possible estimation variance.

Building on this, they then construct estimators based on influence functions7 and cross-fitting8. These estimators achieve the efficiency bound and are multiply robust, which means consistency is guaranteed if *at least one* of several combinations of nuisance functions9 is correctly specified. They also develop a way to estimate the “nested regression”10 needed under the more complex MAR assumption using regression or conditional density methods and show that augmented approaches can recover efficiency even with misspecification.

Finally, they validate their estimators through a very extensive simulation study. By systematically varying whether nuisance models are correctly or incorrectly specified, they demonstrate that the estimators perform as theory predicts: bias and RMSE11 are negligible when the robustness conditions are met, and performance deteriorates only when the assumptions are violated.

Why is this important?

Dropping observations/units with missing baseline outcomes is common practice, but it introduces selection bias and can distort treatment effect estimates. This paper shows how to formally identify and estimate the ATT even when pre-treatment or post-treatment outcomes are missing at random. By deriving efficiency bounds, the authors also tell us how well any estimator could possibly do. Their proposed estimators are not only efficient but also multiply robust, meaning they remain consistent as long as certain subsets of nuisance models are specified correctly. This is a strong safeguard for applied work, where model misspecification is almost inevitable

Who should care?

Anyone using DiD with survey or administrative data where attrition, item nonresponse or incomplete records are common, which then includes applied researchers in labour economics, education, and health, as well as methodologists developing new DiD estimators.

Do we have code?

The paper itself does not release replication code but the estimators are influence-function based and can be implemented with standard tools for cross-fitting and doubly robust estimation in R or Python. The design closely parallels DR-Learner and targeted ML routines therefore adapting existing code should be straightforward. The Monte Carlo simulations in the paper provide guidance on implementation.

In summary, the authors extend the DiD framework to situations where pre or post-treatment outcomes are not fully observed. They derive efficiency bounds that show the best precision researchers can hope for and propose estimators that reach these bounds while retaining robustness if certain models are misspecified. This gives applied researchers a principled way to handle incomplete outcome data, a problem that arises frequently in labour, education and health applications.

Triply Robust Panel Estimators

(This one is also about panel data, but you should remember some concepts from linear algebra and matrix theory12)

TL;DR: when evaluating policies with panel data, choosing the wrong estimation method can completely change your answer, but there’s no way to test which method is right. This paper introduces TROP, an estimator that protects you from choosing wrong by combining unit matching, time weighting and outcome modeling, so if any one approach is valid, you get the right answer. It consistently outperforms traditional methods across diverse real-world settings.

What is this paper about?

The core problem in causal panel data estimation is estimating what would have happened (the counterfactual) to a treated unit had it not been treated, given that unobserved factors likely drive both the outcome and the treatment decision. The challenge with panel data and a binary intervention is not necessarily the lack of causal inference methods but more like the abundance of them. There’s a myriad to choose from: DiD TWFE, SC, MC, or a hybrid like SDiD, and each comes with its own identifying story (PT, factor models, or pre-treatment fit). Because of this, none of these assumptions can be tested against one another. Choosing between them often feels arbitrary.

Even when you settle on a method, their weighting “rules” raise problems. DiD spreads weight evenly across all periods and units. SC fixes a single set of unit weights and applies them to every post-treatment period. Both approaches ignore that some pre-treatment periods are more predictive than others. Also, SC itself is not built for modern settings where multiple units receive treatment at different times or where treatment can switch on and off. Its balancing logic does not extend naturally to these cases. Messy landscape much?

The paper steps in with TROP (Triply RObust Panel), an estimator designed to work across these situations rather than being tied to one fragile set of assumptions. Think of it as a general, unifying framework, from which all the others are “special cases”.

What do the authors do?

The key idea of their estimator is to unify existing approaches rather than picking one. TROP builds counterfactuals for treated units by combining three things:

A flexible outcome model – a low-rank factor structure13 layered on top of unit and time fixed effects → this captures broad common trends and latent dynamics that DiD alone would miss.
Unit weights – like in SC, treated units are matched with controls that had similar pre-treatment trajectories, but the weights are “learned” from the data and can vary.
Time weights – unlike DiD or SC, TROP doesn’t treat all pre-periods as equally useful. It can down-weight very old data and put more weight on periods closer to treatment.

All three components are tuned jointly using cross-validation, so the estimator learns from the data which combination best predicts untreated outcomes. Because of this structure, TROP nests existing methods: if you shut down the factors, or set all weights equal, you recover DiD, SC, SDID, or MC as special cases.

More importantly, this combination gives TROP its superior theoretical guarantee: the property of triple robustness. TROP is asymptotically unbiased14 if any one of three conditions holds: a) perfect balance over unit loadings, b) perfect balance over time factors, or c) correct specification of the regression adjustment. This structure means the estimator’s final bias is tightly bounded by the product of the errors in these three components, making it more robust than any existing doubly-robust estimator.

The authors then put TROP through a series of semi-synthetic simulations calibrated to classic datasets (minimum wage, Basque GDP, German reunification, smoking, boatlift). Across 21 designs, TROP outperforms existing methods in 20 of them. In the one exception (Boatlift treated unit case), standard SC performs better by 24%. They also show why: the ability to combine unit weights, time weights and factor modeling makes it more robust when PT fail, when pre-treatment fit is imperfect or when the assignment is more complex.

Why is this important?

Those of us doing applied work are often stuck with panel data that looks nothing like the textbook case, e.g. interventions don’t arrive all at once, different units get treated at different times, and PT are hard to justify. In those settings, the choice of estimator can make us confuse, and this conundrum is laid out in the paper. The authors show that DiD might work well in one dataset and fail miserably in another, while SC can look great in some cases and terrible in others. There’s no single safe bet.

This is exactly the situation policy evaluations run into. A government wants to know whether to keep, expand, or cut back a programme. The answer depends on estimates that are only as good as the assumptions behind them. If you can’t be sure whether PT hold, or whether a factor model really captures the dynamics, you’re taking a gamble.

Here’s why TROP is so useful: it doesn’t force you to pick one story. It combines unit weights, time weights and an outcome model in a way that means if one assumption fails but another holds, the estimator still works. For applied economists, that triple robustness acts as a safeguard when the data don’t really fit into any single framework15.

Who should care?

Applied econometricians and empirical people group should care most, as TROP directly addresses the daily challenges with data complexity. More specifically, anyone working with panel data (for estimating treatment effects, e.g., policy impact, firm-level interventions) with staggered treatment timing or multiple treated units, where traditional DiD doesn’t work really well. Other researchers who have to do interactive confounding, since TROP provides a defense against the interactive fixed effects (unobserved trends that affect units heterogeneously) that the authors find are both common in real data and fatal to the simpler DiD model. Other individuals involved in high-stakes policy analysis should care (but I don’t believe this paper will reach them) since TROP “protects” against uncertain assumptions. Also data scientists, since they often work with large panel or time-series cross-section data in tech, finance or marketing.

Do we have code?

The paper provides detailed algorithms (Algorithm 1, 2, 3) with step-by-step procedures. While there may not be a published software package, the algorithms are complete enough that implementation is straightforward.

In summary, TROP is best thought of as a unifying method for causal inference with panel data. Instead of choosing between DiD, SC, or MC, it combines their core ideas: balance units, weight time and model outcomes flexibly. That design gives it a “triple robustness”, meaning that if any one of these channels is valid, the estimator delivers. The simulations make clear how unstable results can be if you commit to a single method. For applied economists, the value of TROP is that it consistently performs well across very different settings. It’s a practical safeguard when you face uncertainty about which assumptions your data can plausibly support.

Cohort-Anchored Robust Inference for Event-Study with Staggered Adoption

(Ziyi is a third year PhD student at UC Berkley, Haas School of Business. Keep an eye out for him! This paper is quite long but it’s well-written and Ziyi explains the questions really well)

TL;DR: Ziyi introduces a cohort-anchored framework for robust inference in staggered DiD event-studies, using block bias to replace unreliable aggregated pre-trend checks and yielding confidence sets that stay valid, and more informative, when cohorts differ.

What is this paper about?
Event studies with staggered adoption lean on the PTA. Since PT cannot be tested directly, we usually eyeball the pre-trend plot and/or test whether pre-treatment coefficients are jointly zero. But these approaches are quite limited: the tests have low power, and conditioning on them can distort inference. Rambachan and Roth (2023) proposed a more robust approach: rather than treating PT as all-or-nothing, use the observed pre-trend to bound how much post-treatment bias could plausibly be. This would deliver confidence sets that remain valid even when PT doesn’t hold “exactly”16.

That works fine when treatment happens at one point in time. With staggered adoption, things get messier because inference usually relies on event-study coefficients aggregated across cohorts. This gives birth to three issues: 1) relative-period coefficients from TWFE can be contaminated by weighted averages of heterogeneous effects, sometimes with negative weights17; 2) cohort composition shifts over relative time, so pre and post-trends are not directly comparable18; and 3) for estimators that use not-yet-treated controls, the control group itself changes as cohorts adopt19.

Let me illustrate the issue. Let’s imagine two states raising the minimum wage, but at different times. State A does it in 2005, State B waits until 2007. A standard event-study lines them up by “time since adoption” and averages the results. That looks ok on a plot, but it hides a problem: the early pre-period is based only on State B, while the long post-period is based only on State A . If A and B had different pre-trends, then the “average pre-trend”20 you’re using as a benchmark has nothing to do with the post-treatment cohorts you care about.

This is where Ziyi’s paper comes in. Instead of averaging across everyone, it anchors inference at the cohort level21. Each group of adopters is always compared to the same set of controls (here defined as the units that were untreated when that cohort first adopted)22. Ziyi calls the resulting comparison a block23 , and the difference in trends between the cohort and its fixed control group is the block bias. Because the block bias has the same definition before and after treatment, the pre-period provides a credible benchmark24 for what could happen post-treatment (meaning it has a consistent interpretation across all periods)25. The paper shows then how to use this setup to build robust inference26 in event-studies even when treatment timing is staggered and cohorts look different from each other.

What does the author do?

Ziyi does a lot. As I said, he develops a cohort-anchored robust inference framework that operates at the cohort–period level. His framework is designed to address the three aforementioned methodological complications that arise under staggered adoption: negative weighting from TWFE event-studies, shifting cohort composition across relative periods and changing control groups in estimators that use not-yet-treated units.

The central building block is the block bias (Δ), which is the difference in trends between a treated cohort and its fixed initial control group. Because the block bias is defined consistently before and after treatment, the observed pre-treatment block biases provide a valid benchmark for bounding the unobserved post-treatment ones. The author then establishes a bias decomposition. The overall estimation bias (δ) contaminating any cohort-period treatment effect estimate can be written as an invertible linear transformation of block biases δ=WΔ. Here, each cohort’s overall bias equals its own block bias plus a weighted sum of block biases from cohorts that adopt later, with the weights determined by relative cohort sizes and adoption timing. This let us impose transparent restrictions directly on the interpretable block biases and then translate those restrictions into bounds on the overall bias. The result is a valid confidence set for the treatment effect.

He then goes to implement two specific types of restrictions (adapted from Rambachan and Roth, 2023): relative magnitudes (RM), which bounds how much a cohort’s block bias can change from one period to the next, with the benchmark coming from pre-treatment variation; and second differences (SD), which bounds the change in the slope of the block bias path (it’s well-suited for settings where pre-trends look approximately linear). These restrictions can be applied globally (using the largest observed pre-trend variation across cohorts) or cohort-specifically (using each cohort’s own pre-trends).

His framework is illustrated in two simulation exercises with heterogeneous pre-trends. Compared to the aggregated approach, the cohort-anchored method delivers confidence sets that are better centered on the truth and, in some cases, narrower. He then finishes it by revisiting Callaway and Sant’Anna’s (2021) study of minimum wages and teen employment. Under the cohort-anchored framework, the confidence set remains centered well below zero even after accounting for cohort-specific linear pre-trends (which means the negative employment effect is robust). The aggregated approach, by contrast, centers around zero and kinda obscures this conclusion.

Why is this important?

Ziyi’s paper addresses the issues left unresolved by the major innovations in the DiD literature between 2018 and 2021. Traditional TWFE estimators are invalid under heterogeneous treatment effects. New HTE-robust estimators were developed to fix this, but they introduced complications for the robust inference framework of Rambachan and Roth (2023), which aggregates results across cohorts. The key problems are dynamic treated composition (the average pre-trend is irrelevant for the cohorts driving the post-trends) and dynamic control groups (the definition of parallel trends shifts as adoption unfolds).

By introducing the concept of block bias, Ziyi provides a coherent way to conduct robust inference in this setting. Block bias solves the dynamic control group problem by anchoring comparisons to a fixed baseline, and the cohort-anchored framework demonstrates that the aggregated approach can distort results when cohorts have heterogeneous pre-trends. The cohort-anchored method instead produces confidence sets that are better centered on the true effect.

Who should care?

Anyone running event-studies with staggered adoption, which includes researchers studying policies that roll out at different times across states, firms or schools, as well as applied economists working with treatment programs that expand gradually. If you rely on modern HTE-robust estimators and want to report confidence sets that remain valid without assuming exact PT, check this paper out. It is especially relevant when pre-trends differ across cohorts since the standard aggregated approach can distort inference in that case.

Do we have code?

Ziyi told me he’s working on the package that accompanies the paper, and that he hopes he can make it public soon.

In summary, this paper pushes robust inference in DiD one step further. Rambachan and Roth (2023) showed how to move beyond binary PT by using pre-trends to bound post-treatment violations, but their framework was limited when treatment timing was staggered. Ziyi builds on that insight by introducing the concept of block bias and anchoring inference at the cohort–period level. The result is a framework that works with modern HTE-robust estimators, avoids the distortions of aggregated event-studies and produces confidence sets that remain valid even when cohorts look very different from each other. The simulations and the minimum wage reanalysis show clearly that this matters: once you anchor inference properly, robust negative employment effects emerge that the standard approach washes out.

For an interesting discussion on this, see Bellégo, Benatia and Dortet-Bernadet.

As the authors point out, “if the mechanism that causes the data to be missing is related to treatment or other characteristics that influence the outcome, the remaining sample is no longer representative of the population of interest and the resulting estimates will be biased”.

Ok, some house cleaning first. This classification was first proposed by Rubin in 1976. He formalized the three categories (MCAR, MAR, and MNAR) and provided the statistical framework for understanding how different missing data mechanisms affect inference. This work built on earlier ideas but Rubin really crystallized the taxonomy and its implications.

Missing Completely At Random (MCAR): the missingness here has no relationship to any variables in the dataset (observed or unobserved, e.g., a survey page that randomly fails to load for some users due to some technical glitch). It is the least problematic type because the missing data is essentially a random sample of all data.
Missing At Random (MAR): here the missingness is related to observed variables, but not to the missing values themselves (e.g., younger people might be less likely to report their income, but among people of the same age, whether income is missing is random). It is “random” CONDITIONAL on other observed variables. Not so problematic seeing that most statistical methods can handle MAR if you account for the related variables.
Missing Not At Random (MNAR): the missingness is related to the unobserved (missing) values themselves (e.g., people with very high incomes are less likely to report their income specifically because their income is high). MNAR is the most problematic type because the missing data mechanism is related to what you’re trying to measure, and thus requires special modeling approaches or sensitivity analyses.

“We want to emphasize that our framework differs from both the balanced panel data and the repeated cross-section data analyzed by Sant’Anna and Zhao (2020). The former postulates that each sample is observed both before and after treatment: the latter assumes that each sample is observed either before or after treatment. Instead, our setup mirrors the so-called unbalanced, or partially missing, panel data framework, where the outcomes of some samples are observed both before and after the treatment, while some other samples miss pre-treatment outcomes.”

“These assumptions are novel to our setting and are equivalent to a MAR missingness design. Notice that we let the missingness pattern depend on *both* covariates X and the treatment A, and eventually on the post-treatment outcome. In other words, we admit the possibility that pre-treatment outcomes can be missing due to covariates, the treatment that will be administered, and the post-treatment outcome values.”

The (theoretical) lowest possible variance that an estimator can achieve in a given model. It’s a benchmark to judge whether an estimator is statistically optimal. If your estimator hits this bound, you cannot do better in large samples.

A mathematical way of describing how each individual observation affects an estimator. Think of it as a linear “approximation” that shows the sensitivity of your estimate to each data point. We don’t need the formula, just the idea that it underpins efficiency and robustness proofs.

A technique where the sample is split into folds (subsets of the data, e.g. 1,000 observations split into 5 folds gives 5 groups of 200). Nuisance functions (like propensity scores) are estimated on some folds and then evaluated on a different fold. Each observation is used for training and evaluation, but never at the same time. This rotation avoids overfitting and preserves efficiency.

These are not the main parameter of interest (the ATT), but rather functions that must be estimated from the data to construct the efficient estimator and guarantee its desirable properties. The primary nuisance functions in the paper are:

Regression Functions (μ∗): Models for the expected values of the outcomes conditional on other variables (e.g., the expected post-treatment outcome for the control group, E[Y1∣X,A=0]).
Propensity Score (π∗): The probability of a unit being in the treatment group conditional on their covariates (P[A=1∣X]).
Missingness Probability (γ∗): The probability of an outcome being observed (not missing) conditional on other observed variables.
Nested Regression (η∗): A specialized conditional expectation needed under the most complex missing-at-random assumption.

Correctly modeling just a subset of these functions is what gives the estimators their multiple robustness property.

Nested Regression (η0∗(X,0)): it’s a specific and complex type of conditional expectation. It’s called “nested” because it involves taking an expected value of an expected value. It’s needed for the efficient estimator when missingness depends on the post-treatment outcome (Assumption 2.4).

Root Mean Squared Error, the standard measure combining bias and variance into a single metric. Lower RMSE means better estimator performance.

The key person bridging advanced matrix theory (specifically the low-rank assumption and PCA) into modern econometrics and setting the stage for its application in both macro (dynamic factor models) and micro (interactive fixed effects) panel settings is Jushan Bai (who’s now at Columbia). While the foundation for factor models was laid in macro, the move to rigorous estimation of coefficients in panel data - which is the context of this paper - is his central contribution.

The low-rank factor structure is a statistical modeling assumption which says that the observed outcomes of many units over time can be approximated by a small number of unobserved, common factors or trends. It’s used to simplify and understand complex, high-dimensional data, such as the panel data (many units observed over many time periods) studied in the paper.

An estimator is asymptotically unbiased if, as the sample size grows infinitely large, the expected value of the estimator converges to the true value of the parameter it is trying to estimate. Unbiased means the estimator’s average guess is exactly the true value, regardless of sample size, and asymptotically means it might be biased in small samples, but that bias completely vanishes as you get more and more data (i.e., as the number of units N and/or the number of time periods T goes to infinity).

However… researchers should still verify performance in their specific context as the one exception in the simulations shows that no single method dominates universally.

Or, in the authors words, this allows the average treatment effect to be set-identified, and one can construct a confidence set for it.

Also known as heterogenous treatment effect contamination. This problem can be addressed by HTE-robust estimators (e.g., Sun and Abraham, 2021; Borusyak et al., 2024), which estimate cohort-period level effects. Borusyak and Hull (2024) is a good reading if you’re worried.

Even with HTE-robust estimators, the set of treated cohorts contributing to the aggregated coefficients changes across relative periods, and a a result, aggregated pre-treatment and post-treatment coefficients are not directly comparable, since they are based on different treated cohort compositions.

Aka, dynamic control group problem. This issue arises for estimators using not-yet-treated cohorts as controls (like what Ziyi refers to as the CS-NYT estimator) . A shifting control group means the underlying definition of the PTA changes, making it difficult to impose credible restrictions.

Aggregation can mask cohort-level heterogeneity and produce a distorted benchmark; differences then reflect shifts in cohort composition.

The cohort-anchored framework bases inference, as the name suggests, on cohort-period level coefficients.

The block bias compares a treated cohort to its fixed initial control group (units untreated when the cohort first adopts).

The comparison operates within a block-adoption structure, hence “block bias”.

Block biases are “anchored” to a fixed control group, enabling observable pre-treatment block biases to serve as valid benchmarks for their post-treatment counterparts.

The block bias concept enables robust inference through a bias decomposition. This decomposition shows that the overall estimation bias (δ) can be expressed as an invertible linear transformation of the block biases (δ=WΔ). By imposing restrictions on the consistent block bias (Δ), the framework can translate those bounds to the overall estimation bias (δ) and successfully construct a valid confidence set.

The entire paper proposes a cohort-anchored framework for robust inference, solving issues like dynamic control group and cohort heterogeneity.

Where Standard Assumptions Go To Die

Beatriz Gietner — Mon, 22 Sep 2025 12:47:56 GMT

Hi there! I didn’t forget about you folks, but things have been pretty quiet so I was pilling up the papers as to have a meaningful post.
Let’s then check the latest ones:

Factorial Difference-in-Differences, by Yiqing Xu, Anqi Zhao and Peng Ding (professor Yiqing told me this is not new - it was originally uploaded last year - but we will hear about it in soon so I thought about including it here. He was also very kind and shared slides so we could understand the paper better. Thank you, prof! There’s also a talk available here)

Difference-in-Differences Estimators When No Unit Remains Untreated, by Clément de Chaisemartin, Diego Ciccia, Xavier D’Haultfœuille and Felix Knau (this was also first uploaded last year)

Treatment-Effect Estimation in Complex Designs under a Parallel-trends Assumption, by Clément de Chaisemartin and Xavier D’Haultfœuille (if you’re in Europe and want learn more DiD from prof D’Haultfœuille, he will be at the ISEG Winter School in January. You can sign up for it here)

Inference With Few Treated Units, by Luis Alvarez, Bruno Ferman, and Kaspar Wüthrich (first draft: April 26, 2025; this draft: June 26, 2025)

Factorial Difference-in-Differences

TL;DR: what should we do when there’s no untreated group? Events like famines, wars, or nationwide reforms hit everyone at once, leaving us without the clean control group that canonical DiD relies on. Xu, Zhao and Ding formalise what applied researchers have long been doing in these cases: using baseline differences across units to recover interpretable estimates. They call this Factorial DiD (FDiD). The framework clarifies what the DiD estimator identifies, what extra assumptions are needed to make causal claims, and how canonical DiD emerges as a special case.

What is this paper about?

This paper is about a situation many of us applied people face: what to do when a big event affects everyone? In canonical DiD we need a treated group and a clean control group. But in practice there are cases where no such control exists (e.g., a famine, a war, or a nationwide policy reform) and no one escapes exposure. What we often do is still run a DiD regression but instead of comparing treated and untreated units, the idea is to interact the event with some baseline factor, which is a characteristic that units already have before the event happens and that doesn’t change because of it1. The authors call this approach Factorial DiD2, and the name reflects the fact that the research design involves two factors: the event itself (before versus after) and the baseline factor (e.g., high versus low social capital).

The same DiD estimator is used, but the research design is different. The estimator no longer identifies the standard treatment effect, but instead it captures how the event’s impact differs across groups with different baseline factors, a quantity they call effect modification. The authors go further and show that moving from this descriptive contrast to a causal claim about the baseline factor (what they call causal moderation) requires stronger assumptions. They also show that canonical DiD is a special case of FDiD, but only if you impose an exclusion restriction assuming some group is exposed yet unaffected by the event. This paper gives a language, a framework and clear identification results for a class of designs that are already widely used but rarely well defined in the methodological literature.

What do the authors do?

They begin with the simplest possible case: 2 groups, 2 periods + one event that hits everyone. They show that with universal exposure + no anticipation + the usual PT, the DiD estimator doesn’t give us the “standard” treatment effect but instead identifies effect modification, which is the descriptive contrast in how much the event changes outcomes for high-baseline versus low-baseline units (similar to the idea of heterogeneous treatment effects). They then introduce three related quantities3:

1) Causal moderation: the stronger claim that the baseline factor itself changes how the event matters (e.g., that social capital causes the famine to be less deadly);

2) An ATT analogue: the average effect of the event on the high-baseline group (which looks like the “treated group effect” in canonical DiD);

3) And the baseline factor’s effect given exposure: how the baseline characteristic matters once the event has happened.

The key contribution is to line up which assumptions are needed for each one. With only canonical DiD assumptions, you can get effect modification. To move to causal moderation, you need their new factorial PTA. Canonical DiD then emerges as a special case if you add an exclusion restriction that one group is exposed but unaffected. A different exclusion restriction, assuming the baseline factor has no impact in the absence of the event, links the design to the baseline-given-exposure quantity.

After working through these four cases, they move on to show how the framework can be stretched further. The authors show how to add baseline covariates, work with repeated cross-sections rather than panels and extend to continuous baseline factors. They also clarify what this means for practice: when TWFE regressions with interactions are “coherent” and when they are not. The paper closes by revisiting the famine and social capital example, showing how the framework applies to a real case that has already been studied with this kind of design.

Why is this important?

I think giving names to things matter. We’ve been doing FDiD, we just don’t call it that. We run a TWFE regression with an interaction term, call it DiD and then interpret the coefficient as if it were the standard ATT. The problem is that the target estimand is *not* the ATT.

This paper makes the implicit explicit. It 1) names the design and distinguishes it from canonical DiD, 2) spells out what the estimator is actually identifying under different assumptions, and 3) clarifies what extra assumptions we need if we want to go beyond descriptive contrasts. This obviously matters in practice. It gives us a diagnostic tool by stating plainly which estimand we are after and which assumptions justify it. I think it also sort of prevents over-claiming. With only canonical PT, we’re talking about effect modification. To get causal moderation, we need factorial PT. To recover an ATT-like interpretation, we need an exclusion restriction.

This paper also ties regression practice back to design. TWFE with interactions is ok if the assumptions are “right”; if not, we are not estimating what we think we are. And because the authors extend the framework to covariates, repeated cross-sections and continuous baseline factors, their approach covers the kinds of data we actually use.

At a broader level, the contribution is about transparency. FDiD gives us language that makes clear to authors, referees (and seminar audiences) what is being identified and under what conditions. It “builds a bridge” between factorial designs in stats and DiD in econ, and in doing so makes a very common empirical strategy much easier to defend (and critique).

Who should care?

Anyone in the social sciences doing causal work. Applied people4 working on labour, development, education, political economy, trade or even finance will recognise the setup (wars, famines, nationwide reforms, regulatory changes or financial crises all fit the bill). PolSci dealing with regime shifts, revolutions and/or national elections face the same issue. I think it’s also useful for both referees, editors, authors and us, graduate students, learning causal inference.

Do we have code?

No, for many reasons. FDiD is not a new estimator, but a design. The DiD estimator is standard, what changes is the estimand and assumptions. Implementation here is context-specific. Which quantity you estimate (effect modification, causal moderation, ATT-analogue, baseline-given-exposure) depends on which assumptions you’re willing to make. A “one-click” function would hide that choice. The same regression (TWFE with an interaction or ∆Y on G) can map to different targets under different designs. Packaging it would risk misinterpretation. Extensions vary (covariates, repeated cross-sections, continuous baseline factor). Too many branches for a single simple API. If you want a practical recipe, it’s just: 1) difference outcomes, 2) run ∆Y on baseline-group indicator(s) with covariates as needed, and 3) interpret the coefficient as effect modification under standard PT (add stronger assumptions if you want causal moderation or the other targets).

In summary, FDiD gives a name and a framework to something we were already doing when there was no untreated group. The estimator is the same as in canonical DiD but the design is different (and so is the interpretation). What you get by default is effect modification, which is a descriptive contrast in how the event matters across baseline groups. To turn that into causal moderation, or to recover an ATT-like interpretation, you need extra assumptions. The paper’s value is in making these distinctions clear, mapping assumptions to estimands and also showing how regression practice fits within the framework. For anyone who has ever interacted an event with a baseline factor and called it DiD, please take a moment and think about rewriting it a bit, hehe.

Difference-in-Differences Estimators When No Unit Remains Untreated

TL;DR: what if a policy affects everyone, but by different amounts? Minimum wage hikes, China’s WTO entry, Medicare Part D, etc, no group is left fully untreated. The authors call this a heterogeneous adoption design with no untreated group. Standard DiD fails here and TWFE with treatment intensity can mislead. The paper shows how to recover effects if quasi-untreated units exist, proves impossibility without them, and offers minimal assumptions for partial identification. It also gives tests for when TWFE is “defensible”, with code in Stata and R.

What is this paper about?

What happens when a policy affects everyone, but not in the same way? A nationwide minimum wage hike raises pay everywhere, yet the bite is bigger in some regions than others. China’s entry into the WTO reduced tariff uncertainty for every U.S. industry, but some faced larger potential tariff spikes. Medicare Part D applied to all drugs, but exposure varied with each drug’s Medicare market share.

In each of these examples, there’s no group left *entirely* untouched by the policy, every unit is affected to some degree. The authors call this setup a heterogeneous adoption design (HAD) with no untreated group. It turns out this situation is not rare at all. Many policy changes are universal by design and even when a small untreated group exists we often exclude it (drop if) because it looks too different from the rest of the sample. The problem is that standard DiD relies on having a control group that stays at zero treatment. If everyone is affected, that assumption collapses. And the default fix in applied work (a TWFE regression with treatment intensity as the regressor) doesn’t solve the problem. As the authors show, TWFE can give misleading results in this setting when treatment effects vary across units.

This paper then “attacks” the issue directly. It shows when effects are identified and when they are not. With quasi-untreated units (exposure near zero), they identify a weighted average of slopes (WAS) using a local-linear, optimal-bandwidth estimator adapted from RDD, with valid bias-corrected inference. When no quasi-untreated units exist, they prove an impossibility result for WAS without extra assumptions; they then give minimal assumptions to learn the sign of the effect or an alternative parameter relative to the lowest observed dose. Because TWFE is widely used, they also supply tests: a tuning-free test for the existence of quasi-untreated units and a nonparametric Stute test of the TWFE linearity restriction, alongside pre-trend checks.

What do the authors do?

First the authors set up the framework. They define what they call a heterogeneous adoption design (HAD) with no untreated group: every unit is untreated in period one and in period two everyone gets some positive treatment dose, but the size of that dose differs across units. They show why TWFE regression, which regresses outcome changes on treatment intensity, can be misleading in this setup when effects differ across units.

They then ask: what can we still learn? Their first target is a parameter they call the weighted average of slopes (WAS), which weights each unit’s slope (change in potential outcomes between zero and actual treatment) by the unit’s treatment dose relative to the average dose. If there is a set of quasi-untreated units (those with exposure close to zero), WAS can be identified by comparing outcome changes in the full sample to outcome changes in this group. To estimate it, they import tools from the RD literature (local linear regression at the boundary + optimal bandwidth choice + bias-corrected confidence intervals).

But what if even quasi-untreated units don’t exist? Here the paper proves a complete impossibility result: without extra assumptions, WAS has an identification set that is the entire real line - meaning the data can rationalize any value of WAS. To make progress, the authors then propose 2 minimal routes forward. One option is to assume enough structure to at least pin down the sign of WAS, by comparing the least-treated units to the rest. Another is to define a slightly different parameter, WAS_d, which compares outcomes to the lowest observed dose rather than to zero. Under an additional assumption, that parameter can be identified and estimated.

Because the whole strategy depends on whether a quasi-untreated group exists, they also build a tuning-free statistical test for its presence based on simple order statistics of the treatment variable.

After laying out these new estimators, they circle back to TWFE. Since we often use it anyway, they ask: under what conditions could TWFE actually be valid here? They show that if treatment effects are homogeneous and linear, then TWFE delivers the average slope. This leads to a key insight: TWFE implies that the conditional mean of outcome changes is linear in treatment dose, and this is testable. They adapt a nonparametric specification test (the Stute test) to check this linearity. Combined with standard pre-trend tests and their quasi-untreated test, this gives a three-part testing procedure: if all tests pass, TWFE may be valid. For very large samples, they also provide a faster alternative (the yatchew_test).

Finally, they show the methods in action. In the case of bonus depreciation in the U.S. (2002), they find positive employment effects (often larger than the original TWFE estimates) with pre-trends largely consistent. In the case of China’s PNTR entry, their nonparametric estimates are noisy and mostly insignificant. However, when they control for industry-specific linear trends, the pre-trend and linearity tests are no longer rejected, making TWFE with trends defensible (which points to negative employment effects).

Why is this important?

Many real-world policies look exactly like this. Minimum wage hikes, trade reforms, new health programmes… these are designed to be universal, and they end up affecting everyone to some extent. Untreated groups either don’t exist or they’re so small and different that we usually drop them.

That means the standard DiD design doesn’t apply. Without a zero-treatment group, the whole logic of comparing treated and untreated units makes no sense. Yet in practice, we often still run a TWFE regression with treatment intensity as the regressor, hoping it will stand in for DiD. The problem is that with heterogeneous effects, TWFE can produce misleading results.

This paper matters because it tells us exactly what is and isn’t identifiable in these designs. If quasi-untreated units exist, treatment effects can still be recovered in a transparent way. If they don’t, the paper shows that the data itself can’t tell you the answer (an impossibility result that sets a clear boundary for empirical work).

The authors also bring new tools to the table. Their nonparametric estimators adapt ideas from RD to this setting, and their tuning-free tests give us a way to check assumptions rather than relying on hope (:P). At the same time, instead of throwing TWFE in the bin, they provide guidance on when it might still be valid, and a concrete test for its key assumption.

Also the parameters they focus on (WAS and WAS_s) have direct policy meaning. They tell us whether a universal reform was beneficial compared to no reform, or compared to the minimal observed treatment, which ties the methods back to cost–benefit analysis.

Who should care?

First and foremost, applied microeconomists who study policies that are universal but uneven in their bite (labour market reforms, trade liberalisation, health programmes, education policies). These are exactly the settings where untreated groups don’t exist, yet exposure varies. Policy evaluators and government analysts should also pay attention. When reforms roll out nationwide, the usual evaluation strategies no longer apply. Having a framework that makes clear what can and can’t be identified is very important if you’re tasked with producing credible estimates for policy decisions. Researchers who rely on TWFE in continuous-treatment settings also have a lot at stake. The paper doesn’t say “don’t use TWFE”. The authors rather show when it is misleading and when, with the right tests, it might still be defensible. That’s a message many of us need to hear. And also methodologists and students who are expanding the DiD toolkit will find this work useful.

Do we have code?

Yes, lots. The did_had package (available in both Stata and R) estimates the WAS and can also generate pre-trend and event-study estimators. It implements the nonparametric local-linear estimator with optimal bandwidth selection and bias-corrected confidence intervals, borrowing routines from the nprobust package. For checking the assumptions behind TWFE, they provide the stute_test package (also in Stata and R). This runs the nonparametric specification test of whether outcome changes are linear in treatment dose using a wild-bootstrap implementation. It works fast for moderate datasets, though memory limits kick in as the sample size grows large. For very large datasets, the authors also offer yatchew_test, a faster alternative that scales up to tens of millions of observations.

In summary, this paper shows what to do when every unit is treated. With quasi-untreated units, treatment effects can still be identified using a new nonparametric estimator. Without them, effects are not identified unless you add assumptions (an impossibility result that sets clear limits). The authors also develop tests to decide when TWFE is valid and provide ready-to-use code. For universal reforms, this is now the reference point for what we can and can’t infer and learn.

Treatment-Effect Estimation in Complex Designs under a Parallel-trends Assumption

TL;DR: what to do when treatments are non-binary, reversible or vary at the baseline? This paper introduces AVSQ event-study effects, which compare actual outcomes to a status-quo where each unit keeps its initial treatment. The framework makes it possible to evaluate dynamic and multi-dose policies, while offering a cost–benefit measure (ACE), and shows why distributed-lag TWFE regressions can mislead. The authors also provide software (`did_multiplegt_dyn`) and illustrate the method with US newspapers and turnout.

What is this paper about?

In this paper the authors look at how to estimate treatment effects with DiD when the treatment is more complex than a one-time, binary switch. Many policies can vary in intensity, be reversed, or differ in their baseline levels across units. The usual staggered adoption design doesn’t capture these cases, even though they are common in applied work. The authors propose a way forward by comparing outcomes under the actual treatment path to a counterfactual “status-quo path” where each unit simply keeps its initial treatment. The resulting actual-versus-status-quo (AVSQ) effects generalise the ATT from staggered designs and provide a framework for analysing dynamic treatments beyond the standard setup.

What do the authors do?

The authors begin by setting up a framework for complex DiD designs, where treatments are not just binary switches but can vary in intensity, be reversed, or differ across units at baseline. To deal with this, they propose comparing each unit’s actual treatment path to a counterfactual “status-quo path” where the unit simply keeps its initial treatment level. The difference between these two paths defines what they call AVSQ event-study effects, which extend the familiar ATT from staggered adoption designs.

They then show how these AVSQ effects can be identified under two assumptions: no anticipation and PT conditional on baseline treatment. This conditional PTA is very important because it requires that units with the same initial treatment level have similar outcome trends in the absence of treatment changes, but allows units with different baseline treatments to have different trends. Because raw AVSQ effects can be difficult to interpret, they also introduce a normalised version that rescales by the change in treatment dose, turning them into weighted averages of marginal effects of current and lagged treatments. The AVSQ framework also includes a “no-crossing” condition to ensure interpretable results (this means a unit’s treatment path doesn’t zigzag above and below its initial level, which would make it unclear whether treatment increases or decreases are driving the effects). On top of this, they develop a statistic called the Average Cumulative Effect (ACE), which aggregates AVSQ effects into a cost–benefit style measure of whether the realised treatment path improved outcomes per unit of treatment compared to the status quo.

Beyond defining these new estimands, the authors take a closer look at what happens when researchers run distributed-lag TWFE regressions in these settings. They show that TWFE coefficients do not reliably estimate causal effects when treatment effects are heterogeneous: they end up mixing together different lags and in some cases can even reverse sign. To provide a cleaner alternative, they introduce a random-coefficients distributed-lag model which assumes that treatment effects are constant over time but can vary across units. Under this model they show how to identify and estimate average effects of current and lagged treatments. Estimation is straightforward when treatments take on discrete values, but becomes more complex with continuous treatments which require truncation procedures to handle cases where some units have extreme influence on the estimates.

At the end, they bring these ideas to an empirical application. Revisiting Gentzkow, Shapiro, and Sinkinson’s study of newspapers and voter turnout in the United States, they extend the original analysis by allowing for dynamic effects of current and past newspaper exposure. Their AVSQ estimates show that newspapers increased turnout and that the effect of current exposure was larger than the effect of lagged exposure. A conventional TWFE regression suggests the opposite, which illustrates the practical importance of their framework.

Why is this important?

Many policies that we care about don’t look like simple one-off treatments. In Kenya, whether a district shares the president’s ethnicity changes with each election, so exposure to favouritism in public spending can switch on and off. In the US, states deregulated financial markets during the 1990’s at different times and with different intensities and some even made multiple changes. In Germany, municipalities have local business tax rates that vary across space from the start and move up or down over time. Even the number of newspapers in circulation across US counties (which is a continuous measure) has long been used to study voter turnout.

None of these cases fit perfectly into a staggered binary design. If we force them into that framework or fall back on TWFE regressions, we risk estimates that don’t mean what we think they mean. This paper is important because it provides a framework that lets us evaluate these policies as they actually happened. The AVSQ approach measures the effect of departing from the status quo baseline, the normalisation step makes the results interpretable in terms of current and lagged effects, and the ACE parameter links the findings to a cost–benefit view of policy.

It also matters that the paper shows how and why distributed-lag TWFE can give the wrong answer in these settings. It’s more than just a theoretical point: in their application to newspapers and turnout, the AVSQ framework points to strong effects from current exposure, while TWFE suggests the opposite. For applied researchers, the message is clear: dynamic and non-binary treatments need methods designed for them, not workarounds.

Who should care?

Anyone working with DiD in situations where treatment is not a clean one-off adoption, which includes researchers studying policies that can be introduced and repealed, vary in strength or differ across units from the very beginning. As the authors point out, examples range from political economy questions like ethnic alignment with leaders, to local fiscal policy where tax rates shift up and down, to historical work on newspapers and turnout. It also matters for people still relying on distributed-lag TWFE regressions to study dynamic effects. The paper shows those regressions can give sign-reversed estimates when effects differ across units. If your design involves intensity, reversals, or lagged exposure, this framework offers more credible alternatives.

In short, applied researchers in labour, public finance, political economy and economic history all have reasons to pay attention (basically anywhere policies don’t come as simple binary treatments).

Do we have code?
We have commands for both Stata and R, called did_multiplegt_dyn, which compute the AVSQ event-study effects and related estimators. These commands also implement the placebo tests and the diagnostics for dynamic effects. The paper itself is theory-heavy, but the software is designed to be practical for applied work.

In summary, this paper shows how to use DiD when treatments don’t look like neat one-time switches. The AVSQ approach compares what actually happened to a status-quo where nothing changed and the normalised version tells us how current and past exposure matter. The ACE parameter turns this into a cost-benefit measure. The authors also explain why standard distributed-lag TWFE regressions can mislead and give a simple alternative. In their application to newspapers and turnout, the new method finds current exposure matters more, while TWFE says the opposite.

Inference With Few Treated Units

TL;DR: what if only one or a few units get treated, like a single country’s reform or a couple of states changing policy? Clustered or robust errors break down because they assume many treated clusters and badly understate uncertainty. This paper surveys the fixes: borrowing from controls, pre-trends, or randomisation inference. Each comes with trade-offs. The key point is that the count of treated units drives inference, and much of what we see in practice is invalid.

What is this paper about?

This paper is about a situation many applied researchers find themselves in: what if your treatment only applies to one unit, or just a handful of them? Think of German reunification (only one country “treated”) or a state-level reform adopted by 2 or 3 states. In these cases, the usual inference machinery is a bit more tricky and difficult to justify. We know from practice that standard cluster-robust or heteroskedasticity-robust errors rely on having many treated clusters. With only a few, the treated units have outsized leverage, standard errors are biased down and rejection rates can be way off. What the authors do is to step back and survey the growing literature on this problem. They classify methods into two big “families”:

Model-based inference, which treats uncertainty as coming from potential outcomes (sampling from a super-population); and
Design-based inference, which treats uncertainty as coming from random assignment of treatment.

They then go through the different approaches available when treated units are scarce. Some methods lean on cross-sectional variation (borrowing information from controls), others on time-series variation (borrowing from pre-treatment periods). Each approach makes trade-offs: cross-sectional methods require many controls but allow unrestricted serial correlation; time-series methods require many untreated periods but allow cross-sectional dependence. They also examine extreme cases, like when there’s *literally* only one treated unit and one treated period. Here, inference is only possible under strong assumptions, like assuming treatment effects are homogeneous or restricting how errors behave. As more treated units, periods, or within-cluster observations become available, assumptions can be relaxed somewhat, but the problem never fully goes away.

The main message is that the inferential challenge is about the number of treated units. Even if you have a thousand controls, one treated unit means asymptotics won’t save you. The survey pulls together existing fixes (wild bootstrap variants, randomization inference, conformal methods, etc.), shows how some heuristics can be justified theoretically and proposes small modifications to improve finite-sample performance.

This is a good paper that provides us with a map of what can and can’t (shouldn’t?) be done when “few treated” is the reality, and links econometrics back to the practical settings (DiD, SC, matching) where these problems show up most often.

What do the authors do?

They begin with a simple example: estimating a treatment effect when only one unit is treated. With just one treated unit, heteroskedasticity-robust standard errors collapse (what variance would be left to estimate anyway?) and cluster-robust errors in a DiD setting massively understate uncertainty. They use this to show why conventional inference fails when the number of treated units is small (regardless of how many controls you have).

From there, they organise the literature into categories:

Cross-sectional approaches: methods like Conley and Taber (2011), which assume the error distribution for treated and control units is comparable and typically require homogeneous treatment effects or strong distributional assumptions. With many controls you can use the control residuals to approximate the treated distribution. Other refinements handle heteroskedasticity (e.g. state size differences) or variation in treatment timing.
Time-series approaches: methods that rely on long pre-treatment periods and require stationarity and weak dependence assumptions on the errors. Here the idea is to compare the treated unit’s residuals after treatment to its residuals before treatment, testing whether the last observation looks unusually large. These require assumptions like stationarity and weak dependence but allow for cross-sectional correlation.
Allowing for stochastic treatment effect heterogeneity: they discuss when inference can be framed around testing sharp nulls (no effect whatsoever) or around realised treatment effects conditional on shocks. This shifts the interpretation of what we’re learning but opens up additional strategies.
Design-based approaches: when treatment is randomly assigned but only to a handful of units, the focus shifts. They show how randomisation inference can still be valid, but imbalances between treated and controls become much more likely with small treated groups.
Sharp null testing: many methods test whether there is no effect whatsoever (sharp nulls) rather than testing average treatment effects, which changes how results should be interpreted.

Along the way, they show connections between methods (e.g., how a wild bootstrap with the null imposed can be asymptotically equivalent to a randomisation test) and they propose tweaks that improve finite-sample performance.

The unifying theme is that each method trades one type of assumption for another. Cross-sectional methods need many controls and comparability. Time-series methods need long pre-trends and stationarity. Design-based methods hinge on how treatment was assigned. The survey doesn’t offer a “best” choice, but it does clarify what’s available, what each method buys you and where each one fails.

Why is this important?

I’m preaching to the choir here, but situations like these are everywhere. Many influential DiD and SC papers study events like a single country’s reform, one state-level policy or a handful of treated clusters. Standard inference procedures underestimate uncertainty because they rely on asymptotic theory that fails when the number of treated units is small, regardless of the total sample size. They underestimate uncertainty and can lead to massive over-rejection.

This paper gives us both structure and clarity on a messy corner of applied practice. Instead of a patchwork of one-off fixes, it provides a taxonomy: which methods work when you only have one treated unit, when you have a few, when you have many controls or when you have long pre-treatment histories. It also explains the assumptions you’re implicitly making when you pick one approach over another.

There’s also a bigger point: forget about the share of treated units, consider the count. Even if 1% of units are treated out of a thousand, that’s still just ten treated units. Asymptotics won’t serve much. Knowing this helps us avoid the false comfort of “large N” when what actually matters is the number of treated clusters.

The survey also connects different strands of work to demonstrate that many of the inferential problems are variations of the same underlying issue. For practitioners, that means better tools and clearer diagnostics. For theorists, it ties together a set of results that were previously all over the place. It also reveals that methods designed for single treated units may be more powerful than methods requiring multiple treated units, even when more treated units are available, which is an important practical consideration for researchers.

Who should care?

Applied researchers working with one or a few treated units (country case studies, state-level reforms or costly RCTs). Referees and editors since many papers still report invalid clustered errors in these settings.

Do we have code?

Code doesn’t make sense here.

In summary, this paper reviews what to do when only a handful of units are treated. Standard inference fails in these cases, and the number of treated units - not their share - drives the problem. The authors map out model-based, time-series, cross-sectional, and design-based approaches, show how existing heuristics can be justified and suggest tweaks to improve small-sample performance.

It took me a good half an hour to understand this because the justification for what classifies as a baseline factor is context-dependent. How do we know if something is a baseline factor? Apparently the simplest rules are: 1) it must already be in place before the event, and 2) it must not *plausibly* (oh no my identification assumption!) be changed by the event. We should be able to look at it and say: this is a characteristic of the unit that existed beforehand and won’t flip suddenly just because the event occurred. From what I understand, there are three practical checks and they relate to timing, persistence and isolation (to a certain extent). Timing refers to: was the factor measured before the event? If not, it can’t be baseline. Persistence relates to the factor being something like slow-moving or fixed (e.g., geography, long-standing institutions or cultural traits). And finally, isolation: can the event directly or indirectly alter it in the short run? If yes, it’s not safe to treat it as baseline. Going back to what the authors use as an example, social capital measured in the 1950s works in a famine study because kinship networks don’t suddenly disappear during a few “bad years”. But soil quality would not work if the “event” were a soil restoration reform, since the policy directly targets the factor itself.

“Factorial” comes from stats. Fisher (also here) and Yates developed factorial experiments to study the main effects of two factors and their interaction. FDiD borrows from this logic: treat the event (pre/post) and a baseline factor (e.g., high/low social capital) as the two factors, then interpret what DiD identifies under stated assumptions. For background, the authors recommend these: VanderWeele (2009); Dasgupta et al (2015); Bansak (2020); Zhao and Ding (2021).

Effect modification: identified with the usual DiD assumptions: universal exposure, no anticipation, and canonical parallel trends. This is descriptive, not causal. Causal moderation: to interpret the contrast as causal, you need a stronger assumption, which the authors call factorial PT. ATT analogue: if you add an exclusion restriction that one group is exposed but unaffected, then the design “collapses” to the familiar ATT interpretation from canonical DiD. Baseline effect given exposure: if instead you assume the baseline factor has no effect in the absence of the event, then the estimator captures how much the baseline factor matters once the event has happened.

In the Concluding Remarks, the authors make 3 recommendations for applied researchers:

Be explicit about the design. When using FDID, say so. Don’t call it a standard DiD, because the estimand is different.
State clearly what assumptions justify your claim. With only canonical PT, you can talk about effect modification; to claim causal moderation, you need factorial PT; for ATT-like interpretations, you need an exclusion restriction.
Check robustness with alternative assumptions and specifications. For example, examine whether your baseline factor is truly unaffected by the event, and test sensitivity to different ways of grouping or measuring it.

Parallel Trends… Conditionally: Covariates, Violations, and More Robust DiD Methods

Beatriz Gietner — Mon, 11 Aug 2025 16:33:45 GMT

Hi there! Following a friend’s suggestion, I decided to add some quick explanations of concepts that eventually show up in the main text. I don’t want to overcrowd it, so they will appear as footnotes whenever it’s necessary. I hope this will make things clearer :) For example, the first paper talks about the covariate balancing propensity score, which is a method for estimating propensity scores in a way that automatically balances covariates between treated and control groups (rather than estimating the score first and then checking balance later). I’ll no longer assume everyone is already familiar with these “specifics”. Any other suggestions are welcome, just send me an email (b[dot]gietner[at]gmail[dot]com) or comment in the comment box at the end of the post.

Let’s goooo then. Here are the papers:

A difference-in-differences estimator by covariate balancing propensity score, by Junjie Li and Yukitoshi Matsushita
Good Controls Gone Bad: Difference-in-Differences with Covariates, by Sunny Karim and Matthew D. Webb
Bayesian Sensitivity Analyses for Policy Evaluation with Difference-in-Differences under Violations of Parallel Trends, by Seong Woo Han, Nandita Mitra, Gary Hettinger and Arman Oganisian
The Labor Market Effects of Generative AI: A Difference-in-Differences Analysis of AI Exposure, by Andrew Johnston and Christos Makridis (this is an applied paper that I judged to be super relevant to the current discussion we are having regarding the role of AI on the labour market: is it good, is it bad, does it generate consumer surplus, who’s benefitting from it, who’s being left out, is the generative AI revolution fundamentally different from past technological shifts (like the industrial revolution or the rise of computers), if generative AI is so powerful, why aren't we seeing a massive, economy-wide productivity boom? etc. The narratives are swinging between utopian optimism and dystopian fear, and it doesn’t need to be like that. I think you should check it out for a couple of reasons: it provides evidence that AI appears to be both creating and destroying opportunities, but in different parts of the labour market; and it directly tackles the central question of whether AI will help or replace human workers. I’m not telling you more, go read it :))

A difference-in-differences estimator by covariate balancing propensity score

TL;DR: Professors Li and Matsushita adapt the Covariate Balancing Propensity Score (CBPS) to estimate treatment effects in DiD designs. Their proposed CBPS-DiD estimator enforces covariate balance while offering double robustness for both the point estimate and its statistical inference. Tested in simulations and on the classic LaLonde dataset, it often delivers lower bias and more reliable confidence intervals than standard estimators like regression adjustment, IPW or AIPW. Its advantage is most pronounced when the underlying models are slightly misspecified (a common challenge in applied policy evaluation).

What is this paper about?

This paper by Li and Matsushita takes the Covariate Balancing Propensity Score (CBPS)1 method introduced by Imai and Ratkovic (2014) and adapts it to the DiD framework. Their focus is on estimating the ATT in situations where the PTA is unlikely to hold unconditionally but may be more plausible after conditioning on covariates2.

The authors’ contribution is to integrate CBPS into a DiD setup by showing that their “CBPS-DiD” estimator inherits many “desirable” properties. When both the outcome model and the propensity score model are specified correctly, the estimator approaches the semiparametric efficiency bound3 (meaning it makes full use of the available information in the data). When only one of the models is correct, the estimator remains consistent (doubly-robust4), and unlike the augmented inverse probability weighting (AIPW) DiD estimator, it is also doubly robust in terms of *inference* (so confidence intervals remain valid even if one model is wrong and you’re not being misled about how precise your estimates are).

In the paper’s simulation exercises they also find that CBPS-DiD converges faster to the true ATT than AIPW-DiD when both models are only “slightly” misspecified (it’s a subtle but practically relevant advantage since exact model correctness is very rare in applied work).

What do the authors do?

To examine how their CBPS-DiD estimator performs, the authors perform Monte Carlo simulations that let them control the “truth” and test the method under different conditions. They construct artificial datasets in which they vary whether the outcome model and the propensity score model are both correct, only one is correct, both are wrong or both are “slightly” wrong. These scenarios are important because they mimic the realities of applied work: rarely do we specify every model perfectly and small deviations from the truth are very common even with good theory AND data. For each of the proposed scenarios they compare CBPS-DiD to standard alternatives (outcome regression, IPW and AIPW) tracking bias, coverage probability and confidence interval length (checking if point estimates are close to the truth and whether the measures of uncertainty are reliable).

They then move from simulations to an empirical illustration using LaLonde’s (1986) dataset, which is a “benchmark”5 in program evaluation where the true average treatment effect is known to be zero. It’s like a tough test: the dataset is well known for “tripping up” estimators, often producing spurious effects due to covariate imbalance. By applying CBPS-DiD alongside the other estimators to this data they can see how much bias each method produces when estimating an effect that should, IN THEORY, be absent. They run these comparisons both with a small covariate set and a larger one, displaying how performance changes as the dimensionality of the adjustment set grows.

Why is this important?

The PTA is one of the pillars of DiD, but in many applications it is unlikely to hold unless you adjust for pre-treatment differences in observed characteristics. The problem is that some traditional approaches (like regression adjustment or IPW) often struggle here for a variety of reasons, such as poor covariate balance or relying too much on the model being specified correctly. CBPS-DiD addresses both of these problems by a) building covariate balance into the estimation process and b) keeping the “protection” of double robustness. It also offers double robustness for inference, meaning confidence intervals remain valid even when one of the models is wrong. This matters because in observational data the covariates that matter for trends are often measured with error or modeled imperfectly. If balance is poor the ATT estimate can be biased. If inference is wrong you can be misled about the effect’s precision. CBPS-DiD reduces the risk of both problems without requiring perfect knowledge of the true data-generating process6. It is designed for the realities of applied policy evaluation where theory guides model choice but can’t guarantee correctness.

Who should care?

Applied researchers who rely on DiD in settings where the PTA is unlikely to hold without covariate adjustment and where model misspecification is a concern. It is especially relevant if you work with observational data where balance between treated and control units is hard to achieve or if you want more reliable inference when model assumptions are only partly correct. Policy evaluators, labour economists, education researchers and public health analysts working with staggered or selective treatment adoption can all benefit from a method that strengthens both balance and inference without requiring perfect models.

Do we have code?

No, the paper doesn’t provide replication files or a package info. If you want to try CBPS-DiD you would need to adapt the existing CBPS framework from Imai and Ratkovic (2014) (available in the CBPS package for R) and embed it into a DiD setup following the steps in the paper. The package itself only covers cross-sectional and basic panel weighting so this is not a “plug and play” situation for DiD. It’s doable if you’re comfortable working with both propensity score weighting and DiD estimation but there’s no one-line function for it yet.

In summary, this paper extends the CBPS to the DiD framework, producing an estimator that binds covariate balance into the weighting stage while maintaining double robustness, and even extends it to inference. It’s a method built for real-world policy evaluation where covariates matter for trends, models are rarely perfectly specified and you want both credible point estimates and trustworthy CIs. The simulations and the LaLonde application show it can outperform standard regression adjustment, IPW and AIPW in both bias and precision, with the advantage being more pronounced when model misspecification is mild but inevitable.

Good Controls Gone Bad: Difference-in-Differences with Covariates

(We have plots, DAGs AND flowcharts in this paper, which I love. Also lots of things in this paper follow from the previous one - check the footnotes)

TL;DR; many DiD studies add time-varying covariates assuming they help, but this only works if the CCC assumption holds (meaning covariate effects are stable across groups and time). When CCC fails, popular estimators like TWFE, CS-DiD, imputation and FLEX can produce biased treatment effects, even if PT hold without covariates. This paper formalizes CCC, shows how violations cause bias and introduces DiD-INT, a new estimator that remains unbiased under CCC violations and can recover PT that are hidden by misspecified covariate adjustments.

What is this paper about?

This paper is about what happens when the relationship between your covariates and your outcome isn’t stable across time or across groups, a situation the authors formalize as the Common Causal Covariates (CCC) assumption7.

In most DiD studies, we add covariates to make the PTA more plausible or to control for other influences, but when the effect of such covariates changes over time or differs between treated and control groups, “standard” estimators (from the familiar TWFE model to modern options like Callaway–Sant’Anna, imputation and FLEX) can produce biased treatment effect estimates. Karim and Webb show that this problem is widespread, formalize three versions of the CCC assumption and demonstrate how violations cause bias in widely used estimators. They also introduce a new estimator, the Intersection Difference-in-Differences (DID-INT) which remains unbiased even when CCC fails and works in settings with staggered treatment rollout and heterogeneous treatment effects.

What do the authors do?

They do lots of stuff, so I separated the contributions into four subsections.

First, they formalize the CCC assumption
They introduce three explicit versions of the CCC assumption: two-way, state-invariant, and time-invariant, and explain how each shapes the way covariates should be handled in DiD. This step is super important because until now most methods implicitly assumed two-way CCC without stating it.
Second, they show the problem with existing estimators
Through theory and Monte Carlo simulations, they show that when CCC fails, widely used estimators (TWFE8, CS-DiD9, imputation10, FLEX11) can produce biased ATT estimates. This bias appears even if you have correctly specified the rest of your model and your data meets other standard DiD assumptions.
Third, they propose the DiD-INT
DiD-INT is designed to adjust for covariates in a way that allows CCC to be violated, recover parallel trends hidden by misspecified covariate adjustments, work with staggered adoption and heterogeneous treatment effects, and avoid “forbidden comparisons” and negative weighting issues12. DiD-INT runs in five steps starting with a model selection algorithm that visually checks pre-trends under different ways of interacting covariates with group and time. This identifies the correct functional form *before* estimating the ATT.
Fourth, they develop a model selection algorithm
The authors provide a structured sequence to decide how to model covariates: 1) start without covariates and plot pre-trends; 2) if they fail, try covariates under the two-way CCC assumption; 3) if they still fail, interact covariates with group (state-varying DiD-INT) or time (time-varying DiD-INT); 4) if that fails, interact with both (two-way DiD-INT); 5) if pre-trends still look implausible, no DiD method is recommended. This approach is meant to replace the common “give up when pre-trends fail” practice with a systematic check for hidden parallel trends.

Why is it important?

Many of us include time-varying covariates in our DiD models without thinking too much about whether the relationship between those covariates and the outcome is actually “stable”. If that stability (aka the CCC assumption) fails, the resulting ATT estimates can be biased even if everything else about the design looks good13.

The problem is that CCC violations are not rare. In real datasets, covariate effects often change over time (due to shifting macroeconomic conditions, policy changes or industry composition) or vary between groups (because of structural differences in demographics, institutions or markets), and in some cases both happen.

This paper’s message is that CCC should be treated as a first-order identification issue rather than a minor technical detail. It also offers a way forward when CCC fails: DiD-INT broadens the set of situations where PT can plausibly hold, which then will probably reduce the number of projects that get abandoned just because pre-trends “don’t look parallel” under a misspecified covariate adjustment.

Who should care?

Lots of people (pretty much anyone running DiD models with covariates should care). Applied researchers running DiD models with covariates in repeated cross-sections or administrative datasets (this includes anyone working with survey microdata, labour force stats, education records or health data where both outcomes and covariates change over time); policy evaluators working with interventions that roll out across locations or institutions at different times, where standard estimators might be biased by shifting covariate effects; data scientists in government agencies and research institutes who produce official impact evaluations and must navigate both methodological correctness and practical constraints on variable selection.

Do we have code?

Yes, the authors say: “to ease in the implementation of the DID-INT estimator we have a package available in Julia. We also have a wrapper for Stata which calls the Julia program to perform the calculations, using the approach in Roodman (2025). The Stata program is available [here]. A wrapper in R is forthcoming. The software package allows for cluster robust inference using both a cluster-jackknife and randomization inference. The details of these routines, and their finite sample performance is discussed in the companion paper Karim et al. (2025).”

In summary, this paper reframes something many of us treat as a minor “specification choice” (adding covariates) into a core identification concern for DiD. By making the CCC assumption explicit, it shows that the stability of covariate effects is not a given and that violations can bias even modern estimators. This bias can appear whether or not PT hold unconditionally, meaning that covariates can sometimes turn a valid design into an invalid one. So rather than abandoning a project when pre-trends look implausible, the authors show how re-specifying the covariate adjustment can uncover “hidden” PT. Their DID-INT estimator (combined with a model selection algorithm) provides a structured way to test for and adjust to CCC violations while avoiding forbidden comparisons in staggered adoption settings14. For applied work, the contribution is twofold: a warning that covariates can do harm if misused and a practical method to recover unbiased estimates when a key stability assumption fails15.

Bayesian Sensitivity Analyses for Policy Evaluation with Difference-in-Differences under Violations of Parallel Trends

(Moving away from our frequentist framework…)

TL;DR: this paper presents a Bayesian sensitivity analysis for DiD when the PTA is likely violated. The approach models the size and persistence of violations with an AR(1) process and allows these parameters to be fixed, fully estimated within a Bayesian model or calibrated from pre-treatment data. The authors demonstrated how treatment effect estimates change under different assumptions about violations by using the example of beverage sales data from Philadelphia and Baltimore.

What is this paper about?

In this paper the authors examines how to use DiD when the PTA is likely violated. They propose a Bayesian sensitivity analysis that introduces a formal parameter for the size and persistence of these violations, modeled with an AR(1)16 process to capture temporal correlation. They present three ways to set the priors for this process: fixed values, fully Bayesian estimation and empirical Bayes calibrated from pre-treatment data. The approach is illustrated by estimating the effect of Philadelphia’s sweetened beverage tax using Baltimore as a control city, showing how treatment effect estimates change under different assumptions about the violation.

What do the authors do?

The authors start by extending the standard DiD setup to include a term that measures how much the treated group’s trend could diverge from the control group after treatment. This deviation term is modeled with an AR(1) process, which captures both the average size of the violation and how persistent it is over time. They then specify three strategies for setting the priors on the AR(1) parameters: 1) fixed values, where the level of persistence is set in advance and only the size of the deviations varies; 2) fully Bayesian estimation, where the model learns both persistence and deviation size from the data within chosen prior ranges; and 3) empirical Bayes, where these parameters are first estimated from pre-treatment data and then used as priors in the post-treatment model. They then apply these models to monthly beverage sales data from Philadelphia (treated) and Baltimore (control) before and after the 2017 sweetened beverage tax. The authors tested how large the violation of PT would need to be to make the beverage tax’s effect statistically insignificant. For most of their models the required violation was *implausibly* large (e.g. for supermarkets, one model required a 980-fold increase in counterfactual sales), which strengthens their conclusion that the tax did have an effect.

Why is this important?

In the two previous paper we went on and on about violation of the PTA. It is super important, but less discussed in Bayesian settings17. When the PTA does not hold, the estimated treatment effect can reflect underlying differences between treated and control groups rather than the impact of the policy. Many studies check for PT by looking at pre-treatment data but these tests are often underpowered and can’t guarantee that trends would have remained aligned after treatment. The Bayesian framework in this paper gives us a more structured way to relax the assumption, quantify the size and persistence of violations and see how the results change under different plausible scenarios. This makes it possible to present policy conclusions alongside a transparent assessment of how much they depend on the PTA.

Who should care?

I feel like I am repeating myself here, but this paper will interest applied researchers who use DiD and worry that the PTA may not hold particularly in policy evaluations where treated and control groups have different pre-treatment dynamics. It is also relevant for statisticians and econometricians working on sensitivity analysis methods and for policy analysts who need to present results with a clear statement of how robust they are to key assumptions. Anyone who works with short panels, noisy outcomes or limited pre-treatment data will find the discussion of prior choices and empirical Bayes calibration really useful (I did).

Do we have code?

The paper does not share replication files, but it states that the models were implemented in PyMC (version 4.0) using the No-U-Turn Sampler, a form of Hamiltonian Monte Carlo. Since the main addition to a standard Bayesian DiD is the AR(1) process for violations, the method could be reproduced by combining a DiD setup with an AR(1) prior on the deviation term. PyMC has public examples of AR(1) time-series modeling that can serve as a starting point for implementing this framework.

In summary, this paper adapts DiD to situations where the PTA is unlikely to hold by modeling violations directly in a Bayesian framework. The AR(1) process for the deviation term allows both the size and persistence of violations to be estimated or calibrated from pre-treatment data. The authors show that different prior choices (fixed, fully Bayesian and empirical Bayes) can lead to different conclusions about the policy effect, making sensitivity analysis an essential part of interpretation. The application to Philadelphia’s sweetened beverage tax demonstrates how this approach can make the robustness of DiD results explicit, rather than leaving it as an untested assumption.

CBPS is an alternative way of estimating the propensity score (the probability that a unit receives treatment given its observed characteristics) that directly enforces covariate balance between treated and control groups at the estimation stage, which is different to the standard approach which estimates the propensity score first (often using a logit or probit model) and then checks balance afterward. CBPS produces weights that better align the pre-treatment characteristics of treated and untreated units by making balance a “target” rather than an “afterthought”, which in turn improves the plausibility of the conditional PTA when covariates are important.

For example, when treated units differ systematically from controls in ways that affect how outcomes evolve over time, such as having younger populations, higher baseline income, or different industry structures, but may be more plausible once these differences are accounted for through covariates. In these cases, simply comparing pre and post-treatment changes between treated and untreated groups would risk conflating treatment effects with the influence of those underlying characteristics, whereas conditioning on covariates can “help” isolate the policy’s impact. Let’s think of a concrete example related to my area (Econ of Education): consider a policy that expands access to after-school tutoring but is rolled out first in schools serving lower-income communities (let’s not get into what the policymaker might have in mind such as their specific goals with the policy, whether there is even a real problem this policy would solve, or whether they have a detailed plan for measuring its effectiveness). If we just compared changes in test scores at these schools to changes in more affluent schools without the program, the trends might differ even without the policy because achievement growth often follows different trajectories across socioeconomic groups. But if we adjust for covariates (such as baseline test scores, parents’ education or school resources) the assumption that treated and control schools would have evolved in parallel becomes more plausible, but not completely. Remember that misspecification is a huge issue that can arise from omitted variable bias, wrong functional form, mismeasured variables, incorrect distributional assumptions or inappropriate fixed effects structure. If any other important factors remain unobserved (or if they can’t be measured and theory offers no guidance - aka the unobserved confounders), even conditioning on a “good” set of variables won’t guarantee that the PTA holds. The PTA is an assumption about the true data generating process. Misspecification means your model doesn’t correctly represent that process, and if it leaves out factors that actually drive differences in trends, the *conditional* PTA will still be violated even if you think you’ve adjusted for “enough” covariates. This is such a good topic, I can recommend some reading: “What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature” (Roth, Sant’Anna, Bilinski, Poe, 2023); “Difference-in-differences when parallel trends holds conditional on covariates” (Caetano, Callaway, 2024); “Nothing to see here? Non-inferiority approaches to parallel trends and other model assumptions” (Bilinski, Hatfield, 2019).

In simpler terms, the semiparametric efficiency bound is the “smallest possible variance an estimator can achieve in a given setting without making strong, unrealistic assumptions”, (e.g. knowing the exact functional form of the model, assuming perfectly normally distributed errors or ruling out heteroskedasticity). Hitting that bound means you’re extracting the maximum precision possible from your data under credible assumptions. The type of modeling used is based on how much information are available about the form of the relationship between response variable and explanatory variables, and the random error distribution. Remember the difference: in a parametric model you fully specify the functional form AND distribution of the errors (e.g. linear regression with normally distributed, homoskedastic errors), and all parameters are finite-dimensional; in a nonparametric model you make NO functional form assumptions, you let the data determine the shape of the relationship (often at the cost of efficiency since you need large samples to get precise estimates); and in a semiparametric model you specify a finite-dimensional parameter of interest (like a treatment effect) BUT leave part of the model (it can be the error distribution or the functional form of covariates) unspecified. Going back to our education examples, let’s consider estimating the effect of reducing class size on test scores (you can think of a couple of mechanisms in place). In a parametric model, you might assume a linear (up or down) relationship between class size and scores, control for other factors in a fixed way and assume the errors are normally distributed and homoskedastic. In a nonparametric model, you would make no assumption about the shape of the relationship, letting the data reveal it (by having lots of fun plotting the data) but you would need a much larger sample to get a precise answer (often applied micro people we don’t have that). In a semiparametric model, you could focus on estimating the ATE of small classes while leaving the relationship between other covariates (like parents’ income or teacher experience) and test scores unspecified, which give you a bit more flexibility while retaining more precision than a fully nonparametric approach.

“In a missing data model [where the missingness mechanism can be MCAR, MAR or MNAR], an estimator is doubly robust (DR) or doubly protected if it remains consistent when either a model for the missingness mechanism or a model for the distribution of the complete data is correctly specified. In a causal inference model, an estimator is DR if it remains consistent when either a model for the treatment assignment mechanism or a model for counterfactual data is correctly specified. Because of the frequency and near inevitability of model misspecification, double robustness is a highly desirable property”. (Bang, Robins, 2005)

Check “LaLonde (1986) after Nearly Four Decades: Lessons Learned” (Imbens, Xu , 2024) for a look at the LaLonde dataset.

Even with better balance and double robustness, CBPS-DiD (like any DiD estimator) can’t account for unobserved confounders that affect trends differently between treated and control groups. The conditional PTA must still hold for the set of observed covariates you include.

This is an assumption which says that the effect of each covariate on the outcome is constant either across time, across groups or both. There are three forms: 1) two-way CCC, which is constant across time and groups; 2) state-invariant CCC, which is constant across groups but can vary over time; and 3) time-invariant CCC, which is - as you can guess by - constant over time but can vary between groups. Violation means that the “adjustment” provided by covariates is not the same for all observations, which leads to bias. Think of it as a recipe for a cake that works the same way and produces the same result whether you bake it in California or New York (it’s state-invariant), and whether you bake it today or in five years (it’s time-invariant). The recipe is “stable”.

TWFE “suffers” from negative weighting and forbidden comparisons in staggered designs (Goodman-Bacon, 2021).

Callaway–Sant’Anna avoids forbidden comparisons by estimating group-by-time ATTs and aggregating, and it defaults to a doubly robust DiD approach that still assumes CCC when covariates are time-varying.

Imputation is an approach that predicts untreated counterfactual outcomes for treated units using a model fit on controls, then takes differences. Can use time-varying covariates but still assumes CCC holds.

A flexible regression specification allowing for time-varying covariates, but like others, does not address CCC violations.

In staggered adoption some treated units serve as controls for others which introduces bias when treatment effects are heterogeneous. Negative weights mean some group-time effects get subtracted in aggregation.

Even in designs where PT unconditionally hold, adding covariates that violate CCC can create bias that wouldn’t otherwise exist. Covariates are not always “safe” to include and bad covariates can turn a valid design into an invalid one.

The paper makes it clear that while DiD-INT is unbiased even when CCC fails, this robustness comes at the cost of efficiency. The estimates from DiD-INT will have higher variance (i.e. wider CIs) than a standard estimator. We are familiar with this trade-off: gaining accuracy at the cost of some precision.

Given the bias-variance trade-off, the authors suggest a practical path forward: researchers can either use the model selection algorithm to find the most efficient (parsimonious) model that satisfies PT or simply default to the two-way DiD-INT estimator, which is unbiased across all potential CCC violation scenarios - albeit at the cost of statistical power.

Macro friends will get this one hehe ;) AR(1) stands for autoregressive process of order 1, it’s a way to model how a value at one time point depends on its own value in the previous time period + some random noise. Here it means that deviations from PT are assumed to change gradually over time rather than jumping around randomly. Broader explanations here and here.

Bayesian methods are used here because they make it straightforward to incorporate prior beliefs about the size and persistence of violations into the model and to quantify uncertainty about those violations in a coherent way. In a frequentist framework, sensitivity analyses require separate runs for each assumed violation level and they *do not* produce a single posterior distribution that integrates over uncertainty about these parameters. The Bayesian approach can treat the violation parameters as random variables (which are neither variable nor random, iykyk), update their distributions with the data and directly propagate that uncertainty into the treatment effect estimates.

DiD+: how to handle data constraints, some timing issues and other TWFE limitations

Beatriz Gietner — Fri, 01 Aug 2025 15:30:38 GMT

(Claude creates the image prompt based on the post and it gets crazier by the day. I appreciate it tho)

Hi!

Before we start today’s post, I’d like to recommend this one “Omitted Variables versus Replicating the RCT” by prof Scott. For those of you who don’t know, prof Rubin is the Rubin behind the Rubin causal model1. He’s one of the founding figures of modern causal inference, who introduced the potential outcomes framework that is now in all discussions of treatment effects, from RTs to DiD models. Also we got a mention on my friend Sam’s post, thanks! To conclude the introduction, I know this is not the purpose of this newsletter so I avoid adding links not directly related to the post (for useful ones I have an entire section on my website), but I found a link on LinkedIn by professor Vladislav Morozov on his Advanced Econometrics course and it is reeeeally good and worth checking out :)

The content we will be covering today is:

Difference-in-Differences with Unpoolable Data, by Sunny Karim, Matthew D. Webb, Nichole Austin, and Erin Strumpf
Staggered Adoption DiD Designs with Misclassification and Anticipation, by Clara Augustin, Daniel Gutknecht, and Cenchen Liu
Interactive, Grouped and Non-separable Fixed Effects: A Practitioner's Guide to the New Panel Data Econometrics, by Jan Ditzen and Yiannis Karavias

And there are three “applied” papers I won’t go through but that might be interesting for you to read:

Using did multiplegt dyn in Stata to Estimate Event-Study Effects in Complex Designs: Four Examples Based on Real Datasets, by Clément de Chaisemartin and Bingxue Li (I found this on LinkedIn, it’s a guide that shows you how to estimate dynamic treatment effects in complex DiD designs using the did_multiplegt_dyn Stata command, with real examples involving staggered, continuous, and multi-treatment settings)

Re-Assessing the Impact of Brexit on British Fertility Using Difference-in-Difference Estimation, by Ross Macmillan and Carmel Hannan (this paper offers a forensic re-assessment of a published study, showing how a statistically significant “Brexit effect” on fertility is entirely an artifact of inappropriate control group selection. It provides a practical template for how rigorously ensuring pre-treatment PT can completely reverse a study’s conclusions. Read it if you, like me, love a good discussion on internal validity).
Emerging Techniques for Causal Inference in Transportation: Integrating Synthetic Difference-indifferences and Double/Debiased Machine Learning to Evaluate Japan’s Shinkansen, by Jingyuan Wang, Shintaro Terabe and Hideki Yaginuma (this paper has all the buzzwords I like + it’s about Japan and trains :-) What to do when your treated units are unique megacities like Tokyo or Osaka? The authors show how traditional DiD can give unstable or biased answers and instead use both Synthetic DiD and Double/Debiased Machine Learning to evaluate Japan’s bullet train network. They argue that the real power comes from seeing these two completely different, advanced methods point to the same conclusion, which is a great way to build confidence in your findings).

Difference-in-Differences with Unpoolable Data

(This one hit close to home)

TL;DR: standard DiD often requires merging datasets, but privacy rules sometimes forbid this. This paper introduces UN-DID, a method that works around this by calculating changes within each secure dataset first. You then combine just the summary results (not the sensitive data) to get a valid treatment effect. The authors provide code to do it.

What is this paper about?

This paper is about how to estimate treatment effects using DiD when data from treatment and control groups cannot be combined. In many applications, we have to work with administrative data that is stored in separate secure environments such as in different provinces, agencies or countries. Legal or privacy restrictions may prevent pooling these datasets into a single file for analysis. The authors here propose a method called UN-DID that would allow us to estimate DiD models without needing to combine datasets. It works by estimating pre-post differences separately within each “silo”, then combining these differences externally to recover the ATT. The method supports covariates, multiple groups, staggered adoption and cluster-robust inference, making it suitable for a wide range of applied settings where conventional DiD is infeasible due to data restrictions.

What do the authors do?

They start with the simple 2x2 case and show that UN-DID gives the same estimate as conventional DiD when there are no covariates, then they extend the method to allow for covariates, staggered treatment timing and multiple jurisdictions.

UN-DID works by running separate regressions in each data silo to estimate within-group pre-post differences. These differences, along with their standard errors and covariances, are extracted and combined outside the silos to calculate the treatment effect. The method remains valid even when the effect of covariates differs across silos which is a setting where conventional DiD may be biased.

They support their method with formal proofs, Monte Carlo simulations and two empirical applications using real-world data. In both examples they take pooled datasets and artificially treat them as unpoolable to show that UN-DID produces results that closely match those from standard DiD even in the presence of covariates and staggered adoption.

Why is this important?

Many administrative datasets like health records, tax files, or student registers are stored in secure environments with privacy protections that prohibit exporting or combining data across jurisdictions. These rules are designed to prevent the risk of re-identifying individuals, and as a result, we can’t pool treatment and control data into a single file, which is what conventional DiD methods require.

UN-DID makes it possible to estimate DiD effects in these settings without violating data-sharing rules. UN-DID works within the boundaries of legal and ethical data use while still enabling credible policy evaluation. The method also handles situations where covariates affect outcomes differently across silos, which helps avoid bias that can affect standard DiD models. For example, the authors note that researchers have been unable to evaluate policies like Canada’s national cannabis legalization because they couldn’t combine provincial data. UN-DID is designed for precisely this kind of setting.

Who should care?

You, me, and anyone who works with administrative data/records behind a government “paywall”, which includes even health policy analysts, education researchers, and anyone using secure data that can’t be exported or merged. UN-DID also matters for applied researchers dealing with staggered adoption or treatment heterogeneity where credible counterfactuals lie in datasets they can’t directly combine. It gives them a way to recover treatment effects without weakening their design or dropping research questions entirely due to data restrictions.

Do we have code?

To everybody’s happiness, yes. The authors provide UN-DID software packages in R, Python, Stata, AND Julia (!) which are described in Section 7.3 of the paper. These tools are designed to work within the constraints of siloed data environments, meaning each analyst can run their part locally, and only summary estimates (like pre-post differences and standard errors) need to be exported. A separate user guide with a full empirical example using real-world siloed data is forthcoming.

In summary, UN-DID is a very practical extension of DiD for settings where legal or privacy rules prevent combining datasets across jurisdictions. It produces valid treatment effect estimates by working within each silo and then combining results externally. The method supports covariates, staggered timing, and heterogeneous effects, and performs well in both simulations and real data. For researchers working with administrative data that can’t be pooled, it offers a way to keep using DiD without compromising identification (or breaking the law).

Staggered Adoption DiD Designs with Misclassification and Anticipation

TL;DR: when treatment dates are misclassified or people anticipate a policy before it officially starts, standard staggered DiD estimators produce biased results. This paper offers a solution by introducing new, bias-corrected estimators that account for these issues and provides a diagnostic test to detect the presence and timing of such misspecification in the data.

What is this paper about?

This paper studies two issues that many of us face when using staggered DiD: misclassification of treatment timing and anticipation effects. Such problems aren’t new or even overlooked, but in practice we often do the best we can with the data we have, knowing damn well that policy implementation may not be cleanly observed or that people can react to treatment announcements before the official start date. What’s been missing is a formal framework to understand how these issues threaten identification and how to adjust for them.

Staggered adoption designs are now widely used in policy evaluation. They extend the basic DiD setup to cases where groups are treated at different times. Recent papers, like Callaway and Sant’Anna (2021) and De Chaisemartin and D'Haultfoeuille (2020) have improved how we deal with treatment effect heterogeneity and timing, but they assume that treatment dates are correct and that outcomes don’t change before treatment starts.

This paper builds on that work and shows what goes wrong when those assumptions don’t hold. It explains how misclassification and anticipation can distort the comparisons made in staggered DiD and how that leads to biased estimates. It also connects to other research on errors in treatment status, like Lewbel (2007) and Denteh and Kedagni (2022) and to recent work on checking pre-trends before running DiD such as Rambachan and Roth (2023).

In short, the paper shows that when treatment is misclassified or anticipated even the best available DiD estimators may not be reliable2 unless we adjust for those issues, and proposes ways to both detect and correct the resulting bias.

What do the authors do?

The authors make three main contributions. First, they formally characterize the bias introduced by misclassification and anticipation in staggered DiD. They show that TWFE and other estimators aren’t “reliable” when the timing or incidence of treatment is misspecified (even under homogeneous treatment effects). This happens because these estimators end up comparing units that shouldn’t be treated as untreated, or vice versa, thus creating what the authors call “forbidden comparisons”.

Second, they propose bias-corrected estimators that recover causal effects under these conditions. One version adjusts an existing staggered estimator (from De Chaisemartin and D’Haultfoeuille, 2020) to account for misclassified timing, while another version targets the ATE among actually treated units (even when actual treatment dates aren’t observed) by imposing a weak homogeneity assumption on misclassification probabilities, which lets them estimate how many units were really treated at each point in time and then adjust accordingly.

Third, they introduce a 2-step testing procedure to detect the presence, extent, and timing of misspecification. The first step checks whether the PTA holds using standard pre-trend comparisons. If it passes, then the second step compares outcomes across switchers and not-yet-switched groups in earlier periods to flag potential misclassification or anticipation. These tests help determine when the bias corrections are needed and whether the assumptions behind them are plausible.

They support all of this with simulation evidence and an application to the rollout of a computer-based testing system in Indonesian schools. Interestingly, their results suggest that schools were adjusting behaviour even before officially adopting the system, and that failing to account for this leads to underestimating the true treatment effect.

Why is this important?

Staggered adoption designs are popular in applied economics for estimating treatment effects when policies roll out at different times. They handle heterogeneity and exploit timing variation, but rely on hard-to-verify assumptions.

2 key challenges are misclassification and anticipation. Misclassification happens when recorded treatment timing doesn’t match actual implementation due to measurement error or delayed adoption. Anticipation occurs when units respond before treatment formally begins, perhaps due to advance announcements. Both distort the comparisons staggered DiD depends on, biasing even sophisticated estimators designed for treatment heterogeneity.

These problems appear often in practice (you might have come across them yourself). Studies show institutional behaviour shifting before formal implementation (Bindler and Hjalmarsson, 2018), misalignment between administrative eligibility and actual responses (Goldschmidt and Schmieder, 2017), and anticipatory adjustments before official adoption (Berkhout et al., 2024). In each case, recorded dates poorly reflect actual exposure or behavioural change. So it’s welcoming to see this paper addressing how timing misspecification creates biased comparisons and further providing diagnostic tools and corrected estimators. Rather than treating timing as a secondary issue, it integrates these concerns directly into identification and estimation, and this is particularly valuable for researchers using administrative or policy data where treatment definitions are inherently noisy (aren’t them all, to some extent?).

Who should care?

Anyone using staggered DiD designs with administrative or policy data should look this one up. That includes applied researchers studying policy reforms, education interventions, labour regulations or any other setting where treatment doesn’t arrive cleanly and uniformly. If your treatment variable comes from government records, institutional reports, or eligibility rules (and not from a randomized rollout) then the risk of misclassification or anticipation is probably already in your data. This is also relevant if you’re working with event study plots and interpreting pre-trends. The paper shows that violations from misclassified timing or early behavioral responses can sneak into the pre-treatment window and look like non-parallel trends even when the actual causal structure is sound. So if you’ve ever seen a bumpy pre-trend and wondered whether to throw out your design, this gives you a reason to pause and test before you abandon the setup.

Do we have code?

Not yet. As of the first version of the arXiv post (July 2025), there’s no replication repository linked, and the paper doesn’t mention any software package or implementation details. The estimators and tests are well defined and nothing looks too difficult to implement, but it would be helpful to have code or even a worked example. If that shows up in a later version or companion repo, I’ll update this post.

In summary, this paper strengthens the foundation of staggered DiD by tackling something most of us have seen but rarely model directly: the gap between recorded treatment timing and actual exposure. It formalizes how misclassification and anticipation distort group comparisons, shows that even advanced estimators are vulnerable and offers tools to detect and correct those problems. The framework is useful, the diagnostics are intuitive and the estimators are practical. If you work with policy data where timing is a bit messy (which is most of it) this is worth your attention.

Interactive, Grouped and Non-separable Fixed Effects: A Practitioner's Guide to the New Panel Data Econometrics

(Speaking of TWFE, prof Jeffrey was on about it today here, and prof Scott’s 3 latest posts also discuss it here and here and here. You have a lot of reading to do, I don’t make the rules :-)).

TL;DR: this paper is a practical guide arguing that the standard TWFE model is often too restrictive for modern panel data which can lead to biased results when unobserved characteristics have complex, time-varying impacts. It provides an accessible overview of more flexible methods like interactive, grouped, and non-separable fixed effects, and shows through empirical examples that these newer estimators often fit the data better and can significantly change your conclusions.

What is this paper about?

This is not a “paper” per se but a practical guide to the new generation of panel data models that better capture unobserved heterogeneity. The standard TWFE approach assumes that unobserved traits are either fixed over time or shared across units in a given period. But this often misses how units might respond differently to common shocks or how the value of unobserved characteristics can evolve.

The authors walk us through 3 more flexible frameworks: interactive fixed effects (IFE), which let unobserved unit traits interact with time-varying shocks; grouped fixed effects (GFE), which cluster units into latent groups that share patterns over time; and non-separable two-way (NSTW) models, which allow for nonlinear interactions between units and periods. So rather than introducing new theory, the paper focuses on helping us understand when and how to use these models. It covers the main estimators, lays out key assumptions, compares performance and discusses how to test which model fits best. It also includes two empirical examples and a software appendix to make the methods easier to apply.

What do the authors do?

(They do a lot and I learned so many new terms by reading this paper)

The provide a very structured overview of the main estimation methods for modeling unobserved heterogeneity in panel data using interactive, grouped and non-separable FE. Again, they don’t propose a new estimator but consolidate and explain a wide set of recent contributions (many of which are *technically* demanding), and translate them into accessible terms for applied people. They cover how to estimate models with IFE, where unobserved unit characteristics interact with time-varying common shocks. For that, they walk through estimation techniques like: iterative least squares (ILS), which alternates between estimating factors and coefficients; penalized least squares (PLS), which uses adaptive LASSO to select the number of factors; nuclear norm regularization (NNR), which reframes factor estimation as a convex optimization problem; IV approaches, which include two-step IV estimators and GMM estimators that avoid estimating incidental parameters; and common correlated Effects (CCE), which “sidestep” direct factor estimation by using cross-sectional averages as proxies. They also explain how GFE can be estimated by clustering units into latent groups, reducing dimensionality and avoiding bias when the number of time periods is limited. Finally they discuss non-separable models that go beyond linear IFE structures, allowing for flexible, possibly nonlinear interactions between units and time effects.

The good thing is that throughout the paper they emphasize the trade-offs: whether you need to estimate the number of factors, how estimation bias behaves in small samples, and which estimators work best when regressors are endogenous, when N and T are large (I never had this issue hehe) or when models include lags.

They back this up with two empirical applications, one looking at the inflation-growth nexus and another reassessing the Feldstein-Horioka puzzle in international finance, to show that newer models like NSTW can give quite different (and more plausible) results than standard TWFE.

Why is this important?

This is an important guide because most applied people still rely (or just really like) TWFE even when the assumptions behind it don’t hold (not our fault!). TWFE assumes that unobserved characteristics are either constant over time (like geography or ability) or affect all units the same way in a given year (like a national shock), but in many panels (like when T is large) these assumptions are too restrictive. Unobserved factors often evolve over time and different units may respond to the same shock in different ways. If we *ignore* (sometimes we don’t have an option) this and keep using TWFE, the estimates can be biased and misleading, which means we might be drawing the wrong conclusions from the data. And in many cases, even using IFE isn’t enough: the paper shows that nonlinear and group-based models often fit the data better. This guide matters because it lowers the barrier to using these newer methods, it explains what each estimator assumes, when it “breaks down” and how to choose the right “model” (prof Jeffrey might have an issue with this nomenclature). It also shows that applying these methods doesn’t have to be intimidating! There are diagnostics, workflows and open-source tools that make it manageable.

Who should care?

Anyone working with panel data where unobserved heterogeneity might evolve over time or affect units differently, and that includes applied micro people using survey or firm-level data, macro people modeling country or regional dynamics and financial people studying markets or portfolios over time.

If you’re running panel regressions with many time periods or even just worried that your unobserved variables aren’t neatly time-invariant or uniform across units, this guide gives you the tools to do better. It’s also useful if you’re dealing with potential cross-sectional dependence, incidental parameter bias or unclear model fit and want diagnostics that go beyond residual plots. Even if you don’t plan to use these estimators like right now, understanding them helps you interpret existing empirical work more critically and helps you avoid relying on TWFE out of “habit” (or convenience).

Do we have code?

The paper includes a super helpful appendix listing open-source implementations for most of the methods discussed, available in both Stata and R, with some support in Matlab as well. For example: for CCE, you can use the popular XTDCCE2 command in Stata or the plm package in R; for the iterative IFE estimator, there's REGIFE in Stata and INTERFE in R; for GFE, the authors point to code available on Stéphane Bonhomme's webpage; and for the newer NSTW models, you can find PCLUSTER in R. Some of the very latest estimators may require more hands-on coding, but the paper points to companion codebases and replication files for their applications. It’s definitely worth checking out!

In summary, this is a practical and well-organized guide to modern fixed effects models for panel data. It doesn’t introduce new estimators, but it brings together a wide set of tools that applied people should know about, more so if they’re still relying on TWFE.

For a chapter about it written by him and prof Imbens, email me or access here. I also highly recommend this interview he did with profs Fan Li and Fabrizia Mealli.

For years researchers used TWFE as the default “DiD” estimator, even in staggered settings. It wasn’t until Goodman-Bacon (2021), Sun and Abraham (2021) and others showed it could mix already-treated units into control groups (the negative weights!) that we realized it was doing more than we thought. Now most new work tries to avoid these comparisons but legacy papers and habits still linger.

Heterogeneity, Small Sample Inference, Anticipation, and Local Projections

Beatriz Gietner — Tue, 22 Jul 2025 14:55:16 GMT

Hi there! I am back from China now, and I have a few posts lined up on ML and CI in general, but first we will go through the ones related to DiD.

A bit of housekeeping first. We got two package updates when I was away.

Link to post here

We spoke about these papers here and here :)

And today’s post is about:

Doubly Robust Uniform Confidence Bands for Group-Time Conditional Average Treatment Effects in Difference-in-Differences, by Shunsuke Imai, Lei Qin, and Takahide Yanagi.
Simple Approaches to Inference with Difference-in-Differences Estimators with Small Cross-Sectional Sample Sizes, by Soo Jeong Lee and Jeffrey M. Wooldridge.
Refining the Notion of No Anticipation in Difference-in-Differences Studies, by Marco Piccininni, Eric J. Tchetgen Tchetgen, and Mats J. Stensrud.
A Local Projections Approach to Difference-in-Differences, by Arindrajit Dube, Daniele Girardi, Òscar Jordà, and Alan M. Taylor (this is the JAE published version - congrats!! - of the WP that went around in 2023. Also for Stata users, there’s an update to the locproj package here).

Doubly Robust Uniform Confidence Bands for Group-Time Conditional Average Treatment Effects in Difference-in-Differences

(Shunsuke is a second-year PhD student at Kyoto University)

TL;DR: this paper shows how to study treatment effect heterogeneity in staggered DiD designs when you care about a continuous covariate, like the pre-treatment poverty rate. The authors build on Callaway and Sant'Anna (2021) to estimate group-time Conditional Average Treatment Effects (CATTs) that vary with that covariate. They construct uniform confidence bands so we can see which parts of the curve are meaningful. The method is doubly robust and works under standard identification conditions. The procedure combines parametric estimation for nuisance functions with nonparametric methods for the main parameter of interest. The authors provide an R package (`didhetero`) to help implement the methods.

What is this paper about?

This paper is about how to study treatment effect heterogeneity in staggered DiD designs. The goal is to move beyond group-time averages and see how treatment effects vary depending on the value of a continuous pre-treatment covariate, like the poverty rate. The authors focus on estimating group-time Conditional Average Treatment effects on the Treated (CATTs). This tells us, for each group and period, how the effect changes across the distribution of a pre-treatment variable. Instead of asking whether a policy worked, we can ask for whom it worked, and how strongly .

The second part of the paper is about inference. It is one thing to estimate a curve, but we also want to say which parts of that curve are statistically meaningful. The authors construct uniform confidence bands for the CATT function. These bands help us see where effects are large, where they are noisy, and where they are indistinguishable from zero. The method adapts the doubly robust estimand from Callaway and Sant’Anna (2021) to a more granular, conditional setting. It works under standard assumptions and uses a three-step procedure that combines both parametric and nonparametric estimation. The result is a tool that can handle covariate heterogeneity, treatment timing, and proper inference at the same time.

What do the authors do?

They start with the doubly robust estimand from Callaway and Sant’Anna (2021) and build on it to estimate conditional treatment effects. The main innovation is extending this framework to recover how effects vary with a continuous covariate. To do this, they propose a three-step procedure that combines parametric estimation of nuisance components (the outcome regression and generalized propensity score) in the first stage with nonparametric local polynomial regressions in the second and third stages. This setup captures nonlinearity without forcing a specific shape onto the CATT function.

For inference, they develop two ways to construct uniform confidence bands: one based on an analytical approximation and another using weighted or multiplier bootstrapping . This part is technically demanding. They show how nonparametric smoothing and the presence of estimated nuisance terms affect the distribution of their estimator, and they prove that the bands have the correct asymptotic coverage . The statistical theory extends results from recent work on conditional average treatment effects in unconfoundedness setups to the staggered DiD setting. The paper includes simulations and discusses practical aspects like how to pick bandwidths and estimate standard errors. It also defines several summary measures based on the CATTs, which can help with interpretation when there are many groups and time periods.

Why is this important?

When we employ DiD we often end up reporting some average effect for a group or a time period and move on. But averages aren’t always helpful because they smooth over interesting variation, and sometimes that variation is the whole point. This paper shows how we can move beyond average effects. Instead of asking if a policy worked on average, we can ask who it worked for. Did high-poverty counties see a benefit? This paper shows how to estimate treatment effects that change with covariates like the poverty rate and then build confidence bands around them so we know what we’re seeing is real.

It’s useful because this kind of question comes up all the time in applied work. Think about the minimum wage. The extent to which minimum wage increases reduce poverty depends on the structural relationship between wage gains and job losses. A group average can’t tell you that, but this method can help assess it. And what’s nice is that it works with staggered treatment timing, which is the norm in many real-world applications. You don’t have to pretend that everyone was treated at once or that effects are constant across space and time. You get a picture that’s both more honest and more informative. That’s the kind of thing you want when the stakes are high and the outcomes matter.

Who should care?

If you’re working with staggered treatment and you already use group-time ATT methods like Callaway and Sant’Anna (2021), you should check this paper. It adds another layer: instead of just estimating how effects vary by group and time, you can now see how they shift across something continuous, like the poverty rate. It’s helpful if you’re trying to answer questions like: does this policy work better in richer or poorer counties? Is the effect stronger where unemployment was already high? This paper gives you a way to answer those kinds of questions without needing a separate model for every subgroup. And it gives you proper confidence bands so you can see where the heterogeneity is real and where it’s just noise. The authors also mention that even though their main example is minimum wage and poverty, this method is general and you can apply it to any staggered DiD setup where treatment effects might depend on a baseline variable.

Do we have code?

We have a package: didhetero (Treatment Effect Heterogeneity in Staggered Difference-in-Differences). It “provides tools to construct doubly robust uniform confidence bands (UCB) for the group-time conditional average treatment effect (CATT) function given a pre-treatment covariate of interest and a variety of useful summary parameters in the staggered difference-in-differences setup of Callaway and Sant’Anna (2021)”.

In summary, this paper is a nice reminder that treatment effects are more than numbers, they’re functions, and once we start thinking about them that way, we need the right tools to estimate and interpret them. The authors take a standard DiD setup and give us a way to ask questions like how effects shift with poverty or unemployment, while still keeping the core assumptions intact. You don’t need to change your identification strategy or build a different model for every subgroup, this method fits into what you already do and just makes it more informative.

Simple Approaches to Inference with Difference-in-Differences Estimators with Small Cross-Sectional Sample Sizes

(Soo-jeong is Professor Jeffrey’s soon-to-be-former student. Here’s his thread about this paper)

TL;DR: standard statistical methods for DiD studies are unreliable when you have a small number of treated or control groups (e.g., a single state or a few hospitals). This paper proposes a simple solution: for each group, calculate the average outcome before the policy and the average outcome after. Then, run a basic cross-sectional regression on these averages. This simple trick provides statistically valid confidence intervals and t-tests, even with extremely small samples, such as one treated unit and two control units. It offers a straightforward and easy-to-implement alternative to more complicated methods like Synthetic Control, giving us a reliable tool for policy evaluation when data is limited.

What is this paper about?

This paper looks at a familiar setup (tracking outcomes for treated and control units before and after some intervention) and asks what to do when you don’t have many units on either side. Usually DiD works fine if you have lots of treated and control units since you can lean on large‑sample theory and cluster‑robust errors. Here the authors show a simple trick: collapse each unit’s time series into two numbers (the pre‑intervention average and the post‑intervention average) then run an ordinary cross‑sectional regression of that time‑collapsed outcome on a treatment indicator. Under the usual linear‐model assumptions (normal errors, constant variance) you get exact inference even if you have just one treated unit and two controls. They then extend that to remove unit‑specific trends (fitting a little time trend in the pre‑period, subtracting it off, then averaging), and show we can still do exact t‑tests on that transformed data. Finally, they compare this to synthetic control and synthetic DiD approaches, run simulations and apply the method to California’s 1989 smoking restrictions (one treated state) and to staggered “castle law” rollouts. The upshot is a very easy‐to‐implement alternative when sample sizes in the cross‑section are small.

What do the authors do?

The authors take a hands‑on look at how to get valid confidence intervals when you have very few treated or control units. They start by showing that you can collapse each unit’s entire time path into just two numbers (its average outcome before the policy and its average outcome after). You then run an ordinary cross‑sectional regression of those two‑period averages on a simple indicator for whether a unit was treated. Under the familiar assumptions of a linear model with normal, homoskedastic errors, that single regression gives you exact t‑tests and confidence intervals even if you have as few as one treated unit alongside two controls. Next they make the method more flexible by carving out linear trends at the unit level. They fit a straight‑line trend to each unit’s pre‑treatment history, subtract it from the full series, and then average the residuals before and after treatment. You can then plug those de‑trended averages into the same cross‑sectional regression and still get exact inference. After laying out these core ideas they run through simulations that confirm the small‑sample accuracy of their approach, and then they demonstrate it on two real cases: California’s 1989 smoking ban and a staggered rollout of castle‑law enactments. That empirical work shows the method is fast to implement and gives results that line up well with more elaborate techniques

Why is this important?

A lot of times we applied researchers don’t have the benefit of working with large samples due to the lack of proper data. For example if you’re studying a law change in one state or a corporate rollout in a handful of markets, the usual cluster‑robust approach can give wildly misleading confidence intervals when you only have a few clusters. This paper offers a way to get honest measures of uncertainty in those settings without complicated bootstrap schemes or heavy-duty synthetic control machinery. By boiling each unit’s history down to a before‑and‑after average (or its de‑trended version), you end up with a tiny cross‑sectional regression where the standard t‑statistic really follows a t‑distribution. That means you can report confidence intervals you can trust, even with just one treated unit. It gives us a straightforward fallback when we lack large samples or when more elaborate methods are hard to justify in small‑n contexts.

Who should care?

This paper will matter to anyone running a policy evaluation where you don’t have the luxury of dozens of treated and control units. If you’re an applied economist looking at a single state’s law change or a public health researcher tracking a handful of hospitals before and after an intervention, you’ll run into the limits of cluster‑robust inference with small samples. It’s also useful for consultants or corporate analysts who roll out a pilot program in just a few markets and need reliable uncertainty measures without wrestling with heavy bootstraps or complex synthetic controls. In those cases this simple before‑and‑after averaging trick gives you a clear way to get honest confidence intervals even when your cross‑section is tiny.

Do we have code?

For this paper there isn’t a dedicated R or Python library you need to install. Once you’ve collapsed each unit’s time series into a pre‑ and post‑treatment average (or its de‑trended counterpart), you just run an OLS of that two‑period outcome on your treatment indicator. In Stata you could type something like “reg post_pre_diff treated, robust” and the built‑in t‑statistic is exact under the paper’s normal‑error assumptions. If you want an exact p‑value without leaning on normality, you can use the user‑written ritest command for randomization inference. To compare against synthetic DiD you’d load the sdid package in Stata 18, but you don’t need any fancy routines for the core approach. A few lines of code in R or Python (compute the averages, run lm() or statsmodels.OLS) and you’re done.

In summary, this paper serves as a reminder that sometimes the simplest solution is the most effective. In a field that often trends toward more complex estimators, the authors show how a clever data transformation combined with foundational econometric principles can solve a difficult and common inferential problem. Here we can see a “DiD problem” through a different lens. Soo-jeong and Prof Jeffrey provide a method that is transparent, easy to implement and statistically sound even in small samples. It empowers us to make credible claims in the exact settings where evidence is often most needed but hardest to analyze.

Refining the Notion of No Anticipation in Difference-in-Differences Studies

(I really enjoyed reading this paper - it did have more words than the others)

TL;DR: this paper resolves a common point of confusion about the “no anticipation” assumption in studies that employ DiD. We often worry that people might change their behaviour in anticipation of a policy, but the formal assumption simply states that future treatments can’t affect past outcomes, which is a trivial point about the arrow of time. The authors argue this mismatch stems from conflating the policy’s implementation with the decision to implement it. By introducing a separate variable for the “plan” or “decision” (P), they provide a clearer framework. This clarifies what “no anticipation” really means, shows how the standard DiD estimator can be biased if people react to the plan (as expected) and helps us specify whether we are estimating the effect of the plan itself or its ultimate implementation.

What is this paper about?

This paper tackles a subtle but important ambiguity at the heart of DiD methodology. Many DiD guides and papers state that a no anticipation assumption is required for the method to be valid. Formally, this assumption is often written in a way that says an intervention at a future time point does not affect an outcome in the past. The problem is that in any standard causal model the future can’t affect the past “by definition”. This makes the no anticipation assumption seem either trivially true or completely unnecessary which may be why some foundational DiD papers don’t even mention it. This has led to widespread confusion: is this an important identifying assumption we need to worry about or a redundant statement about time travel?

The authors argue that this confusion arises because the formal assumption fails to capture what researchers are really concerned about. When we talk about anticipation, they don’t mean that the policy itself reached back in time. They mean that knowledge of a future policy (the plan, the announcement, the waiver application) caused people or firms to change their behaviour before the policy was officially implemented. For example, in a study of Medicaid expansion, insurers might change their premiums as soon as a state announces its plan, well before the expansion actually begins. This paper’s goal is to resolve this ambiguity by providing a new, expanded causal model that formally separates the policy’s implementation from the prior decision to implement it.

What do the authors do?

Ok, let’s go in parts. The authors’ main contribution is to clarify the causal model underlying DiD. They do this by introducing a new variable,P, which represents the plan or decision to implement a policy. This decision P occurs at a specific point in time and is distinct from the policy’s actual implementation, A_2, which it causes. The key is that the decision P can happen before the pre-treatment outcome Y_1 is measured. With this expanded model, they propose a new, more meaningful no anticipation assumption (Assumption 5): the plan P has no average effect on the pre-treatment outcome Y_1 for the group that plans to adopt the policy. Unlike the standard assumption, this is a non-trivial statement about behaviour that could be violated in the real world. This framework has a few consequences.

They show that if a researcher (perhaps implicitly?) assumes parallel trends with respect to the plan (P=0) but there are anticipation effects (their new assumption is violated), then the standard DiD estimator is biased. Proposition 1 demonstrates that the DiD estimator identifies the true Average Treatment Effect on the Treated (ATT_A_2) minus a bias term, ψ, which is exactly the effect of the plan on the pre-treatment outcome. They also argue that in some cases, the researcher might be more interested in the effect of the decision itself (ATT_P) rather than the effect of the implementation (ATT_A_2). For example, an announcement of a car recall may cause people to stop driving the faulty cars, meaning the decision had a huge effect even if the subsequent act of seizing the cars had none. They show that under their new no-anticipation assumption, the standard DiD functional identifies ATT_P.

Finally, they extend this logic from the simple two-period case to the more complex staggered adoption setting, showing how their framework can clarify the limited anticipation assumptions used in modern DiD estimators like Callaway and Sant’Anna (2021).

Why is this important?

This paper provides essential clarity on a foundational assumption that has been a source of confusion for decades. It closes the gap between the informal, intuitive meaning of anticipation and its flawed mathematical formulation. This matters for applied work for two main reasons.

First, it makes the potential for bias explicit. If you are studying a policy that was announced well in advance, you must now seriously consider whether that announcement itself changed behaviour. If it did, this paper shows that your standard DiD estimate of the implementation’s effect is likely biased, and it formalizes the source and structure of that bias.

Second, it forces us to be more precise about what causal question we are asking. Is the goal to measure the effect of a law actually being implemented (ATT_A_2)? Or is it to measure the total effect of the policy process, starting from its public announcement (ATT_P)? These are different questions with different answers and the choice has real consequences for interpretation. This paper provides the formal language needed to distinguish between them and to state the assumptions required to identify each one.

Who should care?

This paper is a must-read for any applied researcher who uses or teaches DiD. This includes economists, political scientists, epidemiologists, and other social scientists who evaluate policies or interventions. It is especially important for those studying policies that are announced publicly before they take effect, which covers the vast majority of new laws and regulations. If your research involves a scenario where people could plausibly react to the knowledge of a future treatment (whether it’s a minimum wage increase, a new environmental rule, or a coming tax change) this paper provides the conceptual tools to handle it correctly. It fundamentally clarifies the assumptions discussed in popular practitioner guides and recent econometric surveys, making it essential reading for both students and experts in the field.

Do we have code?

No, this is a conceptual and theoretical paper, not a new estimator. It does not come with a new R or Stata package because its purpose is to refine the thinking that precedes the coding. The paper clarifies the causal model, the estimand of interest, and the identifying assumptions. The tools used to calculate the DiD functional (e.g., packages like did and fixest in R, or csdid and reghdfe in Stata) are unchanged. The contribution of this paper is to help you decide which estimand you are targeting (ATT_A_2 vs ATT_P) and to be explicit about the no anticipation assumption that your identification strategy relies on.

In summary, this paper solves a long-running methodological puzzle: why does the no anticipation assumption in DiD appear to be a self-evident statement about the impossibility of time travel? The authors show that researchers have simply been using the wrong formal language for the right intuition. The real concern is not about the policy itself affecting the past, but about the plan for the policy affecting behaviour in the present. By formally separating the policy’s implementation from the decision to implement it, the paper provides a clear and coherent framework for thinking about anticipation effects. It’s a nice reminder that before we rush to estimate, we must first be precise about what it is we are trying to estimate and under what assumptions. The paper’s contribution is not a new command to type, but a new clarity of thought.

A Local Projections Approach to Difference-in-Differences

TL;DR: this paper offers a solution to the “negative weighting” problem that biases standard DiD estimates in settings with staggered treatment adoption. The authors propose an approach based on local projections (LP), a method common in macroeconomics. The “LP-DiD” estimator runs a separate, simple regression for each post-treatment period. The innovation is to restrict the sample in each regression to only include newly treated units and “clean controls” (units that have not yet been treated). This transparently avoids the problematic comparisons that cause bias, is computationally fast, highly flexible, and can replicate the results of more complex modern DiD estimators.

What is this paper about?

This paper enters the ongoing conversation about how to properly conduct DiD analysis when treatment is “staggered”. A recent wave of econometric research has shown that the traditional two-way fixed-effects (TWFE) regression fails in this setting. The core issue is often called the “negative weighting” problem. In a staggered design the TWFE estimator implicitly uses already-treated units as controls for more recently treated units. For example a state that adopted a policy in 2005 might be used as part of the control group for a state that adopts the same policy in 2010. This is a “forbidden comparison” because the 2005 state may still be experiencing its own dynamic treatment effects, making it a contaminated or “unclean” control. This can lead to the TWFE estimate being a weighted average of the true effects where some weights are negative, sometimes producing an average effect that is nonsensical or even has the wrong sign. In response, this paper proposes an intuitive framework called LP-DiD. It leverages the Local Projections (LP) method to estimate dynamic effects and combines it with a straightforward “clean control” condition to sidestep the negative weighting problem entirely. The result is a regression-based tool that is easy to implement and understand, yet powerful enough to stand alongside other recently developed DiD estimators.

What do the authors do?

The authors’ approach breaks the problem down by estimating the treatment effect for each post-treatment period, or horizon, separately. For each horizon, say, two years after treatment, they run a distinct regression. The outcome variable in this regression is the change in the outcome from the period just before treatment up to that two-year mark. The key variable of interest is an indicator that flags units at the exact moment they become treated.

The real innovation lies in how they construct the sample for each of these regressions. To get a clean estimate they only include two groups of units: the newly treated units (at the moment they switch into treatment) and the clean controls. A clean control is a unit that remains completely untreated all the way through the specific horizon being estimated. For the two-year effect, for example, the controls are units that are still untreated two years later. This simple rule is powerful because it guarantees that already-treated units are never used as controls, which is the source of the negative weighting bias in older methods.

The authors show that this basic procedure estimates a variance-weighted average of the effects across different treatment cohorts. They then demonstrate how to easily recover the more standard, equally-weighted average effect using familiar techniques like weighted least squares or a two-step regression adjustment. This flexible framework is also shown to easily accommodate covariates, situations where treatment isn’t permanent and different ways of defining the pre-treatment baseline.

Why is this important?

The clean control condition is highly intuitive. It makes it easy to understand and explain exactly which comparisons are being made to identify the treatment effect, in contrast to more “black-box” estimators. At its core, the method is just a series of OLS regressions on different subsamples of the data. This makes it computationally fast, which is an advantage when working with very large panel datasets where more complex estimators can be slow. The LP-DiD approach helps demystify the new DiD literature. The authors show that their estimator, with different weighting schemes, is numerically equivalent or very similar to other leading methods, such as those proposed by Callaway and Sant’Anna (2021) and Borusyak, Jaravel, and Spiess (2024). This reveals the common logic underlying these different approaches. The framework is easily adapted to different empirical contexts. Researchers can modify the clean control definition for non-absorbing treatments, include covariates, or pool estimates over various horizons, all within a straightforward regression setup.

Who should care?

Any applied researcher using DiD with panel data, especially in settings with staggered treatment adoption, should be aware of this paper. It will be particularly useful for:

Economists, political scientists, and public health researchers looking for a robust, easy-to-implement alternative to traditional TWFE.
Researchers who value transparency and want to be able to clearly articulate the identifying assumptions and comparisons underlying their estimates.
Analysts working with large datasets (e.g., using administrative or worker-level panel data) who would benefit from a computationally efficient estimation method.
Instructors of econometrics courses, as LP-DiD provides a very clear and teachable example of how to solve the negative weighting problem in modern DiD.

Do we have code?

Yes. The authors have released a Stata command, lpdid, which implements the estimators discussed in the paper. The paper also provides STATA example files on GitHub. While the method is simple enough to be implemented manually with basic regression commands, the dedicated package streamlines the process of estimation, reweighting, and inference. The paper also provides a clear example of how to use the built-in teffects ra command in Stata to obtain the reweighted estimates.

In summary, this paper provides an elegant and intuitive bridge from the world of traditional DiD to the modern literature on robust estimation with staggered treatment. By framing the problem through the lens of local projections, a familiar tool from macroeconomics, the authors show that the notorious “negative weighting” bias that plagues TWFE can be solved with a simple and intuitive sample selection rule. The LP-DiD approach estimates dynamic effects by running a series of simple regressions, each focused on a specific post-treatment horizon and using only clean comparisons between newly treated units and those not yet treated. This approach demystifies what many newer DiD methods are doing under the hood, offering us a tool that is not only robust but also exceptionally flexible, transparent, and computationally fast.

Efficient Estimation, Sequential Synthetic Control DiD, and the TWFE Debate

Beatriz Gietner — Fri, 27 Jun 2025 14:02:46 GMT

Hi there! We are back to talking about DiD in specific :) Here are the latest papers we will discuss:

Efficient Difference-in-Differences and Event Study Estimators, by Xiaohong Chen, Pedro H. C. Sant’Anna, and Haitian Xie
Sequential Synthetic Difference in Differences, by Dmitry Arkhangelsky and Aleksei Samkov
When Can We Use Two-Way Fixed-Effects (TWFE): A Comparison of TWFE and Novel Dynamic Difference-in-Differences Estimators, by Tobias Rüttenauer and Ozan Aksoy

Efficient Difference-in-Differences and Event Study Estimators

(After all, who never heard someone saying “this DiD should’ve been an event study” at a seminar…)

TL;DR: modern DiD estimators handle heterogeneity but often ignore efficiency. This paper builds estimators that hit the semiparametric efficiency bound under standard assumptions, with no functional form restrictions, no extra modelling, just better use of the data. Confidence intervals then shrink, power goes up, and all the pieces are there for anyone working with short panels and staggered treatment.

What is this paper about?

This paper is about improving the precision of DiD and Event Study estimators. Modern methods account for treatment effect heterogeneity and staggered adoption, but they tend to handle pre-treatment periods and control groups in ad hoc ways. Many estimators either drop baseline periods or assign equal weights to them, even though these choices are rarely supported by theory or data.

The authors offer a formal solution1: they build a framework for efficient estimation of DiD and ES parameters that works with short panels, staggered treatment timing, and standard identification assumptions like PT and no anticipation. They show how to characterise these assumptions as conditional moment restrictions and use that structure to derive semiparametric efficiency bounds.

The main insight is that some pre-treatment periods and control groups carry more information than others. By using optimal weights based on the conditional covariance of outcome changes, the proposed estimators deliver tighter confidence intervals and smaller root mean squared error with no added assumptions and no loss in consistency, which then gives us a principled way to use all available data more effectively.

What do the authors do?

This paper is dense, so we’ll go section by section (each answers one question).

After the Introduction, section 2 (Framework, causal parameters, and estimands) answers “what are we estimating?”. The authors formalise the setup: short panels, staggered treatment, absorbing states, and potential outcomes indexed by treatment timing. They define the causal parameters of interest (group-time ATTs and event-study summaries) and lay out standard identification assumptions such as random sampling, overlap, no anticipation, and PT. They allow for both post-treatment-only and full-period parallel trends (PT), with or without covariates.

In section 3 (Semiparametric Efficiency Bound for DiD and ES), they answer “what’s the best we can hope for?”. This is the theoretical core of the paper. They derive semiparametric efficiency bounds for both ATT and ES estimators under the two types of PT. These bounds represent the lowest possible asymptotic variance for any estimator that satisfies the DiD identification conditions. The result is that most DiD estimators are inefficient because they either ignore or misweight pre-treatment and comparison groups. They also provide closed-form expressions for the efficient influence function (EIF). These formulas make clear how to combine periods and groups using weights based on the conditional covariance structure of outcome changes.

In section 4 (Semiparametric efficient estimation and inference), they translate the theory into practice by proposing two-step estimators that use plug-in methods for nuisance components (like outcome regressions and group probabilities) and then apply the EIF weights to get point estimates. They show that this approach can be implemented using either flexible nonparametric methods (like forests or splines) or simple parametric working models. They also explain how to stabilise the procedure by estimating ratios of group probabilities directly.

In section 5 (Monte Carlo simulations), they calibrate simulations using CPS and Compustat data and compare their efficient estimators to popular alternatives (TWFE, Callaway-Sant’Anna, Sun-Abraham, Borusyak et al.). Across designs, the efficient estimators reduce root-mean-squared-error and confidence interval width (often by more than 40%) without increasing bias.

In section 6 (Empirical Illustration) they reanalyse data from Dobkin et al. (2018) on the effect of hospitalisation on out-of-pocket medical spending. Using the publicly available Health and Retirement Study (HRS), they estimate treatment effects based on variation in the timing of hospitalisation. Their efficient estimators produce tighter confidence intervals than existing methods. To match the same level of precision with standard DiD approaches, researchers would need about 30% more data. Their analysis also showcases a fascinating result of the method: the ability to use a post-treatment period as a baseline by “bridging” through the never-treated group. They also use a form of specification curve analysis to visually assess the stability of the estimates across all valid modeling choices, thus providing evidence for the plausibility of the PTA.

They conclude by providing practical advice and future directions. They suggest that we should test whether some pre-treatment periods or control groups are less informative or potentially invalid. To support this, they introduce Hausman-type overidentification tests and offer visual tools to assess sensitivity to different weighting schemes. They also point to extensions: nonlinear DiD models, switching treatments, unbalanced panels, and setups without PT. The framework is designed to be flexible, but it’s built for the standard large‑n, fixed‑T world. When that changes, so should the estimator.

Why is this important?

I love everything about efficiency, but sometimes that’s too much to ask. Most DiD papers focus on identification and stay there. Estimators are consistent, robust to heterogeneity, and safe, but they leave a lot of precision on the table. In a refreshing way, this paper argues that we can have both: credible identification and minimal variance. By working directly with the structure implied by the PTA, the authors show how to build estimators that are semiparametrically efficient. No extra assumptions, no black boxes, no functional form restrictions, and no need for large samples, just better use of the data we already have. One of the key insights of this paper is that DiD setups are nonparametrically overidentified (which means there’s more information available than most estimators use). That’s what makes these efficiency gains possible. I also have to note that while negative weights in some DiD estimators can be problematic, here they are not a concern: they arise naturally from the covariance structure and are a feature of optimal estimation, not a bug. Efficiency matters because in this case because it gives us tighter confidence intervals, more stable estimates, and more power by using the design more intelligently. The fact that the estimators are Neyman orthogonal adds to their credibility as it makes them less sensitive to misspecification of the preliminary estimation steps.

Who should care?

Anyone working with DiD or ES designs using short panels, staggered treatment or small samples. If you’ve ever looked at your standard errors and thought “this feels bigger than it should be”, you should read this one carefully. It’s also useful for methodologists and software developers who want to understand what makes an estimator efficient, and how to design tools that make full use of available data without relying on strong assumptions. Even if you don’t implement the estimators right away, this paper gives you a benchmark: what is the best you could be doing, given your design?

Do we have code?

Professor Pedro said “we will work on packaging everything so you can adopt all this seamlessly”. The estimators are presented in closed form and the implementation relies on standard tools (regression adjustments, propensity scores and conditional covariances), so while there’s no package, the ingredients are all there for anyone comfortable with two-step semiparametric estimation (not sure how many of us to be honest). The simulation and empirical application sections hint at how to put the pieces together and the estimators can be implemented in R or Python with off-the-shelf tools.

In summary, this paper takes the designs we already trust and shows how to push them further. The authors prove that standard DiD setups contain more information than we usually use, and they show how to recover it with estimators that are efficient, transparent, and grounded in the assumptions we already make.

Sequential Synthetic Difference-in-Differences

TL;DR: this paper introduces the Sequential Synthetic Difference-in-Differences (Sequential SDiD) estimator, a new method for event studies with staggered treatment adoption (particularly useful when you doubt the PTA). It works by applying SC principles sequentially to cohort-aggregated data; it estimates effects for early-adopting groups and uses those results to impute counterfactuals for later-adopting groups. The authors’ key theoretical result is that this estimator is asymptotically equivalent to an ideal but infeasible “oracle” OLS estimator that has access to the unobserved factors driving the violation of PT.

What is this paper about?

The rise of synthetic control (SC) methods, starting with Abadie and Gardeazabal (2003) and Abadie, Diamond, and Hainmueller (2010)2 offered applied researchers a new way to construct counterfactuals when PT are unlikely to hold. In 2021 Arkhangelsky, Athey, Hirshberg, Imbens, and Wager gave us the combination of DiD and SC (adding up the structure of SC with the intuition of DiD) in their seminal paper, and now in this one, Arkhangelsky and Samkov extend that framework to settings with staggered treatment adoption (the setting behind many modern policy evaluations).

They propose a method, Sequential Synthetic DiD, that builds counterfactuals *cohort by cohort*, using the information from early adopters to inform the estimation for later ones. It operates on cohort-aggregated data, estimating treatment effects one horizon at a time, and updating the data after each step to reflect imputed counterfactuals. This structure helps correct for violations of PT caused by unobserved time-varying confounders while preserving the transparency we value in DiD.

A key theoretical result is that this estimator behaves like an “oracle” (a benchmark) OLS regression that “knows” the unobserved interactive fixed effects. This connection provides both valid inference and formal efficiency guarantees. The method handles staggered timing, treatment effect heterogeneity, and violations of PT and it includes standard DiD and recent imputation estimators (like Borusyak et al. 2024) as special cases, depending on how the weights are chosen.

What do the authors do?

They introduce a new estimator, Sequential Synthetic Difference-in-Differences (Sequential SDiD), designed for event studies with staggered treatment adoption, where PT may fail due to unobserved time-varying confounders. The method builds on Synthetic DiD (Arkhangelsky et al., 2021)3, but adapts it for aggregated cohort-level data and implements it sequentially. They formalize causality by interpreting the observed outcomes using potential outcomes (Neyman, 1990; Rubin, 1974; Imbens and Rubin, 2015).4 5

For each adoption cohort, they estimate treatment effects at each horizon and update the observed outcomes with imputed counterfactuals. These updated values are then used in the estimation for later cohorts. This recursive structure is the paper’s core innovation: it reduces the risk of bias accumulation by preventing treated observations from contaminating the control pool in subsequent steps.

Also, to make the method directly useful for applied work (you might have asked yourself by no “but how do I handle control variables?”), the authors provide significant practical guidance on how to incorporate covariates (in Section 46). They focus on time-invariant characteristics and outline three different strategies: full stratification, a recommended hybrid model that allows for direct application of their algorithm, and a specific approach for cases with group-level treatment assignment.

They formalise this estimator by connecting it to a theoretical benchmark: an “oracle” OLS regression that has access to the true unobserved interactive fixed effects. This oracle represents an idealised model of what applied researchers try to do when including unit-specific trends or unobserved factor structures. The authors then prove that their estimator is asymptotically equivalent to this oracle under “mild conditions”7. Their results provide a strong theoretical backing, which includes valid inference, asymptotic normality, and the first formal efficiency guarantees for an SC-type method.

A key intermediate result of the paper, presented in Proposition 3.1, is that the infeasible oracle OLS estimator can be represented and computed by a sequential algorithm (Algorithm 2) that mirrors the main Sequential SDiD estimator. The authors note this is a “result of independent interest” because it reveals the underlying mechanics of OLS in staggered settings and provides “new insight into the mechanics of modern imputation estimators” like Borusyak et al.

Inference is conducted using the Bayesian bootstrap, and the authors also propose a placebo-style validation check by artificially shifting adoption dates backward (a way to assess model fit in the absence of PT).

They demonstrate the method’s performance through both an empirical application (on Community Health Centers, using Bailey and Goodman-Bacon 2015) and two simulation exercises. In settings where standard DiD performs well, Sequential SDiD yields similar results, but in scenarios with unobserved confounding, it remains stable while DiD becomes severely biased.

At the end they outline two key limitations. The first being that the method requires reasonably large adoption cohorts to ensure that aggregation averages out idiosyncratic noise. Second, it assumes that idiosyncratic errors are independent across units so that residuals concentrate around zero when averaged. They say that while these assumptions are common in modern DiD applications, they may be restrictive in contexts where individual-level shocks have strong aggregate effects.

Why is this important?

Most modern DiD estimators still depend on some version of the PTA, even when considering those designed for staggered adoption and heterogenous treatment effect. PTA means that untreated units can serve as a valid counterfactual for treated ones, which often doesn’t hold up when treatment is timed by factors we cannot observe or measure. In practice we would try to deal with this by adding unit-specific trends or other proxies for unobservables, but these adjustments only go so far.

Sequential SDiD offers an alternative for us because it addresses violations of PT by working with aggregated cohort data and estimating effects sequentially, using SC-style weighting to build credible counterfactuals one step at a time. This setup is way more robust when unobserved time-varying confounders drive both outcomes and treatment timing, which is exactly the kind of threat that undermines “traditional” DiD.

Their method is also grounded in familiar econometric foundations. The authors work in an asymptotic regime with many units and a fixed number of time periods, connecting their framework to the classic moment-based panel data literature (like Chamberlain, 1984). They operate in a low-noise environment, similar to the conditions studied in the SC literature (by averaging within large cohorts), which makes the method both practically feasible and theoretically sound.

Sequential SDiD also connects back to simpler methods. If the regularisation is set very high, the weights collapse to uniform ones, and the estimator reduces to a version of imputation-based DiD, closely related to Borusyak et al. (2024). This makes it easy to interpret and benchmark against more familiar designs.

Finally, the paper fills a gap in the SC literature by establishing formal efficiency guarantees8. The authors prove that Sequential SDiD is consistent, unbiased, and achieves first-order efficiency by showing it mimics an oracle OLS estimator that knows the unobserved factors. This is the first time an SC-type method has been shown to reach this level of statistical performance. It offers us a principled and robust estimator for settings with staggered adoption and unobserved confounding.

Who should care?

Anyone working with event studies, staggered adoption or policy evaluations where PT are unreliable should pay attention to this paper. If your treatment timing is potentially related to unobserved factors or if you’re worried that units are changing at different underlying rates, Sequential SDiD gives you a way to build more credible counterfactuals. This is useful in applications like health, labour, and education policy where adoption often happens gradually and for reasons we can’t fully observe.

Do we have code?

No, but the paper includes full algorithmic descriptions and detailed pseudocode for implementation. Algorithm 1 describes the Sequential SDiD procedure and Algorithm 2 outlines the oracle OLS estimator used for benchmarking. You can use the structure outlined in your preferred programming language.

In summary, this paper extends Synthetic DiD to settings with staggered treatment, unobserved time-varying confounding and cohort-level aggregation. The proposed Sequential SDiD estimator builds counterfactuals one step at a time, drawing on the structure of synthetic control and the logic of DiD. The authors prove that it behaves like an oracle OLS estimator, offering consistency, valid inference, and formal efficiency guarantees.

When Can We Use Two-Way Fixed-Effects (TWFE): A Comparison of TWFE and Novel Dynamic Difference-in-Differences Estimators

(Here we are again, discussing heterogeneous treatment effects…)

By prof Khoa Vu here :)

TL;DR: this paper is about the ongoing debate around using TWFE in staggered treatment settings, where units receive treatment at different times. The authors explain how TWFE can become biased when treatment effects vary across time or groups. They compare TWFE to five newer DiD estimators using Monte Carlo simulations, testing each one under a range of scenarios, including violations of common assumptions.

What is this paper about?

This paper walks us through the growing debate over using TWFE estimators in staggered treatment settings (cases where some units are treated earlier than others). The authors explain how TWFE can be biased if treatment effects vary over time or across groups. They then compare TWFE to five alternative DiD estimators using Monte Carlo simulations and show how each performs under different scenarios, including violations of key assumptions.

What do the authors do?

They say they have three main goals in this research. First, they want to make the recent staggered DiD literature more accessible to social scientists. They start by explaining the traditional TWFE estimator and the problems that recent research has identified with it9. Then they walk through the new alternative estimators in a way that’s easier to understand10. Second, they test how these different estimators perform using real panel data11. They run Monte Carlo simulations with a large sample size and a staggered design where the treatment is spread out over the entire time period, which differs from macro studies where treatment only happens in a few specific periods. In their simulations, they gradually add “realistic” complications that deviate from the perfect conditions these estimators were designed for. This lets them see how well each estimator holds up when things aren’t ideal, which gives practical insights for researchers. At the end, they provide recommendations for best practices when analyzing this type of data.

Why is this important?

TWFE is common practice in applied work. In simple 2x2 settings, it works as it was designed for, but once treatment is staggered across units, it becomes unclear what TWFE is estimating. Instead of a single contrast, it averages over many group comparisons and some of those comparisons involve already-treated units acting as controls, which introduces bias. This matters because many real-world treatment effects are not static. In individual-level panel data, for example, effects often build gradually and then fade out. That inverted-U shape is common in studies of life events, shocks and policy reforms. If you only include a single treatment dummy, you are misrepresenting the data-generating process, and TWFE becomes biased as a result.

This is one of the reasons why we have so many DiD new methods, but the authors point out that the backlash against TWFE has gone a bit too far. There is nothing wrong with using it as long as you model treatment effect heterogeneity in the right way. Event-time indicators are one simple fix, and while the new estimators are designed to handle dynamic effects, they still rely on assumptions12. When that assumption fails, all estimators perform poorly (and sometimes worse than TWFE).

Who should care?

Anyone doing DiD with more than two time periods. If your treatment is staggered, your effects are dynamic, or your units might anticipate treatment, this paper is super relevant. The audience includes applied economists, sociologists and political scientists working with panel data, particularly those deciding between TWFE and one of the newer DiD estimators.

Do we have code?

The authors say at the end that “a replication package with the simulation and analysis code is available on the author’s Github repository [the link will be added]”. I will update this post when it’s released.

In summary, this paper helps bring clarity to a debate that has confused a lot of applied researchers. TWFE is not inherently flawed but it needs to be specified correctly. If treatment effects vary over time, a single treatment dummy won’t cut it. Switch to an event-time specification or use one of the newer estimators. But remember: all of them rely on assumptions like PT and no anticipation. Each has strengths and weaknesses, and the right choice depends on what you’re most worried about in your data.

“We provide the first semiparametric efficiency bounds for DiD and ES estimators in settings with multiple periods and varying forms of the parallel trends assumption, including covariate-conditional and staggered adoption designs”.

Professor Abadie is generally credited as the primary architect of the SC approach. The method creates a “synthetic” version of the treated unit by taking a weighted combination of control units that best matches the pre-treatment characteristics of the treated unit, which makes it possible to estimate causal effects in comparative case studies where traditional experimental methods aren’t feasible. If I’m not mistaken and didn’t miss anything, I think the order is: Abadie and Gardeazabal’s 2003 “The Economic Costs of Conflict: A Case Study of the Basque Country”; Abadie, Diamond and Hainmueller’s 2010 “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program” and 2015 “Comparative Politics and the Synthetic Control Method”; and finally Abadie’s 2021 “Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects”.

A key difference from the original SDiD estimator is that the weights for the synthetic controls are not constrained to be non-negative.

I highly recommend this paper.

The authors say that time-varying covariates can present theoretical challenges by acting as additional treatment variables and are thus beyond the scope of the analysis of their paper. They propose three distinct strategies for applied researchers. The first, full stratification, involves running the analysis separately for each covariate stratum, though the authors advise that this is often impractical as it may result in subsamples that are too small. Their recommended strategy is a practical hybrid model where additive fixed effects can depend on the covariate but the multiplicative factors are assumed to be common across all units. This allows us to aggregate the data in a way that makes it directly compatible with the main Sequential SDiD algorithm. The final strategy is designed for the specific case where treatment adoption is a deterministic function of the covariates (e.g., a policy assigned at the state level). In this setting, the authors suggest it is more natural to average the data within the groups defined by the covariate rather than by adoption cohort.

This means that the theoretical results are shown to hold even in a “weak factor” setting, where the identifying variation from the interactive fixed effects can be small and vanish asymptotically, only as long as it does so slower than the statistical noise. The authors point to the fact that this is especially appealing for empiricists.

The paper frames its results as an answer to a “long-standing critique of SC methods”. The critique questions why a researcher should rely on SC weights instead of directly estimating the underlying factor model that motivates the method. The paper’s theoretical equivalence result (Theorem 3.1) fixes this tension by formally showing that in large samples, their SC-based method is the same as an oracle that does use the unobserved factors directly. It clarifies that “the choice is not between balancing and direct estimation, but rather how to feasibly approximate the same ideal oracle benchmark”.

When treatment is staggered over time (some units get treated earlier than others), TWFE doesn’t estimate a clean average treatment effect, it’s “just” a weighted mix of many 2×2 DiD comparisons. Some of these comparisons are valid (like early-treated vs never-treated), but others are “forbidden” (late-treated units compared to already-treated ones), and these can bias results even if treatment effects are constant. The authors argue that the issue is not that TWFE is inherently broken per se, but that using a single dummy (0 = untreated, 1 = treated) assumes effects are the same everywhere and constant over time. We all know that’s rarely the case because real-world treatment effects fade in, fade out, or vary across groups. When we ignore this, TWFE gets it wrong.

Callaway and Sant’Anna (2021) breaks treatment effects into clean group-by-time contrasts and avoids problematic comparisons; Sun and Abraham (2021) is a regression-based version of that same logic; Borusyak et al. (2024) uses untreated and not-yet-treated units to impute what the treated group would have looked like without treatment; Matrix Completion (Athey et al. 2021) is a ML-based imputation method that fills in the untreated matrix using patterns over units and time; and Wooldridge’s Extended TWFE (2021) is a more flexible version of TWFE that interacts treatment with time and group indicators. On a funny note, I started my PhD in 2020 and all of this was published in the subsequent years. I still haven’t finished my PhD, so who knows what will happen in a year from now.

They simulate 6 realistic scenarios that vary along 4 dimensions: how treatment effects evolve (step, trend-break, inverted-U), whether effects differ across groups, whether units anticipate treatment, and whether treated and untreated units were on PT. For each setup, they look at two things: bias in the overall ATT estimate and bias in time-specific effects. TWFE works fine when effects are static and homogeneous, but once you introduce heterogeneity it breaks (unless you use event-time indicators). Even then, late-period bias can come in. The main finding is quite an intuitive one: no estimator is perfect, some are better with anticipation (Borusyak, Matrix Completion, ETWFE), others handle non-PT better (Callaway and Sant’Anna, Sun and Abraham). The right choice? As always, “it depends” (more specifically, it depends on what you’re most worried about).

“Violations of the parallel trends assumption appear to be more consequential than issues of treatment effect heterogeneity”.

All About Heterogenous Treatment Effects

Beatriz Gietner — Fri, 20 Jun 2025 13:31:24 GMT

Hello! Today’s post (somewhat unintentionally) ended up being about four papers that all revolve around the same goal: to push DiD methods further so we can say more than just “it worked” or “it didn’t”. You will read about how they help us answer “for whom did it work? How did it work? And are we even measuring the right thing?”

Here are they:

Triple Difference Designs with Heterogeneous Treatment Effects, by Laura Caron
Forests for Differences: Robust Causal Inference Beyond Parametric DiD, by Hugo Gobato Souto and Francisco Louzada Neto
Estimating Treatment Effects With a Unified Semi-Parametric Difference-in-Differences Approach, by Julia C. Thome, Andrew J. Spieker, Peter F. Rebeiro, Chun Li, Tong Li, and Bryan E. Shepherd
Child Penalty Estimation and Mothers’ Age at First Birth, by Valentina Melentyeva and Lukas Riedel

Triple Difference Designs with Heterogeneous Treatment Effects

(Laura is a PhD candidate at Columbia)

TL;DR: triple-difference (3DiD) designs are everywhere, but the way we usually interpret the estimates requires careful consideration. In this paper Laura shows that the usual approach (comparing subgroups assuming one is unaffected) can mislead us and the reader, especially when people in different subgroups respond differently to the treatment (heterogenous treatment effects). She proposes a new way to think about it: instead of comparing average effects (DATT), she focuses on causal differences (causal difference in average treatment effects on the treated, or CDATT), which isolates how much of the difference is actually because of subgroup status, not just because the groups were made up of people who were already likely to respond differently. Laura lays out what needs to be assumed for this to be valid, proposes estimators that still work when models are misspecified, and shows through simulations from real data that this actually changes what we take away from some widely cited studies.

What is this paper about?

This paper is about how we interpret triple-difference (3DiD) estimates and how easily we can get them wrong if we’re not careful about subgroup comparisons.

The ATT (Average Treatment Effect on the Treated) compares treated and untreated groups over time. The key assumption here is that the untreated group shows us what would’ve happened to the treated group in the absence of treatment. That’s what makes it “causal”: you’re treating the untreated group as a valid counterfactual. The DATT (Difference in ATT) takes that one step further and compares the magnitude of treatment effects across two subgroups. And most importantly, it doesn’t assume either subgroup is unaffected. It just compares how strongly each group responded to the treatment.

3DiD designs use these kinds of comparisons across three dimensions, which usually encompasses time (before vs. after the introduction of a policy), treatment (treated vs. untreated units), and subgroup (which can be a demographic or structural characteristic, e.g. men vs. women, South vs. North). If we were to take a policy mandating paid maternity leave, for example, time would be considered pre vs. post-policy, treatment could be states that implemented it vs. those that didn’t, and subgroups would be women of childbearing age vs. older women.

What we usually do is that we assume one subgroup (e.g., older women) is unaffected by the policy. That group then serves as a kind of within-treatment counterfactual. If their outcomes don’t change post-policy, any observed differences in the other group (e.g., younger women) get attributed to the policy. This gives you an ATT-style interpretation: “this group was affected by the policy, relative to a group that wasn’t”. But that’s a strong assumption. And more often than not, it’s incomplete. What if the “unaffected” subgroup was actually affected in some way, like indirectly, or just to a lesser degree? In that case, the 3DiD estimate no longer tells you what you think it does. Laura shows that once you allow for heterogeneity in treatment effects, i.e., once you admit that different subgroups might be differently sensitive to the same policy, the DATT no longer captures the causal effect of being in one group versus the other. It just tells you that outcomes moved differently. But that difference might be driven by other things like occupation, baseline risk, or other characteristics correlated with subgroup status.

To get around this, she defines a new parameter: the causal DATT (CDATT). Instead of just comparing observed differences, it asks: what would the difference in treatment effects have been if both groups had the same underlying sensitivity to treatment, and the only thing that varied was subgroup status itself? That’s a much “cleaner” question, and also a much harder one to answer. The rest of the paper is about how to answer it properly.

What does the author do?

Laura starts with a common practice: using a 3DiD design to compare how different subgroups respond to a policy, assuming one subgroup can stand in as a kind of internal control, which is the standard 3DiD logic used in studies like Gruber (1994), Baum (2003), and more recently Derenoncourt and Montialoux (2021). But she shows that this logic breaks down fast when subgroups differ in how sensitive they are to the treatment.

Even if identification assumptions are satisfied and your DATT is correctly estimated, the result doesn’t have a clean causal interpretation. Why? Because if people in one subgroup would have responded more strongly to the policy even if they had been in the other subgroup, the difference in outcomes reflects more than just subgroup status: it reflects unobserved differences in treatment sensitivity. That’s not a subtle footnote because it completely changes what the parameter means.

To fix this, she defines a new estimand: the causal DATT (CDATT). It asks: what’s the treatment effect because of belonging to one subgroup versus another, ceteris paribus, including how reactive people are to the policy? That’s a much more policy-relevant question, but it comes at a cost: stronger assumptions, more careful identification, and better estimation tools.

The DATT is still identifiable under the usual assumptions (no anticipation and parallel trends across subgroups). So if you’re just interested in whether two groups responded differently to a policy, the DATT will give you that. But if what you really want to know is why they responded differently (like whether being in subgroup A versus subgroup B caused that difference) then you need to go further. That’s what the CDATT is for. To identify it, you need one more assumption: that treatment effect heterogeneity isn’t itself correlated with subgroup membership. Without that, you’re just comparing aggregates.

Laura’s paper includes both simulations and an empirical re-analysis of Gruber’s maternity leave data. In the simulations, she shows how the usual DATT and her proposed CDATT can lead to very different conclusions depending on the data-generating process. In the empirical application, she revisits a classic 3DiD design and shows how accounting for treatment effect heterogeneity shifts the interpretation, sometimes quite substantially.

The technical framework is grounded in the Imbens and Rubin (2015) potential outcomes model, so if that’s familiar territory, the paper is easier to follow. But even if it’s not, the intuition behind her argument is very clear: comparing subgroup outcomes doesn’t tell you anything causal unless you’re explicit about what you’re holding constant.

Why is this important?

3DiD designs are everywhere in applied work. When the usual DiD assumptions do not quite hold (say, when you cannot confidently claim parallel trends between treated and untreated groups) 3DiD offers a way out. It adds a third dimension (typically a subgroup split) and tries to recover the treatment effect by leveraging variation across time, treatment, and some structural or demographic distinction. But that third layer of complexity opens the door to a whole new set of problems.

While the DiD literature has already wrestled with treatment effect heterogeneity (between treated and control groups, or over time) there was less attention paid to what happens when subgroups themselves differ in how they respond to treatment. And yet, that is exactly what most 3DiD designs rely on: that one subgroup can stand in as a control for another.

Laura’s paper fills that gap. It shows that if we are not careful, we can end up interpreting subgroup differences as causal when they are really just driven by selection, sorting, or other underlying differences in treatment sensitivity. And if the whole point of 3DiD is to recover cleaner effects in messy empirical settings, then failing to address this undermines the design at its core.

Laura offers a formal framework for thinking about causal subgroup comparisons in 3DiD, shows what needs to be assumed, and provides robust estimators that work in the presence of heterogeneity. It is a nice correction to how subgroup analyses are often handled in 3DiD settings, and it is really useful now that staggered treatments and increasingly rich data structures are becoming more common.

Who should care?

Anyone using 3DiD to compare subgroups. If your paper assumes one group is not affected by the policy, or if you say “this group was more affected than that one”, then you should check Laura’s paper out. It matters most if treatment happens at different times (staggered), there’s the possibility of spillovers, and you’re comparing groups like men vs. women, or states in the North vs. South. Also if your story is about why the policy worked more for one group than another, this paper shows you what you need to assume to make that claim.

Do we have code?

No public code, but the paper walks through both a simulation and an empirical example using Gruber (1994). The simulation shows how DATT and CDATT can tell different stories depending on how treatment effects vary. The empirical example shows how ignoring that variation can lead you to the wrong conclusion.

In summary, 3DiD designs are everywhere, but the way we interpret subgroup comparisons often leans too hard on assumptions we do not check. This paper shows how easily treatment effect heterogeneity can throw things off and offers a better way to frame and estimate what we really care about. If you’re using subgroup comparisons to tell a causal story, this is the paper that tells you what that story actually needs.

Forests for Differences: Robust Causal Inference Beyond Parametric DiD

TL;DR: this paper introduces DiD-BCF, a new Bayesian ML method that makes it easier to estimate treatment effects in DiD settings, particularly when we “have” heterogenous treatment effects and staggered adoption. It builds on Bayesian Causal Forests, but reworks the model to fit panel data and common policy setups. The key innovation is a way to make the estimation task simpler by using the parallel trends assumption more effectively. The result is more accurate and flexible estimates of average, group-specific, and conditional treatment effects.

What is this paper about?

The paper proposes a new non-parametric method for causal inference in DiD settings. It is called DiD-BCF and is designed to deal with two major challenges in modern DiD applications: staggered treatment adoption and treatment effect heterogeneity. To do this, the authors extend Bayesian Causal Forests to panel data settings. Their method estimates: average treatment effects (ATT), group-level effects (GATT), and conditional effects (CATT), all within a single flexible framework. A key feature of the method is a new way of using the parallel trends assumption to simplify the estimation task. This helps improve the accuracy and stability of the results, especially when standard parametric DiD methods struggle.

What do the authors do?

They develop a new method (DiD-BCF) that combines the credibility of DiD with the flexibility of Bayesian Causal Forests. Instead of relying on fixed-effects regressions or linear models, they propose a fully non-parametric approach that can handle staggered treatment adoption and treatment effect heterogeneity in a single unified framework.

They start by generalising the standard DiD model, allowing for complex, nonlinear relationships between outcomes, time, and covariates. Then they extend Bayesian Causal Forests to panel data, so the model can recover dynamic and covariate-specific treatment effects over time. A key part of their strategy is to reparameterise the model using the parallel trends assumption, so instead of forcing the model to learn that treatment effects should be zero before treatment begins, they build that into the structure from the start. This makes estimation more stable and less prone to error.

To make the model more practical for applied researchers, they also include a procedure that speeds up convergence using a more efficient tree-fitting algorithm before switching to full Bayesian estimation. Finally, they test their method across a range of simulated settings that mirror real-world challenges, such as nonlinear trends, selection into treatment, and heterogeneity in effects, and compare it to leading alternatives like TWFE, DiD-DR, DiD2s, SDID, and DoubleML. Across the board, DiD-BCF performs well, especially in the kinds of cases where standard methods tend to break down.

Why is this important?

In observational settings, we rarely get (to do) randomised experiments. Treated and control units usually differ in ways that matter, whether in background characteristics, exposure, or timing, which makes simple comparisons misleading. That is why quasi-experimental methods like DiD have become central to applied work. DiD gives us a way to estimate causal effects from policy changes or discrete events, as long as the identifying assumptions can be justified.

Over the past few years, there has been a wave of new methods that improve DiD by dealing with common problems like staggered treatment timing or variation in treatment intensity. But most of these still focus on estimating average effects, either the overall ATT or group-time averages, which is useful, but often not enough.

In many real-world applications, the question goes beyond by how much did the policy work on average, it also matters for whom did it work. We need to understand the heterogeneity in treatment effects if we want to say something about mechanisms, or about which groups benefit more or less from an intervention. This is where CATT come in: they tell us how effects vary across observable characteristics.

To estimate CATTs, researchers have started turning to ML tools like Causal Forests. These models are flexible enough to recover treatment effect variation without needing to specify it upfront, and recent work has adapted them for use in DiD settings. If done carefully, this kind of approach lets us combine the identification logic of DiD with the flexibility of non-parametric estimation, enabling us to detect whether something worked, for whom it worked, and when.

This paper fits right into that agenda. It pushes the literature forward by offering a way to estimate CATTs in panel data under staggered adoption, all while maintaining interpretability and robustness. That makes it especially useful for policy evaluations where average effects might hide meaningful variation across time, space, or population subgroups.

Who should care?

This paper will be especially useful to applied researchers working with panel data, where treatment doesn’t happen all at once and where you suspect that not everyone is affected in the same way. If you work in labour, education, health, or policy evaluation more generally, and you’re kinda frustrated with the limitations of TWFE or concerned about heterogeneous effects being averaged away, this model is worth looking into.

It’s also useful for people who want unit-level or group-specific effects in addition to a single, aggregate ATT. And for researchers who want to use ML tools but still work within a familiar causal inference framework like DiD → this is a bridge between the two.

Do we have code?

The authors rely on existing packages like did2s, did, and DoubleML, and while their proposed model is based on well-established tools like BCF and XBART, some of the benchmark methods (like CFFE and MLDID) couldn’t be included in the simulations due to GitHub installation issues. Still, the core components for reproducing DiD-BCF and the comparisons are all available or well-documented. If you’re familiar with R and Bayesian modelling, you should be able to adapt their setup pretty easily.

In summary, DiD-BCF is an interesting new tool that brings flexibility to DiD. It replaces rigid parametric assumptions with a fully Bayesian, nonparametric model that can handle staggered treatment timing, heterogeneity in effects, and selection on observables, all in one go. By leaning on Bayesian Causal Forests and smartly reparameterising the treatment effect, the method improves estimation accuracy without giving up the core logic of DiD. The result is a model that performs well even when traditional approaches break down, and that opens the door to much richer treatment effect analysis in applied work.

Estimating Treatment Effects With a Unified Semi-Parametric Difference-in-Differences Approach

TL;DR: most DiD methods focus on average treatment effects (ATT) and assume parallel trends in means. But when the outcome is skewed, ordinal, or censored, estimating means can be misleading or hard to interpret. In this paper the authors introduces a new semi-parametric DiD estimator that allows researchers to estimate four treatment effects: average, quantile, probability, and Mann-Whitney, using a single model and a unified assumption. The method performs well in simulations and is applied to evaluate the impact of Medicaid expansion on CD4 counts among people living with HIV.

What is this paper about?

In this paper the authors present a new DiD estimator that can recover a quite comprehensive range of causal effects, not just averages but also quantiles, probabilities, and rank-based measures. And they do using a single model and a single identification assumption. Their key idea is to replace traditional mean-based comparisons with a semi-parametric cumulative probability model (CPM)1. Instead of assuming a specific transformation (like logs) to deal with skewed outcomes, the CPM treats the outcome as a monotonic transformation of a latent variable.

The authors focus on four causal estimands, all defined in terms of the marginal distributions of potential outcomes for the treated group: average effect on the treated (ATT) the effect on a specific quantile of the treated outcome distribution (QTT), the change in the probability that the outcome is below a given threshold (PTT), and a rank-based measure that captures the probability a treated person would outperform a comparable untreated person (MTT). Because these estimands depend only on marginal, not joint, potential outcome distributions, the method requires fewer assumptions than alternative approaches. All four effects are then identified using a single conditional parallel trends assumption, made on the latent scale (rather than on the original outcome scale or after transformation). This assumption states that, conditional on covariates, the untreated change over time for the treated group would have followed the same trend as the control group not in observed outcomes, but in the underlying latent variable used by the model.

What do the authors do?

The authors’ proposed DiD estimator can recover the four aforementioned types of causal effects (ATT, QTT, PTT, and MTT) using a single model and a single identification strategy, which is remarkable because it contrasts most existing approaches where estimating each of these effects would require a separate set of assumptions and often a different estimation method.

To operationalise their approach, the authors define a semi-parametric DiD estimator based on a CPM, allowing them to handle skewed, ordinal, or otherwise non-linear outcome distributions without requiring a pre-specified transformation. Then they move on to formally define the four estimands using only marginal potential outcome distributions for the treated group. This keeps things simpler by not trying to model how each treated person’s outcome would have compared to their own untreated potential outcome, which usually requires stronger, less realistic assumptions.

As mentioned, they rely on a single identification assumption, which is a conditional parallel trends assumption on the latent outcome scale. Unlike our parallel trends in means (or transformed means), their assumption is made on the underlying latent variable. They then evaluate performance through simulation, varying sample sizes (n = 200 to 2,000) and generating a large pseudo-population to benchmark the true values of each estimand. The simulations show good performance even in small samples and under data skewness.

In the last part of the paper, they apply the method to real data by estimating the impact of Medicaid expansion in the U.S. on CD4 cell count at HIV care entry (a classic example of a skewed, clinically relevant outcome). They find that the policy had a broad positive impact, affecting all four estimands.

Who should care?

Have you log-transformed your variables before? Does your data look not-normal? If so, you should read this paper. Also pretty much anyone else using wanting to model DiD with “non-standard outcomes” (skewed, ordinal, censored, etc) will find this paper useful. I’d say it’s particularly useful for applied researchers in health and labour economics, where we often find rank-based or distributional effects to matter more than mean shifts.

Do we have code?

Not yet, but hopefully in the future.

In summary, this paper introduces a flexible, semi-parametric DiD method that lets you estimate a wide range of treatment effects using a single model and a single identification assumption. It’s a compelling alternative to the usual mean-based approaches, especially when dealing with skewed or messy outcomes. And by including estimands like the Mann-Whitney2 treatment effect, it opens the door to more interpretable, rank-based causal questions that standard DiD methods often lack.

Child Penalty Estimation and Mothers’ Age at First Birth

TL;DR: ever heard economists saying “the gender pay gap is mostly due to motherhood penalty”? This paper takes this question and shows we might be underestimating the motherhood penalty (in wage terms) by about 30% because we’re averaging across mothers who are nothing alike (or, in other words, standard methods bias the estimated penalty by ignoring staggered timing and treatment heterogeneity). The authors then propose a cleaner DiD approach that estimates the penalty separately by age at first birth, and as a result they find big differences both in the penalty’s size and in what it really means for younger vs older mothers.

What is this paper about?

“Motherhood is still costly for the careers of women”. In this paper, the authors take something we all say (that the gender pay gap is largely driven by the career costs of motherhood) and ask a simple, yet a bit uncomfortable, question: what if we’ve been measuring those costs incorrectly? The standard approach so far has been a big event study centred on the first birth, tracking earnings before and after, and showing that women’s earnings fall sharply while men’s barely move a centimeter. The problem, the authors argue, is that this kind of model implicitly assumes that all mothers are the same. But of course they’re not. The age at which someone has their first child is correlated with education, career stage, earnings, occupation, and parental background (and lots of other unobserved decisions). So when we pool all mothers together, we’re collapsing both very different types of women and very different types of penalties into one average, and that average not only misleads but also obscures meaningful variation policymakers should care about.

They find lots of interesting bits. First, the penalty is actually larger than previously estimated. In their preferred specification, there’s a cumulative earnings loss of nearly €30,000 by year four (about €10,000 more than what a standard event study would suggest), which isn’t a small correction. It reflects the fact that conventional event studies systematically understate what women’s earnings would have looked like in the absence of children, largely because their control groups include women who have already had children themselves. That violates the parallel trends assumption and biases the counterfactual downwards, making the penalty look smaller than it actually is.

Second, the penalty grows with age in absolute terms, but shrinks in relative terms. Older mothers lose more in euros because they were earning more to begin with. But younger mothers lose a larger share of their pre-birth income because they’re cut off just as their wage growth would have accelerated.

Third (and this is what I found most interesting), the nature of the penalty differs. For older mothers, the penalty is mostly about reducing hours, exiting temporarily, or giving up seniority. For younger mothers, it’s about missing the steepest part of the wage trajectory entirely. It’s not the same shock. One is a level shift. The other is a slope change. This is super important from a policy perspective: helping someone who missed a promotion track requires a different tool than helping someone who stepped back from senior management, for example.

What do the authors do?

The paper starts from a now-familiar problem in applied work: when treatment timing is staggered and treatment effects vary across units, conventional event study models can break down. In this case, those models do two things we definitely want to avoid. First, they make forbidden comparisons by including already-treated women in the control group. Second, they suffer from contamination, where estimates for one time period “bleed” into another, especially when pre- and post-birth windows aren’t cleanly separated.

The root of the problem is that standard event study models pool together younger and older first-time mothers, implicitly assuming they’re comparable and that the effects of childbirth are uniform across them. But that assumption, as the authors put it, is “unlikely to hold.” Mothers at different ages differ in earnings, career stage, education, and trajectories. Pooling across them “masks” both the size and the shape of the penalty.

To fix this, the authors propose a stacked DiD design (closely following Wing, Freedman, and Hollingsworth 2024, with attention to overlapping cohorts), combined with a rolling window of control groups by age at first birth. Instead of estimating one big average effect, they estimate separate effects for each age-at-birth group, and they’re very careful about who gets used as a counterfactual. For each group of treated mothers, the control group is made up of not-yet-treated women who will give birth at slightly older ages, observed before they become mothers. So if you’re looking at women who give birth at 25, the control group might be 26- or 27-year-olds who are about to give birth but haven’t yet. That way, everyone in the comparison is close in age, close in life stage, and still on the pre-birth trajectory.

As they describe it, this “combination of a stacked DiD with a rolling window of control groups enables [them] to eliminate the issues present in conventional event studies and estimate the age-at-birth-specific effects of childbirth on post-birth labor market outcomes.” No already-treated mothers in the control group. No artificial smoothing across life stages. Just clean, age-specific estimates of what motherhood does to earnings.

Once the age-specific penalties are estimated, they’re aggregated using sample shares as weights. The resulting average is substantially larger than the pooled estimate, and, most importantly, it makes clear that instead of being one number, the penalty is a set of distinct experiences depending on when motherhood begins.

Why is this important?

The motherhood penalty is one of the largest contributors to gender inequality in the labour market, but if we measure it incorrectly, any policy built on that estimate risks being too late, too generic, or simply misdirected. What this paper shows is that how we estimate the penalty matters. Conventional event studies often use control groups that include women who have already had children, violating parallel trends and biasing the counterfactual downwards. That alone is enough to underestimate the true cost of motherhood, especially because what is being missed isn’t just earnings levels but the growth that would have happened absent childbirth. The timing of motherhood matters because the nature of the penalty changes with age: reducing hours or exiting work hits harder when your earnings are high, but missing wage growth early on can permanently derail a trajectory. This is a methodological paper, but it goes well beyond that. By analysing the effects of motherhood by age at first birth, it opens the door to more targeted and better-informed policy, recognising that women who have children at different stages in their life and career also respond differently to support and constraints.

Who should care?

If you work on gender gaps and use event studies, this paper is a must-read. Same goes if you’ve ever used the phrase “motherhood penalty” without checking whether your estimate holds across women with different life paths. But it also matters more broadly for anyone interested in how we measure inequality, because it reframes the timing of a treatment as a source of information, and not merely a nuisance to be adjusted for. If we keep pooling heterogeneous groups, we risk erasing the very dynamics we claim to study. So even if you do not study gender, if you work with staggered treatments, event-time estimators, or stacked DiD designs, this is a paper worth reading and assigning to your students.

Do we have code?

It’s an application paper, so no.

In summary, this paper is a methodological critique with real-world stakes. It shows that when we average across all mothers to estimate the cost of motherhood, we get a number that is not just wrong but misleading. The timing of motherhood shapes the size, shape, and meaning of the penalty, and by pooling across it, we risk flattening nuance into noise. The stacked DiD approach they propose fix the bias and brings to the surface an economically and policy-relevant heterogeneity that would otherwise be lost. If we care about inequality, then we also need to care about how we measure it.

A cumulative probability model (CPM) models the probability that the outcome is less than or equal to a given value (P(Y ≤ y)), based on covariates. Instead of modelling Y directly, it assumes there’s a latent variable Y* that follows a linear model, and the observed outcome Y is a monotonic transformation of Y*. The function linking Y* and Y (denoted H) is unspecified and estimated from the data, making the model semi-parametric. This allows the CPM to flexibly handle skewed, ordinal, or censored outcomes without needing to pre-specify a transformation like log(Y).

This paper is a joint work between authors from Biostats, Public Heath and Economics departments, so the Mann-Whitney parameter might not be familiar to us. The Mann-Whitney parameter comes from the Mann-Whitney U test (also called Wilcoxon rank-sum test), and it’s often interpreted as the probability that a randomly selected observation from group A is larger than one from group B. So if group A is treated and group B is untreated, and the probabilistic index is 0.65, it means there’s a 65% chance a treated person scores higher than an untreated one. In the paper they define a Mann-Whitney treatment effect among the treated (MTT), which is a probabilistic measure of treatment impact. The Mann-Whitney parameter is a descriptive, non-causal rank-based comparison between two groups, while the MTT turns that idea into a causal estimand in DiD by comparing treated individuals’ actual outcomes to their estimated counterfactuals under no treatment, using the rank-based structure of the Mann-Whitney statistic. Instead of asking how much the outcome increased (like the ATT does), it asks a different question: what is the probability that a treated person has a better outcome than a comparable untreated person? As an example, if the MTT is 0.70, it means there’s a 70% chance that a randomly selected treated individual will have a higher outcome than a randomly selected untreated individual. If the MTT is 0.50, it means treatment had no effect on rank → treated and untreated units perform equally, on average. The authors show how to estimate the MTT in a DiD setup, and according to them, this is the first time anyone has done this.

In defense of Machine Learning in Economics

Beatriz Gietner — Tue, 10 Jun 2025 11:19:08 GMT

Hi there. This post is a “response” to a pre-print I tweeted about last Friday. I read it and thought it was great, and I’ll write about it in the next newsletter post. But that paper per se is not the topic today, and I apologize if the subject is not of your interest. Please feel free to ignore it if you don’t need convincing :)

This post is about me making the case FOR the use of ML in Econ. It feels kind of dated to have to still argue for this, and if “our goal as a field is to use data to solve problems” (which I know some will say it isn’t), then the criticisms warrant even less consideration. I believe the reluctance to adopt ML techniques in Econ stems from two sources1: orthodoxy (the tendency to stick to familiar models and econometric traditions) and purity (the idea that if a method isn’t explicitly tied to a structural model or doesn’t yield a “clean” causal estimate, it’s not real economics).

Part of this reluctance also comes from a misperception: that ML is only useful for prediction and therefore has little to contribute to causal inference or theory-driven analysis. But that’s increasingly out of step with how ML is actually used in empirical research. Consider causal forests which use tree-based methods to identify heterogeneous treatment effects while maintaining the rigor economists demand. Or double machine learning, which combines ML’s predictive power with traditional causal identification strategies to reduce bias in treatment effect estimation. Regularization techniques (e.g., TMLE) help us focus on the parameters we care about most while still leveraging ML’s ability to handle high-dimensional data. Rather than talking about replacing economic reasoning, we can reason from the point of view of enhancing it in a way that helps us extract more nuanced insights from complex data while maintaining causal rigor.

I like to think of these tools as complements, not substitutes, for economic theory. Just like we once borrowed techniques from Stats and Maths, we’re “now” borrowing from CS and engineering. That’s how fields evolve and strengthen. Professor Athey writes: “I believe that regularization and systematic model selection have many advantages over traditional approaches, and for this reason will become a standard part of empirical practice in economics”.

Another point that makes it harder to bridge these two fields, though, is that the culture around publication and validation in CS is so different: pre-prints are fast, feedback is open, and iteration happens in public. In Econ, we move at glacial speed, and nothing feels “legit” until it survives multiple rounds of refereeing. Another point to consider is that criticism often comes from academia. It barely ever comes from Economists at Amazon or Alphabet. Many of the advances in causal ML (like uplift modeling, heterogeneous treatment effects, online experimentation) were incubated in tech companies because they had the scale, the data, and the incentives to care.

Now, I’m obviously not suggesting we should abandon academic rigor for industry speed, and I believe both approaches have their strengths. Industry settings often provide the scale and urgency that drive innovation, while academic settings provide the time and incentives for thorough validation. But we can learn from both.

Just last year, 2/3 of the Nobel in Chemistry was awarded to researchers at Google DeepMind for the creation of AlphaFold2. This kind of breakthrough didn’t come out of a “traditional” academic lab. It came out of industry research, using ML tools to push the frontier of a completely different field. If Chemistry is giving out Nobels for this work, maybe it’s time for us to stop treating ML like it’s not serious science3.

And beyond the science, there’s a very practical angle too: the academic job market is brutal. There are more brilliant PhDs than tenure-track positions, and even fewer in policy-facing or methodological roles that reward creativity. If we don’t expose students to modern tools (*especially* ones that are widely used in industry) we’re not preparing them (and I include myself here) for the full range of careers they may need or want to pursue. The industry isn’t waiting for us to catch up. The least we can do is make sure our students aren’t left behind.

I’ve had to defend these positions a couple of times already, and I think writing about the experience (in order to prepare others for what they’ll likely face if they try to do the same) is worth doing. To make this case more concrete, let me center you in the current state of affairs. Understanding the context helps clarify why the current resistance feels both familiar and ultimately misguided.

But before we move to the critical points, it’s necessary to acknowledge the giants pushing the fields to intersect: Professors Athey at Stanford and Chernozhukov at MIT. They’re leading figures at two of the most prestigious Econ departments in the world, representing a geographic and intellectual complementarity that’s been central to the field’s development4. Professor Athey, who has one foot in Silicon Valley and the other in academia (she was Chief Economist at Microsoft Research), has done a lot to make ML legible to economists (especially by focusing on causal questions and by making the theory accessible to undergrads and postgrads). Her work on causal trees and causal forests is all about taking the structure economists care about and building in the flexibility that real-world data demands. Professor Chernozhukov, meanwhile, has taken the inference side (which makes sense since he’s an econometrician). His work on high-dimensional econometrics (aka the double/debiased ML framework with Belloni and others) showed how you can combine ML’s predictive power with solid causal identification. It’s basically the answer to the “but what about inference?” pushback you still hear in seminars. Together with Professor Athey’s frequent collaborator Professor Imbens and early adopters like Professor Mullainathan, they’ve helped shift the conversation from “prediction versus causation” to “prediction in service of causation”.

To put their contributions in perspective, it’s worth stepping back and tracing how ML and economics first started to intersect. It’s no easy task to figure out where these two “fields” first intersected, or to pinpoint the single first economics paper that used what we now broadly categorize as ML techniques, partly because the definition of "machine learning" has evolved, and partly because some foundational methods were adopted gradually. If we look at when specific techniques came into being, for example, we can start with Ridge Regression (a form of L2 regularization), introduced by Hoerl and Kennard in 1970, and LASSO (Least Absolute Shrinkage and Selection Operator, a form of L1 regularization), introduced by Robert Tibshirani in 1996.

These regularization methods weren’t created with economists in mind, but they quickly became useful for us as datasets got bigger and models more complex. Still, for a long time, they mostly lived on the margins of applied work. You might see LASSO show up in a robustness check or in a variable selection appendix, but it wasn’t central to the analysis. That started to change when the tools stopped being just about prediction and started helping us answer causal questions more transparently.

The real shift came when economists began to recognize that ML could be used reduce overfitting and improve out-of-sample performance, and also to solve problems we already cared about like estimating treatment effects in high-dimensional settings or uncovering heterogeneity we suspected was there but couldn’t capture with linear models. Suddenly, ML wasn’t a detour away from economic reasoning, it was more like a way to get closer to the truth in messy, real-world data.

This is where the crossover with causal inference really took off. Methods like causal forests, double ML, and targeted regularization didn’t just borrow from CS, they adapted those tools to meet our standards for identification and interpretation. That’s why they stuck. And that’s why people like Professors Athey, Chernozhukov, and others were so effective at building the bridge: they spoke the language of economists and they understood the technical machinery involved. So, when people say “ML is just prediction”, you can reply that this is an outdated perspective that overlooks the substantive evolution happening in the empirical literature. The goal was never to abandon causality. The goal was (and is) to do better causal inference with better tools5. That’s a framing I think is worth emphasizing.

Now that you more or less know where we are at the moment and why we need to talk about this, I have a list of the most common criticisms I’ve heard and the way I think we can address them is by pointing at the literature while we try to make the most our of it without losing sight of reality.

1. “It’s just curve-fitting/overfitting”

You’ve probably heard some version of this: ML models are too flexible and will fit noise rather than signal, they don’t generalize well out-of-sample, and economic relationships should be “parsimonious”, not complex. This criticism fundamentally misunderstands how modern ML works. The claim that ML models “overfit” ignores that the entire ML workflow is designed around out-of-sample validation6, which is something traditional econometrics often skips entirely. Mullainathan and Spiess (2017) make this point directly: ML’s obsession with predictive performance makes models more reliable, not less, because it forces researchers to test whether their results generalize beyond the training data. When Belloni et al. (2012, 2014) applied LASSO regularization to causal inference problems, they improved treatment effect estimation by selecting relevant controls while discarding noise, and thus avoiding overfitting. The “parsimony” argument is equally misguided. As Varian (2014) points out, when economic relationships are genuinely complex, forcing them into overly simple models creates more bias than allowing appropriate complexity. The solution is to use principled methods like regularization that balance complexity with generalizability7 and not to pretend the world is simple. We see this in practice: Kleinberg et al. (2018) showed that ML models predicting judicial decisions outperformed simple heuristics precisely because they could handle complexity without overfitting. The irony is that economists criticize ML for “curve-fitting” while often using specification searches and robustness checks that are far more prone to overfitting than properly cross-validated ML approaches.

2. “Black box problem”

This one comes always comes up: you can’t interpret the results, there’s no economic intuition behind the relationships, and policymakers need to understand why something works, not just that it works. But this criticism conflates older ML methods with modern approaches that are explicitly designed for interpretability. Athey and Wager’s (2019) causal forests give you treatment effects and tell you exactly which covariates drive heterogeneity and how. You can’t call it a black box; that’s more interpretable than most traditional econometric models that assume homogeneous effects. Tools like SHAP values (Lundberg and Lee, 2017) and LIME (Ribeiro et al., 2016) can decompose any model’s predictions into interpretable components, showing exactly how each variable contributes to the outcome. When the aforementioned Kleinberg et al. (2018) built ML models to predict judge decisions, they extracted insights about which case characteristics actually matter for judicial outcomes, information that was invisible in traditional analyses. We can’t simply consider every single variable alone and for its relevance. The deeper issue is that “interpretability” often means “fits my priors” rather than “reveals true mechanisms”. A linear regression with 20 control variables isn’t inherently more interpretable than a well-designed tree-based model that shows you which interactions matter the most. As Athey (2018) argues, ML tools can enhance economic intuition by revealing patterns and relationships that traditional methods miss. The question is whether we’re willing to update our understanding when the data suggests more complex relationships than our simple models assumed.

3. “Prediction ≠ causation”

This is probably the most fundamental objection: ML is only good for forecasting, not understanding causal relationships; economics is about identifying causal effects, not just correlations; you can’t do policy analysis without causal identification. After all, we did have an entire revolution concerning this worry. But this criticism is based on a false dichotomy that ignores how modern causal ML works. Chernozhukov et al. (2018) showed exactly how to combine ML’s predictive power with rigorous causal identification in their double/debiased ML framework8: you use ML to estimate nuisance parameters (like propensity scores or outcome models) while maintaining all the causal identification you care about. Athey and Wager (2019) took this further with causal forests by demonstrating how tree-based methods can identify heterogeneous treatment effects with full causal rigor. These are causal inference methods that happen to use ML tools rather than “prediction” methods. A deeper insight comes from Kleinberg et al. (2015), who pointed out that many policy problems ARE prediction problems: predicting who will benefit most from a treatment, predicting which interventions will work in which contexts, predicting the consequences of policy changes. When you frame it this way, the prediction vs. causation distinction starts to look fake. Belloni et al. (2014) solved the lingering inference concerns by showing how to do valid statistical inference after using ML for variable selection. The result? “Better causal inference through better prediction” rather than “prediction instead of causation”. When Davis and Heller (2017) used causal forests to understand treatment heterogeneity in job training programs, they were making causal inference more flexible and informative than traditional approaches ever could, not abandoning it.

4. “No economic theory”

Ok, we are halfway through these. Here’s another one you hear all the time: ML methods are atheoretical: they let the data speak without economic reasoning, economics should be driven by theory not algorithmic pattern recognition (whatever it means), and results lack the structural interpretation needed for counterfactuals. But this gets the relationship between theory and ML backwards. Kleinberg et al. (2015) showed that prediction problems are often at the heart of economic theory: optimal taxation requires predicting behavioral responses, welfare analysis requires predicting who benefits from policies, market design requires predicting how agents will behave under different rules. ML makes this reasoning more rigorous by forcing us to be explicit about our assumptions and test whether our theories predict well. Mullainathan and Spiess (2017) put it perfectly: ML improves traditional econometric practice by providing better tools for the things we were already trying to do. When Gentzkow et al. (2019) used text analysis to study media bias, they were testing theories about how competition affects content in ways that were impossible before ML tools existed, rather than discarding abandoning economic theory. The “structural interpretation” concern is equally misguided. Hartford et al. (2017) showed how to combine deep learning with IVs for structural estimation (akin to what the paper that started this whole conversation was trying to do). Bajari et al. (2015) demonstrated how ML can improve demand estimation while maintaining full economic interpretation. ML doesn’t lack theory (although practice tends to precede it), and it forces us to confront whether our theories really work when tested against complex, real-world data. Sometimes they do, sometimes they don’t, and sometimes the data reveals relationships our theories missed entirely. That’s not atheoretical, that's how science progresses.

5. “External validity concerns”

This criticism also has it backwards. The argument is that models trained on one dataset don’t work elsewhere, that economic relationships vary across time and place, and that we need models that capture stable fundamentals. While the concern is valid, Professor Athey notes that an algorithm might find unstable relationships, like the fact that for a time “the presence of a piano in a video may thus predict cats”, the idea that this makes ML worse than traditional methods is wrong. In fact, traditional approaches that assume homogeneous treatment effects and stable parameters across contexts have a much bigger external validity problem; they just assume it away. ML tools do the opposite: they explicitly model heterogeneity, telling you exactly why and where effects vary. This is the core of external validity. When Athey and Wager developed causal forests they were providing a systematic way to “discover forms of heterogeneity”. Chernozhukov et al. (2018) took this further by developing formal inference procedures that help us understand when results will generalize. And we see this in practice: Kleinberg et al. (2018) built judge prediction models that worked across different courts and time periods precisely because they could identify which factors were stable and which were context-specific. The deeper point is that external validity is about understanding *systematic* variation. When Davis and Heller (2017) used causal forests to study job training programs, they identified which participant characteristics predicted where the program would work and where it wouldn’t. We should not see this as a threat to external validity, but as external validity done right. The irony is that we sometimes worry about ML models not generalizing while using methods that assume away the very heterogeneity that determines whether results will generalize in the first place. Similarly, Gu et al. (2020) demonstrated that ML methods in asset pricing consistently outperform traditional models across different time periods and market regimes. Mullainathan and Obermeyer (2017) found the same pattern in healthcare, where ML models proved more robust across different patient populations than traditional risk-adjustment methods that rely on stable, homogeneous relationships.

6. “Statistical inference problems”

Ok, I will give them that - to some extent. This criticism gets at something real but misunderstands the solution. The concerns are legitimate: traditional standard errors “break down” when you use ML for variable selection, multiple testing becomes a serious issue when algorithms explore thousands of potential relationships, and post-selection inference creates well-known statistical problems. As Professor Athey puts it, a central theme of the new literature is that ML algorithms “have to be modified to provide valid confidence intervals for estimated effects when the data is used to select the model”. What I would say here is that we shouldn’t avoid ML. We should use the statistical innovations that have specifically solved these problems. The modern literature has done exactly that, often using techniques like “sample splitting” and “orthogonalization” to ensure valid inference. For example, Hansen and Kozbur (2014) demonstrated how to do valid inference in high-dimensional panel models. Chernozhukov, Hansen, and Spindler (2015) provide a comprehensive framework for post-selection inference, while Hansen and Liao (2019) developed bootstrap-based approaches for these complex settings. The double/debiased ML framework from Chernozhukov et al. solves the problem by separating the prediction task (where ML excels) from the inference task (where econometric rigor is maintained), giving you the best of both worlds. The multiple testing critique is particularly ironic because traditional econometrics is often far more guilty of this. We frequently check “dozens or even hundreds of alternative specifications behind the scenes” without correcting for it, a practice that invalidates reported p-values (the incentives are there9). In contrast, ML-based methods are systematic and transparent. The causal forests developed by Athey and Wager (2019) comes with an asymptotic theory that provides valid inference even after the algorithm explores complex interactions. The deeper point is that these are not uniquely “ML problems”; they are statistical problems that exist whenever a researcher engages in model selection. The difference is that the ML and modern econometrics literature, including work like selective inference from Tibshirani et al. (2016), has developed principled and transparent solutions where traditional practice often ignored the problem entirely.

7. “It's just a fad”

I will finish with this one (also because the e-mail length has been nearly achieved), and I hope I was able to convince you somewhat of the benefits of using ML techniques in Econ.

My problem with people saying “it’s just a fad” is that this criticism reveals a fundamental misunderstanding of how scientific progress works. The argument goes that Econ has seen many methodological fashions come and go, that core economic insights don’t require fancy new tools, and that traditional econometric methods have worked fine for decades. But this perspective confuses *stability* with *stagnation*. Every major methodological advance in Econ was once dismissed as a fad. Regression analysis itself was once controversial. IVs are still viewed with suspicion. Even RCTs faced resistance before becoming the “gold standard” for causal inference (you can include me on that). The pattern is always the same: initial skepticism, gradual adoption by leading researchers, then widespread acceptance as the new normal. ML is following exactly this trajectory, but with one special difference: the scale and speed of adoption.

Let’s consider the institutional evidence: top economics departments are hiring ML-trained economists, leading journals are starting to publish ML-based research, and the most prestigious conferences feature ML sessions. The Nobel Committee awarded prizes to researchers using computational methods that were unimaginable decades ago. We can’t call this a fad; it’s a permanent expansion of the economist’s toolkit. The “core insights” argument is particularly misguided.

Varian (2014) argued that traditional econometric tools, while effective in many contexts, face serious limitations when confronted with the scale and complexity of modern datasets. How do you study personalized pricing algorithms? How do you analyse text from millions of social media posts or model outcomes using thousands of predictors? Do not label these as “fringe cases” when they are, in reality, a focus point to understanding today’s economic activity. ML offers practical solutions where conventional approaches often fall short. The idea that “old methods worked fine for decades” ignores the empirical reality that those methods frequently break down in the face of high-dimensional or unstructured data. Mullainathan and Spiess (2017) demonstrated that many standard econometric practices perform poorly in out-of-sample tests, and that what “worked fine” may have been an illusion created by never testing predictive performance. The deeper issue is that dismissing ML as a fad reflects a conservative bias that mistakes familiarity for validity. As Professor Athey argues, the question isn’t whether traditional methods are good enough, but whether we can do better. The evidence increasingly suggests we can.

One can even argue about a third driver: institutional incentives (e.g., tenure metrics tied to causal identification papers). Some of the mechanisms in play we can think of are: top journals are more likely to accept papers with clean causal designs, tenure committees count those “causal hits” and discount purely predictive studies, PhD advisors channel students toward DiD rather than neural networks, grant reviewers ask “where’s the treatment?” and mark down ML-only proposals, core graduate courses leave ML out so the entry cost stays high, and authors often graft an instrument onto their models just to satisfy referees. My friend Prashant and Professor Fetzer have an excellent paper on the “rise in the share of causal claims (papers) - from roughly 4% in 1990 to nearly 28% in 2020 - reflecting the growing influence of the “credibility revolution”.” They find that “causal narrative complexity (e.g., the depth of causal chains) strongly predicts both publication in top-5 journals and higher citation counts, whereas non-causal complexity tends to be uncorrelated or negatively associated with these outcomes”.

AlphaFold utilizes ML (particularly deep learning) to predict the 3D structure of proteins from their amino acid sequences. It leverages large datasets of known protein structures and sequences to train a neural network that can identify patterns and relationships in these structures.

“A prediction I have is that there will be an active and important literature combining ML and causal inference to create new methods, methods that harness the strengths of ML algorithms to solve causal inference problems. (…) This new literature takes many of the strengths and innovations of ML methods, but applies them to causal inference. Doing this requires changing the objective function, since the ground truth of the causal parameter is not observed in any test set”. Athey, 2018.

I’d like to thank Professor Marica Valente for this insight. I attended her course in Machine Learning last summer at ISEG and had a great time, she’s awesome.

“Researchers often check dozens or even hundreds of alternative specifications behind the scenes, but rarely report this practice because it would invalidate the confidence intervals reported (due to concerns about multiple testing and searching for specifications with the desired results). There are many disadvantages to the traditional approach, including but not limited to the fact that researchers would find it difficult to be systematic or comprehensive in checking alternative specifications, and further because researchers were not honest about the practice, given that they did not have a way to correct for the specification search process. I believe that regularization and systematic model selection have many advantages over traditional approaches, and for this reason will become a standard part of empirical practice in economics”. Athey, 2018.

“The ML literature uses a variety of techniques to balance expressiveness against over-fitting. The most common approach is cross-validation whereby the analyst repeatedly estimates a model on part of the data (a “training fold”) and then evaluates it on the complement (the “test fold”). The complexity of the model is selected to minimize the average of the mean-squared error of the prediction (the squared difference between the model prediction and the actual outcome) on the test folds. Other approaches used to control over-fitting include averaging many different models, sometimes estimating each model on a subsample of the data (one can interpret the random forest in this way)”. Athey, 2018.

“There are discussions of what interpretability means, and whether simpler models have advantages. Of course, economists have long understood that simple models can also be misleading. In social sciences data, it is typical that many attributes of individuals or locations are positively correlated–parents’ education, parents’ income, child’s education, and so on. (…) So, simpler models can sometimes be misleading; they may seem easy to understand, but the understanding gained from them may be incomplete or wrong”. Athey, 2018.

Chernozhukov et al. have a new pre-print where they provide a practical introduction to Double/Debiased Machine Learning (DML).

“We examine how the evaluation of research studies in economics depends on whether a study yielded a null result. Studies with null results are perceived to be less publishable, of lower quality, less important and less precisely estimated than studies with large and statistically significant results, even when holding constant all other study features, including the sample size and the precision of the estimates. The null result penalty is of similar magnitude among PhD students and journal editors. The penalty is larger when experts predict a large effect and when statistical uncertainty is communicated with p-values rather than standard errors. Our findings highlight the value of a pre-result review”. Chopra et al., 2024.

Some recommended readings, in no specific order:

A Hands-on Machine Learning Primer for Social Scientists: Math, Algorithms and Code, by Nikos Askitas, 2024 (if you are a social scientists that got to this point and want get your hands dirty, I would recommend reading this guide as a starter).

Machine Learning For Causal Inference In Economics, by Anthony Strittmatter, 2025.

Machine Learning Methods That Economists Should Know About, by Susan Athey and Guido W. Imbens, 2019

Financial Machine Learning, by Bryan T. Kelly and Dacheng Xiu, 2023

Lawrence R. Klein’s Principles in Modeling and Contributions in Nowcasting, Real-Time Forecasting, and Machine Learning, by Roberto S. Mariano and Suleyman Ozmucur, 2020

Statistical Modeling: The Two Cultures (paper / slides), by Leo Breiman, 2001 (thanks Gian for this tip!)

Machine Learning for Continuous Treatments, Bayesian Methods for Staggered Timing, and Using Experiments to Fix Observational Data

Beatriz Gietner — Fri, 30 May 2025 16:20:37 GMT

Hi there! Today we have two new DiD papers, one DiD guide, and one paper by 3 heavy-weights in the field on how to leverage experiments to correct for observational bias:

Continuous Difference-In-Differences With Double/Debiased Machine Learning, by Lucas Zhang
Model-based Estimation of Difference-in-Differences with Staggered Treatments, by Siddhartha Chib and Kenichi Shimizu
The Experimental Selection Correction Estimator: Using Experiments to Remove Biases in Observational Estimates, by Susan Athey, Raj Chetty and Guido Imbens (this paper is *not* about DiD but it’s about an estimator :) also it’s a great reading so I decided to add it to the post)
A Concise Guide to Difference-in-Differences Methods for Economists with Applications to Taxation, Regulation, and Environment, by Bruno Bosco and Paolo Maranzano (I added this one here because it’s so good! If you’re just starting to get familiar with DiD, this guide is a great place to start)

Continuous Difference-In-Differences With Double/Debiased Machine Learning

(Lucas’ JMP is also on DiD. He’s a Ph.D. candidate at UCLA)

TL;DR: in this paper, Lucas extends DiD in settings with continuous treatments (where treatment intensity varies rather than being binary). To identify and estimate the average treatment effect on the treated (ATT) at each intensity level, he proposes a new estimator built on double/debiased machine learning (DML). The method he proposes uses orthogonal scores, kernel smoothing, and cross-fitting to reduce bias (particularly in high-dimensional settings). The result is a flexible and theoretically grounded approach to estimating how treatment effects vary with dose, complete with valid confidence bands and a real-world policy application.

What is this paper about?

There is growing interest in extending DiD to continuous treatments, where one can further investigate how outcomes vary across different treatment intensities within the treated group. Think of hospitals exposed to different levels of policy reform, or individuals experiencing varying degrees of exposure to an intervention. This paper tackles this growing interest. The core object of interest is the average treatment effect on the treated (ATT) at any given intensity level, d, defined as

Where ATT at level d is the expected difference between the treated and untreated potential outcomes at time t, for units that actually received treatment level d. To identify this effect, the paper adopts a conditional parallel trends assumption (analogous to the binary case in the “sister” papers Abadie, 2005 and Sant’Anna and Zhao, 2020) where untreated potential outcomes evolve similarly across treatment levels, *conditional on covariates*. This allows the ATT to be expressed in terms of observable quantities even when treatment is continuous. But as Lucas notes, we have some issues: “estimating the ATT in this framework requires first estimating infinite-dimensional nuisance parameters,” making the task more complex than in binary DiD (think of nuisance parameters are parts of the model that aren’t the main focus of your estimation - like the ATT - but that must be estimated to recover it, e.g. conditional means or treatment probabilities). He shows how to handle this challenge in a rigorous way.

What does the author do?

To estimate ATT(d) in practice, Lucas adopts the double/debiased machine learning (DML) framework developed by Chernozhukov et al. (2018), which uses orthogonalisation (where we rewrite the estimator so it’s robust to errors in nuisance parameter estimates) and cross-fitting (where we split the sample so that nuisance parameters are estimated on one part and used on another) to reduce bias in causal inference. This allows him to flexibly estimate complex nuisance functions (e.g., conditional densities and probabilities) without introducing first-order bias, even in high-dimensional settings. The estimator is then constructed in two steps:

Orthogonal (or "locally insensitive") scores: ATT(d) is re-expressed using a score function that remains valid even if the nuisance parameters are estimated with error (a property known as Neyman orthogonality - like taking a derivative that’s zero at the true value, so small estimation mistakes have no first-order impact).
Cross-fitting: the sample is split so that nuisance functions are estimated on one part and used on another, which helps mitigate overfitting and improves finite-sample performance (think of this like using one set of data to “learn” the adjustment, and a different set to “apply” it).

A central challenge here is that ATT(d) depends on the conditional density of the treatment, which is difficult to estimate directly. To address this, Lucas approximates ATT(d) with a smoothed version using kernel weights (think of it like a weighted average centred around d, where the weights smoothly drop off as you move away → this helps us estimate quantities like ATT(d) even when we don’t have many observations with exactly that treatment level), and shows that this approximation converges to the true ATT as the bandwidth shrinks. He then derives the estimator’s asymptotic properties, showing it is asymptotically normal, and provides valid confidence bands using a multiplier bootstrap.

Why is this important?

Most real-world policies do not treat units in a binary way, and interventions often vary in intensity or exposure, whether it's funding amounts, pollution levels, or programme uptake. Yet more traditional DiD methods are built around binary treatment comparisons, limiting what we can learn. This paper offers a clean way to generalise DiD to continuous treatments, while keeping identification credible and inference valid even in high-dimensional settings where machine learning tools are used. It bridges the gap between theory and application by adapting recent advances (like DML) to a DiD context, which is still relatively underdeveloped for continuous treatments. This isn’t a theoretical exercise: the paper applies the method to the 1983 Medicare PPS reform, showing how the treatment effect of the reform varied with hospitals' share of Medicare patients, something standard DiD would miss entirely.

Who should care?

Anyone working with policy variation in intensity, not just presence or absence, and that includes:

Applied economists studying heterogeneous policy effects (e.g. health, education, environment)
Labour and public economists evaluating programme rollouts where take-up or exposure varies
Causal inference researchers interested in extending DiD tools to more realistic settings
Data scientists and ML users who want valid inference when using flexible, nonparametric models

If you’ve ever asked “does the effect depend on how much treatment someone got?”, then Lucas’ paper gives you a way to answer it.

Do we have code?

No replication code yet, but the methods are compatible with existing double machine learning toolkits in both R and Python. For example, the nuisance functions in the paper are estimated using random forests in scikit-learn, and the estimator uses standard components like kernel regression, cross-fitting, orthogonal scores, and multiplier bootstrap for inference. So if you’re already using DML packages like econml, DoubleML, or grf, this setup should feel familiar, just adapted to a continuous treatment setting.

In summary, this paper pushes DiD into more realistic territory where treatment isn’t all-or-nothing, but continuous. By combining conditional parallel trends with double machine learning, it gives us a principled way to estimate how effects vary with treatment intensity. If you're studying policies with uneven exposure or variable implementation, this method lets you recover meaningful causal effects without relying on oversimplified binary comparisons.

Model-based Estimation of Difference-in-Differences with Staggered Treatments

TL;DR: in this paper the authors introduces the first fully model-based approach to estimating DiD effects in staggered treatment settings using Bayesian methods. Instead of relying on TWFE or modern nonparametric estimators, professors Chib and Shimizu specify a hierarchical state-space model where treatment effects can evolve over time and vary across treated units. The model captures latent trends, unit-specific dynamics, and treatment heterogeneity all at once. Bayesian estimation via MCMC then yields full posterior distributions for group-time average treatment effects, along with credible intervals and dynamic treatment profiles (even in small samples, where frequentist methods may be unreliable). The method is then applied to a job training program dataset and to crime outcomes following a state-level policy change. The result is a coherent and flexible Bayesian alternative to recent DiD estimators, which is especially valuable when treatment effects are expected to be dynamic or heterogeneous, or when we want to do probabilistic modelling.

What is this paper about?

When treatments are rolled out at different times across units (what we call a staggered adoption design) standard DiD methods like TWFE can produce misleading results. They assume constant treatment effects and parallel trends, and break down when effects vary across time or groups. Alternatives like Callaway and Sant’Anna, 2021 and Sun and Abraham, 2021 correct for some of these issues, but they still rely on large-sample, nonparametric estimation and don’t always perform well when sample sizes are small or data are noisy. This paper takes a different route. Profs Chib and Shimizu propose a Bayesian model-based approach that directly models how treatment effects evolve over time and vary across units. They set up a state-space model where the average treatment effect for each group and time period ATT(g, t) is treated as a latent variable, to be inferred from the data alongside other unknowns. This allows the model to smoothly track treatment effects over time, account for latent trends and unit-specific dynamics, while also handling staggered treatment in a natural, joint estimation setup. Rather than adjusting after the fact, the model builds in treatment timing and variation from the start. Bayesian inference (via MCMC) then delivers posterior distributions for the effects, offering not just point estimates but full uncertainty quantification. And because it borrows information across units and time, it can be especially helpful in small-sample settings, where many modern DiD estimators struggle.

What do the authors do?

Profs Chib and Shimizu build a hierarchical Bayesian model where treatment effects ATT(g, t) are treated as latent variables within a state-space structure1. This means they explicitly model how effects evolve over time and differ across treated groups, instead of estimating them indirectly like most DiD methods do. The key idea here is that observed outcomes are assumed to follow a data-generating process with latent (unobserved) components (unit fixed effects, time effects latent ATT(g, t) values). These latent effects are assumed to evolve smoothly over time via a stochastic process (e.g. random walk or autoregressive process). The model is estimated using Bayesian MCMC methods, which generate posterior distributions for each ATT(g, t), allowing for dynamic patterns, full uncertainty quantification, and inference even in small samples. They then apply the method to a subset of the Callaway and Sant’Anna (2021) minimum wage dataset, looking at log teen employment across 500 counties between 2003–2007 (the treatment is defined by the year a state first increased its minimum wage), and to a simulation study designed to assess how well the model recovers ATT(g, t) under various data-generating processes. In both examples they demonstrate how their method produces smoother, more stable, and better-identified effect paths than TWFE or recent nonparametric DiD alternatives.

Why is this important?

Most methods for staggered DiD focus on correcting biases in TWFE, but they rely on large samples and assume you can nonparametrically estimate everything you need. That’s often not true, especially in small datasets or when effects change over time. This paper shows that you don’t have to give up on credible DiD just because your data are sparse or noisy. By directly modelling how treatment effects evolve, the method can pick up patterns that other estimators miss and provide honest uncertainty around those estimates. It also avoids the common headaches with event-study plots, like wild fluctuations or imprecise comparisons across groups and periods. Instead, it delivers smooth, interpretable trajectories of treatment effects along with full posterior intervals. The big takeaway: if you're in a setting where standard DiD feels “fragile”, a Bayesian approach can give you clearer answers with fewer assumptions about what the data can do.

Who should care?

This paper is for you if:

You work with panel data where treatment happens at different times, and effects might change over time
Your sample size is small or your data are noisy, and standard DiD methods feel unstable
You’re interested in Bayesian methods but want something grounded in causal questions
You’ve used tools like Callaway and Sant’Anna or Sun and Abraham, but want smoother, more precise estimates (especially for group-time effects)

It’s especially useful if your event-study plots are bumpy, hard to interpret, or if you’re trying to understand treatment dynamics but the data are too thin for conventional methods.

Do we have code?

Prof Shimizu said that a user-friendly package 'bdid' will be available soon in R, Matlab, and Stata.

In summary, most DiD methods try to fix issues with staggered timing by adjusting the estimator. This paper takes a different route: model the treatment effects directly using Bayesian methods. Profs Chib and Shimizu show how you can estimate group-time treatment effects as latent variables, letting them evolve smoothly over time while accounting for uncertainty. The result is especially useful when you’re working with small samples, messy data, or want more than just point estimates. If your usual tools are giving you noisy ATT plots or unstable dynamics, this is a a good alternative, especially if you’re open to bringing more structure (and full posteriors) into your analysis.

The Experimental Selection Correction Estimator: Using Experiments to Remove Biases in Observational Estimates

(I don’t think the authors need an introduction)

TL;DR: as more researchers rely on large observational datasets, a recurring issue emerges: these datasets capture rich outcomes and wide populations but lack randomisation (it turns out that even with larger Ns you still do need randomisation). At the same time, smaller experiments offer clean causal identification but only for a narrow set of outcomes or people. In this paper the authors introduce the Experimental Selection Correction (ESC) estimator, a way to use the experimental data to estimate and correct the bias in observational estimates. Think of it as training a correction function on the experiment, then applying it to the observational data. It’s a modern rethinking of Heckman-style selection correction, but with fewer assumptions and better performance when combined with ML tools.

What is this paper about?

There is a common, real-world, problem many of us applied researchers face: there is have an experiment, but it’s small or limited in scope; there’s also a big observational dataset, but it’s potentially biased. How do we combine them to get accurate causal estimates? Profs Athey, Chetty, and Imbens propose the Experimental Selection Correction (ESC) estimator, which uses the experimental data to learn the selection bias in the observational estimates, and then corrects for that bias across the broader sample. Their set-up is: in the observational data, one observes treatment and a wide set of outcomes, but treatment is not random, whereas in the experimental data, treatment is randomly assigned, but outcomes are more limited or less representative. Their key idea is: if one can measure the difference between the experimental and observational estimates for units observed in both, one can estimate a bias function. Then one can use that function to correct the observational estimates in the larger dataset. Mathematically, it’s similar in spirit to Heckman’s selection model, but the correction term is learned from experimental data rather than assumed from a structural model.

What do the authors do?

Profs Athey, Chetty, and Imbens introduce the Experimental Selection Correction (ESC) estimator, which adjusts observational treatment effect estimates using experimental data. We even have a DAG to exemplify the idea:

Panels A and B represent traditional surrogate models using experimental and observational data. Panels C and D show how ESC brings in the direct effect of treatment and how unobserved confounding affects the observational estimates. ESC uses the clean experimental setting (Panel C) to estimate the bias present in Panel D and remove it.

As one can see from the illustration, they build on the surrogate model literature, where researchers estimate the effect of a treatment on a surrogate outcome (like test scores), and then use the observed relationship between the surrogate and a long-term outcome (like graduation rates) to back out the treatment effect. This approach has been influential (Athey et al., 2019), but it assumes no unobserved confounding, which is an assumption often violated in observational settings. The ESC estimator extends the surrogate approach by explicitly modelling and correcting the bias in the observational estimates of long-term outcomes. It works in a few steps: you use the experimental sample (where treatment is randomly assigned) to estimate the bias in observational treatment effect estimates (i.e. the difference between experimental and observational effects, conditional on covariates); then you estimate a bias function (possibly using ML), that maps observed covariates to the size of the bias; and finaly you apply this bias correction to the observational sample to get a debiased treatment effect estimate. Then we can use a the large, rich observational dataset (but with bias removed) effectively combining the precision and scale of big data with the credibility of randomised experiments.

Why is this important?
There's this tweet that went viral once which reflects a common misconception that -especially in the era of Big Data - if your sample is large enough, you don’t need randomisation. But this paper flips that idea on its head. The authors show that selection bias doesn’t vanish with more data, it just becomes more precisely wrong. You can collect all the observational data in the world, but if treatment assignment is biased, the resulting estimate will be too. That’s why this paper matters. It offers a way to quantify and remove that bias using a well-designed experiment, even if the experiment is small or doesn’t cover every outcome. The idea is to enhance observational data for causal inference by using experimental insights to correct its flaws, thereby avoiding the need for complete replacement. It’s a flexible, scalable solution that keeps the credibility of experiments while unlocking the scope of big data.

Who should care?

This is for researchers who:

Work with large observational datasets (e.g. admin data, education records, firm-level panels)
Have access to some experimental variation, but not at the scale or granularity they need
Worry about selection bias but don’t want to rely on strong structural assumptions
Use ML for causal inference, but want a method that won’t be too difficult to justify when someone asks in a seminar

Do we have code?
We have an entire repo (in R) dedicated to it.

In summary, this paper delivers a powerful message: randomisation is still useful, even in the age of Big Data. Profs Athey, Chetty, and Imbens show how you can use experimental evidence to identify effects, as well as to repair the observational ones. Instead of throwing out observational data or blindly trusting it, their estimator lets you calibrate it. You borrow just enough credibility from the experiment to subtract out the bias, and then scale up your estimates with confidence. It’s a modern take on selection correction, grounded in theory but ready for real-world data.

Latent variables within a state-space structure means that the treatment effects aren’t directly observed or estimated from a single regression equation. Instead, they’re treated as unobserved time series that evolve over time according to a probabilistic process. The model uses the observed data to infer these hidden values step by step, borrowing information across time and units.

DDD Estimators, Distributional Effects, Causal Diagrams, and Cross-Field Counterfactuals

Beatriz Gietner — Fri, 23 May 2025 12:37:35 GMT

Hi there! Hope you’re all doing well :)

Today we have all sorts of studies, and one “treat” at the end :)

Better Understanding Triple Differences Estimators, by Marcelo Ortiz-Villavicencio and Pedro H. C. Sant’Anna
Distributional Difference-in-Differences with Multiple Time Periods, by Andrea Ciaccio
Using Causal Diagrams To Assess Parallel Trends In Difference-In-Differences Studies, by Audrey Renson, Oliver Dukes and Zach Shahn
From What Ifs to Insights: Counterfactuals in Causal Inference vs. Explainable AI, by Galit Shmueli, David Martens, Jaewon Yoo, and Travis Greene

(I apologize if there are any mistakes here - let me know in the comments! I confess I haven’t properly slept in the past 5 days, and I’ve been writing between airports lol terrible idea)

Better Understanding Triple Differences Estimators

(I am so happy to cover this paper! Marcelo is my friend and prof Pedro’s - who needs no introduction - PhD student. Here you can check a thread about it in prof Pedro’s words :))

TL;DR: DDD estimators are commonly used to recover treatment effects when standard DiD PT assumptions are too strict. DDD allows for certain violations (e.g., group-specific or eligibility-specific trends) while still producing “valid” estimates if the identifying assumptions are correctly specified. But here’s the problem: in many realistic settings, those assumptions only hold after conditioning on covariates. Failing to account for this - by erroneously proceeding as if DDD was just the difference between 2 DiD estimators - leads to bias. Prof Pedro and Marcelo propose new regression-adjusted, inverse probability weighted, and doubly robust estimators that remain valid in these more realistic scenarios. They also show that in staggered adoption settings, pooling not-yet-treated units (as is practice in DiD) invalidates DDD. The result is a practical, modern framework for credible DDD estimation, particularly in applications with covariates, staggered timing, or event studies.

What is this paper about?

DDD (Difference-in-Difference-in-Differences) is a common strategy when treatment depends on two conditions, like living in a treated state and being eligible for a policy. It’s attractive because it lets you relax the standard DiD assumption that treated and control groups follow the same trends in the absence of treatment. Instead, DDD allows for different trends across groups, as long as the difference in trends between treatment-eligible and ineligible units is the same in treated and control areas (and vice versa). But the problem is that, in practice, most applications assume *too much*. When covariates matter, or when treatment is staggered, the typical DiD logic fails. This paper shows why and what to do instead.

What do the authors do?

They show that the usual approaches (subtracting two DiDs or running a three-way fixed effects regression) don’t work when covariates are important. Worse, if treatment is staggered, pooling all not-yet-treated units (as in modern DiD) can introduce bias due to changing group composition. To fix this, they introduce three estimators that stay valid under the right assumptions: regression-adjusted, inverse probability weighted (IPW), and doubly robust (DR). The DR version is especially flexible because it can be used with machine learning and allows combining multiple comparison groups using GMM to boost precision.

Why is this important?

DDD is meant to be more “flexible” than a 2x2 DiD, but if you implement it like a 2x2 DiD (by ignoring covariates or picking up controls like crazy), it won’t give you the correct estimates. Marcelo and Prof Pedro show why these choices lead to bias, and offer both a diagnosis and a solution: tools that let you actually relax the PT assumption without breaking identification. They also make a key point that’s often overlooked: DDD designs with staggered adoption cannot be treated as simple extensions of DiD methods. They emphasize that the interaction between timing, eligibility, and changing group composition introduces complexities that demand more careful attention to identification and control selection.

Who should care?

Anyone using DDD in applied work, especially if:

Your treatment depends on both where and who (e.g., geography and eligibility),
Covariates are essential for identification,
Your design involves staggered treatment,
You’re running event studies or subgroup analyses.

Empirical economists, policy evaluators, health and education researchers, and methodologists in general will find this useful. And honestly, so will we grad students running triple differences.

Do we have code?

Marcelo and prof Pedro say that they are “finishing an R package that will automate all these DDD estimators, hopefully making it easier to adopt”. So we wait :)

In summary, DDD isn’t just “two DiDs”, and treating it that way can seriously bias your results. This paper shows exactly why that happens, and more importantly, how to fix it. Marcelo and Prof Pedro give us toolkit for doing DDD right: estimators that are flexible, robust, and grounded in solid identification logic. If your setting involves staggered timing, covariates, or layered eligibility rules, this is the framework you should check out. And when the R package drops, I’ll let you know but keep an eye out for it.

Distributional Difference-in-Differences with Multiple Time Periods

(Andrea is a PhD candidate at Ca' Foscari University of Venice)

TL;DR: most DiD estimators focus on average treatment effects. But what if you care about how a policy affects different parts of the outcome distribution, like the bottom 25% vs the top 25%? This paper extends the Quantile Treatment Effect on the Treated (QTT) framework to multiple time periods and staggered adoption settings. Andrea generalises and builds on Callaway and Li (2019) to recover the full counterfactual distribution (not just means or quantiles!) without needing rank invariance. The result? A flexible, nonparametric method to estimate distributional effects using either panel data or repeated cross-sections.

What is this paper about?

This paper is about moving beyond averages. In many applied settings, we care both about whether a policy had an effect and where in the outcome distribution those effects occurred. Imagine a minimum wage hike. It might raise *average wages*, but does it help low-income workers more, or mostly benefit those in the middle of the distribution? Average treatment effects won't tell you. Andrea proposes a method to estimate distributional treatment effects in settings with multiple periods, staggered adoption, and non-experimental data. The goal is to recover the full counterfactual distribution of untreated outcomes for the treated group. To do this, he builds on the QTT framework and generalises it to work in more realistic settings. The nicest part is that his approach does not rely on rank invariance, which is often violated in practice. Instead, he also introduces tools like stochastic dominance tests to compare distributions more credibly.

What does the author do?

He starts by generalising the QTT estimator from Callaway and Li (2019) to a more realistic setting with multiple time periods, staggered treatment, either panel or repeated cross-section data. Andrea uses a combination of distributional parallel trends and a copula invariance assumption to identify the full counterfactual distribution of untreated outcomes for the treated group. He then proposes new estimands that are robust to violations of rank invariance, such as comparisons using stochastic dominance rankings and inequality measures like the Gini index. He also provides an actual estimator. The only other paper addressing QTT identification in staggered DiD settings - Li and Lin (2024), which I mentioned in another post - stops at theory and does not propose an estimator, which limits its applicability. Andrea’s contribution fills that gap with a method researchers can use.

Why is this important?

This paper is important because heterogeneous effects matter. Policies rarely impact everyone the same way, and when we focus only on averages, we risk missing who actually benefits (or loses). Andrea’s method lets us go beyond the ATT and estimate how the entire distribution shifts, which is especially relevant for policies that aim to reduce inequality or target disadvantaged groups. It also matters methodologically: while earlier papers offered theoretical identification of distributional effects under staggered DiD, Andrea is the first to provide a fully worked-out estimator. The fact that his approach works with repeated cross-sections and avoids rank invariance makes it much more usable in real-world applications.

Who should care?

Researchers working with staggered DiD who want to go beyond average treatment effects
Applied microeconomists studying income, education, labour markets, or health, where heterogeneous impacts are expected
Anyone evaluating policies with distributional goals such as minimum wages, subsidies, school tracking, etc
People using repeated cross-sections instead of panel data
And of course: grad students who keep reading “QTT on staggered DiD” papers and wonder, okay, but how do I actually estimate that?

Do we have code?

Andrea says that “all the MC simulations were run in STATA. The ado files of the command used for implementing the methodology presented in the paper, qtt, are available upon request at the time of writing”.

In summary, Andrea’s paper gives applied researchers a way to estimate how the entire outcome distribution would have evolved in the absence of treatment, even in complex staggered DiD settings. It works with repeated cross-sections, allows for conditioning on covariates, and avoids the strong rank invariance assumption that most distributional methods rely on. Once you have the full counterfactual distribution, you’re not limited to quantiles, you can compute Gini coefficients, Lorenz curves, or test for stochastic dominance. This makes the method flexible and genuinely useful for policy evaluation, especially when heterogeneity is central.

Using Causal Diagrams to Assess Parallel Trends in DiD

(This paper has brought flashbacks from Econometrics 2 hehe hi prof Ben if you’re reading! Also, it’s quite “visual”, which is nice)

TL;DR: DiD relies on the PT assumption, but we usually just hope it holds, or eyeball pre-trends. This paper proposes a more “principled” approach: the use of causal diagrams to assess whether parallel trends are plausible before running the analysis. The authors derive three graphical conditions under which parallel trends likely fails, and show how to apply them using partially directed SWIGs (a tool used in causal inference to represent potential within a graphical model) and a linear faithfulness assumption.

What is this paper about?

Most of us assess parallel trends by plotting outcomes before treatment and hoping the lines run in parallel. But this only works if we have multiple pre-treatment periods, and even then it might not be sufficient. This paper proposes an alternative: use causal diagrams to assess whether PT can plausibly hold, based on what we know about the data-generating model (DGM). The authors focus on the standard 2×2 DiD setup (two groups, two periods) and ask a simple but deep question: under what structural conditions can we actually expect PT to hold? To answer it, they use nonparametric structural equation models and tools from causal graph theory. One key idea is linear faithfulness, which is a principle that says if two variables are connected in the DAG (i.e., d-connected), then they should also be statistically correlated in the data. This lets the authors derive graphical conditions under which the PT assumption is likely violated, just by inspecting the structure of the DAG.

What do the authors do?

The authors show how to use causal diagrams (more specifically something called a Single World Intervention Graph, or SWIG) to evaluate whether the PT assumption is even plausible, based on what we know about the DGM. A SWIG is a type of causal diagram that represents potential outcomes under specific interventions within a single possible world, encoding counterfactual relationships in graphical form while accounting for how unmeasured confounding might affect outcomes. It’s a way to visualise whether treated and control units would have followed the same trend, had neither been treated. They identify three structural features in a SWIG that signal PT is likely violated:

If pre-treatment outcomes influence who gets treated
If different unmeasured confounders affect pre- and post-treatment outcomes
If pre-treatment outcomes directly influence post-treatment outcomes

If any of these are in your graph, then standard DiD assumptions likely fail even if your pre-trends look fine. The authors also describe the largest possible causal structure that still permits PT, and provide R code (using the dagitty package) so you can test these conditions using your own diagrams. (While I was doing research for this paper, I found this website which is super fun to play with).

Why is this important?

What this paper shows is that we can do better than just eyeball pre-trends. Instead of relying on visual checks or intuition, we can use causal diagrams to *reason formally* about whether parallel trends is *even* plausible to begin with, based on our assumptions about the DGM. This is especially useful in applied settings where we already use DAGs to justify identification strategies like unconfoundedness or IV. Now, we can apply the same logic to DiD, and potentially avoid relying on a violated assumption that undermines our entire design. It’s also a step toward a more unified causal inference framework, one where DiD assumptions are made transparent and testable before estimation begins.

Who should care?

Researchers using DiD with limited pre-treatment periods, especially in observational data
Applied economists who already use causal diagrams for IVs or unconfoundedness and want to bring the same rigour to DiD
Methodologists interested in making identification assumptions explicit
Reviewers and instructors who want a clearer way to teach or check the PT assumption
Anyone who has ever said “the pre-trends look parallel, so we’re probably fine”

Do we have code?

I mean, you can kind of draw it on paper, but it does help to be able to do it on the computer. The authors provide R code in the Appendix (they use the dagitty package) to help you check whether your DAG satisfies their conditions. It’s a helpful way to make the theory actionable.

In summary, if you’re using DiD and assuming PT, this paper gives you a smarter way to ask: is that actually plausible? Instead of relying on visual pre-trend checks or hoping for the best, you can use a causal diagram to make your assumptions explicit and formally test whether PT is likely to hold. It’s clean, insightful, and genuinely useful. One of those papers that quietly upgrades how we think about something we thought we already understood, I enjoyed reading it :)

From What Ifs to Insights: Counterfactuals in Causal Inference vs. Explainable AI

TL;DR: “Counterfactual” means different things depending on who you ask. In causal inference (CI), it’s about estimating what would have happened under a different treatment. In explainable AI (XAI), it’s about tweaking inputs to change a model’s prediction. This paper lays out a unified framework to understand both, compares how counterfactuals are used and evaluated across fields, and points to ways CI and XAI can learn from each other.

What is this paper about?

Both CI and XAI are built on “what if” logic, but they use counterfactuals in very different ways and often talk “past each other”. This paper provides a much-needed comparison between the two, showing how counterfactuals are defined, used, evaluated, and interpreted in each field. CI focuses on the THEN part - what happens under an alternative treatment. XAI focuses on the IF part - what input values would lead to a different predicted outcome. This shift in emphasis leads to big differences: in the quantity of interest, the assumptions made, the level of aggregation, and what is being modified (the model vs. the data). The authors propose a general definition of counterfactuals that works for both domains and lay out a roadmap for how ideas from each can inform the other.

What do the authors do?

They break down the counterfactual into its core parts (inputs and outcomes) and then walk through how each field uses it:

Purpose: CI estimates causal effects; XAI explains a prediction
Assumptions: "CI assumptions are about the data-generating process; XAI typically treats the model as given for the purpose of explanation
Quantity of interest: CI compares outcomes; XAI changes inputs
Aggregation: CI typically works at the group level (e.g. ATT, ATE); XAI focuses on individual-level predictions
Modification target: CI modifies the model or treatment variable; XAI modifies the input data
Evaluation: CI cares about confidence intervals and robustness; XAI cares about feasibility, proximity, and sparsity

Why is this important?

The word “counterfactual” is everywhere, but depending on your background, you may be using the term without realising how much baggage it carries from other fields. This paper clears that up. More importantly, it opens the door to “cross-fertilisation”: CI researchers can borrow ideas from XAI about individual-level interpretability, XAI researchers can adopt tools from CI to anchor their explanations in causal logic, And both sides can contribute to a broader understanding of actionable, responsible decision-making in high-stakes settings like healthcare, education, or finance.

Who should care?

Empirical researchers using counterfactuals in any form (policy, medicine, business, etc.)
Anyone working at the intersection of prediction and explanation (ML folks!)
XAI researchers who want to build more meaningful and robust explanations
CI researchers curious about individual-level guidance and model-based insights
Anyone writing about algorithmic fairness, recourse, or model transparency

Do we have code?

No, it is a conceptual paper, but it *does* give you a framework you can apply to your own work. The examples (hiring decisions, loan rejections, ad targeting) are accessible and very usable for teaching or presentations.

In summary, this is a paper about translation. If you've ever used the word counterfactual in a paper, a model, or a policy brief, this helps clarify what you're really doing. CI and XAI both rely on “what ifs,” but this paper shows that they mean different things, serve different goals, and run on different assumptions. I think that understanding this is the first step toward doing better science, in both fields.

More DiD Drops: Continuous Treatments, Sample Selection, and Inference Tests

Beatriz Gietner — Thu, 15 May 2025 14:12:32 GMT

Hi there!
We got a few updates a few days after I published the other post.
Here they are:

Difference-in-Differences for Continuous Treatments and Instruments with Stayers, by de Chaisemartin, D’Haultfœuille, Pasquier, Sow, and Vazquez-Bare (This paper was originally circulated 3 years ago (- Sow). de Chaisemartin, D’Haultfœuille and Vazquez-Bare have one published in the AEA P&P named Difference-in-Difference Estimators with Continuous Treatments and No Stayers).
Difference-in-Differences and Changes-in-Changes with Sample Selection, by Javier Viviens
Inference with Modern Difference-in-Differences Methods, by Yuji Mizushima and David Powel
Prof. Clément de Chaisemartin’s in-depth training on DiD, a series of videos organized by Prof. Andrea Albanese and hosted by LISER & DSEFM, University of Luxembourg.

Let’s go :)

Difference-in-Differences for Continuous Treatments and Instruments with Stayers

(The 2024 AEA P&P paper laid foundational work for DiD estimators in settings with continuous treatments and no stayers (i.e., cases where all units change treatment intensity across time). In contrast, this 2025 paper generalises and complements that work by allowing for the presence of stayers (i.e., units whose treatment remains unchanged across periods), while also providing better identification and robust estimation. These stayers serve as the comparison group in a more “classical” DiD ~spirit~.)

TL;DR: this paper proposes new DiD estimators for settings where treatments are continuous (like prices or taxes) and some units don’t change their treatment over time (i.e. "stayers"). It defines two interpretable slope-based estimands (one for local effects (AS) and one for realised-policy evaluations (WAS)) and shows how to identify them using switchers vs. stayers with the same baseline treatment. The estimators are doubly robust, nonparametric, and come with pre-trends tests and an IV extension. If the 2024 AEA P&P paper gave us DiD for continuous treatments with no stayers, this paper gives us the rest of the picture.

What is this paper about?

In many situations where we are interested in applying DiD methods, we assume binary treatments and rely on having a control group that never receives treatment. But many real-world policies (e.g. tax rates, prices, or even pollution levels) are continuously distributed. And sometimes, everyone is “treated,” just to different degrees. This paper tackles this problem we have. It proposes DiD estimators for continuous treatments in settings where some units change their treatment (“switchers”) and others stay at the same level (“stayers”). You can think of it as when a state that raises its gasoline tax while others keep theirs unchanged. The stayers provide the comparison group, just like in a “classic” DiD. Instead of estimating treatment effects in levels, the paper focuses on slopes: how outcomes change as treatment intensity changes.

The authors define two estimands:

AS (Average of Slopes): the average effect of treatment changes among switchers; it gives equal weight to all switchers regardless of treatment change magnitude.
WAS (Weighted Average of Slopes): a slope summary that gives more weight to larger treatment changes, which is helpful for cost-benefit analysis. WAS weights by the absolute value of treatment changes.

They also extend the method to IV settings (e.g. using tax changes to estimate price elasticity), allow for covariates, dynamic treatment effects, and multi-period panels, and provide doubly-robust, nonparametric estimators with valid inference and pre-trends tests.

What do the authors do?

They propose two estimators: the Average of Slopes (AS), which captures how outcomes change with treatment for switchers on average, and the Weighted Average of Slopes (WAS), which weights those changes by how much treatment actually changed. Both estimators compare switchers to stayers with the same baseline treatment and rely on a new parallel trends assumption that allows for flexible heterogeneity. They also show how to estimate these slopes using doubly-robust, nonparametric estimators with valid inference, even in small samples. The paper extends the framework to handle instrumental variables, multiple time periods, covariates, and even dynamic treatment effects. They wrap it all up with an application on gasoline taxes and a Stata package for implementation.

Why is this important?

This is really important because most real-world treatments are not binary. Prices, taxes, subsidies, exposure level, they all often vary in intensity, not just presence. Traditional DiD estimators break down in these cases, especially when there’s no clear control group. This paper offers a practical fix. By comparing switchers and stayers at the same starting point, it builds credible counterfactuals without needing everyone to be untreated. It also delivers interpretable slope-based effects that can answer both “what-if” and “was-it-worth-it” questions. And unlike many continuous-treatment approaches, these estimators are nonparametric, doubly-robust, and testable using pre-trends. That makes them much more usable in applied work.

Who should care?

Basically anyone studying policies where treatment varies by degree and not just presence. That includes researchers working on:

Taxes and subsidies
Prices and price regulation
Pollution exposure or health risk levels
Education or welfare policies with varying intensity

It’s especially useful for applied micro folks frustrated by DiD methods that fall apart when there’s no clean control group. If you’ve ever had to drop observations just because “everyone got some treatment,” this paper is for you.

Do we have code?

Yes, partially. The authors have released a Stata package called did_multiplegt_stat (currently available via SSC) that implements some of their estimators. It’s still under development, but enough is there to apply the method to standard two-period and multi-period settings. No R version yet.

In summary, this paper fills a big gap in the DiD toolbox: how to estimate treatment effects when everyone’s treated, but to different degrees. It introduces slope-based estimands that are interpretable, testable, and robust, and opens the door to more credible applied work with continuous policies. If the earlier AEA P&P paper gave us DiD without stayers, this one gives us the stayers and the estimators we need. Just keep in mind some things that the authors themselves noted: while this approach offers clear advances over prior DiD work, it’s not without constraints. The AS estimator can’t handle "quasi-stayers" (units with tiny treatment changes), which prevents consistent estimation. In applications with many time periods, the method assumes no dynamic effects, meaning outcomes are only influenced by current treatment, not past treatment. Though the authors suggest fixes for this, these workarounds shrink the sample size. And as with all DiD methods, the underlying parallel trends assumption remains fundamentally untestable for the actual treatment period, even if placebo tests look promising.

Difference-in-Differences and Changes-in-Changes with Sample Selection

(Javier is a 3rd-year PhD student at the European University Institute in Italy. This is his WP. I had fun reading it, Javier’s writing is really pleasant and easy to follow)

TL;DR: when treatment affects who stays in your sample, DiD can break down. This paper shows that DiD estimates can become non-causal (or even undefined) under sample selection, especially when outcomes aren't observed or even well-defined for some units. It adapts Lee bounds to both DiD and Changes-in-Changes (CiC) settings and proposes a method to estimate causal effects for the “Always-Observed” stratum (e.g. the people for whom outcomes exist under both treatment and control). The paper also relaxes monotonicity assumptions and offers a partial identification strategy that holds up even when treatment is confounded.

What is this paper about?

This working paper tackles a classic problem DiD designs often can’t account for: what if treatment affects who’s in your sample? Think dropout, death, non-response, or employment status. If treatment changes who shows up in the data, standard DiD estimates can become biased or even meaningless. The paper focuses on units for whom the outcome is always defined (what Javier calls the Always-Observed stratum). For this group, he develops a way to estimate both average and quantile treatment effects, even when the treatment is confounded and sample selection is not ignorable. He also adapts Lee bounds to both DiD and Changes-in-Changes (CiC) frameworks, providing partial identification results under weaker assumptions than existing methods.

What does the author do?

Javier reworks DiD for cases where sample selection isn’t ignorable, aka endogenous (e.g. when people drop out of a study because of the treatment, or when the outcome just doesn’t exist, like wages for the unemployed). He focuses on the subset of people for whom the outcome is well-defined regardless of treatment status and builds identification and inference strategies around that group. He generalises Lee bounds to the DiD and CiC settings, allowing for partial identification of:

The average treatment effect for Always-Observed units (ATTAO), and
The quantile treatment effect (QTTAO), to explore distributional impacts.

In the paper he also extends the method to relax monotonicity assumptions by using multiple sources of sample selection (like attrition and dropout), which lets him point-identify stratum proportions in more realistic settings. Finally, he puts everything to work in a job training program evaluation in Colombia, showing how naïve DiD would have overstated the treatment effect. It’s the kind of paper that quietly fills a gap many applied researchers have likely tiptoed around without realising how deep it goes.

Why is this important?

In real-world data, who is observed often depends on the treatment. Job training affects employment; education policies affect dropout. And when outcomes don’t exist for the unobserved (like wages for the unemployed), even the causal estimand can fall apart. Most DiD applications quietly assume this isn’t a problem. This paper says otherwise, and gives you tools to deal with it. By focusing on Always-Observed units and adapting Lee bounds, it lets you estimate effects that are well-defined, interpretable, and policy-relevant, even when selection is messy and treatment is confounded. If you’ve ever run DiD on a shrinking sample and just hoped for the best, this might be your way out of it.

Who should care?

Anyone doing DiD with panel or survey data where attrition, dropout, or censoring might be related to treatment. That includes:

Labour economists studying training, unemployment, or job loss
Education researchers tracking dropout or test participation
Health economists dealing with mortality or survey non-response

It’s also relevant for RCTs with imperfect compliance or missing outcomes, especially when researchers try to squeeze DiD or CiC frameworks onto real data that leaks. If your outcomes only exist for a non-random slice of your sample, this is for you.

Do we have code?

An R package is in development (we should give Javier a RA), though not yet public. The paper is very clear on the estimation steps, so if you’re comfortable with trimming and quantile bounds, you could replicate the method manually.

In summary, when treatment affects who’s in your sample, your DiD estimates might not mean what you think they do (or anything at all). This WP offers a sharp correction: estimate effects only for units with well-defined outcomes under both treatment arms. It adapts Lee bounds to DiD and CiC, covers both mean and quantile effects, and relaxes classic assumptions like monotonicity. A clean, careful contribution that quietly plugs a major gap in applied DiD work. Javier also identifies many promising extensions to build upon this work like: extending the method to settings with multiple pre- and post-treatment periods, which would help test identification assumptions more thoroughly; adapting the approach for staggered adoption designs through pairwise comparisons or integration with doubly robust estimators; incorporating covariates to tighten bounds and increase credibility; and even adopting a Bayesian perspective when bounds are too wide to be informative. Something to look forward to.

Inference with Modern Difference-in-Differences Methods

(Yuji is a PhD student at RAND Graduate School and an assistant policy analyst at RAND, while David is a a senior economist at RAND and a member of the RAND School of Public Policy faculty)

TL;DR: most new DiD estimators fix bias from staggered adoption and heterogeneity, but what about inference? This paper runs large-scale simulations to test how well these estimators behave when the number of treated units is small. The authors find that most over-reject the null, especially with fewer than 10 treated units. The one approach that consistently stays close to the nominal 5% rejection rate? An imputation-based estimator paired with a wild bootstrap. If you’ve been assuming the standard errors are fine, you might want to read this.

What is this paper about?

In this paper the authors evaluate how well modern DiD estimators perform when it comes to inference, especially in small-sample settings. Over the last few years, a wave of new DiD methods have corrected for issues like staggered adoption and treatment heterogeneity, but most of the focus has been on point estimates. Yuji and Powell ask: do these methods produce reliable standard errors and valid p-values when the number of treated units is small? To answer that, they simulate thousands of placebo policies (in CPS data and synthetic panels) and compare how often each estimator falsely rejects the null when it shouldn’t. The paper covers seven popular estimators (including Callaway and Sant’Anna, Sun and Abraham, Borusyak et al., Gardner’s 2SDID, and DCDH) and evaluates different inference procedures: cluster-robust SEs, wild bootstrap, randomization inference, and loads more.

What do the authors do?

They run a battery of simulations (both in real CPS data and fully synthetic panels) to see how different DiD estimators perform under the null. The key question is: do these methods reject the null too often when the number of treated units is small? Most do. They test seven leading DiD estimators, including: Callaway and Sant’Anna (2021), Sun and Abraham (2021), Borusyak et al. (2024), Gardner’s 2SDID, Wooldridge (2024), de Chaisemartin & d’Haultfoeuille (DCDH), and stacked DiD. Each is evaluated with its default inference method (usually asymptotic or cluster-robust), and sometimes also with wild bootstrap or randomization inference. They vary sample size, number of treated units, and treatment heterogeneity. Their best-performing combo? Imputation-based estimators paired with wild bootstrap, which is stable even with just 5 treated units. They also call attention to two competing biases at work: small samples lead to over-rejection, while treatment heterogeneity inflates standard errors and reduces rejection rates. By comparing simulations with and without heterogeneity, they show these biases don't really cancel each other out. Surprisingly, they also find that for some estimators, when we increase the number of untreated units, it actually worsens rejection rates rather than improving them, which goes against the usual intuition that more untreated units = better inference.

Why is this paper important?

It’s important for a couple of reasons: most of us don’t have the time (or the human resources) to run 1,000 simulations every time we worry about small-sample inference. This paper does that work for us. It shows that many modern DiD estimators (despite fixing bias from staggered timing and treatment heterogeneity) still rely on inference procedures that fall apart when the number of treated units is small. Cluster-robust SEs and asymptotic formulas often lead to massive over-rejection. That means our p-values might be “lying” to us, and that published findings could be reporting spurious significance just because inference was off. This paper also points to safer alternatives, like pairing imputation methods with wild bootstrap, that behave much better in these settings.

Who should care?

Anyone doing DiD with a small number of treated units, which includes cases where researchers are:

Evaluating state- or district-level policies with a few adopters
Analysing corporate or regional rollouts
Working with staggered adoption in small samples

It’s especially relevant if you’ve moved beyond TWFE but still rely on default SEs, or assume that new methods automatically fix inference too. This paper is a reminder that point estimates and p-values live separate lives, and we need to validate both.

Do we have code?

This is less a “new code” paper and more a “here’s how to use the tools you already have properly” paper. The authors use existing Stata packages for all the estimators (like did2s, csdid, did_imputation, etc.) and pair them with inference strategies like wild bootstrap and randomization tests. No new package is released, but if you’re already using these estimators, you can replicate their setup pretty easily.

In summary, Yuji and Powell take the DiD estimators many of us already use and ask a simple question: do they behave when treated units are few? In most cases, the answer is no. Inference breaks down, p-values overstate significance, and standard errors underperform. But not all hope is lost: pairing imputation-based estimators with a wild bootstrap comes closest to nominal rejection rates, even with just 5 treated units. If you’re working with small samples or rare treatment, this paper is a practical guide to doing inference that actually holds up.

Four new DiD studies + a textbook update

Beatriz Gietner — Mon, 28 Apr 2025 15:32:54 GMT

Hi there! It’s been a while :) It turns out some interesting new developments hit the web in the meantime, and they’re the ones we will be talking about today.

I highlight 5, in no specific order:

Conditional Triple Difference-in-Differences, by Dor Leventer
Triple Instrumented Difference-in-Differences by Sho Miyaji
Covariate Balancing Estimation and Model Selection For Difference-in-Differences Approach, by Takamichi Baba and Yoshiyuki Ninomiya
Spatial Synthetic Difference-in-Differences, by Renan Serenini and Frantisek Masek (it was posted last year but some of you might not be aware)
Credible Answers to Hard Questions: Differences-in-Differences for Natural Experiments, by Clément de Chaisemartin and Xavier D'Haultfœuille (this is a textbook originally published two years ago, but it has just been updated.)

Conditional Triple Difference-in-Differences

(Dor is a PhD candidate - advised by prof Saporta-Eksten - and apparently a very prolific one!)

TL;DR: your trusty triple-difference (TDID) regression stuffed with dozens of controls can still feed you biased estimates whenever the treatment and comparison groups carry different mixes of those controls. This pre-print pins down the bias mathematically, then fixes it with a re-weighted, doubly-robust TDID estimator: one set of weights forces covariate balance, another piece mops up any remaining outcome misspecification. Dor gives us a R package, `tdid`, so you can swap the new estimator into your code and get triple differences that really reflect the causal effect you care about.

What is this paper about?

Researchers often extend DiD to a triple difference (TDID) when an extra contrast is needed (e.g., time × treated × group). In practice, many studies add covariate controls and then difference the two DiD estimates, assuming the residual bias cancels. Dor shows this fails whenever the groups have different covariate distributions (because the group-specific bias in conditional parallel trends generally does not wash out). Merging the unconditional TDID framework of Olden & Møen (2022) with the conditional-DiD framework of Callaway-Sant’Anna (2021), he formalises a conditional TDID setting, proves the conventional estimator is biased, and derives an alternative estimand that is recoverable: ATT for group A minus a covariate-reweighted conditional ATT for group B (using group A’s X-distribution). A double-robust/weighted-double-robust (DR/WDR) estimator with influence-function theory delivers consistent inference, and the accompanying R package tdid makes implementation and Monte-Carlo replication easy.

What does the author do?

Dor develops a conditional TDID framework that unifies two literatures: the unconditional triple‐difference set-up of Olden & Møen (2022) and the covariate-adjusted DiD framework of Callaway-Sant’Anna (2021). Within this framework he:

Diagnoses the problem:
Theorem 1 shows that the familiar recipe (e.g., run DiD + controls separately in each group and difference the two estimates) does not deliver the desired causal contrast whenever the groups’ covariate distributions differ. The bias stems from group-specific deviations in conditional parallel trends that fail to cancel.
Defines an estimand that is identifiable:
He shifts attention to
i.e., the treatment effect for group A minus a covariate-reweighted conditional ATT for group B, where the weights replicate group A’s covariate mix. Theorem 2 proves this quantity is identified under the same conditional-parallel-trends assumption.
He pairs the usual DR estimator for group A with a weighted DR (WDR) estimator for group B, then differences the two. Because either the outcome model or the generalized propensity score can be misspecified (but not both), the estimator remains consistent and asymptotically normal; an influence-function derivation yields valid standard errors.
A toy analytical example and a Monte-Carlo exercise show that the conventional estimator is biased (direction and magnitude match the theory), while the DR/WDR estimator is centered on the true parameter.

Why is this important?

Triple‐difference designs are everywhere in applied work. Dor reviews 66 highly cited TDID papers in top journals and finds that 73 % add controls and 70 % rely on a fully interacted three-way specification. Yet the paper proves that this widespread practice is generically biased whenever the covariate mix differs across groups, which is exactly the scenario that motivates adding controls in the first place. The contribution therefore closes a serious identification gap: it shows when and why the usual estimator fails and replaces it with a re-weighted, double-robust alternative whose consistency can survive misspecification in either the outcome model or the generalized propensity score. That safeguard is valuable for empirical researchers who increasingly pair DiD methods with high-dimensional or machine-learning covariates. Beyond theory, the paper is practice-ready. The accompanying R package tdid implements the weighted DR estimator and reproduces the simulation evidence, lowering the cost of adoption for graduate students, replication teams, and journal reviewers alike.

Who should care?

Applied micro-economists in labour, public, development, health, education, or environmental economics who already lean on triple differences to sharpen identification.
Empirical researchers and journal referees who worry that adding controls may quietly re-introduce bias they thought the third “D” removed.
Methodologists extending the DiD toolkit to high-dimensional controls, ML adjustments, or heterogeneous treatment effects.
Graduate students and replication teams tasked with validating published TDID results that rely on the conventional “DiD-with-controls, then difference” workflow.

Do we have code?

We have a package, tdid on GitHub. With it we can implement the double-robust (DR) and weighted DR estimators laid out in the paper. It provides a vignette that walks through the toy example and Monte-Carlo simulation, and exposes helper functions for influence-function standard errors and covariate-reweighting diagnostics.

Installation is one line:

devtools::install_github("dorleventer/tdid")

After that, estimating a bias-corrected TDID is essentially:

fit <- tdid(y, time, treat, group, x_covariates, data = df) summary(fit)

The README and vignette reproduce every figure in the paper, so users can trace each step from identification to inference.

When your empirical design leans on a triple difference and covariate adjustment, those estimates are only as good as the identification behind them. This paper (and the accompanying tdid package) supply a theoretically sound and implementation-ready fix, ensuring the third “D” does what you intended: isolate causality, not compound bias.

Triple Instrumented Difference-in-Differences

(Sho will be starting their PhD at Yale this fall, which is really impressive! Keep an eye out for them :) Being familiar with Econometrics notation will help you understand their work better)

TL;DR: standard DID-IV compares two groups over two periods; Sho shows how to add a third dimension (e.g., season × state × time) and still recover a local ATE. The key is a triple Wald-DID ratio (DDD in outcomes divided by DDD in treatment) that is valid under monotonicity plus “common acceleration” (a triple-difference analogue to parallel trends). The framework extends to staggered roll-outs and ordered (non-binary) treatments, and comes with clear guidance on estimation and asymptotic inference.

What is this paper about?

Triple-difference designs sometimes face endogeneity in the treatment itself; researchers therefore instrument and difference, but the identification conditions had not been formally spelled out. In this paper Sho defines a triple DID-IV set-up, pins down the target estimand (LATET for the subgroup affected by the instrument in the third dimension), and states the assumptions (monotonicity and common acceleration in both treatment and outcome) that make the triple Wald-DID ratio valid. The analysis covers both two-period and staggered-adoption panels.

What does the author do?

Sho introduces notation for a two-period, three-group setting (exposed/unexposed x instrumented/non-instrumented × demographic group) and defines the triple Wald-DID estimand.
They prove that the ratio of DDDs equals the LATET under monotonicity and common-acceleration assumptions (Theorem 1), and generalise to ordered treatments (Theorem 2).
Then they partition units into cohorts by instrument-start date, define cohort-specific LATETs (CLATTs), and show how to estimate them with either a never-exposed or last-exposed control cohort (Theorems 3–5).
Finally, they provide influence-function formulas and show that an IV regression with triple-interaction dummies in both first stage and reduced form delivers the estimator and its standard error.

Why is this important?

Instrumented DiD is already popular when no clean control group exists; empirical work is increasingly layering a third difference (season, demographic subgroup, geography) on top of an instrument without a clear blueprint for identification. Sho supplies that blueprint, shows the precise conditions under which the familiar “DDD-IV” regression isolates causal effects, and offers a robust alternative to two-way-fixed-effects IV estimators that can be badly biased under staggered timing.

Who should care?

Applied researchers using instruments that switch on only for a subset within a treated group (e.g., environmental regulations applied only in summer months in participating states).
Econometricians extending DiD/IV methods to heterogeneous treatments, staggered timing, or multi-dimensional policy variation.
Reviewers assessing DDD-IV studies who need a clear checklist of assumptions and estimators.

Do we have code?

No public package yet. Estimation boils down to a standard IV regression with fully interacted group × time × instrument dummies; the paper’s appendix gives ready-to-copy equations for both two-period and staggered panels. (If a replication package appears, it will likely be linked from the Sho’s homepage or the arXiv record.)

Covariate Balancing Estimation and Model Selection for Difference-in-Differences

TL;DR: when you run a DiD you usually weight observations by a propensity score, and if that score is even a little wrong your estimate can drift. Baba & Ninomiya show how to choose weights that force the weighted treated and control samples to match on the chosen covariate moments (even quadratic ones), so bias disappears even when the propensity model is misspecified as long as the before-minus-after outcome change is linear in those covariates. They also derive a smarter information criterion that tells you which covariates are really worth including.

What is this paper about?

Abadie’s semiparametric difference-in-differences (SDID) estimator is unbiased only when the propensity-score model is correctly specified; misspecification introduces bias. Baba and Ninomiya propose an alternative SDID procedure, covariate balancing for DID (CBD), that chooses propensity-score weights by solving moment conditions that force the weighted treated and control samples to match on selected covariate moments, including second-order terms such as xxᵀ. The resulting average-treatment-effect-on-the-treated (ATT) estimator is doubly robust: it stays consistent if either (i) the propensity-score model is correct or (ii) the before-minus-after change in outcomes is linear in the covariates. Baba and Ninomiya derive its large-sample distribution and show analytically and via simulations that CBD removes the bias seen with conventional maximum-likelihood weights when the propensity model is misspecified. Because the weights themselves are estimated, standard information criteria (AIC, GIC) are inappropriate for covariate selection. The authors develop an asymptotically unbiased risk-based information criterion whose penalty depends on the estimated weighting matrix and the heteroskedasticity of the weighted outcomes; in practice this penalty is often much larger than the familiar 2 × (number of parameters). Simulations confirm that the new criterion selects sparser, lower-risk models than an AIC-style adaptation (QICw). A re-analysis of the LaLonde job-training data illustrates the practical gains from both CBD weighting and the new model-selection rule.

What do the authors do?

Instead of guessing a propensity-score formula and hoping for the best, they choose weights that make the weighted treated and control samples exactly match on selected covariate moments—including quadratic (covariate × covariate) terms. Think of it as forcing the scales to balance before comparing outcomes. With those balanced weights in place, they run the usual before/after comparison, but with a twist: the ATT estimate stays consistent if either the weight recipe is correct or the before-minus-after change in outcomes is linear in those covariates. One correct piece is enough. Adding covariates can help or just add noise. Classic AIC-style rules under-penalise that noise once weights are estimated, so the authors derive a larger, data-dependent penalty that tells you when an extra variable earns its keep. In simulations, their method wipes out the bias that appears when the usual propensity model is misspecified, and the new information criterion selects lean, low-risk models. Re-analysing the famous LaLonde job-training data, the CBD criterion selects a noticeably different (much sparser) set of covariates than an AIC-style rule, illustrating the practical impact of both the balancing weights and the new model-selection penalty.

Why is this important?

Applied DiD studies often spend pages justifying that treatment and control “look similar.” CBD makes them similar by construction, so that diagnostic “headache” disappears. If your usual propensity-score equation is even slightly misspecified, traditional weights can push your estimate off target. CBD cushions that risk; one correct ingredient (either the weights or a simple outcome trend) is enough to keep the estimate on track. Adding controls can just as easily inflate variance as reduce bias. The new information criterion tells you quantitatively when a covariate earns its keep, instead of relying on ad-hoc *p-value fishing* or AIC rules that are too forgiving once weights are estimated. In simulations and the LaLonde re-analysis, CBD removed bias, kept models lean, and produced a noticeably larger and more precise treatment effect than the standard approach. In other words: fewer assumptions, tighter answers.

Who should care?

The method is most directly useful in two-period DiD designs (pre-/post- only). Multi-period settings will need future extensions, which the authors mention in their discussion.

Researchers running DiD in economics, public health, education, or policy who rely on propensity-score weights and worry about whether those weights truly balance their samples.
Any analyst who already uses “covariate-balancing” tools (CBPS, kernel balancing, etc.) is also a natural user, because CBD brings that logic into the DiD world.
Replication teams and journal reviewers who want a transparent check that treated and control groups are comparable rather than trusting a black-box propensity model.
Methodologists developing robust causal estimators. CBD adds a practical, doubly-robust tool to the DiD arsenal.

Do we have code?

The article gives formulas, theorems, and simulation/empirical results, but it does not include any replication code, software appendix, or GitHub link. The only software-related remark is that the LaLonde data come from the R package Matching; everything else is presented algebraically and in tables. (The authors do mention that the weighting matrix can be computed with GMM or GEL, but they do not supply code for doing so.)

Spatial Synthetic Difference-in-Differences

(Thanks prof Renan for sending me this!)

TL;DR: this paper extends Synthetic Difference-in-Differences (SyDiD) to settings where policies spill across space, violating the no-interference (SUTVA) assumption. By embedding a spatial-weights matrix in SyDiD’s weighted two-way-fixed-effects regression, the authors create Spatial SyDiD (SpSyDiD), which simultaneously estimates direct effects on treated units and indirect (spillover) effects on their neighbours. Monte-Carlo experiments show SpSyDiD outperforms both “vanilla” SyDiD (which ignores spillovers) and Spatial DiD (which uses uniform weights) in bias and precision. A re-analysis of Arizona’s 2007 employer-sanctions law illustrates the method.

What is this paper about?

This paper gives the classic DiD toolkit a badly needed “GPS upgrade” for the real world, where policies rarely stay inside state lines. Think minimum-wage hikes that nudge workers across borders, congestion charges that reroute traffic into the suburbs, smoking ban in one city that push smokers next door, or a jobs program in one region that can poach workers from its neighbours. In standard DiD we kind of assume these cross-border ripples don’t exist, which can bend our estimates. To tackle this problem, Serenini and Masek fuse two well-known ideas:

Who’s my neighbour? A spatial-weights matrix lists, for every region, which other regions lie close enough to feel its policy shock and how strongly they are connected.
Who’s my best match? Synthetic DiD already re-weights control regions and pre-treatment periods so the treated region’s trend is mimicked as closely as possible.

Merge the two and you get Spatial Synthetic DiD (SpSyDiD), a simple weighted regression that delivers two headline numbers instead of one:

Direct effect (τ): what happens inside the region that actually adopts the policy.
Spillover effect (τₛ): how much of that impact seeps into its neighbours.

The authors show, with simulations covering “many treated counties” and the classic “one treated state” setup, that omitting τₛ can induce noticeable bias and extra variance; in their county experiments the relative error reaches a few percentage points and grows with stronger spillovers. SpSyDiD, by contrast, keeps both direct and indirect estimates on target without giving up the nice robustness features of Synthetic DiD. A case study makes it concrete: Arizona’s 2007 employer-sanctions law cut the share of working-age non-citizen Hispanics in the state by –2.7 percentage points, while neighbouring Nevada, Colorado and New Mexico saw a +0.9 pp uptick, which is evidence that the law displaced people rather than making them disappear from the data.

What do the authors do?

Serenini and Masek take Synthetic DiD (the re-weighting framework that already softens the parallel-trends assumption) and graft onto it the core machinery of spatial econometrics. They let the treatment indicator bleed through a row-standardised spatial-weights matrix W, so the regression now contains two coefficients: τ, the change experienced by the directly treated region, and τₛ, the change that reaches its neighbours. They preserve SyDiD’s unit weights (ω) and time weights (λ), which means the new estimator, Spatial SyDiD, can be implemented with the same weighted least-squares routine once those weights and W are in hand. The paper spells this out as a six-step recipe that needs nothing fancier than ordinary matrix algebra. After building the estimator, they show why it matters. A short algebraic detour reveals that if a researcher ignores spillovers, the average treatment effect is biased by the factor (1 + ρ W̄), where ρ measures how strongly shocks diffuse and W̄ is the average share of treated neighbours. In other words, the more porous the border, the further your naïve DiD slides from the truth. The authors then stress-test their method in two simulation worlds. In a county-level setting with many treated units, Spatial SyDiD recovers both the Average Treatment on the Treated and the Average Indirect Treatment Effect with errors under three percent, while plain SyDiD (which has no spillover channel) can only speak to the direct effect and spatial DiD (which uses uniform weights) is less precise. Repeating the exercise at the state level with a single treated unit produces the same ranking: Spatial SyDiD is markedly closer to the data-generating truth and roughly halves the root-mean-squared error relative to its competitors. To show the estimator at work in real data, they revisit Arizona’s 2007 Legal Arizona Workers Act. Using Current Population Survey microdata, Spatial SyDiD attributes a 2.7-percentage-point fall in working-age non-citizen Hispanics to the law inside Arizona, alongside a 0.9-point rise in neighbouring Nevada, Colorado and New Mexico—evidence that the legislation displaced people rather than erasing them from the labour force. Finally, the paper adapts Synthetic DiD’s placebo-based inference to the spatial setting. By repeatedly re-assigning “treated” and “neighbour” labels among control units, the authors build a finite-sample variance for both τ and τₛ without leaning on additional distributional assumptions, giving practitioners an off-the-shelf way to attach standard errors to each component of the effect.

Why is this important?

Most policy studies assume each region is an island; when that fails, estimates can be way off. Spatial SyDiD gives researchers a plug-and-play fix: it keeps SyDiD’s relaxed parallel-trend strengths while explicitly measuring how much impact leaks across borders. That means: (i) cleaner causal claims when spillovers exist; (ii) a transparent split between “what happened here” and “what we pushed onto the neighbours,” which is exactly what policymakers worry about; and (iii) no need for exotic software, you just add a spatial weight matrix to workflows people already use. In short, it turns an unchecked threat to validity into something you can quantify, interpret, and report.

Who should care?

Applied micro-economists evaluating policies that plausibly shift jobs, prices, or people across county or state lines.
Urban and regional planners measuring knock-on effects of zoning changes, congestion charges, or transit investments.
Public-health and environmental scientists tracking how smoking bans, pollution controls, or epidemics spill into neighbouring areas.
Political scientists studying policy diffusion and electoral spillovers.
Impact-evaluation teams at governments and NGOs that rely on DiD/SCM workflows but face obvious “border leakage.”

Do we have code?

Yes! We have a repo on GitHub that you can clone to reproduce the findings of the paper (and maybe adapt the code to your own research).

CME + RDD + IV because DiD has been quiet

Beatriz Gietner — Thu, 10 Apr 2025 13:05:44 GMT

(Image by prof Chelsea Parlett-Pelleriti)

Hi there!

I found 2 papers and 1 guide in the past two weeks that I thought some of you would appreciate. The IV one is related to Health Economics, the RDD is actually a job market paper (full of colorful plots, we love it), and the guide was actually born out of Political Economy (co-written by 2 PhD students!), so I guess there is something for everyone here :)

Here they are:

IV in Randomized Trials, by Joshua D. Angrist, Carol Gao, Peter Hull, and Robert W. Yeh
RDD with Distribution-Valued Outcomes, by David Van Dijcke (thread)
A Practical Guide to Estimating Conditional Marginal Effects: Modern Approaches, by Jiehan Liu, Ziyi Liu, and Yiqing Xu (thread)

We will go from shortest to longest today.

IV in Randomized Trials

(You do not need to be an econometrician to get what is going on here, though it helps to have one nearby - it can be on X/BSky).

TL;DR: clinical trials are sometimes messy. Patients do not always do what they are assigned to do: some in the treatment group never take the treatment, while some in the control group sneak off and take it anyway. This messes with both intention-to-treat (ITT) and per-protocol analyses (the intended treatment is randomly assigned but treatment received is not): the first underestimates the true effect, and the second suffers from selection bias. IV methods offer a middle way, and this paper shows how to apply them in real-world trials. The authors explain IV theory using the ISCHEMIA trial as a motivating example and walk through how IV estimates recover local average treatment effects (LATE) for compliers (the participants whose behaviour was actually influenced by their random assignment). The paper argues that IV methods should be standard in pragmatic, strategy, and nudge trials where adherence is imperfect.

What is this paper about?

This paper revisits a well-known challenge in clinical trials: what do we do when patients do not stick to their assigned treatments? Traditional approaches like intention-to-treat (ITT) analyses estimate the effect of being assigned to treatment, regardless of whether patients actually receive it. This preserves the benefits of randomization but often underestimates the treatment’s true effect when adherence is imperfect. On the other hand, per-protocol or as-treated analyses attempt to estimate the effect of receiving treatment, but at the cost of introducing selection bias, since treatment uptake is no longer randomized. To navigate this trade-off, the authors advocate for IVs as a middle-ground solution. They treat random assignment as an instrument for actual treatment receipt and use it to estimate the Local Average Treatment Effect (LATE, the causal effect of treatment for the subgroup of participants who comply with their assigned treatment). The paper explains the theory clearly, demonstrates its application using the ISCHEMIA trial data, and argues that IV should be a standard tool in modern clinical trials, especially in settings where nonadherence or crossover are common. (Best of all? Health researchers get to use IV without having to write three paragraphs defending the exclusion restriction. Must be nice, I would not know).

What do the authors do?

They lay out a simple yet compelling case for using IVs to estimate causal effects in clinical trials with imperfect adherence. They begin by explaining the core intuition behind IV: when random assignment affects treatment uptake, it can be used to isolate the causal effect of treatment, specifically for the subset of participants who comply because of their assignment. They walk us through the basic mechanics: divide the ITT effect (difference in outcomes by assignment) by the compliance rate (difference in treatment uptake by assignment). The result is a per-protocol effect for compliers (those whose treatment behaviour was influenced by randomization). To make this concrete, they apply IV to the ISCHEMIA trial, a large cardiovascular study where about 20% of patients assigned to invasive treatment did not actually get it, and about 12% of patients in the conservative group did. Using IV, they show that the estimated treatment effect on health-related quality of life (SAQ score) is substantially larger than the ITT estimate (and clinically meaningful). They also provide adjusted estimates using regression-based IV methods, reinforcing the idea that this is more than a back-of-the-envelope trick.

Why is this important?

This is important because most clinical trials do not go exactly as planned and pretending otherwise does not help. Nonadherence, crossover, and treatment contamination are common, especially in pragmatic or strategy trials where patient choice and real-world logistics come into play. The standard ITT approach is clean but conservative, often underestimating the true effect of treatment. Per-protocol analyses try to adjust for this, but they sacrifice the core benefit of randomization → unbiased comparison. IVs offer a “principled” way out of this dilemma. By using assignment as an instrument for treatment receipt, researchers can recover valid causal estimates for the subset of participants whose treatment was influenced by randomization → the compliers. And while economists have long treated IV as part of the standard causal toolkit, this paper brings that logic to a clinical audience, with an accessible explanation, a real trial application, and a strong case for making IV a routine part of analysis when nonadherence is more than a footnote.

Who should care?

This paper is for clinical researchers and trialists who know their sample size calculations by heart but start sweating when compliance drops below 80%. It is for anyone designing or analyzing trials where real-world behaviour gets in the way of ideal randomization (which, let us face it, is most of them). More specifically, it is relevant to:

Clinical trialists working on pragmatic, strategy, or nudge trials, where adherence is not enforced and treatment effects depend on behaviour.
Health economists and outcomes researchers who want causal estimates that are both internally valid and clinically meaningful.
Biostatisticians looking for alternatives to biased per-protocol analyses that do not involve violating the randomization.
Regulatory scientists and policy analysts trying to understand not just whether a treatment works, but how much it works for patients who actually receive it.

If your trial has a CONSORT flowchart with more arrows than a subway map, IV might be exactly what you need.

In summary, if you are working with nonadherence, crossover, or anything resembling the real world, you do not need to choose between an ITT that is too soft and a per-protocol that is too biased. You can run an IV. You probably already know how. Now is the time to do it.

RDD with Distribution-Valued Outcomes

(David is on this year’s market! This is his JMP)

TL;DR: what if your outcome is a whole distribution rather than a single number? This paper extends RDDs to handle distribution-valued outcomes (think income brackets, price spreads, or test score distributions) where you want to know how an intervention shifts the shape of the outcome, going beyond just the average. David introduces a new causal estimand (LAQTE), proposes two estimators (one local polynomial, one in Wasserstein space), and shows how to construct valid confidence bands for inference. The method is applied to U.S. gubernatorial elections, where Democratic wins reduce income inequality by compressing the top end of the distribution. The theory is sharp, the plots are gorgeous, and the contribution is genuinely useful if you care about distributional effects.

What is this paper about?

This paper introduces a very cool (and very visual) extension of the classic RDD for situations where the outcome of interest is a distribution. Think test score distributions in schools, wage distributions within firms, or price distributions across products, cases where you care about the whole shape. Traditional RDDs are not usually equipped to handle these settings, because they assume scalar outcomes and ignore the structure (and randomness) inside each unit (there is a two-level randomness: one from how units are assigned around the cutoff, and one from sampling within each unit’s distribution). This paper provides the tools to handle both. David proposes a new framework, which he calls R3D (Regression Discontinuity with Distribution-Valued Outcomes), that treats these within-unit distributions as functional data. He defines a new causal estimand, the Local Average Quantile Treatment Effect (LAQTE), which captures how the average quantile function shifts around the cutoff. The goal is to estimate how an intervention affects the shape of the distribution, beyond just a point on it. Along the way, he introduces two estimators (one based on local polynomial regression, the other on local Fréchet regression in Wasserstein space → Maths people will have a ball with this one), proves asymptotic normality, derives uniform confidence bands, and shows us why standard quantile RDD estimators fail in these cases. The paper closes with an application to gubernatorial elections and income inequality in the U.S., showing that Democratic wins reduce top-end incomes (a classic equality–efficiency tradeoff).

What does the author do?

David develops a new framework called R3D , designed for settings where each unit (e.g., a state, school, hospital) is associated with a full empirical distribution. In these cases, traditional RDDs fall short, they cannot account for the internal variation or the structure of the distribution itself. He then defines a new causal estimand (LAQTE), which captures how the average quantile function changes at the cutoff; then he shows that standard quantile RDD estimators (e.g., estimating the 10th, 50th, 90th percentiles one by one) do not work in this setting, because they fail to aggregate information across units with different internal structures. The author proposes two estimators: a naïve estimator that estimates quantiles across units and applies standard local polynomial RDD logic, and a Fréchet/Wasserstein-based estimator, which treats the quantile function as a curve and applies local Fréchet regression in Wasserstein space (yes, the space of probability distributions). It sounds fancy (maybe because René Maurice Fréchet was French?), but it works beautifully and gives valid inference. He also derives uniform confidence bands for the full quantile function, letting us ask questions like: where in the distribution does the treatment hit? Finally, he applies the method to a gubernatorial RDD in the U.S., showing that when Democrats win, the top end of the income distribution shrinks, even though the average income stays flat → evidence of distributional shifts without mean effects: exactly what this method is built for.

Why does this matter?

Not all treatment effects show up in the mean. In many real-world settings (from education to health to inequality research) the most meaningful changes occur in the spread, shape, or tails of a distribution. An intervention might compress inequality without raising average income, lift the bottom of the score distribution without changing the top, or reduce risk exposure at the extreme end of a health outcome. Traditional RDDs miss these effects entirely. What David shows is that we do not need to flatten rich outcome structures into single numbers. His proposed method lets us treat distributions as what they are: distributions, with estimators that are both theoretically grounded and visually intuitive. It gives us a way to ask: where is the effect happening? and How does the shape change across the cutoff? This is especially valuable for fields that care about inequality, heterogeneity, or tail risk (not just the average). And the fact that the method “nests” easily within familiar RDD logic makes it much more usable than it first sounds.

Who should care?

Basically anyone working with outcomes that are more than just averages, but more specifically:

Education researchers analyzing test score distributions across schools or students (that would be me and a lot of you).
Health economists looking at risk distributions, not just mean outcomes.
Public finance folks studying income inequality or top-end wealth effects.
Applied microeconomists using RDDs who are somewhat frustrated by their “scalar-only” worldview.

Do we have code?

We have a package! I do not know how he did this without RAs because it seems like a lot of work, so kudos to him! (By the way do you have a moment to talk about out Lord and Saviour of R packages, Hadley Wickham?)

The R package r3d implements both the naïve and Fréchet/Wasserstein estimators, along with tools for plotting and inference. With it you can estimate LAQTEs using either local polynomial or Fréchet regression, generate quantile function plots that show where the treatment effect lies across the distribution, and construct uniform confidence bands for inference.

In sum, this paper is a reminder that not all outcomes fit in a spreadsheet cell. When your data come as distributions (not just point estimates) you need tools that treat them that way. David proposed R3D framework gives us exactly that: a method that respects the structure of complex outcomes, builds on familiar RDD logic, and opens the door to asking richer questions about where effects happen, not just if they do. If you are working with quantiles, histograms, or any outcome with internal shape, you should check this paper out. And it comes with code, plots, and clean theory. What is there not to like?

A Practical Guide to Estimating Conditional Marginal Effects: Modern Approaches

(I particularly enjoyed reading this one. It is a “self-explanatory” guide and the authors do their best in elucidating concepts step by step, but if you are not familiar with stats notation I suggest you try to grasp the idea of what they are trying to do first. Familiarity with R is essential, but the code chunks are very easy to follow. I also learned that “kernel” means “core”, “center”, “basis”. The more you know!)

TL;DR: This guide introduces robust methods for estimating how treatment effects vary with a moderating variable (“conditional marginal effects,” or CME). Traditional approaches like linear interaction models often rely on unrealistic assumptions, suffer from lack of overlap, and impose overly rigid functional forms. The authors then present a progression of methods, from parametric to more flexible ML approaches, that address these limitations while preserving statistical validity. They clearly define the CME estimand, enhance semi-parametric kernel estimators (a method that uses smooth curves to capture patterns without assuming a strict relationship between variables), and introduce modern, robust techniques like AIPW-LASSO (which combines weighting and model selection for added robustness) and Double Machine Learning (which uses ML to estimate treatment and outcome models separately, then combines them to reduce bias). The paper also offers practical recommendations based on simulations and real-world examples, and all methods are implemented in the `interflex` R package.

(Before we move forward, I just wanted to say that there is one formal definition of CME in the causal inference literature - as defined in this guide - but the term "marginal effect" or "conditional marginal effect" is sometimes used differently in statistical modeling contexts, especially in GLMs or GLMMs. So the label might be the same, but the meaning depends on the framework. Here, CME is defined as:

for binary treatment, or

for continuous treatment. It answers "what is the average treatment effect at a specific value of the moderator X?" which is a causal estimand grounded in the potential outcomes framework.)

What is this paper about?

In this paper the authors tackle a fundamental and increasingly relevant challenge in applied research: how to estimate treatment effects that vary across subgroups or contexts, formally known as CME. In many fields, from political science to economics and public policy, we want to understand not just whether a policy or intervention works, but for whom and under what conditions it works best. Unfortunately, the tools available to answer these questions are often limited, outdated, or misused.

Nowadays most applied researchers turn to linear interaction models (which assume that the effect of a treatment changes in a linear way with a moderating variable → this variable tells us “for whom” or “under what conditions” the treatment has a stronger or weaker effect, it is one that we condition on to examine how treatment effects vary). These models are nice because they are simple to estimate and interpret. But in practice, they come with serious limitations: they often rely on strong, mostly unrealistic assumptions about the shape of relationships, while also being highly sensitive to model misspecification (e.g., omitted interactions and nonlinearity) → leading to misleading conclusions. They also sometimes fail to assess whether there is enough overlap in the data (i.e., whether treated and control groups are comparable across the range of the moderator → lack of common support), and they typically lack a clear connection to the causal estimands we care about. Who has not struggled to answer seminar questions about nonlinearity and high-dimensional covariates? Just add a polynomial to your already kitchen sink regression, it should work (not).

These limitations are especially problematic in observational studies, where researchers do not have the “luxury” of randomized assignment and must rely on modeling assumptions to recover causal effects.

To address these challenges, the authors build on prior work by Hainmueller, Mummolo, and Xu (the author of this one himself) (2019), who introduced semiparametric kernel estimators (SKE from here onwards → they are statistical methods that relax functional form restrictions by using local weighted averaging to estimate relationships between variables, they are a middle ground between fully parametric models - with strict assumptions - and nonparametric approaches - with no functional form assumptions) to allow for more “flexible” relationships between treatment (D), moderators (X), and outcomes (Y). But even those methods left key gaps in terms of clarity, robustness, and ease of use. In this guide they clearly define the CME estimand, aligning what researchers want to estimate with how they estimate it.

What do the authors do?

The authors develop a comprehensive and unified framework for estimating CMEs, that is, how the effect of a treatment varies with a moderating variable X, holding other covariates constant. A key contribution of the paper is to clarify what researchers are trying to estimate when they talk about treatment effect heterogeneity. As I wrote in the introduction, the authors define the CME as a special case of the conditional average treatment effect (CATE) for both binary and continuous treatments. For both cases they rely on the assumptions of Unconfoundedness and Strict Overlap (given Random Sampling and SUVTA assumptions).

Next they propose a progression of estimation strategies, from simpler ones to more flexible ones, aka a tiered approach, which allows us to choose an estimator that matches our data structure, research question, and level of complexity.

In Chapter 2 they go through classical approaches to estimating and visualizing the CME (with examples from applied papers), while “emphasizing the importance of clearly defining the estimand and explicitly stating identifying and modeling assumptions”. They address the issue of multiple comparison by introducing uniform CIs constructed via bootstrapping. They propose diagnostic tools to detect lack of common support and model misspecification and discuss strategies to minimize them. Finally, they introduce a SKE to relax the functional form assumptions (I love to say “functional form”) of linear interaction models. They also summarize the challenges and how to solve them.

In Chapter 3 the authors aim to solve Chapter 2 limitations (the fact that SKEs still rely solely on outcome modeling and may struggle with nonlinearities or complex interactions in covariates) by introducing Augmented Inverse Propensity Weighting (AIPW ) and its extensions (especially the AIPW-LASSO approach) to improve the robustness, efficiency, and flexibility of CME estimation. AIPW, introduced by Robins, Rotnitzky, and Zhao (1994), improves on IPW by combining the outcome model and the treatment assignment model, making it doubly robust, that is, it yields consistent estimates as long as either model is correctly specified. (We have been through this before in the previous post, so you know how valuable this is, especially in observational settings where modeling assumptions are hard to verify.) When both models are correctly specified, AIPW achieves the lowest asymptotic variance within the class of IPW estimators. However, as Li, Thomas, and Li (2019) show, in finite samples, its performance depends heavily on the degree of overlap. If overlap is poor and propensity scores are extreme, AIPW may actually underperform compared to simpler outcome-based estimators. The authors then introduce the idea of “signals” - transformed versions of the data that isolate the part relevant for estimating the CME. They propose a three-step estimation strategy based on these signals that is intuitive and straightforward to implement. Then they close the chapter with Section 3.4, which focuses on inference (a step that is often overlooked when we get excited about flexible estimators). Because the AIPW approach involves multi-step estimation (outcome modeling, propensity scores, signal construction, and smoothing), getting valid confidence intervals is non-trivial. The authors propose bootstrap-based procedures to construct both pointwise and uniform CIs for the CME. We know that uniform intervals are especially helpful when visualizing treatment effect heterogeneity across the range of the moderator, since they adjust for multiple comparisons and avoid misleading “significance at a glance” conclusions. This emphasis on valid inference ensures that the flexibility gained by AIPW-LASSO or kernel smoothing does not come at the cost of statistical rigor, something applied researchers will appreciate when facing skeptical referees (or seminar rooms).

In Chapter 4 (my favourite one!) they extends the AIPW framework into a more flexible and scalable setting using Double Machine Learning (DML). The key idea is to use machine learning algorithms (like random forests, neural nets, or gradient boosting) to estimate nuisance functions (the outcome and treatment models), while applying orthogonalization techniques to isolate the CME and preserve valid inference, which then allows us to model complex, high-dimensional data without overfitting or violating identification assumptions. The authors explain the partialling-out procedure, discuss cross-fitting to avoid overfitting, and go into more detail on how DML can be implemented using off-the-shelf ML tools. They also show how to plug DML into the same projection-and-smoothing framework introduced earlier, making it compatible with their general approach to CME estimation.

In Chapter 5 we get to work! After introducing the “full menu” of estimation strategies (linear models, kernel estimators, AIPW-LASSO, and DML) the authors turn to the obvious question: which method should I use? That is exactly what Chapter 5 addresses through a series of Monte Carlo simulation studies designed to compare the methods under different conditions. The authors simulate data from three different Data Generating Processes (DGPs) that vary in complexity: from linear to nonlinear covariate effects and finally to a high-dimensional, highly nonlinear scenario. Each simulation evaluates how well the different estimators recover the true CME, and how their performance changes with sample size, model complexity, and hyperparameter tuning. The first DGP is fairly simple: the CME is nonlinear, but covariates enter the outcome model linearly. The goal here is to assess how well each estimator captures a quadratic treatment effect when most of the rest of the model is well-behaved. They find that Kernel and AIPW-LASSO both perform well when the sample size is moderate (n ≥ 1,000), that DML methods (e.g., DML-neural networks, DML-histogram gradient boosting ) need larger samples (n ≥ 5,000) to reach similar accuracy, linear models are clearly misspecified here and exhibit high bias, and in terms of confidence intervals, kernel and AIPW-LASSO produce narrower bands with better coverage at smaller sample sizes compared to DML. In the second DGP, things get more complex. Now, other covariates (not just the moderator) enter the outcome model nonlinearly, through functions like exp(2Z + 2). This setup challenges estimators that assume or rely on linearity in covariates. They find that Kernel estimators break down in this setting, because they assume linearity in non-moderator covariates; AIPW-LASSO and DML methods handle this complexity much better, flexibly approximating nonlinearities through basis expansions (AIPW) or ML models (DML); even at n = 1,000, AIPW-LASSO produces stable and accurate estimates; DML improves substantially with n ≥ 3,000, but struggles in small samples due to the variance introduced by flexible learning. In the final study they explore a more realistic and challenging DGP with four additional covariates, nonlinear interactions, and a highly nonlinear propensity score model. It also compares the effect of using default vs. tuned hyperparameters in DML models (neural nets, random forests, histogram gradient boosting). They find that tuning matters, especially for neural networks (NN), where cross-validation significantly improves CME estimates; histogram Gradient Boosting (HGB) performs well at larger sample sizes, but tuning does not help much beyond defaults; random forest (RF) does not improve much with tuning and generally underperforms compared to NN and HGB; and that computational cost increases substantially with tuning, raising the classic trade-off: better fit vs. longer runtime. The final take-aways are:

When the CME is nonlinear but covariates are linear, kernel and AIPW-LASSO work well even in small to moderate samples.
When covariates are also nonlinear, kernel breaks down, and AIPW-LASSO or DML are needed.
AIPW-LASSO is the best all-rounder (yey!), especially when sample sizes are limited and you want flexibility without the computational burden of full DML.
DML is powerful, but only if you have lots of data (n ≥ 5,000, usually not the case for most observational studies, but macro people will be happy) and can afford to tune models carefully.
Tuning DML learners (especially neural nets) improves accuracy, but not always dramatically, and it takes time.

Why is this guide important?

Understanding how treatment effects vary across individuals or contexts is at the ~kernel~ of applied social science research. Whether you are studying the impact of a job training program, a public health intervention, or a policy reform, knowing for whom and under what conditions the intervention works is often more important than knowing whether it works on average.

This guide is a goldmine! Their proposed framework gives us the tools to avoid common analytical pitfalls, such as extrapolating treatment effects to regions without common support or imposing rigid linear assumptions when the true relationship may be more complex. By presenting a progression from linear models to semiparametric and fully machine-learning-based approaches, their guide gives us a toolbox that can be tailored to our data's structure and complexity. It is not just a case of estimating effects, it is about trusting them (by properly constructing valid pointwise and uniform confidence intervals, handling multiple comparisons, and diagnosing violations of overlap or model misspecification). The authors show how recent advances in DML can be applied in causal settings without sacrificing interpretability or statistical rigor, something that most economists are worried about when they hear “machine learning”. This helps make cutting-edge methods usable by applied researchers, not just methodologists.

Who should care?

Pretty much everybody (even the macroeconomists)!

Applied researchers studying heterogeneous treatment effects across social science fields like political science, economics, sociology, education, and public policy.
Methodologists working at the intersection of causal inference and ML, especially the ones interested in effect heterogeneity and flexible estimation strategies.
Policymakers and practitioners who need to understand whether, how, and for whom an intervention works, beyond just headline average effects (they might not need to know the details tho).
Data scientists working with complex or observational data who want to go beyond basic treatment-control comparisons and explore nuanced causal relationships.

Do they have code?

Yes! They even have a package (we love packages): the interflex for R.

With interflex, you can:

Implement various estimators (linear, kernel, AIPW-LASSO, DML) with straightforward function calls.
Generate diagnostic plots to check overlap, functional form assumptions, and model misspecification.
Visualize CMEs with appropriate CIs.
Conduct sensitivity analyses to test the robustness of findings.
And even export plots and summaries for inclusion in papers or presentations.

In sum, I had a great time going through this guide. If you have ever sat through a seminar, got grilled on nonlinearities, or wondered whether your interaction term was doing anything meaningful, this one is for you. Whether you are team kernel, curious about LASSO, or finally ready to get into DML, this guide gives you a roadmap grounded in strong theory and practical recommendations. And best of all, it comes with code. So yes, CMEs may sound intimidating at first, but by the end of this guide, they are just a way of answering the question we care about most: who benefits, and how?

A few papers on RDD and IV

Beatriz Gietner — Tue, 01 Apr 2025 11:48:47 GMT

(EDIT: Professor Richard Hahn was kind enough to let me know about his work on RDD, which I included in this post now :) apologies for any mistake, I had to do it using the iPad)

Hi there!

This post is not related to the DiD literature but I saw these new developments in Regression Discontinuity Design and Instrumental Variables and thought about letting you know.

Here’s the list then I will go through them quick, focusing on how and when you can benefit from them:

Optimal Formula Instruments, by Kirill Borusyak and Peter Hull
Donut RDDs, by Claudia Noack and Christoph Rothe
Flexible Covariate Adjustments in RDDs, by Claudia Noack, Tomasz Olma and Christoph Rothe
A Partial Linear Estimator for Small Study RDDs, by Daryl Swartzentruber and Eloise Kaizar
Treatment Effect Heterogeneity in RDDs, by Sebastian Calonico, Matias D. Cattaneo, Max H. Farrell, Filippo Palomba and Rocio Titiunik
Learning Conditional Average Treatment Effects in Regression Discontinuity Designs using Bayesian Additive Regression Trees, by Rafael Alcantara, P. Richard Hahn, Carlos Carvalho, and Hedibert Lopes

Optimal Formula Instruments

TL;DR: this paper presents a method to build better IVs for treatments based on complex formulas (like benefit eligibility). It makes smarter use of individual characteristics, corrects for endogeneity, and leads to much more precise estimates - especially useful in regional/policy studies.

In this paper, profs Kirill and Peter propose a new method to construct more powerful IVs for treatments defined by complex formulas (such as eligibility for public programs, which has to account for income, family status, and what state the person lives in). It builds on and generalizes the widely used simulated instrument (“let’s imagine what would happen to a typical person if this policy were different in their state”, which is somewhat weak because it doesn’t take into account how individuals are affected differently by such policy) approach (e.g., Currie & Gruber, 1996). They introduce what they call "optimal formula instruments" that adjust for heterogeneous shock exposure while ensuring instrument validity through a technique called recentering. These instruments are constructed using observed data rather than relying solely on economic theory or exogenous instruments. It is a “smarter” IV because instead of just simulating how a typical person is affected by policy changes, they:

Predict how each individual is affected by the policy, using their actual characteristics.
Then they adjust (or “recenter”) this prediction so it still qualifies as a valid instrument - i.e., it’s not related to unobserved factors that could mess up the estimate.

Their proposed algorithm that approximates the optimal IV follows a few steps. They mention: “We then propose an algorithm to approximate optimal IVs in practice, focusing on the first two steps: obtaining the best treatment predictor and recentering it. While implementing both steps nonparametrically may be feasible in some settings, in general they represent a high-dimensional problem that may be impractical or infeasible - especially in non-iid data. Instead, we propose using knowledge the researcher has on the treatment formula as well as the “design” (i.e., data-generating process) of the exogenous shocks. First, the researcher predicts the treatment from the shocks and other observables which enter the treatment formula, setting any unobserved or endogenous components of the formula to a base value (such as zero). When there are no unobserved or endogenous components, this prediction is the treatment itself. Second, the researcher recenters this prediction by drawing counterfactual sets of exogenous shocks, following Borusyak and Hull (2023). Optionally residualizing the recentered prediction on covariates yields an approximation to the optimal instrument, up to the heteroskedasticity adjustment that is not popular in practice.”

In summary, these are the steps involved:

Formulating a best guess of treatment from shocks and observables.
A recentering step to ensure exogeneity.
Optional residualization and weighting for efficiency.

In their empirical application, they use this new IV to study how expanding Medicaid eligibility in 2014 (under the ACA) affected private insurance coverage. Their method gives more precise estimates (smaller standard errors) and also reveals that most of the crowd-out effect came from people switching away from direct-purchase insurance (e.g., ACA marketplaces) rather than from employer-sponsored plans - which has important implications for how we think about labour market effects.

This approach doesn’t require any fancy ML or black-box models - just a smart use of the information you already have in your dataset and knowledge of how the treatment formula works. It’s especially helpful when treatments depend on multiple factors (e.g., policy formulas) and when the data aren’t independent and identically distributed (non-iid) - as is common in regional or policy data.

What they do and how it’s different:

Traditional simulated IVs are like saying, “On average, this policy is more generous in State A than in State B, so let’s use that average difference to identify effects.” But Optimal Formula Instruments say: “Let’s be more precise and ask how this exact person, with their specific income and family structure, would be affected if they lived in a different state with different policies.”

This shift matters because many treatments in economics aren’t one-size-fits-all: the same policy can affect people in vastly different ways depending on their characteristics. So building a more personalized instrument (and then recentring it to correct for bias) makes use of this extra information while still satisfying the IV assumptions.

Why is this important?

Instrument strength is a major limitation in many IV applications. This approach leverages additional variation (heterogeneous exposure) without compromising identification.
It provides a formal justification and practical path to improve on standard shift-share or simulated instruments used widely in applied microeconomics.
The method remains valid and powerful even when data are non-iid or high-dimensional, expanding its applicability.

Who should care?

Applied microeconomists using IVs in settings with complex treatment definitions (e.g., eligibility, benefits formulas).
Researchers using simulated instruments, shift-share instruments, or working with non-iid data (e.g., regional economics, policy evaluation).
Methodologists interested in optimal IV construction or semi-parametric efficiency.

Do they have the code/package? I haven’t seen it. While the authors outline a clear algorithm and provide simulation results in the paper, there is no official GitHub repository currently available for implementing the "optimal formula instruments" method. However, Kirill has shared code for related shift-share IV work on GitHub (https://github.com/borusyak), which may be helpful. Researchers can follow the detailed steps in the paper to build their own implementation.

How can we implement it?

In practice, this is how it goes, in three main steps:

Construct the predicted treatment
Use the known formula (e.g., Medicaid eligibility rules) to calculate what the treatment would be under different policy shocks for each individual, using their actual characteristics.
Recenter the treatment prediction
Take the average prediction across many counterfactual versions of the policy shocks (e.g., using permutations or random draws), and subtract this average from the original prediction. This step ensures exogeneity.
Use the recentered prediction as your IV
You can then plug this into your IV regression (optionally adjusting for additional controls or heteroskedasticity for more efficiency).

In sum, this paper offers a practical and theoretically grounded upgrade to traditional simulated IVs, allowing researchers to harness more variation while preserving identification. Their method is especially relevant for settings where treatment depends on complex formulas and where exposure to shocks varies across individuals. Even without an off-the-shelf package, the step-by-step algorithm makes implementation accessible to applied researchers with a bit of coding. I’d keep an eye out for replication code from the authors, and in the meantime, this framework is definitely something to consider incorporating into your IV toolkit.

Donut Regression Discontinuity Designs

TL;DR: this paper provides a theoretical foundation for donut RD designs, a common robustness check in regression discontinuity (RD) studies where researchers exclude observations close to the cutoff. The authors show that while this approach can guard against manipulation concerns, it comes with significant costs in terms of bias and variance. They provide new tools to evaluate and compare donut and conventional RD estimates rigorously.

What is the paper about?

In RD designs, we estimate treatment effects by comparing observations just above and below a threshold (e.g., birth weight of 1500g for extra medical care). But what if there’s manipulation or measurement error exactly at the cutoff? A common fix is to run a "donut RD", where we remove a small window of observations near the cutoff.

In this this paper the authors ask: what does dropping those observations really do to our estimates and inference? And: can we still trust what we find?

What do they do?

The authors do three big things:

Theoretically analyze the costs of donut RD
- Removing observations near the cutoff increases bias and variance (sometimes by a lot).
- For example, excluding units within 10% of the bandwidth raises the bias by 41–63% and variance by 53–61%, depending on the kernel used.
Show that "bias-aware" confidence intervals still work
- Recent methods from Armstrong and Kolesár (Armstrong and Kolesár, 2018, 2020; Kolesár and Rothe, 2018) allow for valid inference even with donut RD, as long as you account for the increased uncertainty.
- These confidence intervals are longer, but still valid.
Propose new statistical tests to compare donut and conventional RD estimates
- They develop tests that account for the dependence between estimates (since most of the data is shared).
- One of their tests compares donut estimates to those using only the inner “donut hole” data, and shows better power.

Why is this important?

Donut RD is widely used but often done informally.
This paper offers a formal econometric framework for when and how to use donut RD responsibly.
It shows that donut RD doesn’t always make estimates “more robust” (it can make them less precise, so the trade-offs need to be understood).

Who should care?

Applied economists doing RD who want to run robustness checks on their identification.
Researchers worried about manipulation or bunching at the threshold.
Methodologists interested in nonparametric inference, confidence intervals, or testing RD assumptions.

Do they have code?

Some simulations were run in R using the RDHonest package (this is a great name for a package), but no replication package is linked in this draft. The methodology is compatible with standard RD toolkits, and researchers could implement the bias-aware inference and tests using tools like: RDHonest (R), and rdrobust (R and Stata).

How can we implement it?

Run a conventional RD using local linear regression.
Drop observations close to the threshold (e.g., within 3g of a cutoff).
Estimate the donut RD with the same bandwidth.
Compute bias-aware confidence intervals using existing tools or formulas from the paper.
Compare the donut vs. regular RD estimates using the proposed statistical tests.

In sum, this paper takes a widely used empirical practice (the donut RD) and gives it a formal statistical backbone. It shows that while excluding data near the cutoff may help address concerns about manipulation or sorting, it doesn’t come for free: donut RD increases both bias and variance, and standard inference methods may no longer apply. The authors equip researchers with new theory, valid confidence intervals, and practical tests to help decide when donut RD is appropriate, and when it might do more harm than good. For anyone using RD designs in applied work, this paper is an important guide to thinking more carefully about robustness and inference.

Flexible Covariate Adjustments in Regression Discontinuity Designs

TL;DR: this paper introduces a new, more flexible way to use covariates in RDDs. Instead of adding covariates linearly (which can be inefficient - at best - especially with many covariates), the authors propose subtracting an estimated function of the covariates from the outcome variable (possibly using ML techniques such as LASSO - my favourite - RF, DNN, or ensemble combinations) before running a standard RD. This approach improves precision and robustness, even in high-dimensional settings, while staying easy to implement.

What is the paper about?

In RD designs, covariates aren’t necessary for identification, but they’re often included to reduce variance. The common way to do this is to include them linearly and globally (i.e., not localized by distance to the cutoff), which can be inefficient - especially with many covariates or nonlinear relationships.

This paper proposes a more general approach: instead of including covariates in the regression, subtract a flexible function of covariates from the outcome, then run a standard RD on this adjusted outcome. This function can be estimated using machine learning (lasso, forests, boosting, neural nets) or more traditional nonparametric methods. The key is to choose the function that best captures the part of the outcome that’s predictable from covariates but not related to treatment.

What do the authors do?

Theoretical contribution: they characterize the optimal covariate adjustment → a function that minimizes asymptotic variance. They show that the adjusted RD estimator remains consistent and asymptotically normal, even if this function is misspecified or slowly estimated. Their method enjoys a stronger version of Neyman orthogonality, meaning it’s robust to errors in the first stage.
Practical implementation: estimate a function η(Z) predicting the outcome from covariates, then subtract η(Z) from Y to get a new “de-noised” outcome, and finally run the usual local linear RD using this adjusted outcome.
Cross-fitting: to avoid overfitting, they use cross-fitting (sample splitting), a best practice in double ML, and they also offer two variants: localized and global, depending on whether the ML algorithm focuses near the cutoff or uses the full sample.
Software and methods: their ensemble uses linear models, post-lasso, boosted trees, and random forests, with weights chosen via super learner. It’s implemented in R (another point for my RStats gang), but easy ( :) ) to adapt elsewhere.

Why is this important?

Improves precision over conventional linear covariate adjustments, especially in high-dimensional settings.
Makes RD designs more robust without complicating the estimation procedure.
Compatible with existing RD software and bandwidth selection methods → just replace Y with the adjusted outcome.

Empirical performance

They reanalyze 56 RD specifications from 16 published economics papers. Key findings: in about half of the cases, adding covariates linearly didn’t reduce confidence intervals much; their flexible method achieved up to 30% shorter confidence intervals, equivalent to doubling the sample size; and even modest improvements (e.g., 10–20%) are common and valuable in practice. They also show in simulations that the method works well under different sample sizes and covariate counts.

Who should care?

Applied researchers using RD who want to improve statistical power or precision.
People working with many covariates or nonlinear relationships.
Methodologists and anyone using ML in causal inference.

Do they have code?

Yes, they implemented the method in R, using xgboost, ranger, hdm, and SuperLearner. However, no public GitHub repo is linked (as of now), so direct replication would require reconstructing based on the paper and supplement.

In sum, this paper offers a powerful yet intuitive improvement to covariate adjustment in RD designs. By using modern prediction tools to “de-noise” the outcome before estimation, researchers can achieve greater precision, especially in high-dimensional or nonlinear settings. The method is simple to implement, robust to estimation error, and fully compatible with standard RD tools, which makes it a valuable addition to the applied econometrician’s toolbox. For those looking to get more out of their data without sacrificing identification, this approach delivers both flexibility and efficiency.

A Partial Linear Estimator for Small Study Regression Discontinuity Designs

TL;DR: this paper revisits and revives an older method - Partial Linear Estimation (PLE) - for RDD, showing that it can outperform standard RD methods in small-sample or sparse designs, which are common in education and other policy evaluations. The authors modify and implement this estimator with new bandwidth and variance selection tools, and show in simulations that it's highly competitive (often better) when data near the cutoff is limited.

What is the paper about?

Most RD studies use what’s called local polynomial estimation (LPE). Think of it as fitting two separate regression lines: one just below the cutoff, one just above. Then, you compare their values right at the threshold to estimate the treatment effect.

But when you have a small number of observations near the cutoff, this method can become unstable - those two separate lines rely heavily on a few data points and can give noisy or biased results. What the authors propose is to use a method called the Partial Linear Estimator (PLE). Instead of fitting two separate lines, PLE fits a single smooth curve across the whole running variable (both sides of the cutoff). This curve is flexible - it adjusts for the general relationship between the running variable and the outcome - but it also includes a separate term that captures the treatment effect at the cutoff. It’s like saying: “let’s model the overall trend smoothly across the data, and then estimate the jump at the cutoff as a separate component.” The benefit is that this uses information from the entire sample and avoids the issue of "boundary bias" that arises from fitting two separate regressions at the edge of the data. That makes it more stable and precise, especially when you’re working with limited data close to the threshold, which is common in many education or policy settings.

What do they do?

They extend and modernize Porter’s estimator by using a local polynomial regression weights instead of local constant ones, then pairing it with a new bandwidth selection algorithm (SM) based on an asymptotic MSE criterion, then jackknife-based standard errors for inference (which is shown to perform well in small samples).
They then simulate performance in small-sample scenarios: they compare their estimator to standard methods like CV/IK, FLCI/AK, and local randomization (LR); across 4 different data generating processes (DGPs) and multiple sample sizes, their PLE method consistently performs well, especially PLE with IK bandwidth.
Finally they apply the method to real school accountability data: they analyze scores from Indiana schools just above/below the failing threshold. Despite a total sample of 1933 schools, only 88 are near the cutoff - typical of RD sparsity. PLE estimates a small, non-significant negative treatment effect (opposite of what policymakers would hope), highlighting the method’s usability even with thin data.

Why is this important?

Many applied RD settings (especially in education, public policy, or subgroup analysis) suffer from low effective sample sizes near the cutoff. Standard RD methods assume large samples or dense data around the threshold, which may not hold in practice. PLE offers greater stability in small samples, it avoids boundary bias from fitting separate models, and outperforms popular alternatives like FLCI and LR in realistic small-sample settings.

Who should care?

Applied researchers working with small RD samples (e.g., schools, villages, programs with eligibility thresholds).
Economists and statisticians interested in practical improvements to RD methods.
Policy evaluators using RD designs where most units are far from the threshold.

Do they have code?

Yes, the authors have implemented the method in an R package called rdple, which is available on GitHub, not CRAN. The package includes functions for estimating the Partial Linear Estimator (PLE), selecting bandwidths, and computing standard errors. To install it, you’ll need to use the devtools package:

install.packages("devtools") devtools::install_github("DSwartzy/rdple")

Once installed, you can use it to apply the PLE method to your own RD data, especially in small-sample contexts.

How can we implement it?

Install and load the rdple package from GitHub
Choose a bandwidth: use either the authors’ proposed SM (Smoothness-based) bandwidth selector, designed specifically for PLE, or a standard one like IK (Imbens-Kalyanaraman). Both bandwidths are compatible with the method; PLE/IK performs particularly well in simulations.
Estimate the treatment effect: use local polynomial regression weights (typically local linear, i.e., degree = 1). The estimator fits a single smooth function across the running variable and estimates the treatment effect separately at the cutoff.
Compute standard errors: the recommended variance estimator is jackknife-based, specifically the one built on Wu (1986), which removes one residual at a time and has shown good performance in small samples. Other options are discussed (e.g., Hinkley’s method, direct plug-in), but jackknife on residuals is preferred due to stability and robustness.
Construct confidence intervals: use the jackknife variance estimate to build 95% intervals.

In sum, this paper revisits an underused method (PLE) and shows that it can be a highly effective alternative to standard RD estimators, particularly in small-sample or data-sparse settings. By fitting a smooth function across the entire running variable and estimating the treatment effect at the cutoff, the method avoids common issues like boundary bias and instability near the threshold. The authors modernize the estimator with better bandwidth selection and robust variance estimation, and they provide an easy-to-use R package for implementation. For researchers facing limited data near RD cutoffs, this approach offers a practical and reliable tool that often outperforms conventional techniques.

Treatment Effect Heterogeneity in Regression Discontinuity Designs

TL;DR: this paper develops a rigorous econometric framework for analyzing heterogeneous treatment effects in RDDs. It formalizes the most common empirical practice (interacting treatment with covariates in local linear regressions) and shows when and how this approach recovers causally interpretable conditional effects. It also provides tools for estimation, robust bias-corrected inference, and optimal bandwidth selection, all implemented in a companion R package `rdhte` → everything we love.

What is the paper about?

In practice, researchers often want to know: “does the treatment effect vary across different groups (e.g., by income, gender, or region)?” In RDDs, these subgroup analyses are typically done by adding interaction terms between the treatment and covariates (like income or education level) in a local linear regression. However, until now, there has been no formal framework validating this strategy, despite its widespread use. This paper provides that foundation. It shows when and how local linear regressions with interactions can be used to identify causal heterogeneous effects, and it clarifies what can and cannot be interpreted causally, especially when working with continuous covariates. The authors also develop tools for robust bias-corrected inference, optimal bandwidth selection, and group comparison tests, offering a unified approach to RD heterogeneity analysis.

What do the authors do?

They begin by showing that when the covariate is discrete (e.g., income quartiles), causal subgroup treatment effects are identified without needing additional assumptions. When the covariate is continuous, however, causal identification requires a semiparametric structure, specifically, that the treatment effect varies linearly with the covariate at the cutoff. This assumption ensures that the estimated heterogeneity reflects meaningful variation rather than noise or misspecification. Under this structure, the authors define conditions under which a standard local linear RD with interactions recovers the Conditional Average Treatment Effect (CATE) at the threshold. They also derive optimal bandwidth formulas tailored to heterogeneity targets, whether estimating group-specific effects or differences between groups. While having separate bandwidths for each group is theoretically optimal, they show that using a common bandwidth is often justified and simplifies implementation. For inference, the authors provide bias-corrected confidence intervals using the robust methods developed in Calonico et al. (2014), and extend these tools to allow for clustered standard errors. In their empirical illustration, they reanalyze a well-known RD study (Akhtari et al., 2022) on political turnover in Brazilian mayoral elections. They investigate treatment effect heterogeneity in headmaster replacement, using municipal income as the moderator, both discretized (via median, quartiles, deciles) and continuous. They find that heterogeneous effects are statistically significant among lower-income municipalities, and that modeling income continuously yields similar patterns with greater efficiency.

Why is this important?

Heterogeneity is central to policy → knowing who benefits most (or least) informs targeting and fairness. Many RD studies analyze subgroup effects informally; this paper standardizes and validates the most common empirical practice. It provides rigorous conditions for causal interpretation and the tools to do valid inference.

Who should care?

Applied economists and political scientists using RD designs with subgroup analysis.
Researchers working with discrete or continuous covariates in RD settings.
Anyone doing covariate-interacted local linear RD, especially for policy heterogeneity.

Do they have code?

Yes. The companion R package is called rdhte, available here. It implements all the estimation and inference tools described in the paper, including linear interaction models, optimal bandwidth selectors, robust bias-corrected confidence intervals, and heterogeneity tests.

In sum, this paper offers a comprehensive framework for analyzing treatment effect heterogeneity in RD designs, bridging the gap between common empirical practice and formal identification theory. It clarifies when covariate interactions in local linear RD regressions yield causally interpretable effects, provides practical tools for estimation, inference, and bandwidth selection, and delivers everything in a ready-to-use R package. For researchers aiming to uncover who benefits most from treatment, this paper turns an informal add-on into a rigorous and reliable strategy.

Learning Conditional Average Treatment Effects in Regression Discontinuity Designs using Bayesian Additive Regression Trees

TL;DR: this paper introduces BARDDT, a purpose-built BART model for RDDs, that estimates Conditional Average Treatment Effects (CATE) at the cutoff, conditional on covariates. It outperforms standard BART, local polynomial RD, and CART-based alternatives, especially when treatment effects vary across units! It has everything we love: plots, trees, matrices

What is this paper about?

Most RD studies estimate the average treatment effect at the cutoff—but we often care about how that effect varies across people. For example, do students with lower high school GPAs respond differently to academic probation than students with stronger academic records?

This paper introduces a new method called BARDDT (Bayesian Additive Regression Trees for Discontinuity Treatment Effects) that helps answer exactly that. It’s a flexible, data-driven tool that can estimate how the treatment effect at the cutoff varies depending on someone’s characteristics—without needing to pre-specify the subgroups in advance.

In simple terms:

BARDDT looks for patterns in who responds more or less to the treatment, based on the covariates you have (e.g., age, gender, baseline performance).
It works like a very smart version of splitting your sample into subgroups, except it does this automatically, based on where the data shows meaningful differences.
Unlike standard regression trees or off-the-shelf machine learning tools, it’s specifically built for the RD setting: it respects the discontinuity structure and estimates heterogeneity right at the cutoff.

What do the authors do?

They develop BARDDT, a version of Bayesian Additive Regression Trees (BART) adapted for RDDs.

The model:
- Fits smooth curves instead of flat segments within each tree.
- Splits on both the running variable and individual covariates (e.g., gender, GPA).
- Directly estimates individual-level treatment effects at the cutoff.
Run extensive simulations:
- Compare BARDDT to standard BART, local polynomial RD, and tree-based CATE methods.
- BARDDT consistently delivers lower bias and better CATE recovery, especially when relationships are nonlinear.
Apply the method to academic probation data:
- Estimate how probation affects GPA for different types of students.
- Find larger effects among students with low prior GPA or lighter course loads.

Why it matters

Heterogeneous treatment effects are often what policy cares about. Existing RD methods usually assume constant effects at the cutoff or do manual subgroup analysis. BARDDT offers a principled, flexible way to uncover causal heterogeneity without needing to specify subgroups ahead of time. It brings the benefits of ML to RD while still respecting identification assumptions.

Who should care?

Applied researchers using RDDs who want to understand which subgroups respond most to treatment.
Economists and data scientists working on personalized policy effects (e.g., education, health, labor).
Anyone interested in combining causal inference and ML, especially for structured designs like RD.

Do they have code?

Yes! The authors provide an open-source package called stochtree (R and Python). It includes: BARDDT implementation, simulation code, and academic probation replication.

How to implement it

Prepare your RD dataset with a running variable and relevant covariates.
Standardize the running variable (the model expects this).
Fit BARDDT using the stochtree package.
Estimate the CATE at the cutoff, conditional on covariates.
Visualize or summarize heterogeneity (e.g., using tree summaries of CATEs or marginal effects).

In sum, this paper shows how to bring modern ML techniques into the world of RDD, without breaking the causal assumptions that make RD attractive. By customizing BART to the RD context, the authors give researchers a new way to estimate and explore treatment heterogeneity at the margin. If you’ve been doing RD-by-subgroup or linear interactions, this is a powerful, flexible alternative, especially when you don’t know in advance where the differences lie. Do we ever?

What’s New in DiD? On Flexibility, Heterogeneity, and Robustness

Beatriz Gietner — Mon, 24 Mar 2025 11:55:38 GMT

I have three new updates, I’ll list them first and then we move on to talk about them. I’ll try to keep it short.

Difference-in-Differences Designs: A Practitioner's Guide, by Andrew Baker, Brantly Callaway, Scott Cunningham, Andrew Goodman-Bacon, and Pedro H. C. Sant'Anna.
Finitely Heterogeneous Treatment Effect in Event-study, by Myungkou Shin.
Difference-in-Differences Meets Synthetic Control: Doubly Robust Identification and Estimation, by Yixiao Sun, Haitian Xie, Yuhang Zhang.

Difference-in-Differences Designs: A Practitioner's Guide

It seems like the literature keeps moving mainly thanks to the efforts of these guys in particular (+ prof Wooldridge The Great). Their collective productivity is not only boosting the US TFR, but also expanding our understanding of how to better set up DiD studies and what to actually consider when doing so.

You can’t use the excuse that you don’t know how to do stuff if there’s so much material online and offline nowadays. This guide is particularly useful because it brings together all the efforts they put along the years. It is straight-forward and very rich in details. They are putting the “science” in “social science” by proposing an organizing framework that conceptualizes complex DiD designs as aggregations of multiple 2x2 comparisons. This guide does for DiD what a microscope does for biology, it zooms into the essential building blocks, helping you understand and construct better causal inference models. They emphasize that even “intricate” DiD studies can be broken down into simpler components, each relying on the fundamental parallel trends assumption.

The guide further explores practical considerations such as the use of covariates, weighting schemes, and strategies for handling multiple periods and staggered treatments. They want to provide us with the tools to implement DiD methods more effectively and interpret our results with a greater clarity. It’s packed with clear examples, organizing frameworks, and even code references to help us implement modern DiD designs properly (we love tools that we can immediately apply).

Save it, print it, enjoy it!

(Also thank them!)

Finitely Heterogeneous Treatment Effect in Event-study

I know most of us sometimes stumble on the pivotal parallel trends assumption, the idea that untreated units provide a valid counterfactual for treated units. It can be hard to justify (someone will ask you about it at a seminar), especially when conditions aren’t ideal, like short time horizons or pre-treatment trends that don’t quite line up.

Others might say: “This should’ve been an event study.” This paper brings both worlds together: it bridges the traditional DiD and event-study approaches by proposing a more flexible way to handle heterogeneity in time trends across units.

신명구 교수님 shows how event-study designs implemented via DiD can be made more robust when natural groupings exist in the data even if we can’t observe them directly. It’s especially useful when there isn’t a clean natural experiment, but you suspect there are distinct patterns of heterogeneity in your data.

The key idea is the “type-specific parallel trends” assumption. It allows for heterogeneous time trends by introducing a latent type variable - an unobserved characteristic that assigns each unit (person, firm, region) to a group or “type” based on underlying traits we can’t observe directly, but that affect their outcomes over time.

In traditional DiD, we assume all units (treated and untreated) would have followed similar trends absent treatment. But what if some units are natural fast improvers, while others stagnate or decline - even without treatment? And what if that variation reflects some unobserved group structure? In that case, the standard DiD estimate can be biased - because the untreated trend isn’t a valid counterfactual for the treated group.

This paper’s innovation is to allow for finitely many latent types - and assume that units of the same (even unobserved) type would have followed the same trajectory absent treatment. This relaxes the classic parallel trends assumption and gives us a more robust framework to deal with hidden heterogeneity in DiD designs.

Definitely worth a read if you’re working with DiD or event-study designs in applied settings - especially when you’re concerned about hidden heterogeneity or shaky pre-trends. It’s a thoughtful contribution that expands what we can credibly identify in non-experimental data.

Difference-in-Differences Meets Synthetic Control: Doubly Robust Identification and Estimation

Ok, so we discussed event-study and DiD settings - now let’s talk SC and DiD (so much content today. What’s next, IV and DiD? Don’t be crazy).

DiD is great, but it relies on the parallel trends assumption - that treated and control groups would have evolved similarly in the absence of treatment. SC doesn’t assume parallel trends; instead, it builds a weighted combination of control units to replicate the treated unit’s pre-treatment path. Both methods can fail if their core assumptions don’t hold - and in practice, we often don’t know which one is more believable.

This paper’s innovation is an integrated approach that offers a doubly robust identification strategy for causal effects in panel data with group structures. It identifies the average treatment effect on the treated (ATT) if either the parallel trends assumption or a group-level synthetic control assumption holds. Sun, Xie and Zhang 老师 focus on typical micro panel data settings where individuals are observed repeatedly over time and grouped into larger units (e.g. households in states), with treatment assigned at the group level.

They develop a doubly robust estimator that blends the strengths of DiD and SC. You get a consistent estimate of the treatment effect as long as one of the two assumptions holds, you don’t need both. The method works with panel data, repeated cross-sections, and staggered adoption, making it highly versatile for applied researchers working with micro-level datasets like surveys, administrative records, or digital trace data.

In more technical terms, Sun, Xie and Zhang 老师 introduce a moment function φ - a carefully constructed combination of outcome changes in treated and control groups. It’s adjusted for covariates and treatment probabilities, and weighted using synthetic control logic. This structure ensures that φ delivers a valid ATT estimate as long as either the DiD or SC assumptions are satisfied.

In more “math” terms, a moment function is an expression that, when averaged over the sample, gives an estimate of a parameter (here the ATT). The authors work in a semiparametric estimation framework, where the parameter of interest (ATT) is finite-dimensional, while the nuisance components (e.g., the propensity score, conditional outcomes, and synthetic weights) are modeled nonparametrically. This allows for flexible estimation without overcommitting to functional form assumptions, and inference is handled via a multiplier bootstrap.

Speaking of bootstrapping, they provide a recipe for implementation (hopefully a package soon?). To compute the estimator in practice, they describe using cross-fitting (a form of sample-splitting common in double machine learning) to estimate nuisance functions like mΔ(X)m_\Delta(X)mΔ(X) (the expected change in outcome given covariates), pg(X)p_g(X)pg(X) (the group-level propensity scores, and weights wg(X)w_g(X)wg(X) for SC. These can be estimated using nonparametric or machine learning methods, like local polynomial regression, RF, boosted trees, neural nets, and any other “flexible” method. Sun, Xie and Zhang 老师 specifically mention that their estimator is compatible with DML-type algorithms, which makes it relatively plug-and-play if you’re using packages like DoubleML (in R or Python) or custom estimators in scikit-learn, grf, etc.

It is great paper for anyone working with messy pre-trends, fuzzy treatment timing, or “almost good” quasi-experiments. Their doubly robust approach can help us to identify and estimate the ATT as long as either the parallel trends assumption or the synthetic control condition holds, which is a significant step forward in settings with imperfect designs.

The End (?)

That’s a lot of causal inference in one sitting but honestly it’s exciting to see how far we’ve moved from “just run a DiD” to actually thinking hard about assumptions, heterogeneity, and design. If you’re doing applied work, these papers are well worth saving, reading, and probably thanking the authors for later.

Next time: maybe IV and DiD?

New DID Guide Just Dropped: Better Causal Inference in Panel Data (in R).

Beatriz Gietner — Fri, 21 Feb 2025 10:00:47 GMT

Professor Yiqing Xu and Ziyi Liu just put out a "New DID Methods" chapter from the fect (Fixed Effects Counterfactual Estimators) package user manual, which offers a comprehensive guide to implementing advanced DID estimators in R. It addresses the limitations of traditional TWFE models in causal panel analysis (which have well-documented issues when treatment effects are heterogeneous or dynamic) and complements their 2025 study, providing practical instructions and R code for researchers. Many researchers rely on TWFE DID models, but recent research has shown that these models can produce biased estimates when: a) treatment effects vary over time (e.g., some units respond faster or slower to the policy), b) different units receive treatment at different times (staggered adoption), and c) treatment effects are heterogeneous across individuals or groups. With more policy evaluations relying on DiD methods, ensuring accurate causal inference is more important than ever. This guide reflects the latest advances in the field, providing practical tools that researchers can implement straight away.

This chapter serves as a “hands-on” guide to implementing modern, HTE-robust DID methods in R, helping users produce more reliable causal inferences.

Some of the features of this chapter include:

Comprehensive Coverage of HTE-Robust Estimators: The chapter introduces various heterogeneous treatment effect (HTE) robust estimators developed as alternatives to TWFE models. These methods, proposed by researchers such as Cengiz et al. (2019), Sun and Abraham (2021), and Callaway and Sant’Anna (2021), are closely connected to the classic DID estimator.
Practical Implementation Guidance: Readers are guided through the implementation of these HTE-robust estimators in R, including instructions on creating event study plots to display estimated dynamic treatment effects. The chapter presents a recommended pipeline for analyzing panel data, covering data exploration, estimation, result visualization, and diagnostic tests.
Empirical Examples: The chapter illustrates these methods using two empirical examples: Hainmueller and Hangartner (2019), which examines the effects of indirect versus direct democracy on naturalization rates in Switzerland without treatment reversals, and Grumbach and Sahn (2020), which includes treatment reversals.
Sensitivity Analysis: It demonstrates how to implement the sensitivity analysis proposed by Rambachan and Roth (2023) using the imputation estimator and data from the first example, enhancing the robustness of causal inferences.

This guide is ideal for:

a) Economists, political scientists, and social scientists conducting policy evaluations
b) Researchers analyzing staggered treatment adoption in panel datasets
c) Anyone using DID estimators in R and looking for more robust methods
d) Those interested in event-study analysis for dynamic causal effects

If you're working with panel data and policy evaluations, using traditional TWFE methods could be leading you to incorrect conclusions. Their guide offers a state-of-the-art approach to estimating treatment effects with more accuracy and reliability.