Where Standard Assumptions Go To Die

Universal treatments without clean controls, continuous policies without untreated groups, reversible reforms over time and statistical inference with handful of cases

Sep 22, 2025

Hi there! I didn’t forget about you folks, but things have been pretty quiet so I was pilling up the papers as to have a meaningful post.
Let’s then check the latest ones:

Factorial Difference-in-Differences, by Yiqing Xu, Anqi Zhao and Peng Ding (professor Yiqing told me this is not new - it was originally uploaded last year - but we will hear about it in soon so I thought about including it here. He was also very kind and shared slides so we could understand the paper better. Thank you, prof! There’s also a talk available here)

Difference-in-Differences Estimators When No Unit Remains Untreated, by Clément de Chaisemartin, Diego Ciccia, Xavier D’Haultfœuille and Felix Knau (this was also first uploaded last year)

Treatment-Effect Estimation in Complex Designs under a Parallel-trends Assumption, by Clément de Chaisemartin and Xavier D’Haultfœuille (if you’re in Europe and want learn more DiD from prof D’Haultfœuille, he will be at the ISEG Winter School in January. You can sign up for it here)

Inference With Few Treated Units, by Luis Alvarez, Bruno Ferman, and Kaspar Wüthrich (first draft: April 26, 2025; this draft: June 26, 2025)

Factorial Difference-in-Differences

TL;DR: what should we do when there’s no untreated group? Events like famines, wars, or nationwide reforms hit everyone at once, leaving us without the clean control group that canonical DiD relies on. Xu, Zhao and Ding formalise what applied researchers have long been doing in these cases: using baseline differences across units to recover interpretable estimates. They call this Factorial DiD (FDiD). The framework clarifies what the DiD estimator identifies, what extra assumptions are needed to make causal claims, and how canonical DiD emerges as a special case.

What is this paper about?

This paper is about a situation many of us applied people face: what to do when a big event affects everyone? In canonical DiD we need a treated group and a clean control group. But in practice there are cases where no such control exists (e.g., a famine, a war, or a nationwide policy reform) and no one escapes exposure. What we often do is still run a DiD regression but instead of comparing treated and untreated units, the idea is to interact the event with some baseline factor, which is a characteristic that units already have before the event happens and that doesn’t change because of it1. The authors call this approach Factorial DiD2, and the name reflects the fact that the research design involves two factors: the event itself (before versus after) and the baseline factor (e.g., high versus low social capital).

The same DiD estimator is used, but the research design is different. The estimator no longer identifies the standard treatment effect, but instead it captures how the event’s impact differs across groups with different baseline factors, a quantity they call effect modification. The authors go further and show that moving from this descriptive contrast to a causal claim about the baseline factor (what they call causal moderation) requires stronger assumptions. They also show that canonical DiD is a special case of FDiD, but only if you impose an exclusion restriction assuming some group is exposed yet unaffected by the event. This paper gives a language, a framework and clear identification results for a class of designs that are already widely used but rarely well defined in the methodological literature.

What do the authors do?

They begin with the simplest possible case: 2 groups, 2 periods + one event that hits everyone. They show that with universal exposure + no anticipation + the usual PT, the DiD estimator doesn’t give us the “standard” treatment effect but instead identifies effect modification, which is the descriptive contrast in how much the event changes outcomes for high-baseline versus low-baseline units (similar to the idea of heterogeneous treatment effects). They then introduce three related quantities3:

1) Causal moderation: the stronger claim that the baseline factor itself changes how the event matters (e.g., that social capital causes the famine to be less deadly);

2) An ATT analogue: the average effect of the event on the high-baseline group (which looks like the “treated group effect” in canonical DiD);

3) And the baseline factor’s effect given exposure: how the baseline characteristic matters once the event has happened.

The key contribution is to line up which assumptions are needed for each one. With only canonical DiD assumptions, you can get effect modification. To move to causal moderation, you need their new factorial PTA. Canonical DiD then emerges as a special case if you add an exclusion restriction that one group is exposed but unaffected. A different exclusion restriction, assuming the baseline factor has no impact in the absence of the event, links the design to the baseline-given-exposure quantity.

After working through these four cases, they move on to show how the framework can be stretched further. The authors show how to add baseline covariates, work with repeated cross-sections rather than panels and extend to continuous baseline factors. They also clarify what this means for practice: when TWFE regressions with interactions are “coherent” and when they are not. The paper closes by revisiting the famine and social capital example, showing how the framework applies to a real case that has already been studied with this kind of design.

Why is this important?

I think giving names to things matter. We’ve been doing FDiD, we just don’t call it that. We run a TWFE regression with an interaction term, call it DiD and then interpret the coefficient as if it were the standard ATT. The problem is that the target estimand is *not* the ATT.

This paper makes the implicit explicit. It 1) names the design and distinguishes it from canonical DiD, 2) spells out what the estimator is actually identifying under different assumptions, and 3) clarifies what extra assumptions we need if we want to go beyond descriptive contrasts. This obviously matters in practice. It gives us a diagnostic tool by stating plainly which estimand we are after and which assumptions justify it. I think it also sort of prevents over-claiming. With only canonical PT, we’re talking about effect modification. To get causal moderation, we need factorial PT. To recover an ATT-like interpretation, we need an exclusion restriction.

This paper also ties regression practice back to design. TWFE with interactions is ok if the assumptions are “right”; if not, we are not estimating what we think we are. And because the authors extend the framework to covariates, repeated cross-sections and continuous baseline factors, their approach covers the kinds of data we actually use.

At a broader level, the contribution is about transparency. FDiD gives us language that makes clear to authors, referees (and seminar audiences) what is being identified and under what conditions. It “builds a bridge” between factorial designs in stats and DiD in econ, and in doing so makes a very common empirical strategy much easier to defend (and critique).

Who should care?

Anyone in the social sciences doing causal work. Applied people4 working on labour, development, education, political economy, trade or even finance will recognise the setup (wars, famines, nationwide reforms, regulatory changes or financial crises all fit the bill). PolSci dealing with regime shifts, revolutions and/or national elections face the same issue. I think it’s also useful for both referees, editors, authors and us, graduate students, learning causal inference.

Do we have code?

No, for many reasons. FDiD is not a new estimator, but a design. The DiD estimator is standard, what changes is the estimand and assumptions. Implementation here is context-specific. Which quantity you estimate (effect modification, causal moderation, ATT-analogue, baseline-given-exposure) depends on which assumptions you’re willing to make. A “one-click” function would hide that choice. The same regression (TWFE with an interaction or ∆Y on G) can map to different targets under different designs. Packaging it would risk misinterpretation. Extensions vary (covariates, repeated cross-sections, continuous baseline factor). Too many branches for a single simple API. If you want a practical recipe, it’s just: 1) difference outcomes, 2) run ∆Y on baseline-group indicator(s) with covariates as needed, and 3) interpret the coefficient as effect modification under standard PT (add stronger assumptions if you want causal moderation or the other targets).

In summary, FDiD gives a name and a framework to something we were already doing when there was no untreated group. The estimator is the same as in canonical DiD but the design is different (and so is the interpretation). What you get by default is effect modification, which is a descriptive contrast in how the event matters across baseline groups. To turn that into causal moderation, or to recover an ATT-like interpretation, you need extra assumptions. The paper’s value is in making these distinctions clear, mapping assumptions to estimands and also showing how regression practice fits within the framework. For anyone who has ever interacted an event with a baseline factor and called it DiD, please take a moment and think about rewriting it a bit, hehe.

Difference-in-Differences Estimators When No Unit Remains Untreated

TL;DR: what if a policy affects everyone, but by different amounts? Minimum wage hikes, China’s WTO entry, Medicare Part D, etc, no group is left fully untreated. The authors call this a heterogeneous adoption design with no untreated group. Standard DiD fails here and TWFE with treatment intensity can mislead. The paper shows how to recover effects if quasi-untreated units exist, proves impossibility without them, and offers minimal assumptions for partial identification. It also gives tests for when TWFE is “defensible”, with code in Stata and R.

What is this paper about?

What happens when a policy affects everyone, but not in the same way? A nationwide minimum wage hike raises pay everywhere, yet the bite is bigger in some regions than others. China’s entry into the WTO reduced tariff uncertainty for every U.S. industry, but some faced larger potential tariff spikes. Medicare Part D applied to all drugs, but exposure varied with each drug’s Medicare market share.

In each of these examples, there’s no group left *entirely* untouched by the policy, every unit is affected to some degree. The authors call this setup a heterogeneous adoption design (HAD) with no untreated group. It turns out this situation is not rare at all. Many policy changes are universal by design and even when a small untreated group exists we often exclude it (drop if) because it looks too different from the rest of the sample. The problem is that standard DiD relies on having a control group that stays at zero treatment. If everyone is affected, that assumption collapses. And the default fix in applied work (a TWFE regression with treatment intensity as the regressor) doesn’t solve the problem. As the authors show, TWFE can give misleading results in this setting when treatment effects vary across units.

This paper then “attacks” the issue directly. It shows when effects are identified and when they are not. With quasi-untreated units (exposure near zero), they identify a weighted average of slopes (WAS) using a local-linear, optimal-bandwidth estimator adapted from RDD, with valid bias-corrected inference. When no quasi-untreated units exist, they prove an impossibility result for WAS without extra assumptions; they then give minimal assumptions to learn the sign of the effect or an alternative parameter relative to the lowest observed dose. Because TWFE is widely used, they also supply tests: a tuning-free test for the existence of quasi-untreated units and a nonparametric Stute test of the TWFE linearity restriction, alongside pre-trend checks.

What do the authors do?

First the authors set up the framework. They define what they call a heterogeneous adoption design (HAD) with no untreated group: every unit is untreated in period one and in period two everyone gets some positive treatment dose, but the size of that dose differs across units. They show why TWFE regression, which regresses outcome changes on treatment intensity, can be misleading in this setup when effects differ across units.

They then ask: what can we still learn? Their first target is a parameter they call the weighted average of slopes (WAS), which weights each unit’s slope (change in potential outcomes between zero and actual treatment) by the unit’s treatment dose relative to the average dose. If there is a set of quasi-untreated units (those with exposure close to zero), WAS can be identified by comparing outcome changes in the full sample to outcome changes in this group. To estimate it, they import tools from the RD literature (local linear regression at the boundary + optimal bandwidth choice + bias-corrected confidence intervals).

But what if even quasi-untreated units don’t exist? Here the paper proves a complete impossibility result: without extra assumptions, WAS has an identification set that is the entire real line - meaning the data can rationalize any value of WAS. To make progress, the authors then propose 2 minimal routes forward. One option is to assume enough structure to at least pin down the sign of WAS, by comparing the least-treated units to the rest. Another is to define a slightly different parameter, WAS_d, which compares outcomes to the lowest observed dose rather than to zero. Under an additional assumption, that parameter can be identified and estimated.

Because the whole strategy depends on whether a quasi-untreated group exists, they also build a tuning-free statistical test for its presence based on simple order statistics of the treatment variable.

After laying out these new estimators, they circle back to TWFE. Since we often use it anyway, they ask: under what conditions could TWFE actually be valid here? They show that if treatment effects are homogeneous and linear, then TWFE delivers the average slope. This leads to a key insight: TWFE implies that the conditional mean of outcome changes is linear in treatment dose, and this is testable. They adapt a nonparametric specification test (the Stute test) to check this linearity. Combined with standard pre-trend tests and their quasi-untreated test, this gives a three-part testing procedure: if all tests pass, TWFE may be valid. For very large samples, they also provide a faster alternative (the yatchew_test).

Finally, they show the methods in action. In the case of bonus depreciation in the U.S. (2002), they find positive employment effects (often larger than the original TWFE estimates) with pre-trends largely consistent. In the case of China’s PNTR entry, their nonparametric estimates are noisy and mostly insignificant. However, when they control for industry-specific linear trends, the pre-trend and linearity tests are no longer rejected, making TWFE with trends defensible (which points to negative employment effects).

Why is this important?

Many real-world policies look exactly like this. Minimum wage hikes, trade reforms, new health programmes… these are designed to be universal, and they end up affecting everyone to some extent. Untreated groups either don’t exist or they’re so small and different that we usually drop them.

That means the standard DiD design doesn’t apply. Without a zero-treatment group, the whole logic of comparing treated and untreated units makes no sense. Yet in practice, we often still run a TWFE regression with treatment intensity as the regressor, hoping it will stand in for DiD. The problem is that with heterogeneous effects, TWFE can produce misleading results.

This paper matters because it tells us exactly what is and isn’t identifiable in these designs. If quasi-untreated units exist, treatment effects can still be recovered in a transparent way. If they don’t, the paper shows that the data itself can’t tell you the answer (an impossibility result that sets a clear boundary for empirical work).

The authors also bring new tools to the table. Their nonparametric estimators adapt ideas from RD to this setting, and their tuning-free tests give us a way to check assumptions rather than relying on hope (:P). At the same time, instead of throwing TWFE in the bin, they provide guidance on when it might still be valid, and a concrete test for its key assumption.

Also the parameters they focus on (WAS and WAS_s) have direct policy meaning. They tell us whether a universal reform was beneficial compared to no reform, or compared to the minimal observed treatment, which ties the methods back to cost–benefit analysis.

Who should care?

First and foremost, applied microeconomists who study policies that are universal but uneven in their bite (labour market reforms, trade liberalisation, health programmes, education policies). These are exactly the settings where untreated groups don’t exist, yet exposure varies. Policy evaluators and government analysts should also pay attention. When reforms roll out nationwide, the usual evaluation strategies no longer apply. Having a framework that makes clear what can and can’t be identified is very important if you’re tasked with producing credible estimates for policy decisions. Researchers who rely on TWFE in continuous-treatment settings also have a lot at stake. The paper doesn’t say “don’t use TWFE”. The authors rather show when it is misleading and when, with the right tests, it might still be defensible. That’s a message many of us need to hear. And also methodologists and students who are expanding the DiD toolkit will find this work useful.

Do we have code?

Yes, lots. The did_had package (available in both Stata and R) estimates the WAS and can also generate pre-trend and event-study estimators. It implements the nonparametric local-linear estimator with optimal bandwidth selection and bias-corrected confidence intervals, borrowing routines from the nprobust package. For checking the assumptions behind TWFE, they provide the stute_test package (also in Stata and R). This runs the nonparametric specification test of whether outcome changes are linear in treatment dose using a wild-bootstrap implementation. It works fast for moderate datasets, though memory limits kick in as the sample size grows large. For very large datasets, the authors also offer yatchew_test, a faster alternative that scales up to tens of millions of observations.

In summary, this paper shows what to do when every unit is treated. With quasi-untreated units, treatment effects can still be identified using a new nonparametric estimator. Without them, effects are not identified unless you add assumptions (an impossibility result that sets clear limits). The authors also develop tests to decide when TWFE is valid and provide ready-to-use code. For universal reforms, this is now the reference point for what we can and can’t infer and learn.

Treatment-Effect Estimation in Complex Designs under a Parallel-trends Assumption

TL;DR: what to do when treatments are non-binary, reversible or vary at the baseline? This paper introduces AVSQ event-study effects, which compare actual outcomes to a status-quo where each unit keeps its initial treatment. The framework makes it possible to evaluate dynamic and multi-dose policies, while offering a cost–benefit measure (ACE), and shows why distributed-lag TWFE regressions can mislead. The authors also provide software (`did_multiplegt_dyn`) and illustrate the method with US newspapers and turnout.

What is this paper about?

In this paper the authors look at how to estimate treatment effects with DiD when the treatment is more complex than a one-time, binary switch. Many policies can vary in intensity, be reversed, or differ in their baseline levels across units. The usual staggered adoption design doesn’t capture these cases, even though they are common in applied work. The authors propose a way forward by comparing outcomes under the actual treatment path to a counterfactual “status-quo path” where each unit simply keeps its initial treatment. The resulting actual-versus-status-quo (AVSQ) effects generalise the ATT from staggered designs and provide a framework for analysing dynamic treatments beyond the standard setup.

What do the authors do?

The authors begin by setting up a framework for complex DiD designs, where treatments are not just binary switches but can vary in intensity, be reversed, or differ across units at baseline. To deal with this, they propose comparing each unit’s actual treatment path to a counterfactual “status-quo path” where the unit simply keeps its initial treatment level. The difference between these two paths defines what they call AVSQ event-study effects, which extend the familiar ATT from staggered adoption designs.

They then show how these AVSQ effects can be identified under two assumptions: no anticipation and PT conditional on baseline treatment. This conditional PTA is very important because it requires that units with the same initial treatment level have similar outcome trends in the absence of treatment changes, but allows units with different baseline treatments to have different trends. Because raw AVSQ effects can be difficult to interpret, they also introduce a normalised version that rescales by the change in treatment dose, turning them into weighted averages of marginal effects of current and lagged treatments. The AVSQ framework also includes a “no-crossing” condition to ensure interpretable results (this means a unit’s treatment path doesn’t zigzag above and below its initial level, which would make it unclear whether treatment increases or decreases are driving the effects). On top of this, they develop a statistic called the Average Cumulative Effect (ACE), which aggregates AVSQ effects into a cost–benefit style measure of whether the realised treatment path improved outcomes per unit of treatment compared to the status quo.

Beyond defining these new estimands, the authors take a closer look at what happens when researchers run distributed-lag TWFE regressions in these settings. They show that TWFE coefficients do not reliably estimate causal effects when treatment effects are heterogeneous: they end up mixing together different lags and in some cases can even reverse sign. To provide a cleaner alternative, they introduce a random-coefficients distributed-lag model which assumes that treatment effects are constant over time but can vary across units. Under this model they show how to identify and estimate average effects of current and lagged treatments. Estimation is straightforward when treatments take on discrete values, but becomes more complex with continuous treatments which require truncation procedures to handle cases where some units have extreme influence on the estimates.

At the end, they bring these ideas to an empirical application. Revisiting Gentzkow, Shapiro, and Sinkinson’s study of newspapers and voter turnout in the United States, they extend the original analysis by allowing for dynamic effects of current and past newspaper exposure. Their AVSQ estimates show that newspapers increased turnout and that the effect of current exposure was larger than the effect of lagged exposure. A conventional TWFE regression suggests the opposite, which illustrates the practical importance of their framework.

Why is this important?

Many policies that we care about don’t look like simple one-off treatments. In Kenya, whether a district shares the president’s ethnicity changes with each election, so exposure to favouritism in public spending can switch on and off. In the US, states deregulated financial markets during the 1990’s at different times and with different intensities and some even made multiple changes. In Germany, municipalities have local business tax rates that vary across space from the start and move up or down over time. Even the number of newspapers in circulation across US counties (which is a continuous measure) has long been used to study voter turnout.

None of these cases fit perfectly into a staggered binary design. If we force them into that framework or fall back on TWFE regressions, we risk estimates that don’t mean what we think they mean. This paper is important because it provides a framework that lets us evaluate these policies as they actually happened. The AVSQ approach measures the effect of departing from the status quo baseline, the normalisation step makes the results interpretable in terms of current and lagged effects, and the ACE parameter links the findings to a cost–benefit view of policy.

It also matters that the paper shows how and why distributed-lag TWFE can give the wrong answer in these settings. It’s more than just a theoretical point: in their application to newspapers and turnout, the AVSQ framework points to strong effects from current exposure, while TWFE suggests the opposite. For applied researchers, the message is clear: dynamic and non-binary treatments need methods designed for them, not workarounds.

Who should care?

Anyone working with DiD in situations where treatment is not a clean one-off adoption, which includes researchers studying policies that can be introduced and repealed, vary in strength or differ across units from the very beginning. As the authors point out, examples range from political economy questions like ethnic alignment with leaders, to local fiscal policy where tax rates shift up and down, to historical work on newspapers and turnout. It also matters for people still relying on distributed-lag TWFE regressions to study dynamic effects. The paper shows those regressions can give sign-reversed estimates when effects differ across units. If your design involves intensity, reversals, or lagged exposure, this framework offers more credible alternatives.

In short, applied researchers in labour, public finance, political economy and economic history all have reasons to pay attention (basically anywhere policies don’t come as simple binary treatments).

Do we have code?
We have commands for both Stata and R, called did_multiplegt_dyn, which compute the AVSQ event-study effects and related estimators. These commands also implement the placebo tests and the diagnostics for dynamic effects. The paper itself is theory-heavy, but the software is designed to be practical for applied work.

In summary, this paper shows how to use DiD when treatments don’t look like neat one-time switches. The AVSQ approach compares what actually happened to a status-quo where nothing changed and the normalised version tells us how current and past exposure matter. The ACE parameter turns this into a cost-benefit measure. The authors also explain why standard distributed-lag TWFE regressions can mislead and give a simple alternative. In their application to newspapers and turnout, the new method finds current exposure matters more, while TWFE says the opposite.

Inference With Few Treated Units

TL;DR: what if only one or a few units get treated, like a single country’s reform or a couple of states changing policy? Clustered or robust errors break down because they assume many treated clusters and badly understate uncertainty. This paper surveys the fixes: borrowing from controls, pre-trends, or randomisation inference. Each comes with trade-offs. The key point is that the count of treated units drives inference, and much of what we see in practice is invalid.

What is this paper about?

This paper is about a situation many applied researchers find themselves in: what if your treatment only applies to one unit, or just a handful of them? Think of German reunification (only one country “treated”) or a state-level reform adopted by 2 or 3 states. In these cases, the usual inference machinery is a bit more tricky and difficult to justify. We know from practice that standard cluster-robust or heteroskedasticity-robust errors rely on having many treated clusters. With only a few, the treated units have outsized leverage, standard errors are biased down and rejection rates can be way off. What the authors do is to step back and survey the growing literature on this problem. They classify methods into two big “families”:

Model-based inference, which treats uncertainty as coming from potential outcomes (sampling from a super-population); and
Design-based inference, which treats uncertainty as coming from random assignment of treatment.

They then go through the different approaches available when treated units are scarce. Some methods lean on cross-sectional variation (borrowing information from controls), others on time-series variation (borrowing from pre-treatment periods). Each approach makes trade-offs: cross-sectional methods require many controls but allow unrestricted serial correlation; time-series methods require many untreated periods but allow cross-sectional dependence. They also examine extreme cases, like when there’s *literally* only one treated unit and one treated period. Here, inference is only possible under strong assumptions, like assuming treatment effects are homogeneous or restricting how errors behave. As more treated units, periods, or within-cluster observations become available, assumptions can be relaxed somewhat, but the problem never fully goes away.

The main message is that the inferential challenge is about the number of treated units. Even if you have a thousand controls, one treated unit means asymptotics won’t save you. The survey pulls together existing fixes (wild bootstrap variants, randomization inference, conformal methods, etc.), shows how some heuristics can be justified theoretically and proposes small modifications to improve finite-sample performance.

This is a good paper that provides us with a map of what can and can’t (shouldn’t?) be done when “few treated” is the reality, and links econometrics back to the practical settings (DiD, SC, matching) where these problems show up most often.

What do the authors do?

They begin with a simple example: estimating a treatment effect when only one unit is treated. With just one treated unit, heteroskedasticity-robust standard errors collapse (what variance would be left to estimate anyway?) and cluster-robust errors in a DiD setting massively understate uncertainty. They use this to show why conventional inference fails when the number of treated units is small (regardless of how many controls you have).

From there, they organise the literature into categories:

Cross-sectional approaches: methods like Conley and Taber (2011), which assume the error distribution for treated and control units is comparable and typically require homogeneous treatment effects or strong distributional assumptions. With many controls you can use the control residuals to approximate the treated distribution. Other refinements handle heteroskedasticity (e.g. state size differences) or variation in treatment timing.
Time-series approaches: methods that rely on long pre-treatment periods and require stationarity and weak dependence assumptions on the errors. Here the idea is to compare the treated unit’s residuals after treatment to its residuals before treatment, testing whether the last observation looks unusually large. These require assumptions like stationarity and weak dependence but allow for cross-sectional correlation.
Allowing for stochastic treatment effect heterogeneity: they discuss when inference can be framed around testing sharp nulls (no effect whatsoever) or around realised treatment effects conditional on shocks. This shifts the interpretation of what we’re learning but opens up additional strategies.
Design-based approaches: when treatment is randomly assigned but only to a handful of units, the focus shifts. They show how randomisation inference can still be valid, but imbalances between treated and controls become much more likely with small treated groups.
Sharp null testing: many methods test whether there is no effect whatsoever (sharp nulls) rather than testing average treatment effects, which changes how results should be interpreted.

Along the way, they show connections between methods (e.g., how a wild bootstrap with the null imposed can be asymptotically equivalent to a randomisation test) and they propose tweaks that improve finite-sample performance.

The unifying theme is that each method trades one type of assumption for another. Cross-sectional methods need many controls and comparability. Time-series methods need long pre-trends and stationarity. Design-based methods hinge on how treatment was assigned. The survey doesn’t offer a “best” choice, but it does clarify what’s available, what each method buys you and where each one fails.

Why is this important?

I’m preaching to the choir here, but situations like these are everywhere. Many influential DiD and SC papers study events like a single country’s reform, one state-level policy or a handful of treated clusters. Standard inference procedures underestimate uncertainty because they rely on asymptotic theory that fails when the number of treated units is small, regardless of the total sample size. They underestimate uncertainty and can lead to massive over-rejection.

This paper gives us both structure and clarity on a messy corner of applied practice. Instead of a patchwork of one-off fixes, it provides a taxonomy: which methods work when you only have one treated unit, when you have a few, when you have many controls or when you have long pre-treatment histories. It also explains the assumptions you’re implicitly making when you pick one approach over another.

There’s also a bigger point: forget about the share of treated units, consider the count. Even if 1% of units are treated out of a thousand, that’s still just ten treated units. Asymptotics won’t serve much. Knowing this helps us avoid the false comfort of “large N” when what actually matters is the number of treated clusters.

The survey also connects different strands of work to demonstrate that many of the inferential problems are variations of the same underlying issue. For practitioners, that means better tools and clearer diagnostics. For theorists, it ties together a set of results that were previously all over the place. It also reveals that methods designed for single treated units may be more powerful than methods requiring multiple treated units, even when more treated units are available, which is an important practical consideration for researchers.

Who should care?

Applied researchers working with one or a few treated units (country case studies, state-level reforms or costly RCTs). Referees and editors since many papers still report invalid clustered errors in these settings.

Do we have code?

Code doesn’t make sense here.

In summary, this paper reviews what to do when only a handful of units are treated. Standard inference fails in these cases, and the number of treated units - not their share - drives the problem. The authors map out model-based, time-series, cross-sectional, and design-based approaches, show how existing heuristics can be justified and suggest tweaks to improve small-sample performance.

It took me a good half an hour to understand this because the justification for what classifies as a baseline factor is context-dependent. How do we know if something is a baseline factor? Apparently the simplest rules are: 1) it must already be in place before the event, and 2) it must not *plausibly* (oh no my identification assumption!) be changed by the event. We should be able to look at it and say: this is a characteristic of the unit that existed beforehand and won’t flip suddenly just because the event occurred. From what I understand, there are three practical checks and they relate to timing, persistence and isolation (to a certain extent). Timing refers to: was the factor measured before the event? If not, it can’t be baseline. Persistence relates to the factor being something like slow-moving or fixed (e.g., geography, long-standing institutions or cultural traits). And finally, isolation: can the event directly or indirectly alter it in the short run? If yes, it’s not safe to treat it as baseline. Going back to what the authors use as an example, social capital measured in the 1950s works in a famine study because kinship networks don’t suddenly disappear during a few “bad years”. But soil quality would not work if the “event” were a soil restoration reform, since the policy directly targets the factor itself.

“Factorial” comes from stats. Fisher (also here) and Yates developed factorial experiments to study the main effects of two factors and their interaction. FDiD borrows from this logic: treat the event (pre/post) and a baseline factor (e.g., high/low social capital) as the two factors, then interpret what DiD identifies under stated assumptions. For background, the authors recommend these: VanderWeele (2009); Dasgupta et al (2015); Bansak (2020); Zhao and Ding (2021).

Effect modification: identified with the usual DiD assumptions: universal exposure, no anticipation, and canonical parallel trends. This is descriptive, not causal. Causal moderation: to interpret the contrast as causal, you need a stronger assumption, which the authors call factorial PT. ATT analogue: if you add an exclusion restriction that one group is exposed but unaffected, then the design “collapses” to the familiar ATT interpretation from canonical DiD. Baseline effect given exposure: if instead you assume the baseline factor has no effect in the absence of the event, then the estimator captures how much the baseline factor matters once the event has happened.

In the Concluding Remarks, the authors make 3 recommendations for applied researchers:

Be explicit about the design. When using FDID, say so. Don’t call it a standard DiD, because the estimand is different.
State clearly what assumptions justify your claim. With only canonical PT, you can talk about effect modification; to claim causal moderation, you need factorial PT; for ATT-like interpretations, you need an exclusion restriction.
Check robustness with alternative assumptions and specifications. For example, examine whether your baseline factor is truly unaffected by the event, and test sensitivity to different ways of grouping or measuring it.

DiD Digest

Discussion about this post