Missing, Messy, and Misweighted

A DiD horror story?

Oct 08, 2025

Hello! We have four new papers to cover :)

Here they are:

Efficient Difference-in-Differences Estimation when Outcomes are Missing at Random, by Lorenzo Testa, Edward H. Kennedy and Matthew Reimherr
Triply Robust Panel Estimators, by Susan Athey, Guido Imbens, Zhaonan Qu and Davide Viviano
Cohort-Anchored Robust Inference for Event-Study with Staggered Adoption, by Ziyi Liu
Aggregating Average Treatment Effects on the Treated in Difference-in-Differences Models, by Partha Deb, Edward C. Norton, Jeffrey M. Wooldridge and Jeffrey E. Zabel

Before we jump to those papers, I will recommend some DiD applied papers I found (actually went looking for them) online.

→ In the next post I want to focus on DiD papers by JOB MARKET CANDIDATES, both theory and applied. If you know of someone’s paper (or you wrote one yourself), please let me know (send me an email at b.gietner@gmail.com) :) ←

Let’s begin with these ones, but there are many more on NBER:
Maternity Leave Extensions and Gender Gaps: Evidence from an Online Job Platform, by Hanming Fang, Jiayin Hu and Miao Yu (DDD, China, unintended consequences of maternity leave extension on gender gaps in the labour market)
Price and Volume Divergence in China’s Real Estate Markets: The Role of Local Governments, by Jeffery (Jinfan) Chang, Yuheng Wang, and Wei Xiong (dynamic DiD, China, divergence between price and volume in residential land and new housing transactions across Chinese cities during the Covid-19 pandemic)
Moving for Good: Educational Gains from Leaving Violence Behind, by María Padilla-Romo and Cecilia Peluffo (stacked DiD, Mexico, effects of moving away from violent environments into safer areas on migrants’ academic achievement)

Efficient Difference-in-Differences Estimation when Outcomes are Missing at Random

(There’s a considerable amount of stats lingo in this paper, which I tried to explain in the footnotes, but would recommend having this Glossary saved somewhere. Would appreciate if someone could provide a “Stats lingo for economists” *wink*)

TL;DR: when outcomes are missing in panel data, simply discarding those cases can result in a biased sample. This paper sets out a framework for DiD under missing-at-random assumptions and develops estimators that achieve efficiency while remaining robust to some model misspecification.

What is this paper about?

This paper speaks to all of us who ever had to drop observations in longitudinal/panel data because of incomplete information on covariates and/or attrition1. This practice, in one way or another, introduces selection bias2. How, then, should we conduct DiD analysis when pre-treatment outcome data are missing for a subset of the sample? This paper provides a formal framework for identifying and efficiently estimating the ATT when pre-treatment outcomes are missing at random (MAR)3. In the Appendix, they also extend it to two scenarios: one where all pre-treatment outcomes are observed, but post-treatment outcomes can be MAR, and the other in which outcomes can be missing both before and after treatment.

What do the authors do?

They start with setting up the problem by assuming a 2x2 DiD4, then provide sets of assumptions for identifiability (including two different sets of assumptions for the missingness mechanism in the pretreatment outcome5, one where baseline outcomes are missing conditional on covariates and treatment, and another where missingness can also depend on the post-treatment outcome) and semiparametric efficiency bounds6, which set the benchmark for the lowest possible estimation variance.

Building on this, they then construct estimators based on influence functions7 and cross-fitting8. These estimators achieve the efficiency bound and are multiply robust, which means consistency is guaranteed if *at least one* of several combinations of nuisance functions9 is correctly specified. They also develop a way to estimate the “nested regression”10 needed under the more complex MAR assumption using regression or conditional density methods and show that augmented approaches can recover efficiency even with misspecification.

Finally, they validate their estimators through a very extensive simulation study. By systematically varying whether nuisance models are correctly or incorrectly specified, they demonstrate that the estimators perform as theory predicts: bias and RMSE11 are negligible when the robustness conditions are met, and performance deteriorates only when the assumptions are violated.

Why is this important?

Dropping observations/units with missing baseline outcomes is common practice, but it introduces selection bias and can distort treatment effect estimates. This paper shows how to formally identify and estimate the ATT even when pre-treatment or post-treatment outcomes are missing at random. By deriving efficiency bounds, the authors also tell us how well any estimator could possibly do. Their proposed estimators are not only efficient but also multiply robust, meaning they remain consistent as long as certain subsets of nuisance models are specified correctly. This is a strong safeguard for applied work, where model misspecification is almost inevitable

Who should care?

Anyone using DiD with survey or administrative data where attrition, item nonresponse or incomplete records are common, which then includes applied researchers in labour economics, education, and health, as well as methodologists developing new DiD estimators.

Do we have code?

The paper itself does not release replication code but the estimators are influence-function based and can be implemented with standard tools for cross-fitting and doubly robust estimation in R or Python. The design closely parallels DR-Learner and targeted ML routines therefore adapting existing code should be straightforward. The Monte Carlo simulations in the paper provide guidance on implementation.

In summary, the authors extend the DiD framework to situations where pre or post-treatment outcomes are not fully observed. They derive efficiency bounds that show the best precision researchers can hope for and propose estimators that reach these bounds while retaining robustness if certain models are misspecified. This gives applied researchers a principled way to handle incomplete outcome data, a problem that arises frequently in labour, education and health applications.

Triply Robust Panel Estimators

(This one is also about panel data, but you should remember some concepts from linear algebra and matrix theory12)

TL;DR: when evaluating policies with panel data, choosing the wrong estimation method can completely change your answer, but there’s no way to test which method is right. This paper introduces TROP, an estimator that protects you from choosing wrong by combining unit matching, time weighting and outcome modeling, so if any one approach is valid, you get the right answer. It consistently outperforms traditional methods across diverse real-world settings.

What is this paper about?

The core problem in causal panel data estimation is estimating what would have happened (the counterfactual) to a treated unit had it not been treated, given that unobserved factors likely drive both the outcome and the treatment decision. The challenge with panel data and a binary intervention is not the lack of causal inference methods but the abundance of them. There’s a myriad to choose from: DiD TWFE, SC, MC, or a hybrid like SDiD, and each comes with its own identifying story (PT, factor models, or pre-treatment fit). Because of this, none of these assumptions can be tested against one another. Choosing between them often feels arbitrary.

Even when you settle on a method, their weighting “rules” raise problems. DiD spreads weight evenly across all periods and units. SC fixes a single set of unit weights and applies them to every post-treatment period. Both approaches ignore that some pre-treatment periods are more predictive than others. Also, SC itself is not built for modern settings where multiple units receive treatment at different times or where treatment can switch on and off. Its balancing logic does not extend naturally to these cases. Messy landscape much?

The paper steps in with TROP (Triply RObust Panel), an estimator designed to work across these situations rather than being tied to one fragile set of assumptions. Think of it as a general, unifying framework, from which all the others are “special cases”.

What do the authors do?

The key idea of their estimator is to unify existing approaches rather than picking one. TROP builds counterfactuals for treated units by combining three things:

A flexible outcome model – a low-rank factor structure13 layered on top of unit and time fixed effects → this captures broad common trends and latent dynamics that DiD alone would miss.
Unit weights – like in SC, treated units are matched with controls that had similar pre-treatment trajectories, but the weights are “learned” from the data and can vary.
Time weights – unlike DiD or SC, TROP doesn’t treat all pre-periods as equally useful. It can down-weight very old data and put more weight on periods closer to treatment.

All three components are tuned jointly using cross-validation, so the estimator learns from the data which combination best predicts untreated outcomes. Because of this structure, TROP nests existing methods: if you shut down the factors, or set all weights equal, you recover DiD, SC, SDID, or MC as special cases.

More importantly, this combination gives TROP its superior theoretical guarantee: the property of triple robustness. TROP is asymptotically unbiased14 if any one of three conditions holds: a) perfect balance over unit loadings, b) perfect balance over time factors, or c) correct specification of the regression adjustment. This structure means the estimator’s final bias is tightly bounded by the product of the errors in these three components, making it more robust than any existing doubly-robust estimator.

The authors then put TROP through a series of semi-synthetic simulations calibrated to classic datasets (minimum wage, Basque GDP, German reunification, smoking, boatlift). Across 21 designs, TROP outperforms existing methods in 20 of them. In the one exception (Boatlift treated unit case), standard SC performs better by 24%. They also show why: the ability to combine unit weights, time weights and factor modeling makes it more robust when PT fail, when pre-treatment fit is imperfect or when the assignment is more complex.

Why is this important?

Those of us doing applied work are often stuck with panel data that looks nothing like the textbook case, e.g. interventions don’t arrive all at once, different units get treated at different times, and PT are hard to justify. In those settings, the choice of estimator can make us confuse, and this conundrum is laid out in the paper. The authors show that DiD might work well in one dataset and fail miserably in another, while SC can look great in some cases and terrible in others. There’s no single safe bet.

This is exactly the situation policy evaluations run into. A government wants to know whether to keep, expand, or cut back a programme. The answer depends on estimates that are only as good as the assumptions behind them. If you can’t be sure whether PT hold, or whether a factor model really captures the dynamics, you’re taking a gamble.

Here’s why TROP is so useful: it doesn’t force you to pick one story. It combines unit weights, time weights and an outcome model in a way that means if one assumption fails but another holds, the estimator still works. For applied economists, that triple robustness acts as a safeguard when the data don’t really fit into any single framework15.

Who should care?

Applied econometricians and empirical people group should care most, as TROP directly addresses the daily challenges with data complexity. More specifically, anyone working with panel data (for estimating treatment effects, e.g., policy impact, firm-level interventions) with staggered treatment timing or multiple treated units, where traditional DiD doesn’t work really well. Other researchers who have to do interactive confounding, since TROP provides a defense against the interactive fixed effects (unobserved trends that affect units heterogeneously) that the authors find are both common in real data and fatal to the simpler DiD model. Other individuals involved in high-stakes policy analysis should care (but I don’t believe this paper will reach them) since TROP “protects” against uncertain assumptions. Also data scientists, since they often work with large panel or time-series cross-section data in tech, finance or marketing.

Do we have code?

The paper provides detailed algorithms (Algorithm 1, 2, 3) with step-by-step procedures. While there may not be a published software package, the algorithms are complete enough that implementation is straightforward.

In summary, TROP is best thought of as a unifying method for causal inference with panel data. Instead of choosing between DiD, SC, or MC, it combines their core ideas: balance units, weight time and model outcomes flexibly. That design gives it a “triple robustness”, meaning that if any one of these channels is valid, the estimator delivers. The simulations make clear how unstable results can be if you commit to a single method. For applied economists, the value of TROP is that it consistently performs well across very different settings. It’s a practical safeguard when you face uncertainty about which assumptions your data can plausibly support.

Cohort-Anchored Robust Inference for Event-Study with Staggered Adoption

(Ziyi is a third year PhD student at UC Berkley, Haas School of Business. Keep an eye out for him! This paper is quite long but it’s well-written and Ziyi explains the questions really well)

TL;DR: Ziyi introduces a cohort-anchored framework for robust inference in staggered DiD event-studies, using block bias to replace unreliable aggregated pre-trend checks and yielding confidence sets that stay valid, and more informative, when cohorts differ.

What is this paper about?
Event studies with staggered adoption lean on the PTA. Since PT cannot be tested directly, we usually eyeball the pre-trend plot and/or test whether pre-treatment coefficients are jointly zero. But these approaches are quite limited: the tests have low power, and conditioning on them can distort inference. Rambachan and Roth (2023) proposed a more robust approach: rather than treating PT as all-or-nothing, use the observed pre-trend to bound how much post-treatment bias could plausibly be. This would deliver confidence sets that remain valid even when PT doesn’t hold “exactly”16.

That works fine when treatment happens at one point in time. With staggered adoption, things get messier because inference usually relies on event-study coefficients aggregated across cohorts. This gives birth to three issues: 1) relative-period coefficients from TWFE can be contaminated by weighted averages of heterogeneous effects, sometimes with negative weights17; 2) cohort composition shifts over relative time, so pre and post-trends are not directly comparable18; and 3) for estimators that use not-yet-treated controls, the control group itself changes as cohorts adopt19.

Let me illustrate the issue. Let’s imagine two states raising the minimum wage, but at different times. State A does it in 2005, State B waits until 2007. A standard event-study lines them up by “time since adoption” and averages the results. That looks ok on a plot, but it hides a problem: the early pre-period is based only on State B, while the long post-period is based only on State A . If A and B had different pre-trends, then the “average pre-trend”20 you’re using as a benchmark has nothing to do with the post-treatment cohorts you care about.

This is where Ziyi’s paper comes in. Instead of averaging across everyone, it anchors inference at the cohort level21. Each group of adopters is always compared to the same set of controls (here defined as the units that were untreated when that cohort first adopted)22. Ziyi calls the resulting comparison a block23 , and the difference in trends between the cohort and its fixed control group is the block bias. Because the block bias has the same definition before and after treatment, the pre-period provides a credible benchmark24 for what could happen post-treatment (meaning it has a consistent interpretation across all periods)25. The paper shows then how to use this setup to build robust inference26 in event-studies even when treatment timing is staggered and cohorts look different from each other.

What does the author do?

Ziyi does a lot. As I said, he develops a cohort-anchored robust inference framework that operates at the cohort–period level. His framework is designed to address the three aforementioned methodological complications that arise under staggered adoption: negative weighting from TWFE event-studies, shifting cohort composition across relative periods and changing control groups in estimators that use not-yet-treated units.

The central building block is the block bias (Δ), which is the difference in trends between a treated cohort and its fixed initial control group. Because the block bias is defined consistently before and after treatment, the observed pre-treatment block biases provide a valid benchmark for bounding the unobserved post-treatment ones. The author then establishes a bias decomposition. The overall estimation bias (δ) contaminating any cohort-period treatment effect estimate can be written as an invertible linear transformation of block biases δ=WΔ. Here, each cohort’s overall bias equals its own block bias plus a weighted sum of block biases from cohorts that adopt later, with the weights determined by relative cohort sizes and adoption timing. This let us impose transparent restrictions directly on the interpretable block biases and then translate those restrictions into bounds on the overall bias. The result is a valid confidence set for the treatment effect.

He then goes to implement two specific types of restrictions (adapted from Rambachan and Roth, 2023): relative magnitudes (RM), which bounds how much a cohort’s block bias can change from one period to the next, with the benchmark coming from pre-treatment variation; and second differences (SD), which bounds the change in the slope of the block bias path (it’s well-suited for settings where pre-trends look approximately linear). These restrictions can be applied globally (using the largest observed pre-trend variation across cohorts) or cohort-specifically (using each cohort’s own pre-trends).

His framework is illustrated in two simulation exercises with heterogeneous pre-trends. Compared to the aggregated approach, the cohort-anchored method delivers confidence sets that are better centered on the truth and, in some cases, narrower. He then finishes it by revisiting Callaway and Sant’Anna’s (2021) study of minimum wages and teen employment. Under the cohort-anchored framework, the confidence set remains centered well below zero even after accounting for cohort-specific linear pre-trends (which means the negative employment effect is robust). The aggregated approach, by contrast, centers around zero and kinda obscures this conclusion.

Why is this important?

Ziyi’s paper addresses the issues left unresolved by the major innovations in the DiD literature between 2018 and 2021. Traditional TWFE estimators are invalid under heterogeneous treatment effects. New HTE-robust estimators were developed to fix this, but they introduced complications for the robust inference framework of Rambachan and Roth (2023), which aggregates results across cohorts. The key problems are dynamic treated composition (the average pre-trend is irrelevant for the cohorts driving the post-trends) and dynamic control groups (the definition of parallel trends shifts as adoption unfolds).

By introducing the concept of block bias, Ziyi provides a coherent way to conduct robust inference in this setting. Block bias solves the dynamic control group problem by anchoring comparisons to a fixed baseline, and the cohort-anchored framework demonstrates that the aggregated approach can distort results when cohorts have heterogeneous pre-trends. The cohort-anchored method instead produces confidence sets that are better centered on the true effect.

Who should care?

Anyone running event-studies with staggered adoption, which includes researchers studying policies that roll out at different times across states, firms or schools, as well as applied economists working with treatment programs that expand gradually. If you rely on modern HTE-robust estimators and want to report confidence sets that remain valid without assuming exact PT, check this paper out. It is especially relevant when pre-trends differ across cohorts since the standard aggregated approach can distort inference in that case.

Do we have code?

Ziyi told me he’s working on the package that accompanies the paper, and that he hopes he can make it public soon.

In summary, this paper pushes robust inference in DiD one step further. Rambachan and Roth (2023) showed how to move beyond binary PT by using pre-trends to bound post-treatment violations, but their framework was limited when treatment timing was staggered. Ziyi builds on that insight by introducing the concept of block bias and anchoring inference at the cohort–period level. The result is a framework that works with modern HTE-robust estimators, avoids the distortions of aggregated event-studies and produces confidence sets that remain valid even when cohorts look very different from each other. The simulations and the minimum wage reanalysis show clearly that this matters: once you anchor inference properly, robust negative employment effects emerge that the standard approach washes out.

Aggregating Average Treatment Effects on the Treated in Difference-in-Differences Models

(Continuing on the “weights” subject…)

TL;DR: this paper shows that when collapsing cohort–time treatment effects into a single ATET, you have to think about the weights. Theory says they should come from treated observations in treated periods, but Stata’s `csdid` and `hdidregress` also add pre-period counts which can push the overall ATET up or down in unbalanced samples. In simulations that choice changed results by 16%. Estimation is not the issue here, since different DiD estimators give the same cohort–time effects. You should think about aggregation: if you report a single ATET, make sure you know what weights produced it.

What is this paper about?

DiD with staggered adoption produces a “grid” of treatment effects: by cohort and by time. The issue is that we usually don’t present the whole grid. Instead, we collapse it into a single number: the overall Average Treatment Effect on the Treated (ATET).

This collapse requires weights. The natural way is to weight each cohort–time estimate by how many treated observations it represents, i.e. bigger treated groups get more weight and smaller ones get less, etc. But here’s the thing: the most widely used software (Stata’s csdid and hdidregress) doesn’t “just” use treated-period observations, but also adds the number of pre-treatment observations to the weights, and it does this again for every post-treatment year.

Think of it like averaging exam scores across years. Suppose a school has: 100 students in year 0 (pre), 50 students in year 1 (post) and 30 students in year 2 (post) (imagine the school is in Japan). If we average post-treatment outcomes by the number of treated students, year 1 gets weight 50 and year 2 gets weight 30. But if we also keep adding the 100 pre-treatment students, year 1 gets weight 150 and year 2 gets weight 130. That shifts the average toward later years, even though no one is actually treated in year 0.

This is what the authors call attention to: the aggregation formula in Stata does not match the one described in the Callaway–Sant’Anna paper, and it can meaningfully change the reported ATET27.

What do the authors do?

They take a close look at how different DiD estimators aggregate treatment effects into an overall ATET. They start by showing that at the cohort–time level many methods are equivalent: FLEX (their own estimator), Callaway–Sant’Anna, Borusyak–Jaravel–Spiess imputation, and others all return identical treatment effect estimates when specified comparably.

The differences emerge only at the aggregation stage. Using simulated repeated cross-section data with staggered treatment adoption, they construct a setting with three cohorts entering treatment at different times, with treatment effects that grow over time; then show that FLEX, BJS imputation, and de Chaisemartin–D’Haultfœuille all aggregate cohort–time effects using the intuitive weights (treated observations only); they contrast this with Stata’s implementation of Callaway–Sant’Anna, which adds pre-treatment observations into the weights for every post-treatment period; and quantify the effect (in their simulation, the “true” weighted ATET was about 2.8, while the Stata aggregation gave 3.25, which accrues to a 16% difference, *entirely* due to weighting.*

They then repeat the exercise with multiple simulations showing that this is not a one-off artefact: whenever sample sizes vary across periods and treatment effects evolve over time, the difference persists. In some runs, the Callaway–Sant’Anna aggregated ATET was even further from the truth than the biased TWFE estimate.

Why is this important?

My interpretation of it is that the theory is fine, but the software implementation (and what we make of it) needs attention. Callaway and Sant’Anna (2021) explicitly describe several aggregation formulas. Their main one (eq. 3.10 in the paper) uses weights proportional to the number of treated observations in treated periods. They don’t propose adding pre-period observations. Stata’s csdid and hdidregress (postestimation commands) implement a weighting scheme that does add pre-period sample sizes to every treated-period weight. This is not what the paper describes.

The authors show that this implementation detail (documented neither in the C&S paper nor clearly in the Stata manuals) produces different aggregated ATETs in unbalanced data. So the theory is fine (C&S gave interpretable options), but the software implementation diverged from those theoretical weights, creating results many of us don’t realise we’re reporting.

In balanced panel data, all periods have the same number of observations, so the weighting formulas coincide and nothing changes. In repeated cross-sections, which is the norm in survey and administrative data, the number of observations usually varies by year. If sample sizes shrink in treated periods, adding pre-period counts into the weights inflates the importance of later treatment effects. If treatment effects grow over time, this will push the overall ATET upwards (as in their simulation). If effects fade, the opposite could happen.

The broader point is: DiD is no longer just about identification and estimation of cohort–time effects. Aggregation is itself a modelling choice. Depending on the software, we might end up reporting an average with no clear interpretation.

Who should care?
Anyone using modern DiD estimators with staggered treatment timing should. The problem affects the way applied results are reported. Any applied researchers relying on survey or administrative data: if your sample sizes vary across years, the “overall ATET” you get from Stata may not reflect the intuitive average you think you are reporting. For the policy analysts, your policy conclusions often hinge on whether an intervention effect is large or modest. A 10–20% variation in the headline estimate due to aggregation weights isn’t trivial, so pay attention. Reviewers and journal editors should also keep an eye out because many published papers likely report ATETs aggregated with Stata’s default formula28.

Do we have code?

The authors implement their examples in Stata. They use flexdid (their own FLEX estimator, available on SSC); csdid and hdidregress (to replicate Callaway–Sant’Anna); did_imputation (Borusyak–Jaravel–Spiess); and did_multiplegt (de Chaisemartin–D’Haultfœuille). In their simulations, FLEX, Callaway–Sant’Anna, and Borusyak–Jaravel–Spiess all give the same cohort–time effects. The differences between them show up only at the aggregation step. The de Chaisemartin–D’Haultfœuille method produces different underlying estimates due to its different choice of control group.

In summary, this paper is a good reminder that DiD work doesn’t stop at identification. Once you have cohort–time effects, you still have to decide how to average them. In balanced panel data, aggregation choices collapse to the same number. In repeated cross-sections, they don’t. The key finding is that Stata’s default implementation of Callaway–Sant’Anna silently adds pre-treatment observations into the weights which can shift the overall ATET by 10–20% in realistic settings. The theory in C&S (2021) is clear, but the code isn’t. For applied researchers, the takeaway is simple: be explicit about what weights you are using to report an “overall” effect. Aggregation is part of your model.

For an interesting discussion on this, see Bellégo, Benatia and Dortet-Bernadet.

As the authors point out, “if the mechanism that causes the data to be missing is related to treatment or other characteristics that influence the outcome, the remaining sample is no longer representative of the population of interest and the resulting estimates will be biased”.

Ok, some house cleaning first. This classification was first proposed by Rubin in 1976. He formalized the three categories (MCAR, MAR, and MNAR) and provided the statistical framework for understanding how different missing data mechanisms affect inference. This work built on earlier ideas but Rubin really crystallized the taxonomy and its implications.

Missing Completely At Random (MCAR): the missingness here has no relationship to any variables in the dataset (observed or unobserved, e.g., a survey page that randomly fails to load for some users due to some technical glitch). It is the least problematic type because the missing data is essentially a random sample of all data.
Missing At Random (MAR): here the missingness is related to observed variables, but not to the missing values themselves (e.g., younger people might be less likely to report their income, but among people of the same age, whether income is missing is random). It is “random” CONDITIONAL on other observed variables. Not so problematic seeing that most statistical methods can handle MAR if you account for the related variables.
Missing Not At Random (MNAR): the missingness is related to the unobserved (missing) values themselves (e.g., people with very high incomes are less likely to report their income specifically because their income is high). MNAR is the most problematic type because the missing data mechanism is related to what you’re trying to measure, and thus requires special modeling approaches or sensitivity analyses.

“We want to emphasize that our framework differs from both the balanced panel data and the repeated cross-section data analyzed by Sant’Anna and Zhao (2020). The former postulates that each sample is observed both before and after treatment: the latter assumes that each sample is observed either before or after treatment. Instead, our setup mirrors the so-called unbalanced, or partially missing, panel data framework, where the outcomes of some samples are observed both before and after the treatment, while some other samples miss pre-treatment outcomes.”

“These assumptions are novel to our setting and are equivalent to a MAR missingness design. Notice that we let the missingness pattern depend on *both* covariates X and the treatment A, and eventually on the post-treatment outcome. In other words, we admit the possibility that pre-treatment outcomes can be missing due to covariates, the treatment that will be administered, and the post-treatment outcome values.”

The (theoretical) lowest possible variance that an estimator can achieve in a given model. It’s a benchmark to judge whether an estimator is statistically optimal. If your estimator hits this bound, you cannot do better in large samples.

A mathematical way of describing how each individual observation affects an estimator. Think of it as a linear “approximation” that shows the sensitivity of your estimate to each data point. We don’t need the formula, just the idea that it underpins efficiency and robustness proofs.

A technique where the sample is split into folds (subsets of the data, e.g. 1,000 observations split into 5 folds gives 5 groups of 200). Nuisance functions (like propensity scores) are estimated on some folds and then evaluated on a different fold. Each observation is used for training and evaluation, but never at the same time. This rotation avoids overfitting and preserves efficiency.

These are not the main parameter of interest (the ATT), but rather functions that must be estimated from the data to construct the efficient estimator and guarantee its desirable properties. The primary nuisance functions in the paper are:

Regression Functions (μ∗): Models for the expected values of the outcomes conditional on other variables (e.g., the expected post-treatment outcome for the control group, E[Y1∣X,A=0]).
Propensity Score (π∗): The probability of a unit being in the treatment group conditional on their covariates (P[A=1∣X]).
Missingness Probability (γ∗): The probability of an outcome being observed (not missing) conditional on other observed variables.
Nested Regression (η∗): A specialized conditional expectation needed under the most complex missing-at-random assumption.

Correctly modeling just a subset of these functions is what gives the estimators their multiple robustness property.

Nested Regression (η0∗(X,0)): it’s a specific and complex type of conditional expectation. It’s called “nested” because it involves taking an expected value of an expected value. It’s needed for the efficient estimator when missingness depends on the post-treatment outcome (Assumption 2.4).

Root Mean Squared Error, the standard measure combining bias and variance into a single metric. Lower RMSE means better estimator performance.

The key person bridging advanced matrix theory (specifically the low-rank assumption and PCA) into modern econometrics and setting the stage for its application in both macro (dynamic factor models) and micro (interactive fixed effects) panel settings is Jushan Bai (who’s now at Columbia). While the foundation for factor models was laid in macro, the move to rigorous estimation of coefficients in panel data - which is the context of this paper - is his central contribution.

The low-rank factor structure is a statistical modeling assumption which says that the observed outcomes of many units over time can be approximated by a small number of unobserved, common factors or trends. It’s used to simplify and understand complex, high-dimensional data, such as the panel data (many units observed over many time periods) studied in the paper.

An estimator is asymptotically unbiased if, as the sample size grows infinitely large, the expected value of the estimator converges to the true value of the parameter it is trying to estimate. Unbiased means the estimator’s average guess is exactly the true value, regardless of sample size, and asymptotically means it might be biased in small samples, but that bias completely vanishes as you get more and more data (i.e., as the number of units N and/or the number of time periods T goes to infinity).

However… researchers should still verify performance in their specific context as the one exception in the simulations shows that no single method dominates universally.

Or, in the authors words, this allows the average treatment effect to be set-identified, and one can construct a confidence set for it.

Also known as heterogenous treatment effect contamination. This problem can be addressed by HTE-robust estimators (e.g., Sun and Abraham, 2021; Borusyak et al., 2024), which estimate cohort-period level effects. Borusyak and Hull (2024) is a good reading if you’re worried.

Even with HTE-robust estimators, the set of treated cohorts contributing to the aggregated coefficients changes across relative periods, and a a result, aggregated pre-treatment and post-treatment coefficients are not directly comparable, since they are based on different treated cohort compositions.

Aka, dynamic control group problem. This issue arises for estimators using not-yet-treated cohorts as controls (like what Ziyi refers to as the CS-NYT estimator) . A shifting control group means the underlying definition of the PTA changes, making it difficult to impose credible restrictions.

Aggregation can mask cohort-level heterogeneity and produce a distorted benchmark; differences then reflect shifts in cohort composition.

The cohort-anchored framework bases inference, as the name suggests, on cohort-period level coefficients.

The block bias compares a treated cohort to its fixed initial control group (units untreated when the cohort first adopts).

The comparison operates within a block-adoption structure, hence “block bias”.

Block biases are “anchored” to a fixed control group, enabling observable pre-treatment block biases to serve as valid benchmarks for their post-treatment counterparts.

The block bias concept enables robust inference through a bias decomposition. This decomposition shows that the overall estimation bias (δ) can be expressed as an invertible linear transformation of the block biases (δ=WΔ). By imposing restrictions on the consistent block bias (Δ), the framework can translate those bounds to the overall estimation bias (δ) and successfully construct a valid confidence set.

The entire paper proposes a cohort-anchored framework for robust inference, solving issues like dynamic control group and cohort heterogeneity.

This weighting issue is specific to the Stata implementation of the Callaway and Sant’Anna method; the authors note that other modern DiD estimators use the more intuitive weighting scheme.

The authors themselves express this concern by noting the “enormous influence and popularity” of the method and its Stata implementation, leading them to be “concerned that many published results have been calculated with a formula that is not what the researchers intended”.

DiD Digest

Discussion about this post