Hidden Bias, Pie Charts and Spillovers

What to consider when reality has stopped cooperating

Nov 03, 2025

Hello! Today we have 3 theory papers, 1 “software”/guide paper, and 3 applied I found (the Japanese one is behind a paywall, so email me if you would like a copy). All of them are super cool.
Buckle up because we have many concepts from areas we might not be entirely familiar with… check the footnotes!

Sensitivity Analysis for Treatment Effects in Difference-in-Differences Models using Riesz Representation, by Philipp Bach, Sven Klaassen, Jannis Kueck, Mara Mattes and Martin Spindler
Compositional Difference-in-Differences for Categorical Outcomes, by Onil Boussim (on the JM!)
Efficient nonparametric estimation with difference-in-differences in the presence of network dependence and interference, by Michael Jetsupphasuk*, Didong Li and Michael G. Hudgens (* on the JM!)

Software/guide:

Using did_multiplegt_dyn to Estimate Event-Study Effects in Complex Designs: Overview, and Four Examples Based on Real Datasets, by Clément de Chaisemartin, Diego Ciccia, Felix Knau, Mélitine Malézieux, Doulo Sow, David Arboleda, Romain Angotti, Xavier D’Haultfœuille, Bingxue Li, Henri Fabre and Anzony Quispe (De Chaisemartin et al. extend their 2024 event-study estimators into a new command for Stata and R. The command handles on/off, continuous and multivalued treatments, and it replaces did_multiplegt for this use. The new command is faster because variance uses analytic formulas rather than bootstrap which they show how in the paper. It’s a neat and practical guide, they go through the estimators, validate them in simulations and show four real-data applications).

Applied:

The Labor Market Effects of Generative AI: A Difference-in-Differences Analysis of AI Exposure, by Andrew C. Johnston and Christos A. Makridis (USA, DiD with continuous treatment intensity in an event study framework) → I was seriously considering starting a Substack focused on Econ and AI, but I don’t have time. If you feel like you could do this, you should do it. It’s an area in development, at its early stages, and we would all benefit from learning more about it. This paper is the first comprehensive study providing economy-wide evidence of real labour market impacts from AI exposure. What I particularly enjoyed was the analysis at both extensive (e.g., whether workers remain employed or lose jobs, whether firms enter or exit sectors, whether establishments expand or contract their workforce) and intensive (e.g., how AI changes the composition of tasks workers perform within their jobs, changes in productivity/output per worker, changes in hours worked or work intensity) margins. They also decompose AI exposure into augmenting (where AI complements human work) versus displacing (where AI substitutes for it) components, finding that augmenting exposure drives employment growth, while displacing exposure causes contraction. Because the analysis was done using administrative data that covered the universe of U.S. employers, they were able to show that high-skilled workers see larger gains, which represents a departure from previous automation waves.
The role of human interaction in innovation: evidence from the 1918 influenza pandemic in Japan, by Hiroyasu Inoue, Kentaro Nakajima, Tetsuji Okazaki, and Yukiko U. Saito (Japan, tech-class x year DiD with common shock and differential exposure; event-study; PPML for counts) → I’m a sucker for natural experiments, even though there were lots of deaths involved. I’m also a sucker for historical data collection, which I would love to do if I spoke Japanese. This paper mixes both things. The 1918 flu in Japan had three waves: Oct 1918–Mar 1919, Dec 1919–Mar 1920, Dec 1920–Mar 1921, about 23.8 million infections and 390k deaths out of a 55 million population, which is crazy. Transport and factories were disrupted, mortality peaked among working-age adults and face-to-face contact became costly for inventors. This setting allowed the authors to test the claim that tacit knowledge and spillovers travel through direct interaction. We know that collaboration drives the flow of ideas which has a direct effect on the quality of output, therefore the test was straight-forward: if tacit knowledge moves through direct contact, then tech classes that collaborated more before 1918 should stumble more when contact gets costly. The authors proxy collaboration needs with pre-1918 co-inventing shares and track patents by class over 1911–1930, so you watch the pandemic hit play out. The drop is sharper in collaboration-intensive classes in 1919–1921, and it comes from fewer new inventors rather than incumbents retreating. The key message then becomes: raise interaction costs and innovation slows most where collaboration is the input. This paper made me think of Covid as the modern analogue: a common shock that raised interaction costs, with differential exposure by how much collaboration must be in-person. Unlike in 1918, we had Zoom and cloud tools which softened the blow, so we should expect the effects to be smaller where remote work was feasible and larger in wet-lab, field or hardware R&D(if anyone wants to write about it).
Forgoing Nuclear: Nuclear Power Plant Closures and Carbon Emissions in the United States, by Luke Petach (USA, TWFE DiD with staggered treatment timing; robustness checks with Callaway-Sant’Anna and Borusyak et al. estimators) → We were discussing the other day on X/Twitter about the difficulty of using DiD to quantify the effect of new data centres on energy prices due to several threats to identification: data centre planners select locations based on energy characteristics that independently affect price trends (selection bias), their arrival coincides with other infrastructure investments and policy changes (confounders), energy markets are interconnected across regions (spillovers) and prices may react to announcements before construction begins (anticipation effects). Fundamentally, it’s hard to find a credible counterfactual for what would have happened to energy prices without the data centre. Luke has addressed lots of those concerns in his paper, but more importantly, a plant closing is more exogenous than a data centre (or plant) opening. The “treatment” happens to the state rather than being chosen by strategic actors based on local conditions. Luke finds that nuclear closures increased state-level CO_2 emissions by 6-8%, with the gap filled almost entirely by coal rather than cleaner natural gas. Event studies showed no pre-trends and the results hold when analyzed at the regional electricity market (RTO/ISO) level to account for cross-state spillovers, meaning that the effect is real and it’s carbon-intensive. This is an important example that showcases how identification strategy matters as much as the question itself. When you can’t randomize treatment, the next best thing is finding settings where treatment is plausibly exogenous, even if that means studying the problem in reverse.

Sensitivity Analysis for Treatment Effects in Difference-in-Differences Models using Riesz Representation

TL;DR: this paper shows how to measure how much hidden bias could affect DiD estimates when the PTA might not hold. Using the Riesz representation, the authors express the treatment effect as a weighted average and show how bias from unobserved factors links to familiar quantities. The result is a practical way to report DiD effects with clear bounds instead of relying on untestable assumptions.

What is this paper about?

When analyzing a treatment in order to quantify a causal effect, we have to consider a lot of things. But we never really “see” the full picture. People, firms or regions decide to take up a policy for reasons we can’t fully observe, and those same reasons often affect how their outcomes change over time. In DiD we “deal” (to the best of our intentions) with that by conditioning on covariates and assuming that, after controlling for them, treated and control units would have followed similar trends in the absence of treatment (the PTA). The problem is that this assumption of parallel trends given covariates can still fail if we’re missing something important that we can’t quantify or account for.

This paper is about measuring how much of such missing information “could” matter. Instead of treating “unobserved confounders” as a black box, the authors show how to express their influence in terms of quantities we can describe and argue about. They build on a mathematical idea called the Riesz representation1, which expresses the treatment effect as an average of predicted outcome changes, each multiplied by a specific weight. Once written that way, we can study how much the estimate would move if some unobserved factors were left out of either the prediction or the weighting step.

The motivation is there: the goal with a DiD design is to estimate the ATT, but this estimate is only as good as the assumption that the PTA holds once we condition on covariates. Since that can never be fully verified, the question becomes how to describe and measure the possible bias rather than ignore it.

What do the authors do?

They start off by rewriting the DiD effect as an average of predicted outcome changes taken with a particular set of weights. That step (which comes from the Riesz representation) makes it easier to see what happens when some parts of the data are missing. Once written that way, the estimate depends on two things: the model used to predict outcome changes and the weights that compare treated and control units. If either of these leaves out important factors, the bias can be expressed in terms of how much those missing variables would help explain the outcome or the likelihood of treatment.

This is where the paper becomes useful for applied work. It shows that the size of the bias depends on how much the unobserved variables would improve the chosen model’s fit for outcomes *and* for treatment assignment, and on how correlated those two pieces are. Each of these quantities can be linked back to measures we are familiar with, such as the partial R^22. With that the authors turn “how sensitive is my DiD estimate to hidden bias?” into something measurable.

They go to extend the same idea to staggered adoption designs, showing how to bound both point estimates and confidence intervals when some confounders are missing. Instead of producing a single number that might rest on unrealistic assumptions, the paper shows how to report an effect that comes with well-defined bounds, or how large the unobserved bias would need to be to make the result disappear/insignificant. The estimation itself follows a DML approach, where modern ML methods are used to learn both the prediction and weighting steps without having to assume a specific functional form. This keeps the method practical for high-dimensional data while preserving standard large-sample inference.

Why is this important?

Couple of reasons, both related to real-world problems and to the econometrics itself. The first one is straight-forward: the PTA does half of the work in DiD, and it comes directly from the design. Because we never observe the treated group’s counterfactual path after treatment, we need to assume it would have followed the same trend as the control group once we condition on covariates. The issue is that this path is unobserved by construction, so we can’t ever prove that the assumption holds. That’s what makes DiD possible, but also what makes it fragile. In practice we check pre-trends, add controls, and then try to justify the identification assumptions, but there is always a gap between what we observe and what drives selection into treatment and changes in outcomes. This paper gives a way to describe that gap and put numbers on it, so we are not asked to take the key assumption on faith.

It also changes how we present results. Instead of a single estimate that depends on untestable judgment calls, you can say: here is my best estimate and here is how far it would move if unobserved factors were strong enough to improve prediction by this much, which then makes claims more honest without turning them into hand-waving. It also helps avoid overclaiming and it gives a common language for authors, referees, and seminar audiences to discuss what “plausible” unobserved bias means.

For applied work, the paper ties sensitivity to the tools we already use: pre-trend placebos, leave-one-covariate-out checks and familiar fit measures. If you have multiple pre-periods, you can calibrate scenarios from the size of the placebos. If you don’t, you can benchmark against observed covariates and ask whether any truly unobserved factor would be as powerful as the ones you already “see”. Either way, you get clear bounds for the point estimate and the confidence interval and the same logic carries over to staggered adoption settings.

There’s a policy angle too. When a result is sensitive, you can show how and by how much rather than pretending the issue isn’t there. When it’s robust, you can document that claim with numbers. In both cases the takeaway becomes easier to trust, which is the point of methods in the first place.

Who should care?

Applied economists using DiD in labour, education, health, development or policy evaluation, anyone who has ever been told by a referee to “test robustness to hidden confounders”, researchers who rely on DiD but worry about unobserved bias they can’t rule out, and methodologists interested in linking ML-based estimation with identification theory.

Do we have code?

There’s an open-source implementation in DoubleML for Python, with a user guide that covers this DiD sensitivity setup (2×2 and staggered), examples, and the Riesz-based formulas. You can run the DML estimator, set sensitivity scenarios (from pre-trends or benchmarking) and get point and CI bounds straight from the package.

In summary, this paper focuses on the most fragile part of DiD: the fact that the PTA can’t be proved. By rewriting the effect using the Riesz representation, the authors show how to express bias from unobserved factors in terms of familiar quantities like partial R^2. The estimation uses DML, which keeps it workable when there are many covariates or flexible models. The idea is simple: instead of assuming the PTA holds, measure how far it would have to fail for the result to change. It’s a way to be clearer about what we can learn from the data and what still depends on judgment.

Compositional Difference-in-Differences for Categorical Outcomes

(Onil is a PhD candidate at PSU, he’s on the JM this year and this is his JMP! Good luck, Onil :) I also liked this paper because the tables are well-structured, the plots and diagrams are pretty and it’s well formatted. See, a better world is possible)

TL;DR: this paper develops a DiD framework for categorical outcomes where shares must stay positive and sum to one. By replacing additive changes with proportional growth and redefining the PTA, it yields valid counterfactuals and interpretable effects for outcomes that are distributions rather than single values.

What is this paper about?

One of the first things you learn in data analysis is types of data. This is gonna be important here, as we deal with categorical vars3. These are vars that take on labels instead of numbers (e.g., vote choice, occupation, education level) and what we often analyse are the shares of each category. The issue is that those shares form a composition: they’re all positive and must add up to one. If one category’s share increases, at least one other must decrease. We can’t do this with our standard DiD, which relies on additive changes that can move freely in either direction. In a compositional setting, those additive differences can produce impossible results (negative shares, totals above one, and most importantly changes that have no behavioural meaning). Onil then introduces Compositional DiD (CoDiD), a framework that fixes this by working with proportional (rather than additive) changes.

The idea is to replace the PTA with a parallel growths assumption: in the absence of treatment, each category’s share would have grown (or shrunk) at the same *proportional rate* in treated and control groups. This keeps the counterfactual composition valid (shares stay positive and sum to one) and lines up with how we usually mode choices across categories in econ. Onil then uses this framework to define new treatment effects that capture how the overall distribution changes and how probability mass shifts between categories4. He extends it further to cases with staggered adoption and to a version that builds synthetic counterfactuals for compositional data (did anyone say synthetic DiD?) that works when the outcome is a distribution instead of a single number.

What does the author do?

Onil does *a lot*. He starts with the simplest possible setup (our canonical the 2×2 case) to lay out the idea of parallel growths and show both its economic and geometric meaning. From there, he generalises the framework to multiple periods, which is what most of us actually deal with in applied work. He then connects this new approach to familiar methods like standard DiD and Synthetic DiD, showing where they overlap and where the compositional logic changes the interpretation.

To make it concrete, the paper includes two empirical examples: one on how early voting reforms affected turnout and party vote shares and another on how the Regional Greenhouse Gas Initiative (RGGI) shifted the composition of electricity generation. Both illustrate how treating the outcome as a distribution rather than a single variable changes the estimated effect and keeps the counterfactuals within the realm of what’s possible.

Why is this important?

The core issue is how to think about the PTA when the outcome is categorical. A naïve way would be to run a standard DiD separately on each category’s raw share, predict what those shares would have been without treatment and then normalize everything so the shares add up to one. That might sound reasonable, yet it’s not. Doing this ignores how categories relate to each other (e.g., an increase in one share “mechanically” means a decrease in another) and it messes with the link to any behavioural or structural model of how people make choices. In other words, you’d be imposing linear trends on something that we don’t know if evolves linearly.

From an econometrics point of view, this causes several issues: it violates the probability constraint (shares might turn negative or exceed one), it treats each category as independent even though they’re jointly determined and it produces counterfactuals that have no theoretical justification in how choices across categories actually adjust. The bigger point is that additive DiD logic doesn’t fit categorical data. What Onil does is rebuild the entire framework so the counterfactual distribution is coherent (i.e., the total probability mass stays fixed, categories remain linked and the treatment effect can be interpreted in economic terms rather than as an artifact of normalization).

Who should care?

Beyond the ones Onil talks about, pretty much anyone studying outcomes where the composition is what you’re interested in. Migration (domestic, international, return), industry shifts (manufacturing, services, tech), language use (home language shares), transportation modes (car, public, walking) or household spending patterns (food, housing, leisure) all fit this setup. Development work often tracks how populations move across employment types, informal versus formal work, or agricultural crops. If you can think of a pie chart when modeling the var at stake, you should consider CoDiD.

Do we have code?

No, and I tried to think of a way to code it but couldn’t (#skillissue). If code is released later, I’ll link it here.

In summary, CoDiD provides a framework for analysing categorical outcomes within a coherent probabilistic structure. It replaces additive comparisons with proportional growth, ensuring that counterfactuals remain consistent with the underlying composition and that estimated effects reflect economically meaningful reallocation across categories. Onil’s contribution is both conceptual and practical: it formalises how to conduct DiD when the object of interest is a distribution, preserving internal consistency and interpretability across empirical settings.

Efficient nonparametric estimation with difference-in-differences in the presence of network dependence and interference

(Michael is a Ph.D. Candidate in UNC’s Department of Biostats, he’s on the JM this year and this is his JMP! Good luck!)

TL;DR: this paper builds a DiD framework for settings where one unit’s treatment affects others through networks. It replaces the usual no-interference assumption with a setup that measures both direct and spillover effects, using an estimator that stays reliable even when some parts of the model are wrong.

What is this paper about?

In lots of DiD setups, units are assumed to be “isolated” (no contamination, SUTVA anyone?): one unit’s treatment doesn’t affect another’s outcome. That assumption is broken the moment there’s a network (e.g., people, firms and/or regions connected in ways that let effects spill over). When a factory installs a “pollution scrubber”, nearby counties benefit too. When a vaccination campaign starts in one area, infection risk falls for neighbours as well. These are classic cases of interference: what happens to you depends not only on your own treatment but also on others’.

This paper deals with that. The authors build a Difference-in-Differences framework that can handle both network dependence (correlated outcomes across connected units) and interference (treatment spillovers). Instead of assuming independent, neatly separated units, they let each unit’s exposure depend on its *position* in the network and on how treatment spreads through its connections.

They then show how to estimate average treatment effects in this environment efficiently and without bias, even when nuisance functions like treatment probabilities or outcome models are learned flexibly with ML. This gives us a doubly robust, semiparametric estimator that remains valid under complex dependence structures (a kind of “network-aware” DiD that accounts for who is connected to whom and how that matters for identification and inference).

What do the authors do?

They start by defining what “treatment” means when spillovers exist. Each unit isn’t just treated or untreated, but it also has exposure through its neighbours. The authors formalise this by introducing exposure mappings which describe how each unit’s treatment and its neighbours’ treatments combine to determine potential outcomes.

From there, they extend the DiD framework to this network setting. They define a conditional parallel trends assumption: in the absence of treatment, both directly treated and indirectly exposed units would have followed the same expected trend, conditional on their covariates and network position. That assumption plays the same role as PTA in standard DiD but now accounts for dependence across connected units.

They then build a doubly robust + semiparametric estimator for the average treatment effect under interference. It combines an outcome model (predicting changes) with a treatment and exposure model (predicting who gets treated and how that treatment propagates through the network). Either model can be misspecified, but as long as one is correct, the estimator remains consistent. The authors also derive its asymptotic efficiency bound and show that the estimator achieves it, even when using flexible ML methods to estimate the nuisance functions (we talked all about these terms before).

Finally, they run simulations to check performance under “realistic” (IRL scenarios) network structures and apply the method to US county-level data, studying how the adoption of scrubbers in coal power plants affected cardiovascular mortality; both in treated counties and in neighbouring ones that received indirect benefits.

Why does this matter?

We don’t exist in isolation, we are all part of network. Factories share “air sheds”, counties share labour markets and hospitals, schools share catchment areas, firms share suppliers. In these settings treatment in one place changes exposure in nearby places. If we ignore that structure, the DiD contrast can mix up direct effects with spillovers and the policy story becomes blurred (and incredibly difficult to justify).

The paper gives a way to define and estimate effects that respect our reality. By writing outcomes in terms of own treatment and exposure through neighbours, and by stating a parallel-trends condition that conditions on network position, we get an estimand that matches the question policymakers ask: what changed for treated units, and what changed for those connected to them?

There is also a precision and credibility gain. The estimator is doubly robust (either the outcome model or the treatment/exposure model can be off and consistency is kept) and it reaches the efficiency bound while letting nuisance pieces be learned with modern ML. That combination is super important when treatment is rare, networks are sparse or irregular and/or the signal is relatively small.

Finally, it changes how we report results. Instead of one average with vague caveats about “contamination,” you can report direct and spillover effects with valid inference under dependence. For environmental regulation, vaccination campaigns, transport investments or school reforms, that is the difference between a credible evaluation and one that misses how benefits spread across the network.

Who should care?

Environmental and health economists studying pollution or disease spread across regions, labour and urban economists dealing with commuting zones and shared markets, education researchers looking at peer or school-network effects, and development economists evaluating geographically clustered programmes. It also matters for applied researchers who suspect “contamination” but don’t want to drop affected observations or pretend it away. If treatments diffuse through networks (physical, social or economic) this framework gives a way to formalise that dependence and still estimate interpretable causal effects. And for methodologists, it extends the efficiency theory of DiD to a frontier problem: identification and inference under interference.
Do we have code?

No, the paper does not appear to provide a public code repository or package for the estimator itself. The authors mention using other R packages in their application (such as disperseR to calculate the interference matrices and various ML packages like BART and HAL for the nuisance functions) but they do not link to their own code that implements the proposed network-based doubly robust DiD estimator. I will update the post if anything changes.

In summary, this paper extends our DiD to the networked world. It formalises what “treatment” means when outcomes depend on neighbours’ exposure, defines a conditional PTA that accounts for those links and derives an efficient, DR estimator for both direct and spillover effects. This results in a DiD framework that stays valid when SUTVA is frail. It treats interference as part of the design rather than a violation, which lets us measure how policies spread through connected units instead of assuming an isolation that rarely exists.

Some of the papers in this newsletter can be summarized as “econometricians writing papers using ML methods based on obscure Maths from the early 1900s”. The Riesz representation is a concept from Maths that says you can always rewrite a linear estimate (e.g., a treatment effect) as an average of predictions with the “right” weights.

Partial R^2 measures how much additional variation a variable explains once other covariates are already included in the model.

Not all outcomes behave like “heights” or “incomes”. Some are categories. If we’re looking at vote choice, we don’t measure a number for each person; we record a label (Democrat, Republican, Independent). At the group level we talk about shares across those labels. Those shares are probabilities, they’re all non-negative and they must add up to one. Think of a fixed pie cut into slices: make one slice bigger and at least one other slice must get smaller. That constraint is always there. There’s also a difference between categories and ordered categories. Letter grades are ordered (A above B above C), but the jump from B to A is not the same thing as the jump from C to B. If we map A=3, B=2, C=1, we’re pretending the steps are evenly spaced when they aren’t. That can push a linear DiD to say things that don’t make sense for the underlying scale. Standard DiD is built for additive, nearly continuous outcomes where plus/minus changes behave well. With categorical or compositional outcomes, additive changes can mess with the basic rules (negative “probabilities”, totals above one) or hide the fact that gains in one category must come from losses elsewhere. The fix is to work with the whole distribution and with changes that respect the “pie-chart constraint”. That’s the problem this paper takes on: how to do DiD when the outcome is a set of category shares rather than a single continuous number.

As Onil says, his framework is particularly suited to settings with discrete, unordered outcomes (e.g., employment status, voting choice, other health categories) where the policy question goes beyond an average estimate, where the entire composition of categories was reshaped. We can think of labour-market reforms that shift people between employment, unemployment and out-of-labour-force states; or health interventions that move patients across diagnostic categories rather than changing a single health index. Even education and migration policies often work this way: they change who ends up where. In all these cases, what matters is the redistribution of probability mass across categories, i.e., how treatment changes the mix of outcomes beyond their mean.

DiD Digest

Discussion about this post