(Image by prof Chelsea Parlett-Pelleriti)
Hi there!
I found 2 papers and 1 guide in the past two weeks that I thought some of you would appreciate. The IV one is related to Health Economics, the RDD is actually a job market paper (full of colorful plots, we love it), and the guide was actually born out of Political Economy (co-written by 2 PhD students!), so I guess there is something for everyone here :)
Here they are:
IV in Randomized Trials, by Joshua D. Angrist, Carol Gao, Peter Hull, and Robert W. Yeh
RDD with Distribution-Valued Outcomes, by David Van Dijcke (thread)
A Practical Guide to Estimating Conditional Marginal Effects: Modern Approaches, by Jiehan Liu, Ziyi Liu, and Yiqing Xu (thread)
We will go from shortest to longest today.
IV in Randomized Trials
(You do not need to be an econometrician to get what is going on here, though it helps to have one nearby - it can be on X/BSky).
TL;DR: clinical trials are sometimes messy. Patients do not always do what they are assigned to do: some in the treatment group never take the treatment, while some in the control group sneak off and take it anyway. This messes with both intention-to-treat (ITT) and per-protocol analyses (the intended treatment is randomly assigned but treatment received is not): the first underestimates the true effect, and the second suffers from selection bias. IV methods offer a middle way, and this paper shows how to apply them in real-world trials. The authors explain IV theory using the ISCHEMIA trial as a motivating example and walk through how IV estimates recover local average treatment effects (LATE) for compliers (the participants whose behaviour was actually influenced by their random assignment). The paper argues that IV methods should be standard in pragmatic, strategy, and nudge trials where adherence is imperfect.
What is this paper about?
This paper revisits a well-known challenge in clinical trials: what do we do when patients do not stick to their assigned treatments? Traditional approaches like intention-to-treat (ITT) analyses estimate the effect of being assigned to treatment, regardless of whether patients actually receive it. This preserves the benefits of randomization but often underestimates the treatment’s true effect when adherence is imperfect. On the other hand, per-protocol or as-treated analyses attempt to estimate the effect of receiving treatment, but at the cost of introducing selection bias, since treatment uptake is no longer randomized. To navigate this trade-off, the authors advocate for IVs as a middle-ground solution. They treat random assignment as an instrument for actual treatment receipt and use it to estimate the Local Average Treatment Effect (LATE, the causal effect of treatment for the subgroup of participants who comply with their assigned treatment). The paper explains the theory clearly, demonstrates its application using the ISCHEMIA trial data, and argues that IV should be a standard tool in modern clinical trials, especially in settings where nonadherence or crossover are common. (Best of all? Health researchers get to use IV without having to write three paragraphs defending the exclusion restriction. Must be nice, I would not know).
What do the authors do?
They lay out a simple yet compelling case for using IVs to estimate causal effects in clinical trials with imperfect adherence. They begin by explaining the core intuition behind IV: when random assignment affects treatment uptake, it can be used to isolate the causal effect of treatment, specifically for the subset of participants who comply because of their assignment. They walk us through the basic mechanics: divide the ITT effect (difference in outcomes by assignment) by the compliance rate (difference in treatment uptake by assignment). The result is a per-protocol effect for compliers (those whose treatment behaviour was influenced by randomization). To make this concrete, they apply IV to the ISCHEMIA trial, a large cardiovascular study where about 20% of patients assigned to invasive treatment did not actually get it, and about 12% of patients in the conservative group did. Using IV, they show that the estimated treatment effect on health-related quality of life (SAQ score) is substantially larger than the ITT estimate (and clinically meaningful). They also provide adjusted estimates using regression-based IV methods, reinforcing the idea that this is more than a back-of-the-envelope trick.
Why is this important?
This is important because most clinical trials do not go exactly as planned and pretending otherwise does not help. Nonadherence, crossover, and treatment contamination are common, especially in pragmatic or strategy trials where patient choice and real-world logistics come into play. The standard ITT approach is clean but conservative, often underestimating the true effect of treatment. Per-protocol analyses try to adjust for this, but they sacrifice the core benefit of randomization → unbiased comparison. IVs offer a “principled” way out of this dilemma. By using assignment as an instrument for treatment receipt, researchers can recover valid causal estimates for the subset of participants whose treatment was influenced by randomization → the compliers. And while economists have long treated IV as part of the standard causal toolkit, this paper brings that logic to a clinical audience, with an accessible explanation, a real trial application, and a strong case for making IV a routine part of analysis when nonadherence is more than a footnote.
Who should care?
This paper is for clinical researchers and trialists who know their sample size calculations by heart but start sweating when compliance drops below 80%. It is for anyone designing or analyzing trials where real-world behaviour gets in the way of ideal randomization (which, let us face it, is most of them). More specifically, it is relevant to:
Clinical trialists working on pragmatic, strategy, or nudge trials, where adherence is not enforced and treatment effects depend on behaviour.
Health economists and outcomes researchers who want causal estimates that are both internally valid and clinically meaningful.
Biostatisticians looking for alternatives to biased per-protocol analyses that do not involve violating the randomization.
Regulatory scientists and policy analysts trying to understand not just whether a treatment works, but how much it works for patients who actually receive it.
If your trial has a CONSORT flowchart with more arrows than a subway map, IV might be exactly what you need.
In summary, if you are working with nonadherence, crossover, or anything resembling the real world, you do not need to choose between an ITT that is too soft and a per-protocol that is too biased. You can run an IV. You probably already know how. Now is the time to do it.
RDD with Distribution-Valued Outcomes
(David is on this year’s market! This is his JMP)
TL;DR: what if your outcome is a whole distribution rather than a single number? This paper extends RDDs to handle distribution-valued outcomes (think income brackets, price spreads, or test score distributions) where you want to know how an intervention shifts the shape of the outcome, going beyond just the average. David introduces a new causal estimand (LAQTE), proposes two estimators (one local polynomial, one in Wasserstein space), and shows how to construct valid confidence bands for inference. The method is applied to U.S. gubernatorial elections, where Democratic wins reduce income inequality by compressing the top end of the distribution. The theory is sharp, the plots are gorgeous, and the contribution is genuinely useful if you care about distributional effects.
What is this paper about?
This paper introduces a very cool (and very visual) extension of the classic RDD for situations where the outcome of interest is a distribution. Think test score distributions in schools, wage distributions within firms, or price distributions across products, cases where you care about the whole shape. Traditional RDDs are not usually equipped to handle these settings, because they assume scalar outcomes and ignore the structure (and randomness) inside each unit (there is a two-level randomness: one from how units are assigned around the cutoff, and one from sampling within each unit’s distribution). This paper provides the tools to handle both. David proposes a new framework, which he calls R3D (Regression Discontinuity with Distribution-Valued Outcomes), that treats these within-unit distributions as functional data. He defines a new causal estimand, the Local Average Quantile Treatment Effect (LAQTE), which captures how the average quantile function shifts around the cutoff. The goal is to estimate how an intervention affects the shape of the distribution, beyond just a point on it. Along the way, he introduces two estimators (one based on local polynomial regression, the other on local Fréchet regression in Wasserstein space → Maths people will have a ball with this one), proves asymptotic normality, derives uniform confidence bands, and shows us why standard quantile RDD estimators fail in these cases. The paper closes with an application to gubernatorial elections and income inequality in the U.S., showing that Democratic wins reduce top-end incomes (a classic equality–efficiency tradeoff).
What does the author do?
David develops a new framework called R3D , designed for settings where each unit (e.g., a state, school, hospital) is associated with a full empirical distribution. In these cases, traditional RDDs fall short, they cannot account for the internal variation or the structure of the distribution itself. He then defines a new causal estimand (LAQTE), which captures how the average quantile function changes at the cutoff; then he shows that standard quantile RDD estimators (e.g., estimating the 10th, 50th, 90th percentiles one by one) do not work in this setting, because they fail to aggregate information across units with different internal structures. The author proposes two estimators: a naïve estimator that estimates quantiles across units and applies standard local polynomial RDD logic, and a Fréchet/Wasserstein-based estimator, which treats the quantile function as a curve and applies local Fréchet regression in Wasserstein space (yes, the space of probability distributions). It sounds fancy (maybe because René Maurice Fréchet was French?), but it works beautifully and gives valid inference. He also derives uniform confidence bands for the full quantile function, letting us ask questions like: where in the distribution does the treatment hit? Finally, he applies the method to a gubernatorial RDD in the U.S., showing that when Democrats win, the top end of the income distribution shrinks, even though the average income stays flat → evidence of distributional shifts without mean effects: exactly what this method is built for.
Why does this matter?
Not all treatment effects show up in the mean. In many real-world settings (from education to health to inequality research) the most meaningful changes occur in the spread, shape, or tails of a distribution. An intervention might compress inequality without raising average income, lift the bottom of the score distribution without changing the top, or reduce risk exposure at the extreme end of a health outcome. Traditional RDDs miss these effects entirely. What David shows is that we do not need to flatten rich outcome structures into single numbers. His proposed method lets us treat distributions as what they are: distributions, with estimators that are both theoretically grounded and visually intuitive. It gives us a way to ask: where is the effect happening? and How does the shape change across the cutoff? This is especially valuable for fields that care about inequality, heterogeneity, or tail risk (not just the average). And the fact that the method “nests” easily within familiar RDD logic makes it much more usable than it first sounds.
Who should care?
Basically anyone working with outcomes that are more than just averages, but more specifically:
Education researchers analyzing test score distributions across schools or students (that would be me and a lot of you).
Health economists looking at risk distributions, not just mean outcomes.
Public finance folks studying income inequality or top-end wealth effects.
Applied microeconomists using RDDs who are somewhat frustrated by their “scalar-only” worldview.
Do we have code?
We have a package! I do not know how he did this without RAs because it seems like a lot of work, so kudos to him! (By the way do you have a moment to talk about out Lord and Saviour of R packages, Hadley Wickham?)
The R package r3d
implements both the naïve and Fréchet/Wasserstein estimators, along with tools for plotting and inference. With it you can estimate LAQTEs using either local polynomial or Fréchet regression, generate quantile function plots that show where the treatment effect lies across the distribution, and construct uniform confidence bands for inference.
In sum, this paper is a reminder that not all outcomes fit in a spreadsheet cell. When your data come as distributions (not just point estimates) you need tools that treat them that way. David proposed R3D framework gives us exactly that: a method that respects the structure of complex outcomes, builds on familiar RDD logic, and opens the door to asking richer questions about where effects happen, not just if they do. If you are working with quantiles, histograms, or any outcome with internal shape, you should check this paper out. And it comes with code, plots, and clean theory. What is there not to like?
A Practical Guide to Estimating Conditional Marginal Effects: Modern Approaches
(I particularly enjoyed reading this one. It is a “self-explanatory” guide and the authors do their best in elucidating concepts step by step, but if you are not familiar with stats notation I suggest you try to grasp the idea of what they are trying to do first. Familiarity with R is essential, but the code chunks are very easy to follow. I also learned that “kernel” means “core”, “center”, “basis”. The more you know!)
TL;DR: This guide introduces robust methods for estimating how treatment effects vary with a moderating variable (“conditional marginal effects,” or CME). Traditional approaches like linear interaction models often rely on unrealistic assumptions, suffer from lack of overlap, and impose overly rigid functional forms. The authors then present a progression of methods, from parametric to more flexible ML approaches, that address these limitations while preserving statistical validity. They clearly define the CME estimand, enhance semi-parametric kernel estimators (a method that uses smooth curves to capture patterns without assuming a strict relationship between variables), and introduce modern, robust techniques like AIPW-LASSO (which combines weighting and model selection for added robustness) and Double Machine Learning (which uses ML to estimate treatment and outcome models separately, then combines them to reduce bias). The paper also offers practical recommendations based on simulations and real-world examples, and all methods are implemented in the interflex
R package.
(Before we move forward, I just wanted to say that there is one formal definition of CME in the causal inference literature - as defined in this guide - but the term "marginal effect" or "conditional marginal effect" is sometimes used differently in statistical modeling contexts, especially in GLMs or GLMMs. So the label might be the same, but the meaning depends on the framework. Here, CME is defined as:
for binary treatment, or
for continuous treatment. It answers "what is the average treatment effect at a specific value of the moderator X?" which is a causal estimand grounded in the potential outcomes framework.)
What is this paper about?
In this paper the authors tackle a fundamental and increasingly relevant challenge in applied research: how to estimate treatment effects that vary across subgroups or contexts, formally known as CME. In many fields, from political science to economics and public policy, we want to understand not just whether a policy or intervention works, but for whom and under what conditions it works best. Unfortunately, the tools available to answer these questions are often limited, outdated, or misused.
Nowadays most applied researchers turn to linear interaction models (which assume that the effect of a treatment changes in a linear way with a moderating variable → this variable tells us “for whom” or “under what conditions” the treatment has a stronger or weaker effect, it is one that we condition on to examine how treatment effects vary). These models are nice because they are simple to estimate and interpret. But in practice, they come with serious limitations: they often rely on strong, mostly unrealistic assumptions about the shape of relationships, while also being highly sensitive to model misspecification (e.g., omitted interactions and nonlinearity) → leading to misleading conclusions. They also sometimes fail to assess whether there is enough overlap in the data (i.e., whether treated and control groups are comparable across the range of the moderator → lack of common support), and they typically lack a clear connection to the causal estimands we care about. Who has not struggled to answer seminar questions about nonlinearity and high-dimensional covariates? Just add a polynomial to your already kitchen sink regression, it should work (not).
These limitations are especially problematic in observational studies, where researchers do not have the “luxury” of randomized assignment and must rely on modeling assumptions to recover causal effects.
To address these challenges, the authors build on prior work by Hainmueller, Mummolo, and Xu (the author of this one himself) (2019), who introduced semiparametric kernel estimators (SKE from here onwards → they are statistical methods that relax functional form restrictions by using local weighted averaging to estimate relationships between variables, they are a middle ground between fully parametric models - with strict assumptions - and nonparametric approaches - with no functional form assumptions) to allow for more “flexible” relationships between treatment (D), moderators (X), and outcomes (Y). But even those methods left key gaps in terms of clarity, robustness, and ease of use. In this guide they clearly define the CME estimand, aligning what researchers want to estimate with how they estimate it.
What do the authors do?
The authors develop a comprehensive and unified framework for estimating CMEs, that is, how the effect of a treatment varies with a moderating variable X, holding other covariates constant. A key contribution of the paper is to clarify what researchers are trying to estimate when they talk about treatment effect heterogeneity. As I wrote in the introduction, the authors define the CME as a special case of the conditional average treatment effect (CATE) for both binary and continuous treatments. For both cases they rely on the assumptions of Unconfoundedness and Strict Overlap (given Random Sampling and SUVTA assumptions).
Next they propose a progression of estimation strategies, from simpler ones to more flexible ones, aka a tiered approach, which allows us to choose an estimator that matches our data structure, research question, and level of complexity.
In Chapter 2 they go through classical approaches to estimating and visualizing the CME (with examples from applied papers), while “emphasizing the importance of clearly defining the estimand and explicitly stating identifying and modeling assumptions”. They address the issue of multiple comparison by introducing uniform CIs constructed via bootstrapping. They propose diagnostic tools to detect lack of common support and model misspecification and discuss strategies to minimize them. Finally, they introduce a SKE to relax the functional form assumptions (I love to say “functional form”) of linear interaction models. They also summarize the challenges and how to solve them.
In Chapter 3 the authors aim to solve Chapter 2 limitations (the fact that SKEs still rely solely on outcome modeling and may struggle with nonlinearities or complex interactions in covariates) by introducing Augmented Inverse Propensity Weighting (AIPW ) and its extensions (especially the AIPW-LASSO approach) to improve the robustness, efficiency, and flexibility of CME estimation. AIPW, introduced by Robins, Rotnitzky, and Zhao (1994), improves on IPW by combining the outcome model and the treatment assignment model, making it doubly robust, that is, it yields consistent estimates as long as either model is correctly specified. (We have been through this before in the previous post, so you know how valuable this is, especially in observational settings where modeling assumptions are hard to verify.) When both models are correctly specified, AIPW achieves the lowest asymptotic variance within the class of IPW estimators. However, as Li, Thomas, and Li (2019) show, in finite samples, its performance depends heavily on the degree of overlap. If overlap is poor and propensity scores are extreme, AIPW may actually underperform compared to simpler outcome-based estimators. The authors then introduce the idea of “signals” - transformed versions of the data that isolate the part relevant for estimating the CME. They propose a three-step estimation strategy based on these signals that is intuitive and straightforward to implement. Then they close the chapter with Section 3.4, which focuses on inference (a step that is often overlooked when we get excited about flexible estimators). Because the AIPW approach involves multi-step estimation (outcome modeling, propensity scores, signal construction, and smoothing), getting valid confidence intervals is non-trivial. The authors propose bootstrap-based procedures to construct both pointwise and uniform CIs for the CME. We know that uniform intervals are especially helpful when visualizing treatment effect heterogeneity across the range of the moderator, since they adjust for multiple comparisons and avoid misleading “significance at a glance” conclusions. This emphasis on valid inference ensures that the flexibility gained by AIPW-LASSO or kernel smoothing does not come at the cost of statistical rigor, something applied researchers will appreciate when facing skeptical referees (or seminar rooms).
In Chapter 4 (my favourite one!) they extends the AIPW framework into a more flexible and scalable setting using Double Machine Learning (DML). The key idea is to use machine learning algorithms (like random forests, neural nets, or gradient boosting) to estimate nuisance functions (the outcome and treatment models), while applying orthogonalization techniques to isolate the CME and preserve valid inference, which then allows us to model complex, high-dimensional data without overfitting or violating identification assumptions. The authors explain the partialling-out procedure, discuss cross-fitting to avoid overfitting, and go into more detail on how DML can be implemented using off-the-shelf ML tools. They also show how to plug DML into the same projection-and-smoothing framework introduced earlier, making it compatible with their general approach to CME estimation.
In Chapter 5 we get to work! After introducing the “full menu” of estimation strategies (linear models, kernel estimators, AIPW-LASSO, and DML) the authors turn to the obvious question: which method should I use? That is exactly what Chapter 5 addresses through a series of Monte Carlo simulation studies designed to compare the methods under different conditions. The authors simulate data from three different Data Generating Processes (DGPs) that vary in complexity: from linear to nonlinear covariate effects and finally to a high-dimensional, highly nonlinear scenario. Each simulation evaluates how well the different estimators recover the true CME, and how their performance changes with sample size, model complexity, and hyperparameter tuning. The first DGP is fairly simple: the CME is nonlinear, but covariates enter the outcome model linearly. The goal here is to assess how well each estimator captures a quadratic treatment effect when most of the rest of the model is well-behaved. They find that Kernel and AIPW-LASSO both perform well when the sample size is moderate (n ≥ 1,000), that DML methods (e.g., DML-neural networks, DML-histogram gradient boosting ) need larger samples (n ≥ 5,000) to reach similar accuracy, linear models are clearly misspecified here and exhibit high bias, and in terms of confidence intervals, kernel and AIPW-LASSO produce narrower bands with better coverage at smaller sample sizes compared to DML. In the second DGP, things get more complex. Now, other covariates (not just the moderator) enter the outcome model nonlinearly, through functions like exp(2Z + 2). This setup challenges estimators that assume or rely on linearity in covariates. They find that Kernel estimators break down in this setting, because they assume linearity in non-moderator covariates; AIPW-LASSO and DML methods handle this complexity much better, flexibly approximating nonlinearities through basis expansions (AIPW) or ML models (DML); even at n = 1,000, AIPW-LASSO produces stable and accurate estimates; DML improves substantially with n ≥ 3,000, but struggles in small samples due to the variance introduced by flexible learning. In the final study they explore a more realistic and challenging DGP with four additional covariates, nonlinear interactions, and a highly nonlinear propensity score model. It also compares the effect of using default vs. tuned hyperparameters in DML models (neural nets, random forests, histogram gradient boosting). They find that tuning matters, especially for neural networks (NN), where cross-validation significantly improves CME estimates; histogram Gradient Boosting (HGB) performs well at larger sample sizes, but tuning does not help much beyond defaults; random forest (RF) does not improve much with tuning and generally underperforms compared to NN and HGB; and that computational cost increases substantially with tuning, raising the classic trade-off: better fit vs. longer runtime. The final take-aways are:
When the CME is nonlinear but covariates are linear, kernel and AIPW-LASSO work well even in small to moderate samples.
When covariates are also nonlinear, kernel breaks down, and AIPW-LASSO or DML are needed.
AIPW-LASSO is the best all-rounder (yey!), especially when sample sizes are limited and you want flexibility without the computational burden of full DML.
DML is powerful, but only if you have lots of data (n ≥ 5,000, usually not the case for most observational studies, but macro people will be happy) and can afford to tune models carefully.
Tuning DML learners (especially neural nets) improves accuracy, but not always dramatically, and it takes time.
Why is this guide important?
Understanding how treatment effects vary across individuals or contexts is at the ~kernel~ of applied social science research. Whether you are studying the impact of a job training program, a public health intervention, or a policy reform, knowing for whom and under what conditions the intervention works is often more important than knowing whether it works on average.
This guide is a goldmine! Their proposed framework gives us the tools to avoid common analytical pitfalls, such as extrapolating treatment effects to regions without common support or imposing rigid linear assumptions when the true relationship may be more complex. By presenting a progression from linear models to semiparametric and fully machine-learning-based approaches, their guide gives us a toolbox that can be tailored to our data's structure and complexity. It is not just a case of estimating effects, it is about trusting them (by properly constructing valid pointwise and uniform confidence intervals, handling multiple comparisons, and diagnosing violations of overlap or model misspecification). The authors show how recent advances in DML can be applied in causal settings without sacrificing interpretability or statistical rigor, something that most economists are worried about when they hear “machine learning”. This helps make cutting-edge methods usable by applied researchers, not just methodologists.
Who should care?
Pretty much everybody (even the macroeconomists)!
Applied researchers studying heterogeneous treatment effects across social science fields like political science, economics, sociology, education, and public policy.
Methodologists working at the intersection of causal inference and ML, especially the ones interested in effect heterogeneity and flexible estimation strategies.
Policymakers and practitioners who need to understand whether, how, and for whom an intervention works, beyond just headline average effects (they might not need to know the details tho).
Data scientists working with complex or observational data who want to go beyond basic treatment-control comparisons and explore nuanced causal relationships.
Do they have code?
Yes! They even have a package (we love packages): the interflex
for R.
With interflex
, you can:
Implement various estimators (linear, kernel, AIPW-LASSO, DML) with straightforward function calls.
Generate diagnostic plots to check overlap, functional form assumptions, and model misspecification.
Visualize CMEs with appropriate CIs.
Conduct sensitivity analyses to test the robustness of findings.
And even export plots and summaries for inclusion in papers or presentations.
In sum, I had a great time going through this guide. If you have ever sat through a seminar, got grilled on nonlinearities, or wondered whether your interaction term was doing anything meaningful, this one is for you. Whether you are team kernel, curious about LASSO, or finally ready to get into DML, this guide gives you a roadmap grounded in strong theory and practical recommendations. And best of all, it comes with code. So yes, CMEs may sound intimidating at first, but by the end of this guide, they are just a way of answering the question we care about most: who benefits, and how?