On Finding Causes vs. Finding Levers

What are you actually trying to say?

Nov 18, 2025

Best binoculars for stargazing 2025: Spot stars and galaxies | Live Science

TL;DR: the credibility revolution improved our ability to find causes, and incentives push us toward novel causal stories. But a) that doesn’t tell us which causes explain the variation in outcomes (Cochrane); b) our methods identify local effects under specific conditions (SUTVA), not the structural drivers of a phenomenon (Galiani); c) we conflate “precise” with “important” (McCloskey–Ziliak) and commit Type C errors (Krämer); and d) when measured properly, many “significant” variables explain essentially zero deviation (Sterck). Conclusion: We must stop treating “cause” as synonymous with “lever.” A lever must be causally credible and have “oomph” (explain meaningful variation or be cost-effective).

Hi there!

This post is a bit different from the others, but I felt compelled to write it as someone who does research on education and runs this newsletter on a specific causal inference technique. Nothing I will say in this post is new in the sense that it hasn’t been discussed before, but I do hope to provide more perspective to those who are not “in the trenches” of these sorts of nuanced topics.

It all started with a post on Twitter/X about the effects of school spending on educational outcomes (heterogenous treatment effects’ discussion: almost absent, except maybe for this). On one side, some users pointed to specific RCTs in developing countries as evidence that increasing school funding increases test scores. On the other, some pointed to meta-analyses finding mixed results in developed countries. The biggest point of discussion was on effect size and statistical-significance.

I’m not going to weigh in on that specific example since it’s not the point of this post (also Twitter/X is a better place for this type of discussion). Instead, I want to talk about the “causal inference revolution”, how its legacy might be amplified - positively or negatively - by the problems we will (already?) have (for example, when it comes to outsourcing the reviewing process to LLMs1), and circle back to the discussion on effect size and statistical-significance. Admittedly, the referee-example is a niche concern directly affecting a small circle of researchers. But since many of us go on to teach the present and next generations (and maybe some are out there reading this!), a nuanced view here could do wonders to spark their interest in what actually drives economic variation.

The debate also struck a chord because it touches on something I love to explore (thanks prof Frank for asking me a very relevant question at my first ever PhD seminar): the mechanisms and channels behind a causal claim. They are the pathways through which the cause creates an effect. The “cause” (e.g., spending) doesn’t “occur” through the mechanism; it “operates” through the mechanism. The mechanism can be of all sorts, like increasing teacher training or providing meals at no direct cost to the students. It’s one thing to say “spending is a cause” but we should not stop there. We need to think about the mechanisms and channels through which it ~effectively~ works.

Let’s call this the difference between finding a “cause” and finding a “lever”. But first, a bit of context.

How “credibility” became the new gold-standard

I’m not sure if it was an “orchestrated revolution” with a defined plan2, but something undeniably changed in empirical economics in the early 00’s. The “credibility revolution” of the 90’s and 2000’s completely shifted our methodological goalposts. I would argue it wasn’t so much a planned success as it was a fundamental shift in standards.

Before, running a kitchen-sink OLS3 and hand-waving about endogeneity was common (first thing you teach the undergrads *not* to do). After, the new gold standard became clean identification4. The “Big Four” methods - RCTs, DiD, IV, and RDD - became the “gatekeepers” of truth5. In this sense, the movement won. It won the argument over what “credible” empirical work looks like, and it gave us a powerful, trusted toolkit to find “cause”, and therefore has changed the actual output of the profession6.

This new toolkit also created a very powerful new incentive structure. We now know that “causal novelty” (finding a new relationship) and “causal narrative complexity” (using these sophisticated methods) are strong predictors of getting published in a Top 5 journal. But, anecdotally, it goes further down the line. Every week I get a few friends and colleagues asking me “how can I DiD this?”, “what should be my treatment group?”, “is it ok if I have around 20 treated units, shifting in and out of treatment, and 400 control units in some period?”. I find these questions fascinating because they make us all think deeper about the limits of research questions and proper study design. But notice what is missing from these conversations.

We spend hours debating the feasibility of the identification strategy (“can we get the standard errors right?”, “is this even worth of publishing?”) but we rarely stop to ask: if this ~does~ work, will the effect size be large enough to matter?

And that’s where the problem starts. The revolution succeeded in teaching us how to find a cause. But in our quest to publish novel, statistically significant causal chains, we often forget to ask if we’ve found a lever7.

Why a cause is not a lever

So we have the toolkit. We can “confidently” say that X causes Y. But this brings us back to the Twitter debate and that 0.056 SD effect (a number which I made up, by the way, and got smart pushback on). The problem happens when we assume finding a cause is the same as finding a lever. It isn’t.

Prof Cochrane - just a few days ago - cleverly articulated this in his essay “Causation Does Not Imply Variation”. From my understanding, his point was that you can rigorously prove a policy causes a change in an outcome, but that cause might account for almost none of the variation between success and failure in the real world. You can prove that school funding causes test scores to move (p < 0.05), but if the effect is small, it explains almost none of the difference between a struggling school and a thriving one.

Prof Galiani wrote a post after prof Prof Cochrane’s in which he takes this further, explaining why our methods fail us here. He reiterates that the “causal revolution” tools (DiD, RCTs, etc.) are built to answer a specific question - “what happens to Y when we induce an exogenous change in X?” - and that they are ~not~ built to answer THE broader question: “what are the main causes of variation in Y?”.

Prof Galiani points out a couple of fatal flaws in treating these “causes” as “levers”. In the first one, which is a “battle” between “local vs structural”, he says that our methods identify a total effect in a specific context rather than a universal structural parameter. We learn what happened there, not necessarily how the machine works everywhere. In the second one, characterized as “the SUTVA trap, he says that these methods rely on SUTVA, which essentially assumes no spillovers. But true “levers” - like macro policies or institutional changes - are defined by their equilibrium effects and spillovers. By assuming them away to get a clean estimate, we often strip the “cause” of the very mechanism that would make it a “lever”8.

Statisticians and econometricians have been warning about this for decades, often using a vocabulary that is painfully absent from our current seminars. Profs McCloskey and Ziliak and created the term “Oomph” to describe “the difference a treatment makes” arguing that our profession suffers from a disproportionate focus on “precision” (statistical significance) at the expense of “Oomph” (magnitude).

Prof Krämer formalized this critique. He argued that our obsession with precision has birthed a new category of scientific failure: the Type C Error (his nomenclature).

We are familiar with Type A errors (finding an effect where there is none) and Type B errors (missing a large effect because of noise). But a Type C Error occurs when there is a “small effect (no “oomph”), but due to precision it is highly “significant” and therefore taken seriously”. This is the (abstract) 0.056 SD problem. It is a Type C error masquerading as a scientific victory. We used our sophisticated “credibility” toolkit to maximize precision, driving our standard errors down to zero, which allowed us to put three stars next to a number that, for all practical purposes, is zero.

We found a cause. We did not find a lever. A “lever” is a confirmed cause that possesses “Oomph”, meaning it explains a substantial portion of the actual variation in the outcome. But we need to be precise. A “lever” isn’t a large coefficient in isolation. As Prof Sterck (2019) provides the background for me to argue, a true lever is a variable that drives the variation we see in the real world. It is a factor that explains a meaningful percentage of the deviations in our outcome. So a 0.056 SD effect might be a “cause”, but if it explains only 0.1% of the variation in test scores, it is certainly not a lever.

This distinction seems clear “enough” in theory, but in practice - as when we are staring at a regression table - it gets muddy. Where’s the “lever” column? Instead, we fall back on the standard metrics we were trained to trust, a lot of the times without realizing they aren’t answering the question we think they are.

What do we even mean by “important”?

In the back of our heads there’s always a voice that goes “check N, R^2, effect size, and statistical significance”. I don’t think that’s bad, but it is superficial.

When we find a “cause”, how do we decide if it matters? We play around with terms like “economically significant” but as Prof. Sterck (2019) argued, the literature is surprisingly vague on what that really means9. We usually rely on standard tools - like standardized coefficients (the 0.056 SD!) or R^2 decompositions, but Prof Sterck shows these are often “flawed and misused”10. If our tools for measuring “importance” are not ideal, it’s no wonder we are left to rely on one tool that gives us a clear binary answer and that all scientists (I am being charitable) understand: the p-value.

But this is where the “stars” can “blind” us. To see why, look at the fertility case study in Prof Sterck’s paper. He analyzes the determinants of fertility rates (specifically the number of births per woman) using a massive dataset (N > 490,000). Because the sample is so huge, almost everything is statistically significant. If you look at the regression output, you see stars next to ln(GDP/capita), land quality, distance to water, child mortality, age, and education. If you are hunting for “causes” (stars), you get lost in the sauce. The regression output shows that all of the aforementioned variables are “significant” causes. This makes it incredibly easy to engage in HARKing (Hypothesizing After the Results are Known). A researcher looking at those stars could pick any variable (e.g., distance to water) and justify a paper on it. They are all “causes”. It’s not even “data mining” or “p-hacking” in the traditional sense because no manipulation was necessary. The significance was guaranteed by the sample size; the error is in pretending that significance equals importance.

But… if you shift your thinking to look for “levers”, i.e., by asking both “is it non-zero?” and “how much of the deviation does this actually explain?”, the picture changes from night to day. To go back to his example: ln(GPD/capita) explains only 0.19% of the deviation (a cause, but not a lever), distance to water explains 0.73%, age explains 45.6%, and woman’s education explains 10.0%. The “stars” told us all these variables were valid. But only by asking the right question - “does this variable actually drive the variation?” - do we see that GDP is a distraction in this context, while education is a massive policy lever.

This is the trap we fall into. We use the precision of our causal tools to find the 0.19% drivers, and because they have stars next to them, we treat them as if they matter just as much as the 10% drivers.

What we publish vs what we cite

This brings us back to the incentives. If we know that “levers” (like education in the example above) are what actually drive outcomes, why do we spend so much time hunting for and publishing tiny “causes”?

Because that is what the market *rewards*.

Garg and Fetzer (2025) find a stark divergence between what gets published (causes) and what creates impact (levers). Top 5 journals reward “novelty”, e.g., finding a new causal link, even a tiny, obscure one… is the path to acceptance. The long-term impact (citations), tho, comes from engaging with “central, widely recognized concepts”.

We are incentivized to hunt for the 0.056 SD effect because it’s novel. But we cite the papers that discuss the big, central questions. I think both are somewhat fair ways to advance the science, and not totally exclusionary.

And as I briefly mentioned, this problem is likely to get worse. As AI-assisted tools are increasingly used in peer review, we risk amplifying this exact incentive structure. A LLM can be trained to find “novel causal claims” and check for the “right” methods, but it cannot ~easily~ be trained to understand “economic importance” (or can it???? I won’t bet against it) a concept we humans struggle to agree on sometimes since it’s highly context-dependent. An AI reviewer seems to be the ultimate “cause” hunter, as of today. I think it will likely reward the 0.056 SD effect as long as it’s novel and significant. It has no intuition for finding the “lever”. I want to be proven wrong.

As Prof. Krämer wrote, “Cheap t-tests... have in equilibrium a marginal scientific product equal to their cost”. AI makes those tests almost free. So, my point isn’t necessarily that we need a new statistic to replace the p-value or that we should downgrade it in terms of relative importance. My point is that we should be more careful and focus on the questions we ask during the design phase. Instead of asking “can I identify this?” or “is this novel?”, we need to be asking further: “if this *does* work, how much of the problem does it actually solve?”… if there’s a problem even to start with.

If we don’t start valuing the answer to that question, we are going to drown in statistically significant noise.

Already happening in other sciences. Good point here.

Anyone wanting to discuss this case against Kuhn’s definition, e-mail me!

“OLS” (Ordinary Least Squares) is the standard method for drawing a line of best fit through data. A “kitchen-sink” regression is when a researcher throws every available variable into the model (control variables) in hopes of isolating an effect. Endogeneity is a fatal flaw here: it happens when there is something else driving your result that you didn’t (or couldn’t) put in the “sink”. For example, if you find that “tutoring causes higher test scores” but you didn’t control for “parental motivation” your result is endogenous, meaning it’s biased because motivated parents are likely the ones hiring the tutors.

In econometrics, “identification” refers to the strategy used to try to strip away bias and identify the true causal link. A “clean” identification strategy mimics a laboratory experiment. For example, an RCT flips a coin to assign treatment; a Regression Discontinuity (RDD) compares people just above and just below an arbitrary cutoff (like a test score threshold for a scholarship); and DiD compares the change in a treated group to the change in a control group over time.

Garg and Fetzer (2025) note that “leading journals now prioritize studies employing these methods over traditional correlational approaches”, and they point out that these four specific methods are the ones that have seen “substantial growth” as the discipline shifted toward rigorous identification.

Garg and Fetzer (2025) show that the share of claims supported by these rigorous causal methods skyrocketed from about 4% in 1990 to nearly 28% in 2020.

A “cause” answers the question “does X affect Y?” (precision), a “lever” answers the question “how much of the difference between success and failure does X actually explain?” (importance).

He argues that we need both: the local identification of causal effects and models (comparative statics) to understand the “causal architecture” of a phenomenon. He also notes in an addendum that low partial R^2 isn’t bad per se, provided the intervention is cost-effective - which, by extension, aligns with our definition of a “lever”: it has to be worth pulling.

Standardized coefficients are tricky (not as straight forward as logs) because they depend heavily on the variance of your sample, making comparisons across studies a literal nightmare (good luck on your meta-analyses). R^2 measures (like Shapley values) are often computationally heavy and not intuitive, and they that can attribute importance to irrelevant variables just because they are correlated with relevant ones (that is also one of the criticisms regarding the use of ML in Econ, specifically for large samples)

And “with large N you don’t need randomization”… :P

DiD Digest

Discussion about this post