On Finding Causes vs. Finding Levers
What are you actually trying to say?
Hi there!
This post is a bit different from the others, but I felt compelled to write it as someone who does research on education and runs this newsletter on a specific causal inference technique. Nothing I will say in this post is new in the sense that it hasn’t been discussed before, but I do hope to provide more perspective to those who are not “in the trenches” of these sorts of nuanced topics.
It all started with a post on Twitter/X about the effects of school spending on educational outcomes (heterogenous treatment effects’ discussion: almost absent, except maybe for this). On one side, some users pointed to specific RCTs in developing countries as evidence that increasing school funding increases test scores. On the other, some pointed to meta-analyses finding mixed results in developed countries. The biggest point of discussion was on effect size and statistical-significance.
I’m not going to weigh in on that specific example since it’s not the point of this post (also Twitter/X is a better place for this type of discussion). Instead, I want to talk about the “causal inference revolution”, how its legacy might be amplified - positively or negatively - by the problems we will (already?) have (for example, when it comes to outsourcing the reviewing process to LLMs1), and circle back to the discussion on effect size and statistical-significance. Admittedly, the referee-example is a niche concern directly affecting a small circle of researchers. But since many of us go on to teach the present and next generations (and maybe some are out there reading this!), a nuanced view here could do wonders to spark their interest in what actually drives variation.
The debate also struck a chord because it touches on something I love to explore (thanks prof Frank for asking me a very relevant question at my first ever PhD seminar): the mechanisms and channels behind a causal claim. They are the pathways through which the cause creates an effect. The “cause” (e.g., spending) doesn’t “occur” through the mechanism; it “operates” through the mechanism. The mechanism can be of all sorts, like increasing teacher training or providing meals at no direct cost to the students. It’s one thing to say “spending is a cause” (identification). It is entirely another to ask if spending is the primary driver of student success (decomposition). We need to distinguish between identifying a causal link and finding a policy “lever”, a factor that really accounts for the variation we see in the world.
How “credibility” became the new gold-standard
I’m not sure if it was an “coordinated revolution” with a defined plan2, but something undeniably changed in empirical economics in the early 00’s. The “credibility revolution” of the 90’s and 2000’s completely shifted our methodological goalposts. I would argue it wasn’t so much a planned success as it was a fundamental shift in standards.
Before, running a kitchen-sink OLS3 and hand-waving about selection bias and omitted variables was common (first thing you teach the undergrads *not* to do). After, the new gold standard became clean identification4. The “Big Four” methods - RCTs, DiD, IV, and RDD - became the “gatekeepers” of truth5. In this sense, the movement won. It won the argument over what “credible” empirical work looks like, and it gave us a powerful, trusted toolkit to find “cause”, and therefore has changed the actual output of the profession6.
This new toolkit also created a very powerful new incentive structure. We now know that “causal novelty” (finding a new relationship) and “causal narrative complexity” (using these sophisticated methods) are strong predictors of getting published in a Top 5 journal. But, anecdotally, it goes further down the line. Every week I get a few friends and colleagues asking me “how can I DiD this?”, “what should be my treatment group?”, “is it ok if I have around 20 treated units, shifting in and out of treatment, and 400 control units in some period?”. I find these questions fascinating because they make us all think deeper about the limits of research questions and proper study design. But notice what is missing from these conversations.
We spend hours debating the feasibility of the identification strategy (“can we get the standard errors right?”, “is this even worth of publishing?”) but we rarely stop to ask: if this ~does~ work, will the effect size be large enough to matter?
And that’s where the problem starts. The revolution succeeded in teaching us how to find a cause. But in our quest to publish novel, statistically significant causal chains, we often forget to ask if we’ve found a lever7.
Why a cause is not a lever
So we have the toolkit. We can “confidently” say that X causes Y. But this brings us back to the Twitter debate and that 0.056 SD effect (a number which I made up, by the way, and got smart pushback on). The problem happens when we assume finding a cause is the same as finding a lever. It isn’t.
Prof Cochrane - just a few days ago - cleverly articulated this in his essay “Causation Does Not Imply Variation”. From my understanding, his point was that you can rigorously prove a policy causes a change in an outcome, but that cause might account for almost none of the variation between success and failure in the real world. You can prove that school funding causes test scores to move (p < 0.05), but if the effect is small, it explains almost none of the difference between a struggling school and a thriving one.
Prof Galiani wrote a post after prof Prof Cochrane’s in which he takes this further, explaining why our methods fail us here. He reiterates that the “causal revolution” tools (DiD, RCTs, etc.) are built to answer a specific question - “what happens to Y when we induce an exogenous change in X?” - and that they are ~not~ built to answer THE broader question: “what are the main causes of variation in Y?”.
Prof Galiani points out a couple of limitations in treating these “causes” as “levers”. In the first one, which is a “battle” between “local vs structural”, he says that our methods identify a total effect in a specific context rather than a universal structural parameter. This focus on the “snapshot” is likely to obscure dynamic levers: a small effect size might look dismissible in a static regression, but if it captures a structural parameter that compounds over time (i.e., a daily learning gain), it can be massive. We learn what happened there, not necessarily how the machine works over time and everywhere. In the second one, characterized as “the SUTVA trap, he says that these methods rely on SUTVA, which essentially assumes no spillovers. But true “levers” - like macro policies or institutional changes - are defined by their equilibrium effects and spillovers. By assuming them away to get a clean estimate, we often strip the “cause” of the very mechanism that would make it a “lever”8.
Statisticians and econometricians have been warning about this for decades, often using a vocabulary that is painfully absent from our current seminars. Profs McCloskey and Ziliak and created the term “Oomph” to describe “the difference a treatment makes” arguing that our profession suffers from a disproportionate focus on “precision” (statistical significance) at the expense of “Oomph” (magnitude).
This outcome shouldn’t surprise us. Nearly 40 years ago, sociologist Peter Rossi formulated the “Stainless Steel Law of Evaluation”9: “The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero”. The credibility revolution gave us the stainless steel tools Rossi predicted; we shouldn’t be shocked that they are doing exactly what he said they would: revealing that many well-meaning interventions don’t work.
Prof Krämer formalized this critique. He argued that our obsession with precision has birthed a new category of scientific failure: the Type C Error (his nomenclature).
We are familiar with Type A errors (finding an effect where there is none) and Type B errors (missing a large effect because of noise). But a Type C Error occurs when there is a “small effect (no “oomph”), but due to precision it is highly “significant” and therefore taken seriously”. This is the (abstract) 0.056 SD problem. It is a Type C error masquerading as a scientific victory. We used our sophisticated “credibility” toolkit to maximize precision, driving our standard errors down to zero, which allowed us to put three stars next to a number that, for all practical purposes, is zero.
We found a cause. We did not find a lever. A “lever” is a confirmed cause that possesses “Oomph”, meaning it explains a substantial portion of the actual variation in the outcome. But we need to be precise. A “lever” isn’t a large coefficient in isolation. As Prof Sterck (2019) provides the background for me to argue, a true lever is a variable that drives the variation we see in the real world. It is a factor that explains a meaningful percentage of the deviations in our outcome. So a 0.056 SD effect might be a “cause”, but if it explains only 0.1% of the variation in test scores, it is certainly not a lever.
This distinction seems clear “enough” in theory, but in practice - as when we are staring at a regression table - it gets muddy. Where’s the “lever” column? Instead, we fall back on the standard metrics we were trained to trust, a lot of the times without realizing they aren’t answering the question we think they are.
What do we even mean by “important”?
In the back of our heads there’s always a voice that goes “check N, R^2, effect size, and statistical significance”. I don’t think that’s bad, but it is superficial.
When we find a “cause”, how do we decide if it matters? We play around with terms like “economically significant” but as Prof. Sterck (2019) argued, the literature is surprisingly vague on what that really means10. We usually rely on standard tools - like standardized coefficients (the 0.056 SD!) or R^2 decompositions, but Prof Sterck shows these are often “flawed and misused”11 (standardized coefficients, for instance, are sensitive to sample variance, making them unreliable for comparing effects across different populations). If our tools for measuring “importance” are not ideal, it’s no wonder we are left to rely on one tool that gives us a clear binary answer and that all scientists (I am being charitable) understand: the p-value.
But this is where the “stars” can “blind” us. To see why, look at the fertility case study in Prof Sterck’s paper. He analyzes the determinants of fertility rates (specifically the number of births per woman) using a massive dataset (N > 490,000). Because the sample is so huge, almost everything is statistically significant. If you look at the regression output, you see stars next to ln(GDP/capita), land quality, distance to water, child mortality, age, and education. If you are hunting for “causes” (stars), you get lost in the sauce. The regression output shows that all of the aforementioned variables are “significant” causes. This makes it incredibly easy to engage in HARKing (Hypothesizing After the Results are Known). A researcher looking at those stars could pick any variable (e.g., distance to water) and justify a paper on it. They are all “causes”. It’s not even “data mining” or “p-hacking” in the traditional sense because no manipulation was necessary. The significance was guaranteed by the sample size - which is a feature, not a bug, as it gives us precise estimates (though, as I’ve argued before, modern ML tools are better suited to handle this “star-gazing” problem in high-dimensional data than standard OLS); the error is in pretending that significance equals importance. Of course, context dictates the threshold for “Oomph”. If the outcome is “survival rates”, a 0.056 SD effect *is* a massive victory. But for continuous variables like test scores or wages, where we are trying to explain inequality or performance gaps, a 0.056 SD effect without a cost-benefit argument should be taken with a grain of salt.
But… if you shift your thinking to look for “levers”, i.e., by asking both “is it non-zero?” and “how much of the deviation does this actually explain?”, the picture changes from night to day. To go back to his example: ln(GPD/capita) explains only 0.19% of the deviation (a cause, but not a lever), distance to water explains 0.73%, age explains 45.6%, and woman’s education explains 10.0%. The “stars” told us all these variables were valid. But only by asking the right question - “does this variable actually drive the variation?” - do we see that GDP is a distraction in this context, while education is a massive policy lever.
This is the trap we fall into. We use the precision of our causal tools to find the 0.19% drivers, and because they have stars next to them, we treat them as if they matter just as much as the 10% drivers.
What we publish vs what we cite
This brings us back to the incentives. If we know that “levers” (like education in the example above) are what actually drive outcomes, why do we spend so much time hunting for and publishing tiny “causes”?
Because that is what the market *rewards*.
Garg and Fetzer (2025) find a stark divergence between what gets published (causes) and what creates impact (levers). Top 5 journals reward “novelty”, e.g., finding a new causal link, even a tiny, obscure one… is the path to acceptance. The long-term impact (citations), tho, comes from engaging with “central, widely recognized concepts”.
We are incentivized to hunt for the 0.056 SD effect because it’s novel. But we cite the papers that discuss the big, central questions. I think both are somewhat fair ways to advance the science, and not totally exclusionary. We need architects to design the building (structural/levers) and plumbers to fix the pipes (causal/identification). The problem arises when we only reward the plumbers but expect them to explain the architecture. Prof. Megan Stevenson (2023)12 argues that this expectation stems from a mistaken “Engineer’s View” of the world - the belief that society is a machine where we can isolate and pull specific levers for predictable results. But the social world is actually full of “stabilizers”, which are forces that push people back onto their original trajectories after a small, limited in scope intervention. A small “cause” (like a job training program) rarely triggers a “cascade” of success because structural forces dampen the effect. Finding a statistically significant effect in a vacuum ignores the stabilizing forces that make that effect irrelevant in the aggregate.
And as I briefly mentioned, this problem is likely to get worse. As AI-assisted tools are increasingly used in peer review, we risk amplifying this exact incentive structure. A LLM can be trained to find “novel causal claims” and check for the “right” methods, but it cannot ~easily~ be trained to understand “economic importance” (or can it???? I won’t bet against it) a concept we humans struggle to agree on sometimes since it’s highly context-dependent. An AI reviewer seems to be the ultimate “cause” hunter, as of today. I think it will likely reward the 0.056 SD effect as long as it’s novel and significant. It has no intuition for finding the “lever”. LLMs are trained on the corpus of published literature - a literature that systematically selects for significance over importance. The AI will likely entrench, rather than correct, this bias.. But I want to be proven wrong.
As Prof. Krämer wrote, “Cheap t-tests... have in equilibrium a marginal scientific product equal to their cost”. AI makes those tests almost free. My point isn’t that we should abandon the p-value, but that we must be far more rigorous about the questions we ask during the design phase. Instead of stopping at “can I identify this?” or “is this novel?”, we need to ask: “does this variable explain the variation?”. If it works, how much of the problem does it actually solve? Practically, this means reporting variance decomposition alongside our regression tables, or justifying small effects with explicit cost-benefit or compounding arguments. If we don’t start valuing the answers to these questions, we are going to drown in statistically significant noise.
Further readings:
Angrist, J. D., & Pischke, J. S. (2010). The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. Journal of Economic Perspectives, 24(2), 3-30.
Deaton, A. (2010). Instruments, randomization, and learning about development. Journal of Economic Literature, 48(2), 424-455.
Imbens, G. W., & Angrist, J. D. (1994). Identification and Estimation of Local Average Treatment Effects. Econometrica, 62(2), 467–475.
Already happening in other sciences. Good point here.
Anyone wanting to discuss this case against Kuhn’s definition, e-mail me!
“OLS” (Ordinary Least Squares) is the standard method for drawing a line of best fit through data. A “kitchen-sink” regression is when a researcher throws every available variable into the model (control variables) in hopes of isolating an effect. Endogeneity is a fatal flaw here: it happens when there is something else driving your result that you didn’t (or couldn’t) put in the “sink”. For example, if you find that “tutoring causes higher test scores” but you didn’t control for “parental motivation” your result is endogenous, meaning it’s biased because motivated parents are likely the ones hiring the tutors.
In econometrics, “identification” refers to the strategy used to try to strip away bias and identify the true causal link. A “clean” identification strategy mimics a laboratory experiment. For example, an RCT flips a coin to assign treatment; a Regression Discontinuity (RDD) compares people just above and just below an arbitrary cutoff (like a test score threshold for a scholarship); and DiD compares the change in a treated group to the change in a control group over time.
Garg and Fetzer (2025) note that “leading journals now prioritize studies employing these methods over traditional correlational approaches”, and they point out that these four specific methods are the ones that have seen “substantial growth” as the discipline shifted toward rigorous identification.
Garg and Fetzer (2025) show that the share of claims supported by these rigorous causal methods skyrocketed from about 4% in 1990 to nearly 28% in 2020.
A “cause” answers the question “does X affect Y?” (precision), a “lever” answers the question “how much of the difference between success and failure does X actually explain?” (importance).
He argues that we need both: the local identification of causal effects and models (comparative statics) to understand the “causal architecture” of a phenomenon. He also notes in an addendum that low partial R^2 isn’t bad per se, provided the intervention is cost-effective - which, by extension, aligns with our definition of a “lever”: it has to be worth pulling.
Standardized coefficients are tricky (not as straight forward as logs) because they depend heavily on the variance of your sample, making comparisons across studies a literal nightmare (good luck on your meta-analyses). R^2 measures (like Shapley values) are often computationally heavy and not intuitive, and they that can attribute importance to irrelevant variables just because they are correlated with relevant ones (that is also one of the criticisms regarding the use of ML in Econ, specifically for large samples)
And “with large N you don’t need randomization”… :P



Good post!
I feel there is another important distinction that has less to do with statistics but also needs clarification. That is the difference between "small easy wins" and "high-cost transformative intervention".
For an example, suppose we figured out we could improve learning by a little bit simply by having the teacher start the semester saying "Kids, always remember you CAN learn!" Maybe it makes a small difference, but the cost of implementation is also really, really, low. It won't transform education forever, maybe the effect only shows up in the most powerful studies.
On the other hand, imagine something like full-time private tutors for every kid, or something that tries to get close to that. I bet it would work very well! But the cost is huge. Maybe some version of it passes the statistical test *and* still comes out as a positive return.
So, how do we compare those? Cost-benefit sure sounds good, and some interventions even _save_ money in the total account, by preventing other costs. You probably won't transform education *just* by doing those. But it would also be stupid not to!
And what should researchers focus on? It's a tough question!
Great article! I wish I could write like this.
I know that Tyler Cowen has been calling for more studies with “oomph” about questions we really care about (versus studies that are well identified but don’t have that impact). Your examples and background make that easier to understand.
It would be interesting to go into the approaches that economists/statisticians use to find the levers. Obviously R^2 is a good start but definitely not sufficient