Causal Scope Laundering

When valid evidence closes the wrong question.

Elias Kunnas

Standard objections addressed in this essay

“This is just external validity.” — §IV (external validity asks if evidence transports; CSL names the public-discourse operator that strips the boundary)
“This is anti-evidence / anti-science.” — §IV (the evidential surface is democratic infrastructure; the failure is unscoped closure, not citation)
“This is just Goodhart’s Law.” — §IV (Goodhart requires gaming the measure; the cited estimate here can be honestly conducted and on-target)
“Mechanism speculation is easier to abuse than citation.” — §V (Mechanism-Claim Gate; symmetric burden in direction / magnitude / falsifier / evidence range)
“The gate becomes obstructionism.” — §V (gate defeats evidentiary/deliberative closure, not procedural action; named residue-owner for action under uncertainty)
“Policy can’t wait for full mechanism analysis.” — §V (three closures distinguished; procedural closure can legitimately occur under uncertainty)
“You’ve named an operator but not documented a case.” — §III closing caveat (the logical gap is visible from inside the discourse; documented chains would sharpen the sociological-step sub-claims, not validate the model itself)

A treatment-effect study identifies what happens to a defined population under a defined intervention against a defined counterfactual over a defined window. Policy discourse routinely quotes such a study as if it had answered a different question: what system of incentives, expectations, and equilibrium effects should govern the underlying domain. The citation is formally accurate. The scope jump is invisible. The failure is not false citation. It is the citation being deployed as warrant for a question it did not identify — whether the underlying study was well-identified to begin with or not. Citation as a public evidential surface is democratic infrastructure; the failure is its use as a stopping rule for a question whose decisive causal components the cited evidence did not vary.

I. Specimen — the body-worn camera RCTs

A police department deploys body cameras. Officers are randomly assigned to wear them or not. After a year, the data come in. The Washington, DC randomized trial — among the largest in the world, run by the Lab @ DC and published in PNAS — found no statistically significant effect on documented use of force, civilian complaints, policing activity, or judicial outcomes. The Rialto, California pilot reported large reductions in use-of-force incidents and citizen complaints.

Ariel and colleagues’ multisite work in Journal of Experimental Criminology finds use-of-force effects depend on officer activation discretion — when officers control when cameras record, use of force can increase rather than decrease. Lum and colleagues’ systematic review finds no clear or consistent effects on most measured police/citizen behaviors and notes that whether body-worn cameras strengthen accountability systems remains under-addressed.

Rialto’s reported reductions were widely invoked by departments, media, and governments as rationale for rollout; later null or mixed estimates were likewise invoked as evidence against expansive expectations. The same bounded findings carried different closure claims across the discourse.

The public-discourse summary collapses this to three phrases: the evidence shows body cameras work, or the evidence shows body cameras don’t work, or body cameras are evidence-based police accountability. The same body of work is cited by all three. Each citation is formally correct: the studies do report what the speaker says they report.

But the studies estimate the effect of assigning cameras to officers in a given department under that department’s existing activation, review, disclosure, and disciplinary regime over the study period. That is a real causal object. It is not the causal object the policy discourse claims to be settling.

The policy question is about an accountability architecture: when officers are required to record, with what carve-outs and exemptions; who reviews the footage and under what cadence; whether civilians and prosecutors can compel disclosure; whether misconduct revealed by footage is reliably punished; whether the union contract permits review at all; whether officers learn to manage the camera as a shield or a witness over time; whether the department’s promotion incentives reward camera-friendly behavior. A body camera is not a treatment in that sense. It is a sensor inside an accountability system that the trial holds constant.

The treatment-effect literature can be rigorously conducted, internally credible, and accurately quoted, and still be silent on whether the system should be redesigned around cameras. The studies cannot have answered that question. By construction, they vary one input within an existing institutional architecture; they do not vary the architecture.

The political mirror — minimum-wage employment effects

The same shape recurs across the political axis. Card and Krueger’s 1994 New Jersey–Pennsylvania fast-food study, and Jardim et al.’s Seattle minimum-wage phase-in work, estimate short-run employment-margin changes for a defined low-wage workforce in defined geographies during defined windows. The discourse summarizes these as the evidence shows minimum-wage increases don’t cost jobs or the evidence shows they do.

Whichever direction the employment-margin estimate runs, it does not settle the policy. That question routes through monopsony structure in local labor markets, long-run automation incentives, price pass-through, regional substitution, enforcement equilibrium, family-income incidence, and the political dynamics that determine future floor increases. A short-run employment estimate is a real causal object, identified inside an existing labor-market structure. What policy should be is a question about which labor-market structure to build, and the studies can be rigorously conducted while remaining silent on that. Body cameras and minimum wage differ in political valence and share methodological shape; the scope-laundering move appears on both. Whether the laundering benefits the reader’s preferred policy varies; the operator does not.

II. Levels of causal aggregation

A study can identify a bounded component estimand. A policy claim depends on a composite of components plus composition assumptions. The discourse routinely treats one as if it were the other.

A component estimand has the shape: for population P, under intervention I, what is the change in measured outcome Y against counterfactual C over window W? Well-identified study designs answer this. Identification holds the surrounding system constant by construction. Without that constancy, the estimate has no causal interpretation.

A composite policy claim is the actor-action-population-timeframe-threshold combination the decision is really about: should this jurisdiction install this accountability regime, given these alternatives, under these constraints, with these tradeoffs? The composite depends on a bundle of component estimands plus assumptions about interaction, sequencing, scale, adaptation, and institutional capacity. The composite cannot be answered by holding the surrounding system constant. It requires varying the system.

These are different levels of causal aggregation, not different types of question. The gap between them is well charted in the methodological literature: Cartwright and Hardie (Evidence-Based Policy, 2012) on the inference from “it worked there” to “it will work here,” with explicit support-factor identification; Deaton and Cartwright on what trial-sample ATEs do and do not identify; Heckman and Vytlacil on the distance between treatment-effect estimands and structural policy evaluation; Pawson and Tilley’s realist evaluation tradition on context-mechanism-outcome configurations; Pritchett and Sandefur on the development-economics literature where rigorous local evidence routinely fails to transport across contexts.

Public citation drops the qualifiers the methodological literature preserves.

The diagnostic below treats the best case: the cited study is well-identified, the qualifiers exist on paper, and the only failure is that they get dropped between paper and citation. Large parts of the behavioral evidence base do not meet that standard — in the Open Science Collaboration 2015 reproducibility project, 36% of replication studies produced statistically significant results; Camerer and colleagues 2018 replicated 13 of 21 Nature/Science social-science experiments, with replication effect sizes roughly half of originals. Specification searches, researcher degrees of freedom, convenience samples, and qualifiers that exist nominally but cannot survive hostile review are routine. Scope laundering operates at the citation-warrant layer regardless of whether the underlying study is rigorous; the floor case is strictly worse, because the citation now does work for a mechanism-design claim while the underlying estimate cannot bear even its narrow weight.

III. The laundering move

Causal scope laundering runs through five operational steps. Each is invisible alone; the composite converts a valid narrow estimate into authority over a question no one has answered.

Evidence anchor. The speaker cites a real study and reports what it found. (Whether the study itself is well-identified or weakly-identified does not change the rest of the sequence.)
Material scope compression. The qualifiers attached to the original estimate — population, intervention, counterfactual, window, identification strategy — drop out of the sentence. The estimate becomes free-floating. Compression is universal in citation; it becomes material when the omitted qualifier is necessary to prevent the warrant from expanding beyond what the cited evidence identified.
Warrant transfer. The free-floating estimate is offered as an answer to the larger composite policy claim the audience is actually arguing about.
Burden reversal. Anyone who points out that the cited study did not identify the broader composite is framed as anti-evidence, ideological, or unserious. Objecting is now reputationally costly.
Public-institutional closure. The composite policy claim drops out of the live decision process. Closure does not require literal cessation of thought; it requires that the institutional or public surface stop treating the residue as decision-relevant.

Each step is locally deniable. Step 1 is good practice. Step 2 is normal compression. Step 3 is what citation is for. Step 4 enforces evidential discipline. Step 5 is closure under uncertainty. Steps 2 and 3 often fuse in a single public sentence (“the evidence shows X” simultaneously drops scope and answers the larger question). Steps 4 and 5 are conceptually distinct (enforcement vs outcome) but operationally tied. The composite move is what should be diagnosed, not any single step.

The pattern recurs across domains. Charter-school lottery studies estimate effects for applicants admitted to oversubscribed urban schools and are cited as if they settled the architecture of school choice at scale. Project STAR’s Tennessee K–3 RCT estimated within-state class-size effects in a specific era and is cited as if it settled national class-size mandates. Grade-retention studies estimate outcomes for retained students under existing systems and are cited as if they settled whether minimum-competence progression rules should exist. Each is a real, well-identified estimate. Each becomes authority over a question its identification strategy could not have answered.

The logical gap (component cited as warrant for composite) is visible from inside the discourse. The sociological steps — burden reversal, pathologized objection — are claims about how speakers and audiences behave, and would be sharpened by documented institutional chains. That documentation is a separate empirical task; what is shown here is the recurring pattern, not a worked single closure event.

IV. The evidential surface — dual-use infrastructure

The evidential surface is democratic infrastructure before it is a failure mode. Citations are public artifacts. The study exists or it does not. The design can be summarized. The sample can be named. The estimate can be contested by other studies. Evidence review has institutional practices — systematic reviews, replication, meta-analysis, risk-of-bias tools, pre-registration, confidence grading. These are imperfect but they create a shared review surface across factions. A libertarian, a socialist, a police chief, a union lawyer, and a public-choice economist can disagree about body cameras while pointing at the same DC RCT.

The mechanism layer does not have that property by default. Mechanism stories are private goods — analysts, modelers, agency insiders, and policy professionals can produce elaborate causal narratives that ordinary citizens cannot easily audit. Without the evidential surface, debate becomes a competition between insider mechanism stories, and the best-rhetorician or highest-status-expert wins. Show me the study is a democratic improvement over accept my equilibrium model.

So the failure is not the evidential surface. The failure is unscoped citation used as a stopping rule for a question whose decisive causal components the cited evidence did not vary. The same surface that makes debate publicly reviewable becomes corrupt when it claims unearned scope.

The corruption survives because it is the cheapest defensible move in a public-discourse environment. Computing the composite policy claim is expensive — it requires specifying the proposed architecture, modeling incentive responses, naming the hidden ledgers, estimating equilibrium effects, and accepting uncertainty about all of them. Doing so visibly assigns the analyst the costs of the proposed system. The computation creates an obligation.

Citing a study, by contrast, costs nothing. The citation routes the burden of objection to the other side. The other side now has to produce their study, or explain why the cited one shouldn’t count, or — most damaging — propose alternative policy without an equivalently authoritative-sounding evidential anchor. None of those responses scales as well as the citation did.

When a citation is used this way, the cited study need not be wrong, well-identified, or even rigorous; it becomes the evidential surface over an uncomputed composite. In the limit, a citation to a non-replicated, single-site, specification-searched estimate carries the same closure warrant in discourse as a citation to a pre-registered multisite RCT — both surfaces are equally cheap to deploy and equally costly to contest. Carol Weiss called this the symbolic use of research: research deployed to retroactively defend an already-formed decision rather than to inform a new one. Pawson and Tilley’s realist-evaluation tradition was built to discipline the equivalent failure in evaluation practice. The evidential-surface failure is the public-discourse version of the same problem.

This is not Goodhart’s Law. Goodhart says that when a measure becomes a target, agents game the measure until it no longer tracks the underlying thing. In causal scope laundering, no one gamed the study. The cited estimate can be honestly conducted, well-identified, accurately quoted, and entirely on-target for its own design. The failure is at the citation-warrant layer: the estimate’s bounded identification is silently used as an unbounded stopping rule. Donald Campbell’s law of social indicators is structurally adjacent but distinct — Campbell’s mechanism requires gaming, and the laundering mechanism does not.

V. The Causal-Scope Gate

The gate fires only when a citation is being used to close a policy dispute. Routine citations as one input among several need no gate. The burden is the claimant’s, not the reader’s. When closure is being claimed, the person doing the closing is responsible for stating four things; if they cannot, the citation has not closed the dispute, regardless of whether any specific objector could have filled the fields in their place.

The identified question — the population, intervention, counterfactual, outcome, and window the cited study estimates.
The scope boundary — what the identification strategy varied and what it held constant by design.
The policy claim — actor, action, target population, timeframe, outcome criterion, decision threshold. (Not the slogan; the operational claim.)
The action-relevant residues — the incentive responses, hidden ledgers, and equilibrium effects the policy claim depends on, that the cited design did not vary. The list is sufficient when it covers the major causal dependencies the claim relies on; it does not require enumerating every imaginable unknown.

When citation is not laundering

The gate is narrower than it might sound. A citation is not laundering when any of the following hold:

The citation is offered as one input into deliberation, not as a stopping rule.
The policy claim is genuinely coterminous with the intervention the cited study identified (e.g., a procurement decision to buy cameras under the existing operating regime, not a redesign of the accountability architecture).
The omitted qualifiers do not change the action threshold for the live decision.
The speaker openly acknowledges uncertainty and treats the citation as defeasible.

Ordinary compressed citation in public discourse — “the evidence shows X” — is laundering only when it functions as closure against an action-relevant composite the cited study did not vary. Compression is universal; closure-by-compression is what the gate names.

The gate binds the closure level — and the gate-invoker

A symmetric obstruction risk runs in the opposite direction. The closure speaker can be tempted to use a narrow study as warrant for a broader composite (the laundering move). The objector can be tempted to upscale the live decision into a broader composite that no available evidence could close, then invalidate the citation for failing to answer the enlarged question. Both are the same move at different layers.

The gate therefore binds both sides at the level of the actual decision. Before invoking the gate, the objector must state whether the cited evidence is being used to close a premise, an implementation decision, a procurement decision, or a full institutional architecture — and whether the closure claim being challenged actually operates at the enlarged level the objector is naming. If the cited study fits the narrower decision being made, the gate does not fire even if a broader architecture remains unspecified.

Gate in use — body cameras

Applied to a citation of the DC RCT closing the claim that body cameras are not effective police accountability architecture:

Identified question — study-recoverable: effect of camera assignment on documented use of force, civilian complaints, policing activity, and judicial outcomes for the 2,224 MPD officers randomized in the 2015–2017 trial.
Scope boundary — design-inferred: identification varied camera assignment under MPD’s then-current activation, supervisor-review, disclosure, union-contract, and disciplinary regime; those institutional variables were held fixed by the experimental design and are not estimable from the RCT.
Policy claim being made — supplied by the closure speaker, not by the study: e.g., that municipal X should not fund department-wide cameras in its next budget cycle on accountability grounds.
Action-relevant residues — mechanism-analysis input, not study output: what changes under mandatory disclosure; under supervisor review on a fixed cadence; under discipline for non-activation; under footage admissibility in misconduct proceedings; under multi-year officer adaptation; under known-disclosure-rule civilian complaint behavior. These are derived from the policing-domain literature outside the cited RCT.

The gate does not say the DC RCT is wrong. It says the cited estimate cannot have answered the closure claim, because the variables the closure claim depends on are precisely the ones the identification strategy held constant. Labeling each bullet’s input source — study-recoverable, design-inferred, policy-context, mechanism-analysis — keeps the gate transparent about what the citation itself supplies versus what the gate-application requires from outside the cited study.

The Mechanism-Claim Gate — symmetric in burden, different in fields

A common counter-claim is: cameras will be useless because officers will learn to game activation and disclosure timing. That is a mechanism-design hypothesis, not a citation. The same anti-closure principle applies, but the field structure differs — a citation gate maps cited evidence to a closure claim; a mechanism-claim gate bounds an asserted hypothesis. Both are denied unbounded warrant; the symmetry is normative, not structural.

The mechanism-claim gate asks:

Predicted direction. Officer gaming should attenuate camera effects on use of force and substitute toward unrecorded encounters.
Magnitude range. The claimant must state a range large enough to change the policy decision, or admit the claim is directional only. If no defensible range can be given, the mechanism hypothesis can inform design but cannot close the decision.
Falsification condition. Multi-year panel with non-activation rate, encounter-type composition, and per-officer adaptation would refute or confirm.
Evidence range. Ariel et al.’s activation-discretion finding is suggestive evidence in the predicted direction; no direct estimate exists for a comprehensive disclosure-and-discipline regime under hostile officer adaptation.

The counter-claim survives as a stated hypothesis with bounded warrant. It can inform the policy debate; it does not close it. A mechanism-design objection used as a stopping rule against a closure claim must pass this gate; otherwise it is itself laundering at a different layer.

The gate is not a veto

Three closures need to be distinguished. Evidentiary closure — the claim that this dispute over what the evidence shows is settled. Deliberative closure — the claim that the reasoned case for an action is complete. Procedural closure — a vote, a ruling, a budget deadline, a statutory clock. The gate disciplines the first two. Procedural closure can legitimately occur under uncertainty even when scope statements are not produced; legitimacy in such cases flows from procedure, not from evidence.

A causal-scope objection defeats evidentiary or deliberative closure, not action. The gate says a citation has not settled the question; it does not say the policy must wait until every mechanism residue is computed. Anyone using the gate to delay must state the default action they are choosing, the costs of delay, the evidentiary threshold that would change their mind, and whether their mechanism objection is itself identified, bounded, or speculative. When action proceeds under uncertainty, someone must own the residue: a named owner who tracks whether the unresolved mechanism residues materialize and reopens the decision if they do. See Corrective Closure Ownership for the architecture of that role. Without a residue-owner, the gate-invocation collapses into the obstructionist move the diagnostic names elsewhere.

The 3-line working version

Under time pressure, the gate compresses to three questions anyone can ask the claimant:

What exactly did the cited study estimate?
What exact policy action is the citation being used to close?
Which action-relevant mechanisms or contexts were not varied by the study?

A citation that closes a policy dispute should let the claimant answer all three before the closure stands. A mechanism-claim that closes the same dispute should let its claimant answer the parallel three (direction / falsifier / evidence range) before its closure stands.

VI. Close

A study cannot close a question it did not identify. The evidential surface is democratic infrastructure; the failure is its use as a stopping rule for a question whose decisive causal components the cited evidence did not vary. The Causal-Scope Gate disciplines evidentiary and deliberative closure, not procedural action — a residue stopping rule, a stated decision-under-uncertainty default, and a named owner for unresolved residues keep the discipline from collapsing into the obstructionist move it diagnoses.

Related:

The Refusal to Compute — the institutional incentive that scope laundering exploits; laundering is one of the cheapest tools for refusing to compute.
The Mechanism Analysis — the structured artifact this essay’s repair points toward; the Causal-Scope Gate is a pre-MA citation-discipline rule.
The Response Vector — why the laundering move survives; the cheapest defensible response wins.
Cargo Cult Epistemology — sibling failure mode where facts function as tribal ammunition; CSL covers the case where the cited fact is real, correctly summarized, and still functions as warrant theft.
Nullius in Verba — context for science-as-authority decay; CSL names a specific public-discourse mode of it.
Calculemus — demand for computation; CSL names the move that lets institutions skip the demand.
Complexity Laundering — sibling pattern at the policy-cost layer.
The Causal Talisman — sibling diagnostic. CSL closes the warrant-overflow gap for evidence citation; CT closes the parallel gap for morally protected cause-names.

Sources and notes

Central interlocutor.

Nancy Cartwright and Jeremy Hardie, Evidence-Based Policy: A Practical Guide to Doing It Better (Oxford, 2012) — a canonical treatment of the “it worked there → it will work here” inference problem, with the support-factor discipline this essay’s gate translates into citation-practice form. CSL takes Cartwright’s methodological diagnosis and names the institutional-rhetorical move that perpetuates it.

Causal scope, external validity, transportability.

Angus Deaton and Nancy Cartwright, “Understanding and Misunderstanding Randomized Controlled Trials,” Social Science & Medicine 210 (2018) — trial-sample ATEs, mechanism dependence, scaling, interference, SUTVA. The technical articulation of the gap CSL names in discourse.
William Shadish, Thomas Cook, and Donald Campbell, Experimental and Quasi-Experimental Designs for Generalized Causal Inference (Houghton Mifflin, 2002) — the standard treatment of internal vs external validity that the citation practice routinely conflates.
Lee Cronbach, Designing Evaluations of Educational and Social Programs (Jossey-Bass, 1982) — UTOS framework (units, treatments, observations, settings) on what evaluation evidence can and cannot generalize over.
Judea Pearl and Elias Bareinboim, “External Validity: From Do-Calculus to Transportability across Populations,” Statistical Science 29:4 (2014) — the formal-causal-inference treatment of when treatment-effect estimates can be transported.
Lant Pritchett and Justin Sandefur, “Learning from Experiments When Context Matters,” AER P&P 105:5 (2015) — development-economics literature where rigorous local evidence routinely fails to transport.

Mechanism vs treatment-effect.

James Heckman and Edward Vytlacil, “Structural Equations, Treatment Effects, and Econometric Policy Evaluation,” Econometrica 73:3 (2005) — what policy questions common treatment-effect estimators do and do not answer; the technical version of the §II distinction.
Ray Pawson and Nick Tilley, Realistic Evaluation (Sage, 1997) — “what works, for whom, in what circumstances, and how”; the closest evaluation-tradition prior to CSL’s framing. CSL claims the operator is now general enough to name in public-discourse terms.
Charles Manski, Patient Care under Uncertainty (Princeton, 2019); “Identification Problems and Decisions under Ambiguity” — decision discipline when evidence under-identifies the choice. Manski supplies the partial-identification frame; CSL supplies the citation-practice form.

Replication crisis (the floor case).

Open Science Collaboration, “Estimating the Reproducibility of Psychological Science,” Science 349:6251 (2015) — 36% direct-replication rate across 100 psychology studies.
Colin F. Camerer, Anna Dreber, Felix Holzmeister et al., “Evaluating the Replicability of Social Science Experiments in Nature and Science between 2010 and 2015,” Nature Human Behaviour 2:9 (2018) — 13 of 21 (62%) replications significant, with replication effect sizes approximately half of originals.

Adjacent failure modes (distinguished from).

Donald Campbell, “Assessing the Impact of Planned Social Change” (1976/1979) — Campbell’s Law on corruption pressure for quantitative social indicators used in decision-making. Adjacent to CSL but not identical: Campbell’s mechanism requires gaming; CSL does not.
Carol Weiss, “The Many Meanings of Research Utilization,” Public Administration Review 39:5 (1979); “Knowledge Creep and Decision Accretion” (1980) — instrumental, conceptual, and symbolic uses of research. The symbolic-use category is where CSL’s evidential-surface dynamic lives.

Empirical anchors for §I.

David Yokum, Anita Ravishankar, and Alexander Coppock, “A randomized control trial evaluating the effects of police body-worn cameras,” PNAS 116:21 (2019; Lab @ DC) — the largest body-cam RCT to date, null on the headline measures.
Tony Farrar, “Self-awareness to being watched and socially desirable behavior: A field experiment on the effect of body-worn cameras on police use-of-force” (Police Foundation / Rialto pilot, 2013) — the early positive estimate widely cited as evidence cameras work.
Barak Ariel, Alex Sutherland, Darren Henstock et al., “Report: Increases in Police Use of Force in the Presence of Body-Worn Cameras Are Driven by Officer Discretion,” Journal of Experimental Criminology 12:3 (2016) — activation-discretion finding: use-of-force effects depend on who controls when cameras record.
Cynthia Lum, Megan Stoltz, Christopher Koper, and J. Amber Scherer, “Research on body-worn cameras: What the evidence tells us,” Campbell Systematic Reviews 16:3 (2020) — systematic review showing the downstream architecture (supervisor review, disclosure, discipline) remains under-evaluated as a treatment-effect question.
David Card and Alan Krueger, “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania,” AER 84:4 (1994) — the canonical narrow-scope minimum-wage estimate frequently extended into broader claims.
Ekaterina Jardim, Mark Long, Robert Plotnick, Emma van Inwegen, Jacob Vigdor, and Hilary Wething, “Minimum Wage Increases, Wages, and Low-Wage Employment: Evidence from Seattle,” NBER w23532 (2017 / revised) — modern Seattle $15 study, contested specifically on identification and external-validity grounds.

Stack placement. In Stack terms, this is a Layer 1 / 4 / 6 / 7 failure: mechanism, measurement, compilation, and computation collapse into a citation surface.

On the status caveat. The status note appearing immediately after the thesis is load-bearing for this essay’s epistemic posture, not a polite hedge. The diagnostic and the gate are derived from methodological literature plus type-level discourse patterns; they have not been tested against a documented institutional chain in any single policy domain. The proper next step for anyone using the diagnostic is to find a case, run the gate against it, and see what the gate-application reveals about the candidate operator’s empirical density.