The Measurement Anchor

What holds a proxy to its territory under consequence.

Elias Kunnas

A metric is a proxy placed under consequence. Institutions use proxies because they cannot act on unbounded reality. Once reward, punishment, funding, ranking, legitimacy, or closure attaches to a proxy, actors learn which reported number pays — and unless the proxy is anchored, its relation to the underlying territory begins to drift. A measurement regime survives this pressure only if four institutional conditions preserve the proxy-territory link: a causal audit path, a validity envelope with recalibration, bounded proxy status, and independent or adversarial verification. Goodhart-family pathologies are what happens when one or more of those anchors fail. The measurement anchor is the institutional discipline that keeps a consequential proxy from becoming the thing it was meant to observe.


I. The credit that was not a tonne

A carbon credit is treated as a quantified claim about the territory: one credit equals one tonne of CO₂e reduced or avoided relative to a counterfactual baseline world. The unit is tradable. Once tradable, the proxy is under direct consequence — every actor in the system has reason to optimize the proxy.

Probst and colleagues, in a 2024 Nature Communications synthesis covering roughly a fifth of credit volume issued to date and almost a billion tons CO₂e, estimated that less than 16% of credits issued to the investigated projects constituted real emission reductions. The variation by project type was large: 11% for cookstoves, 16% for SF₆ destruction, 25% for avoided deforestation, 68% for HFC-23 abatement, no statistically significant reductions for Chinese wind power and US improved forest management in the investigated set. The mechanisms behind the gap include additionality failures, inflated baselines, leakage, weak usage measurement, methodology problems, and verifier-incentive misalignment.

This is not reducible to fraud. Some failures are deliberate; most are predictable structural responses to attaching tradable consequence to a counterfactual quantified claim without anchoring the proxy-territory relation institutionally. The case asks the four questions this essay names: Causal audit path — how exactly does the project cause reductions relative to the counterfactual? Validity envelope — where do baseline and leakage assumptions hold? Bounded proxy status — is the credit treated as a conditional quantified claim, or as atmospheric subtraction made solid? Independent verification — do auditors test gaming and favorable-assumption strategies, and what is their incentive structure?

A credit is not a measured physical tonne the way a scale measures mass. It is a counterfactual proposition about a baseline world. When the institutional discipline that maintains the link from the proposition to the territory fails, the credit drifts away from the tonne. The drift is not exotic. It is the structural pathology of consequential proxies.


II. What a measurement anchor is

A metric is a pointer from a measurable surface (the proxy) back to a causal surface (the territory) the institution wants to act on. The pointer is useful only as long as the proxy-territory relation is stable. Once the metric becomes consequential — once an actor’s reward, funding, status, legitimacy, or operational closure depends on the score — actors apply selection pressure to the proxy. The territory may also change under reactivity: rankings reshape institutions, targets reshape services. The structural point is that the proxy’s information content changes under pressure — the same number no longer means what it meant before consequence attached.

The anchor is not a demand to weaken measurement. It is a demand to preserve the measurement relation when measurement starts to govern.

A measurement anchor is the set of institutional conditions that keeps a consequential proxy connected to the territory it stands in for. It has four:

The conditions are necessary, not sufficient. Even well-anchored measurement can be defeated by determined gaming. They specify the minimum institutional spec any consequential proxy should pass before deployment.

Alfred Korzybski’s 1933 formulation — “the map is not the territory” — is honored prior art at the level of representation. The Measurement Anchor is one layer downstream: not whether the map is the territory (it isn’t), but what institutional discipline holds the map’s relation to the territory once the map starts governing. Korzybski warned against confusing maps with territory; this essay identifies the structural mechanism by which institutions are forced to do exactly that whenever a proxy becomes consequential.


III. Consequence as selection pressure

Selection pressure on a proxy is not a moral failing of the actors. It is the predictable structural response of any system where actors are evaluated against a measurable surface.

Charles Goodhart’s 1975 monetary-policy paper named the dynamic: any observed statistical regularity tends to collapse once pressure is placed on it for control purposes. Donald Campbell’s 1979 social-indicators paper made the parallel claim: quantitative indicators used for social decision-making become subject to corruption pressures and distort the processes they monitor. Marilyn Strathern’s 1997 paper “Improving Ratings: Audit in the British University System” supplied the popular paraphrase — when a measure becomes a target, it ceases to be a good measure — and a stronger claim alongside it: audit is not merely neutral checking; it reorganizes institutions around auditable surfaces while jeopardizing what it claims to monitor.

Theodore Porter’s Trust in Numbers (1995) supplies the steelman against treating quantification as a pure mistake: institutions turn to numbers because personal trust, professional discretion, and institutional legitimacy are politically unavailable. Numbers are a governance technology adopted under conditions of distrust. The Measurement Anchor does not argue against measurement; it argues that consequence-attached measurement requires institutional discipline to remain a working pointer rather than degenerating into a target.

Wendy Espeland and Michael Sauder’s “Rankings and Reactivity” (2007), studying law-school rankings, showed how public consequential measures recreate the social worlds they purport to describe: institutions reorganize around the metric, expectations shift, and the metric’s relation to the underlying object changes through the act of attaching consequence to it. The reactivity is the structural mechanism the anchor preserves the relation against.

The literature already names the phenomenon. What it does not always supply, in one place and in deployable form, is the institutional discipline that keeps the pointer working.


IV. The Goodhart family

The canonical taxonomy is Manheim and Garrabrant’s “Categorizing Variants of Goodhart’s Law” (2018), which distinguishes four mechanisms:

The variant taxonomy classifies failure mechanisms. The Measurement Anchor classifies maintenance failures. They overlap, but they do not map one-to-one.

Goodhart variants name how the proxy fails. Anchors name what the institution failed to maintain. The diagnostic move is to ask which anchor failed, not which variant the failure was — because the variant is the diagnosis, and the anchor is the repair obligation.

Goodhart variantTypical anchor failure
RegressionalValidity envelope and bounded proxy status
ExtremalValidity envelope and recalibration
CausalCausal audit path
AdversarialIndependent verification, bounded proxy status, often also audit path

A short walk through specimens:

GDP is a different kind of case worth naming. Stiglitz, Sen, and Fitoussi’s 2009 Commission report argued that GDP, useful as a market-production indicator, is treated as a welfare, sustainability, and distributional indicator it was never validated to be. Bounded-proxy-status failure at scale: the metric is used outside its validated domain because no better-grounded number is institutionally available.


V. The four anchors in detail

Causal audit path. A deployed metric should specify the causal chain from the proxy surface to the underlying causal surface it stands in for. Without this, the institution cannot say what a change in the score proves about the thing it governs — what is the territory the academic citation count tracks? Donabedian’s 1966 structure-process-outcome framework for healthcare quality is the canonical example of a discipline that requires this explicitly: structural and process measures are valid only when their causal chain to outcomes is specified. The audit path is not statistical fit; it is the named causal mechanism.

Validity envelope and recalibration. The institution declares the domain in which the proxy-territory relation has been validated, and re-anchors periodically when the relation drifts. Drift is not always temporal. A metric can fail because consequence pushed actors into a region where the proxy was never validated. Without an envelope, extremal and regressional drift accumulate silently. Generalizability theory in psychometrics is the technical ancestor; the envelope concept translates it into institutional scope-of-use: the institution declares where the metric may authorize action, and what happens when use exceeds that scope.

Bounded proxy status. The institutional record names the metric as a proxy, states the causal surface it is meant to track, and limits the consequences that can be attached to it outside its validated scope. This is stronger than language hygiene. It carries institutional force: do not reify the score, do not use the proxy outside its validation envelope, do not let the proxy become the sole basis for funding, ranking, closure, or legitimacy. For a carbon credit, bounded proxy status means the buyer may record the credit as a conditional offset claim, not as a physical tonne removed, unless the project class has passed the stated baseline, leakage, and additionality tests. Messick’s 1989 unified validity theory treats score interpretation and use as part of validity, including consequential validity; the governance rule here is stricter: no consequence attaches outside the validated use.

Independent or adversarial verification. The metric is checked by a body outside the measured actor’s reporting chain, with an explicit gaming hypothesis. Self-reporting is the canonical failure mode. The adversarial frame matters: the verifier must assume the measured actor is optimizing the proxy, not the territory, and design checks accordingly. Reliability and replication, in the psychometric and audit traditions, ask whether scores are stable across raters and occasions. Adversarial verification asks something different — whether the actor under consequence can manufacture the score while severing it from the territory. Daniel Koretz’s Measuring Up (2008) is the canonical analysis of how high-stakes accountability under self-reporting produces score inflation and the illusion of progress.

The four conditions are validity-under-use written as an institutional checklist for carbon credits, ESG scores, AI evaluations, NHS targets, university rankings, KPIs, fiscal indicators, and other consequential proxies that do not always arrive packaged as psychometric tests. The psychometric tradition since Cronbach and Meehl’s 1955 “Construct Validity in Psychological Tests” already contains the technical vocabulary of validity, reliability, generalizability, calibration, and (after Messick) consequential validity. The work here is translation and packaging — putting that tradition into deployable form for institutions that have to attach consequence to proxies and want to know what holds the proxy-territory link.


VI. Wrong repairs

When boards, agencies, funders, or managers see a metric failing, they usually add metrics, granularity, transparency, or automation. The instinct is wrong when the missing condition is not metric volume:

Donella Meadows’s “Leverage Points” (1999) ranks numerical parameters as the lowest-leverage intervention in a system. Anchored measurement does not replace rule, goal, or feedback redesign. It only prevents the metric layer from severing the institution’s contact with the governed thing. Higher-leverage moves sit at other layers of the corpus and are not the subject here. The anchor is the floor under any institution that wants to use a consequential metric at all.

The diagnostic before any reform: ask which of the four anchors failed, not which metric to add or improve.


VII. Relation to the corpus

The Measurement Anchor sits at Stack Layer 4 (Measurement). Adjacent corpus essays at neighbouring layers:


VIII. Close

A consequential proxy remains usable only while it is institutionally bounded by a declared validity envelope, a causal audit path, and independent adversarial verification. Outside those conditions, the institution must either recalibrate, reduce the consequence attached to the metric, or explicitly treat the metric as advisory.

A metric is a maintained surface of contact. The anchor keeps that surface from becoming the thing it was meant to observe.

The diagnostic question before any consequential metric is deployed: What is the causal chain from this proxy to the territory? Where has the proxy-territory relation been validated, and what triggers recalibration? Is the metric named as a proxy and limited in the consequences attached to it? Who verifies it, and do they assume the measured actor is optimizing it?


Related:


Sources and Notes

The four-anchor primitive. Causal audit path, validity envelope and recalibration, bounded proxy status, independent or adversarial verification. The conditions are necessary, not sufficient: an anchored metric can still be defeated by determined gaming. The contribution is institutional packaging of existing validity, audit, and Goodhart-family insights into a deployable spec for consequential proxies across domains.

The Goodhart family canonical. Charles Goodhart, “Problems of Monetary Management: The U.K. Experience,” in Papers in Monetary Economics 1, Reserve Bank of Australia (1975), 1–20. Donald T. Campbell, “Assessing the Impact of Planned Social Change,” Evaluation and Program Planning 2(1) (1979), 67–90. Marilyn Strathern, “Improving Ratings: Audit in the British University System,” European Review 5(3) (1997), 305–321. Strathern’s stronger claim — beyond the popular paraphrase — is that audit reorganizes institutions around auditable surfaces while jeopardizing what it claims to monitor.

Goodhart taxonomy. David Manheim and Scott Garrabrant, “Categorizing Variants of Goodhart’s Law,” arXiv:1803.04585 (2018). The four-variant taxonomy (regressional, extremal, causal, adversarial) is the standard reference; the Measurement Anchor’s variant-to-condition table is many-to-many, not one-to-one.

Audit Society and NPM critique. Michael Power, The Audit Society: Rituals of Verification (Oxford University Press, 1997). Gwyn Bevan and Christopher Hood, “What’s Measured Is What Matters: Targets and Gaming in the English Public Health Care System,” Public Administration 84(3) (2006), 517–538.

Reactivity and rankings. Wendy Nelson Espeland and Michael Sauder, “Rankings and Reactivity: How Public Measures Recreate Social Worlds,” American Journal of Sociology 113(1) (2007), 1–40. Wendy Espeland and Mitchell Stevens, “A Sociology of Quantification,” European Journal of Sociology 49(3) (2008), 401–436 — secondary.

Numbers under distrust. Theodore M. Porter, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life (Princeton University Press, 1995). Institutions adopt quantification under conditions of low trust in professional discretion. The Measurement Anchor does not argue against measurement; it argues for institutional discipline once measurement carries consequence.

Map and territory (philosophical parent). Alfred Korzybski, Science and Sanity: An Introduction to Non-Aristotelian Systems and General Semantics (Lancaster, PA: Science Press Printing Company, 1933).

Construct validity and consequential validity (psychometric ancestors). Lee J. Cronbach and Paul E. Meehl, “Construct Validity in Psychological Tests,” Psychological Bulletin 52(4) (1955), 281–302. Samuel Messick, “Validity,” in Robert L. Linn, ed., Educational Measurement, 3rd edition (American Council on Education / Macmillan, 1989), 13–103 — the unified validity theory including consequential validity. AERA / APA / NCME, Standards for Educational and Psychological Testing (multiple editions, most recently 2014). The four-anchor spec is not a rival measurement theory; it translates validity-under-use into a governance primitive for consequential proxies outside the psychometric test context.

Healthcare quality measurement. Avedis Donabedian, “Evaluating the Quality of Medical Care,” Milbank Memorial Fund Quarterly 44(3), Part 2 (1966), 166–206. The structure-process-outcome framework is canonical for causal-audit-path discipline in healthcare measurement.

Accountability and gaming. Daniel Koretz, Measuring Up: What Educational Testing Really Tells Us (Harvard University Press, 2008). Koretz analyzes test-based accountability under high stakes — score inflation, teaching to the test, illusions of progress — and is the canonical applied psychometric study of adversarial Goodhart under self-reporting.

Carbon offset specimen. Benedict S. Probst, Malte Toetzke, Andreas Kontoleon, Laura Díaz Anadón, Jan C. Minx, Barbara K. Haya, Lambert Schneider, Philipp A. Trotter, Thales A. P. West, Annelise Gill-Wiehl, and Volker H. Hoffmann, “Systematic assessment of the achieved emission reductions of carbon crediting projects,” Nature Communications 15:9562 (2024). Coverage: 14 studies of 2,346 projects, plus 51 studies of similar non-credit interventions, roughly one-fifth of credit volume issued, almost one billion tons CO₂e. Headline: less than 16% of credits issued to investigated projects represented real emission reductions; the study estimates 812 of 972 million credits across covered project types likely do not constitute real reductions. Variation by project type: cookstoves 11%, SF₆ destruction 16%, avoided deforestation 25%, HFC-23 abatement 68%; no statistically significant reductions for Chinese wind power and US improved forest management in the investigated set. The framing here is measurement-anchor failure entangled with market design, baseline construction, and verification-incentive design, not pure measurement pathology.

Financial VaR / tail-risk specimen. Jón Daníelsson, “Blame the Models,” Journal of Financial Stability 4(4) (2008), 321–328. Statistical risk models are useful for frequent small events; under consequence pushed into systemic-tail regimes, the proxy-target relation does not hold. Extremal Goodhart, validity-envelope failure.

School quality ratings specimen. Joshua Angrist, Peter Hull, Parag A. Pathak, and Christopher R. Walters, “Race and the Mismeasure of School Quality,” American Economic Review: Insights 6(1) (March 2024), 20–37. School ratings in Denver and New York correlate with white enrollment shares partly because student demographics and selection drive both ratings and perceived value-added. Causal Goodhart: proxy and target share a common cause.

Beyond-GDP institutional case. Joseph E. Stiglitz, Amartya Sen, and Jean-Paul Fitoussi, Report by the Commission on the Measurement of Economic Performance and Social Progress (2009). GDP is useful as a market-production indicator; it is treated as a welfare, sustainability, and distributional indicator outside its validated scope. The canonical institutional case for bounded-proxy-status failure at scale.

Academic citations specimen. Michael Fire and Carlos Guestrin, “Over-optimization of academic publishing metrics: observing Goodhart’s Law in action,” GigaScience 8(6) (2019), giz053. Mixed case: regressional component at extreme citation levels, plus reactivity (Strathern) and adversarial gaming (Campbell). Not assigned cleanly to a single variant.

Wells Fargo aside. The CFPB’s 2016 enforcement case documented sales-target-driven fake-account creation. The case is the private-sector cousin of Bevan-Hood NHS gaming but is dominated by deliberate fraud under incentive pressure rather than by measurement-anchor failure proper. Mentioned for completeness, not as a load-bearing specimen.

Leverage-point caveat. Donella H. Meadows, “Leverage Points: Places to Intervene in a System” (Sustainability Institute working paper, 1999). Meadows ranks numerical parameters as the lowest-leverage intervention. The Measurement Anchor is consistent: anchored measurement is the minimum condition for using metrics without losing contact with reality, not the highest-leverage reform.

Status of the partition. The four-anchor decomposition is a working compression of measurement-validity, audit-theory, and Goodhart-family insights into one institutional deployment spec. The conditions overlap in live practice: bounded proxy status is implicit in any validity envelope; adversarial verification often depends on a clean audit path. The conditions are useful as a diagnostic checklist before deployment, not as mutually exclusive species. A better compression that did the same diagnostic work would be welcome.