The Measurement Anchor

What holds a proxy to its territory under consequence.

Elias Kunnas

A metric is a proxy placed under consequence. Institutions use proxies because they cannot act on unbounded reality. Once reward, punishment, funding, ranking, legitimacy, or closure attaches to a proxy, actors learn which reported number pays — and unless the proxy is anchored, its relation to the underlying territory begins to drift. A measurement regime survives this pressure only if four institutional conditions preserve the proxy-territory link: a causal audit path, a validity envelope with recalibration, bounded proxy status, and independent or adversarial verification. Goodhart-family pathologies are what happens when one or more of those anchors fail. The measurement anchor is the institutional discipline that keeps a consequential proxy from becoming the thing it was meant to observe.

I. The credit that was not a tonne

A carbon credit is treated as a quantified claim about the territory: one credit equals one tonne of CO₂e reduced or avoided relative to a counterfactual baseline world. The unit is tradable. Once tradable, the proxy is under direct consequence — every actor in the system has reason to optimize the proxy.

Probst and colleagues, in a 2024 Nature Communications synthesis covering roughly a fifth of credit volume issued to date and almost a billion tons CO₂e, estimated that less than 16% of credits issued to the investigated projects constituted real emission reductions. The variation by project type was large: 11% for cookstoves, 16% for SF₆ destruction, 25% for avoided deforestation, 68% for HFC-23 abatement, no statistically significant reductions for Chinese wind power and US improved forest management in the investigated set. The mechanisms behind the gap include additionality failures, inflated baselines, leakage, weak usage measurement, methodology problems, and verifier-incentive misalignment.

Fraud is the narrow case; the general failure is structural response to tradable consequence. The carbon-credit specimen tests the four anchors: Causal audit path — how does the project cause reductions relative to the counterfactual? Validity envelope — where do baseline and leakage assumptions hold? Bounded proxy status — is the credit a conditional quantified claim, or atmospheric subtraction made solid? Independent verification — do auditors test gaming and favourable-assumption strategies, and what is their incentive structure?

A carbon credit is a counterfactual proposition about a baseline world, not a measured physical tonne. When the institutional discipline that maintains the link from the proposition to the territory fails, the credit drifts away from the tonne — the structural pathology of consequential proxies.

II. What a measurement anchor is

A metric is a pointer from a measurable surface (the proxy) back to a causal surface (the territory) the institution wants to act on. The pointer is useful only as long as the proxy-territory relation is stable. Once the metric becomes consequential — once an actor’s reward, funding, status, legitimacy, or operational closure depends on the score — actors apply selection pressure to the proxy. The territory may also change under reactivity: rankings reshape institutions, targets reshape services. The structural point is that the proxy’s information content changes under pressure — the same number no longer means what it meant before consequence attached.

The anchor preserves the measurement relation when measurement starts to govern.

A measurement anchor is the set of institutional conditions that keeps a consequential proxy connected to the territory it stands in for. It has four:

Causal audit path — a specified causal chain from the proxy surface back to the underlying territory. The audit path is not “we measured X carefully”; it is “we can show the chain through which X is causally connected to Y, the thing we want.”
Validity envelope and recalibration — the domain in which the proxy-territory relation has been validated, with periodic re-anchoring when the relation drifts under pressure or context change.
Bounded proxy status — the institutional record names the metric as a proxy, states the causal surface it tracks, and limits the consequences that can be attached outside its validated scope. A school score cannot be used outside the population it was validated on. A hospital target should not become the sole basis for funding unless the institution has validated the proxy for that use and installed stronger verification. A carbon credit cannot be treated as atmospheric subtraction unless the claim class allows.
Independent or adversarial verification — a verifier outside the measured actor’s reporting chain, with an explicit gaming hypothesis: the check assumes the actor is optimizing the proxy, not the territory, and designs the audit accordingly.

These conditions are the minimum institutional spec a consequential proxy should pass before deployment; determined gaming can still defeat well-anchored measurement.

Alfred Korzybski’s 1933 formulation — “the map is not the territory” — is honored prior art at the level of representation. The Measurement Anchor is one layer downstream: not whether the map is the territory (it isn’t), but what institutional discipline holds the map’s relation to the territory once the map starts governing. Korzybski warned against confusing maps with territory; consequence forces institutions to treat the governing map as the governed object unless the proxy relation is maintained.

III. Consequence as selection pressure

Selection pressure on a proxy is not a moral failing of the actors. It is the predictable structural response of any system where actors are evaluated against a measurable surface.

Charles Goodhart’s 1975 monetary-policy paper named the dynamic: any observed statistical regularity tends to collapse once pressure is placed on it for control purposes. Donald Campbell’s 1979 social-indicators paper made the parallel claim: quantitative indicators used for social decision-making become subject to corruption pressures and distort the processes they monitor. Marilyn Strathern’s 1997 paper “Improving Ratings: Audit in the British University System” supplied the popular paraphrase — when a measure becomes a target, it ceases to be a good measure — and a stronger claim alongside it: audit is not merely neutral checking; it reorganizes institutions around auditable surfaces while jeopardizing what it claims to monitor.

Theodore Porter’s Trust in Numbers (1995) is the steelman against treating quantification as a pure mistake: institutions turn to numbers because personal trust, professional discretion, and institutional legitimacy are politically unavailable. Numbers are a governance technology adopted under conditions of distrust. Consequence-attached measurement requires institutional discipline to remain a working pointer rather than degenerating into a target.

Wendy Espeland and Michael Sauder’s “Rankings and Reactivity” (2007), studying law-school rankings, showed how public consequential measures recreate the social worlds they purport to describe: institutions reorganize around the metric, expectations shift, and the metric’s relation to the underlying object changes through the act of attaching consequence to it. The reactivity is the structural mechanism the anchor preserves the relation against.

The missing institutional object is the maintenance spec that keeps the pointer working under consequence pressure.

IV. The Goodhart family

Goodhart variants name how proxies fail; anchors name what institutions failed to maintain. The canonical taxonomy is Manheim and Garrabrant’s “Categorizing Variants of Goodhart’s Law” (2018), which distinguishes four mechanisms:

Regressional Goodhart — proxy-target correlation breaks at extreme values. The correlation held at moderate intensity; extreme optimization selects metric-favorable noise rather than the underlying territory.
Extremal Goodhart — the proxy is used in a regime where the proxy-target relation was never validated. The metric was validated in one regime; consequence pushes actors into a regime where its information content differs.
Causal Goodhart — proxy and target share a common cause. Intervention on the proxy doesn’t propagate to the territory because the proxy is downstream of, not upstream of, the underlying causal surface.
Adversarial Goodhart — actors actively game the proxy.

The variant taxonomy classifies failure mechanisms. The Measurement Anchor classifies maintenance failures. They overlap, but they do not map one-to-one.

Goodhart variants name how the proxy fails. Anchors name what the institution failed to maintain. The diagnostic move is to ask which anchor failed, not which variant the failure was — because the variant is the diagnosis, and the anchor is the repair obligation.

Goodhart variant	Typical anchor failure
Regressional	Validity envelope and bounded proxy status
Extremal	Validity envelope and recalibration
Causal	Causal audit path
Adversarial	Independent verification, bounded proxy status, often also audit path

Four specimens show the variant-anchor split: VaR breaks at the envelope, school ratings at the audit path, NHS targets at verification, citation indices across several anchors.

Financial VaR models — a value-at-risk model validated for frequent, ordinary market-risk estimation operates accurately in its calibration regime. When pushed into systemic-tail regimes, the proxy-target relation does not hold. Jón Daníelsson’s “Blame the Models” criticized this exact failure: statistical risk models are useful for frequent small events but unreliable when applied to systemically important events. Extremal Goodhart; validity-envelope failure.
School quality ratings — ratings often correlate with student outcomes through common causes (student demographics, family income, selection effects) rather than through school value-added. Angrist, Hull, Pathak, and Walters (2024) found that school performance ratings’ correlation with white enrollment shares in Denver and New York reflects selection bias rather than causal value-add. Intervention on the rating doesn’t propagate to school quality because the rating is downstream of the common cause. Causal Goodhart; audit-path failure.
NHS target gaming — Bevan and Hood (2006) documented the canonical adversarial case under the English NHS “targets and terror” regime: ambulance response-time correction and reclassification below the eight-minute threshold, A&E gaming through extra staffing during measured periods and cancellation of operations to keep within waiting-time limits, trolley-wait gaming by treating trolleys as beds, and inappropriate waiting-list adjustments affecting nearly 6,000 records. The targets were validated metrics at the population level; under consequence, the institutional reporting chain optimized the score while the underlying service worsened or stayed unchanged. Adversarial Goodhart; independent-verification failure plus bounded-proxy-status failure.
Academic citation indices — Fire and Guestrin (2019) document publication inflation, longer titles and reference lists, self-citation, h-index gaming, and reactive academic publishing. This is a mixed case: regressional component at extreme citation levels (where the score selects for field size and citation-network position rather than research quality), but also strong reactivity in Strathern’s sense and gaming in Campbell’s. It does not fit a single variant cleanly.

GDP is a different kind of case worth naming. Stiglitz, Sen, and Fitoussi’s 2009 Commission report argued that GDP, useful as a market-production indicator, is treated as a welfare, sustainability, and distributional indicator it was never validated to be. Bounded-proxy-status failure at scale: the metric is used outside its validated domain because no better-grounded number is institutionally available.

V. The four anchors in detail

Causal audit path. A deployed metric should specify the causal chain from the proxy surface to the underlying causal surface it stands in for. Without this, the institution cannot say what a change in the score proves about the thing it governs — what is the territory the academic citation count tracks? Donabedian’s 1966 structure-process-outcome framework for healthcare quality is the canonical example of a discipline that requires this explicitly: structural and process measures are valid only when their causal chain to outcomes is specified. The audit path is not statistical fit; it is the named causal mechanism.

Validity envelope and recalibration. The institution declares the domain in which the proxy-territory relation has been validated, and re-anchors periodically when the relation drifts. Drift is not always temporal. A metric can fail because consequence pushed actors into a region where the proxy was never validated. Without an envelope, extremal and regressional drift accumulate silently. Generalizability theory in psychometrics is the technical ancestor; the envelope concept translates it into institutional scope-of-use: the institution declares where the metric may authorize action, and what happens when use exceeds that scope.

Bounded proxy status. The institutional record names the metric as a proxy, states the causal surface it is meant to track, and limits the consequences that can be attached to it outside its validated scope. This is stronger than language hygiene. It carries institutional force: do not reify the score, do not use the proxy outside its validation envelope, do not let the proxy become the sole basis for funding, ranking, closure, or legitimacy. For a carbon credit, bounded proxy status means the buyer may record the credit as a conditional offset claim, not as a physical tonne removed, unless the project class has passed the stated baseline, leakage, and additionality tests. Messick’s 1989 unified validity theory treats score interpretation and use as part of validity, including consequential validity; the governance rule here is stricter: no consequence attaches outside the validated use.

Independent or adversarial verification. The metric is checked by a body outside the measured actor’s reporting chain, with an explicit gaming hypothesis. Self-reporting is the canonical failure mode. The adversarial frame matters: the verifier must assume the measured actor is optimizing the proxy, not the territory, and design checks accordingly. Reliability and replication, in the psychometric and audit traditions, ask whether scores are stable across raters and occasions. Adversarial verification asks something different — whether the actor under consequence can manufacture the score while severing it from the territory. Daniel Koretz’s Measuring Up (2008) is the canonical analysis of how high-stakes accountability under self-reporting produces score inflation and the illusion of progress.

These four conditions turn validity-under-use into an institutional checklist for consequential proxies — carbon credits, ESG scores, AI evaluations, NHS targets, university rankings, KPIs, fiscal indicators — that do not always arrive packaged as psychometric tests. The psychometric tradition since Cronbach and Meehl’s 1955 “Construct Validity in Psychological Tests” already contains the technical vocabulary of validity, reliability, generalizability, calibration, and (after Messick) consequential validity; the institutional translation is what consequence-attached proxies need before deployment.

Cross-system anchor failure: the orphaned return trip

The four anchors assume one institution owns the proxy and bears the consequence of its drift. Consequential proxies routinely break that assumption: a metric is produced under one actor’s schema and relied on for another actor’s decision. A CVSS base score, a PRIIPs key-information document, a published p-value, an inspection grade, or a benchmark number can be valid inside its production rule while omitting the local exposure, recency, drift, counterparty condition, or replication state the downstream decision requires. The failure is not that the symbol is false. It is that the producer owns routine production, the relier owns the consequence, and no actor owns the routine return trip from the symbol back to the decision-relevant territory — the causal audit path runs forward to the score and stops, because the actor positioned to reconstruct the territory is not the actor obligated to.

The diagnostic adds an owner question to the four anchors: what decision is being made from the symbol, what local context the schema omits, who can reconstruct it, and who is obligated, funded, triggered, and answerable for doing so before reliance. Where the answer is no one, the repair is the anchor’s own — bound the proxy to advisory use, reduce the consequence, or install a verification owner (CVSS Environmental metrics, SSVC-style prioritization, risk-based sampling, a replication budget, a drift trigger, independent re-anchoring) — but assigned across the boundary rather than inside one institution, and the owner need not be the producer. The gate against overfiring is consequence: the orphaned return trip is a failure only where a material downstream decision falls inside a window in which reconstruction would change the action; a grade read for an ordinary lunch needs no owner. Where the symbol is credited as having executed the decision it only informs, the failure is Nominal Execution’s; where the territory has been reconstructed but no actor must act, it is Powerless Intelligence’s; where a reliance already made must be reopened after it proves wrong, it is Corrective Closure Ownership’s.

VI. Wrong repairs

When boards, agencies, funders, or managers see a metric failing, they usually add metrics, granularity, transparency, or automation. The instinct is wrong when the missing condition is not metric volume:

More metrics without causal audit path → more degrees of freedom for gaming, not better measurement. A new ESG sub-score inherits whatever audit-path failure broke the headline ESG score.
More granular metrics without validity envelope → finer-grained drift outside the validated range. Breaking the school rating into ten subscales does not extend the population it was validated on; it just splits the same selection bias into ten pieces.
More transparent metrics without bounded proxy status → audience and actor both treat the metric as the thing itself. Publishing the carbon-credit methodology does not stop buyers from booking the credit as a physical tonne when the institution never bounds the claim.
More automated metrics without independent verification → the actor authoring the data is still the actor being measured. NHS trust dashboards updated nightly do not solve self-reporting; they speed it up.

Donella Meadows’s “Leverage Points” (1999) ranks numerical parameters as the lowest-leverage intervention in a system. Anchored measurement is the floor under consequential metrics; rule, goal, and feedback redesign operate above it. The anchor prevents the metric layer from severing the institution’s contact with the governed thing.

The diagnostic before any reform: ask which of the four anchors failed, not which metric to add or improve.

VII. Relation to the corpus

The Measurement Anchor sits at Stack Layer 4 (Measurement). Adjacent corpus essays at neighbouring layers:

Mechanism Space — the philosophical parent: the metric is the semantic-space pointer; the territory is mechanism-space. The Anchor is the institutional discipline that preserves the relation under consequence.
Response Vector — formal gaming is one of four response channels actors take when a metric becomes target. The Anchor is upstream: the measurement-side failure that activates the gaming channel.
Constructive Diagnosis — the §VII wrong-repair grid contains the closest existing compressed version of this thesis. The Anchor is the dedicated expansion for the measurement layer.
Bad Equilibria Are Not One Thing — “Payoff failure” is the diagnostic gate. The Anchor supplies the measurement-layer mechanism beneath it.
The Severed Map — the discipline-level pattern: when the map stops updating from the territory. The Anchor is the metric-object specialization of the same failure.
Cargo Cult Epistemology — ritualized verification decoupled from causal contact. Anchor failure on the verification side.
Implementation Ledger — the verification field requires an anchored metric. The Anchor is upstream of any conditional-closure design.
The Stack — Layer 4 anchor.

VIII. Close

A consequential proxy remains usable only while it is institutionally bounded by a declared validity envelope, a causal audit path, and independent adversarial verification. Outside those conditions, the institution must either recalibrate, reduce the consequence attached to the metric, or explicitly treat the metric as advisory.

A metric is a maintained surface of contact. The anchor keeps that surface from becoming the thing it was meant to observe.

The diagnostic question before any consequential metric is deployed: What is the causal chain from this proxy to the territory? Where has the proxy-territory relation been validated, and what triggers recalibration? Is the metric named as a proxy and limited in the consequences attached to it? Who verifies it, and do they assume the measured actor is optimizing it?

Related:

Mechanism Space — the metric is the semantic-space pointer; territory is mechanism-space.
Response Vector — what actors do when a metric becomes target.
Constructive Diagnosis — the wrong-repair discipline this essay specializes for the measurement layer.
Bad Equilibria Are Not One Thing — “Payoff failure” gate; the Anchor supplies its measurement-layer mechanism.
Implementation Ledger — conditional closure requires anchored verification.
The Refusal to Compute — the Layer 7 sibling: how institutions avoid producing or admitting the constraining relation in the first place, before a proxy could even drift.
The Severed Map — discipline-level map-detachment; the Anchor is the metric-object specialization.
Cargo Cult Epistemology — ritualized verification decoupled from causal contact.
The Stack — Layer 4 anchor.

Sources and Notes

The four-anchor primitive. Causal audit path, validity envelope and recalibration, bounded proxy status, independent or adversarial verification. The conditions are necessary, not sufficient: an anchored metric can still be defeated by determined gaming. The contribution is institutional packaging of existing validity, audit, and Goodhart-family insights into a deployable spec for consequential proxies across domains.

The Goodhart family canonical. Charles Goodhart, “Problems of Monetary Management: The U.K. Experience,” in Papers in Monetary Economics 1, Reserve Bank of Australia (1975), 1–20. Donald T. Campbell, “Assessing the Impact of Planned Social Change,” Evaluation and Program Planning 2(1) (1979), 67–90. Marilyn Strathern, “Improving Ratings: Audit in the British University System,” European Review 5(3) (1997), 305–321. Strathern’s stronger claim — beyond the popular paraphrase — is that audit reorganizes institutions around auditable surfaces while jeopardizing what it claims to monitor.

Goodhart taxonomy. David Manheim and Scott Garrabrant, “Categorizing Variants of Goodhart’s Law,” arXiv:1803.04585 (2018). The four-variant taxonomy (regressional, extremal, causal, adversarial) is the standard reference; the Measurement Anchor’s variant-to-condition table is many-to-many, not one-to-one.

Audit Society and NPM critique. Michael Power, The Audit Society: Rituals of Verification (Oxford University Press, 1997). Gwyn Bevan and Christopher Hood, “What’s Measured Is What Matters: Targets and Gaming in the English Public Health Care System,” Public Administration 84(3) (2006), 517–538.

Reactivity and rankings. Wendy Nelson Espeland and Michael Sauder, “Rankings and Reactivity: How Public Measures Recreate Social Worlds,” American Journal of Sociology 113(1) (2007), 1–40. Wendy Espeland and Mitchell Stevens, “A Sociology of Quantification,” European Journal of Sociology 49(3) (2008), 401–436 — secondary.

Numbers under distrust. Theodore M. Porter, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life (Princeton University Press, 1995). Institutions adopt quantification under conditions of low trust in professional discretion. The Measurement Anchor does not argue against measurement; it argues for institutional discipline once measurement carries consequence.

Map and territory (philosophical parent). Alfred Korzybski, Science and Sanity: An Introduction to Non-Aristotelian Systems and General Semantics (Lancaster, PA: Science Press Printing Company, 1933).

Construct validity and consequential validity (psychometric ancestors). Lee J. Cronbach and Paul E. Meehl, “Construct Validity in Psychological Tests,” Psychological Bulletin 52(4) (1955), 281–302. Samuel Messick, “Validity,” in Robert L. Linn, ed., Educational Measurement, 3rd edition (American Council on Education / Macmillan, 1989), 13–103 — the unified validity theory including consequential validity. AERA / APA / NCME, Standards for Educational and Psychological Testing (multiple editions, most recently 2014). The four-anchor spec is not a rival measurement theory; it translates validity-under-use into a governance primitive for consequential proxies outside the psychometric test context.

Healthcare quality measurement. Avedis Donabedian, “Evaluating the Quality of Medical Care,” Milbank Memorial Fund Quarterly 44(3), Part 2 (1966), 166–206. The structure-process-outcome framework is canonical for causal-audit-path discipline in healthcare measurement.

Accountability and gaming. Daniel Koretz, Measuring Up: What Educational Testing Really Tells Us (Harvard University Press, 2008). Koretz analyzes test-based accountability under high stakes — score inflation, teaching to the test, illusions of progress — and is the canonical applied psychometric study of adversarial Goodhart under self-reporting.

Carbon offset specimen. Benedict S. Probst, Malte Toetzke, Andreas Kontoleon, Laura Díaz Anadón, Jan C. Minx, Barbara K. Haya, Lambert Schneider, Philipp A. Trotter, Thales A. P. West, Annelise Gill-Wiehl, and Volker H. Hoffmann, “Systematic assessment of the achieved emission reductions of carbon crediting projects,” Nature Communications 15:9562 (2024). Coverage: 14 studies of 2,346 projects, plus 51 studies of similar non-credit interventions, roughly one-fifth of credit volume issued, almost one billion tons CO₂e. Headline: less than 16% of credits issued to investigated projects represented real emission reductions; the study estimates 812 of 972 million credits across covered project types likely do not constitute real reductions. Variation by project type: cookstoves 11%, SF₆ destruction 16%, avoided deforestation 25%, HFC-23 abatement 68%; no statistically significant reductions for Chinese wind power and US improved forest management in the investigated set. The framing here is measurement-anchor failure entangled with market design, baseline construction, and verification-incentive design, not pure measurement pathology.

Financial VaR / tail-risk specimen. Jón Daníelsson, “Blame the Models,” Journal of Financial Stability 4(4) (2008), 321–328. Statistical risk models are useful for frequent small events; under consequence pushed into systemic-tail regimes, the proxy-target relation does not hold. Extremal Goodhart, validity-envelope failure.

School quality ratings specimen. Joshua Angrist, Peter Hull, Parag A. Pathak, and Christopher R. Walters, “Race and the Mismeasure of School Quality,” American Economic Review: Insights 6(1) (March 2024), 20–37. School ratings in Denver and New York correlate with white enrollment shares partly because student demographics and selection drive both ratings and perceived value-added. Causal Goodhart: proxy and target share a common cause.

Beyond-GDP institutional case. Joseph E. Stiglitz, Amartya Sen, and Jean-Paul Fitoussi, Report by the Commission on the Measurement of Economic Performance and Social Progress (2009). GDP is useful as a market-production indicator; it is treated as a welfare, sustainability, and distributional indicator outside its validated scope. The canonical institutional case for bounded-proxy-status failure at scale.

Academic citations specimen. Michael Fire and Carlos Guestrin, “Over-optimization of academic publishing metrics: observing Goodhart’s Law in action,” GigaScience 8(6) (2019), giz053. Mixed case: regressional component at extreme citation levels, plus reactivity (Strathern) and adversarial gaming (Campbell). Not assigned cleanly to a single variant.

Wells Fargo aside. The CFPB’s 2016 enforcement case documented sales-target-driven fake-account creation. The case is the private-sector cousin of Bevan-Hood NHS gaming but is dominated by deliberate fraud under incentive pressure rather than by measurement-anchor failure proper. Mentioned for completeness, not as a load-bearing specimen.

Leverage-point caveat. Donella H. Meadows, “Leverage Points: Places to Intervene in a System” (Sustainability Institute working paper, 1999). Meadows ranks numerical parameters as the lowest-leverage intervention. The Measurement Anchor is consistent: anchored measurement is the minimum condition for using metrics without losing contact with reality, not the highest-leverage reform.

Status of the partition. The four-anchor decomposition is a working compression of measurement-validity, audit-theory, and Goodhart-family insights into one institutional deployment spec. The conditions overlap in live practice: bounded proxy status is implicit in any validity envelope; adversarial verification often depends on a clean audit path. The conditions are useful as a diagnostic checklist before deployment, not as mutually exclusive species. A better compression that did the same diagnostic work would be welcome.