RED TEAM — Refuting the CaaCS-Delta Corroboration (randy-reducer)

Author: Randy Reducer, on a RED TEAM (dialectical antithesis). Branch: design/naming-identity-ssot-alignment @ 3.2.0 — read-only; no commit, no branch switch. Target: caacs-delta-robbie.md + naming-identity-ssot-strangler/caacs-* — the white team's claim that the v3.1.10→v3.2.0 metrics prove (1) authorities cooled / consumers heated, (2) "what is ratcheted shrinks, what is not grows" is the causal mechanism, (3) the goals are evidence-grounded continuations.

Governance applied.

  • Directives (from randy-reducer.agent.yaml): DIRECTIVE_001 (Architectural Integrity — compression must clarify boundaries, not relabel coupling), DIRECTIVE_024 (Locality of Change — a "law" must be tied to the evidence, not extrapolated), DIRECTIVE_030/034 (a reduction is only real when the behavioral envelope shrinks with it — moving lines into a husk is not reduction).
  • Tactic forensic-repository-audit. Its own failure_modes are the weapons here, in order: vanity-file / scope dominance, squash-merge distortion (the white team flagged this but did not neutralise it), no-complexity-capture-in-raw-git-data, no-rename-following, and the unstated-but-fatal denominator failure (raw counts across two snapshots of different total size).
  • Semantic-compression equivalence lens: a count that drops because the syntax changed (from x.sub import Yfrom x import Y) while the coupling fan-in grows is not a reduction; it is a relabel. I treat the "−76%" as a candidate relabel and test it.

Stance: the data does not show what they say. I re-ran every headline query on this branch. Findings below, each as claim | counter-analysis | re-run data | survives? | severity.


R-1 — The flagship "−76% status bypass" is a RELABEL artifact, not decoupling

White-team claim. "status deep-imports collapsed 182→43 (−76%) — the single clearest 'needle moved' number in the dataset… status is the success story of this range." Cited as the empirical proof the ratchet mechanism works.

Counter-analysis. The ratchet (tests/architectural/test_status_module_boundary.py) is a purely syntactic rule: it forbids from specify_cli.status.<sub> import X and requires from specify_cli.status import X (line 324: "All imports from status/ must go through from specify_cli.status import X"). So a "violation removed" is, by construction, an import statement rewritten through the package __init__ — not a consumer decoupled from status. The correct test of whether bypass shrank is total status imports and distinct files coupled to status, not the deep-vs-facade syntactic split the ratchet itself defines.

Re-run data (this branch).

deep imports (from specify_cli.status.<sub>):   v3.1.10 = 182   v3.2.0 = 34   (I get 34, not 43)
facade imports (from specify_cli.status import): v3.1.10 =   5   v3.2.0 = 259
ALL status imports:                              v3.1.10 = 187   v3.2.0 = 299   (+60%)
distinct FILES importing status:                 v3.1.10 =  44   v3.2.0 =  78   (+77%)
status/ package SLOC:                            v3.1.10 = 4752  v3.2.0 = 9119  (+92%)

The deep imports did not vanish — they migrated 5→259 facade imports. Total imports rose 60%, the coupled- file count rose 77%, and the package itself nearly doubled. The only thing that fell is the metric the ratchet is wired to move. This is Goodhart's law caught in the act: the ratchet defines its own success criterion, and the "−76%" measures import-statement hygiene, not coupling.

Also note the reproducibility crack: the white team reports 43; I reproduce 34 with their own documented command. A flagship "clearest number in the dataset" that two read-only runs on the same branch disagree on by 26% is not load-bearing evidence.

Survives? The refutation HOLDS. Decoupling did not occur; fan-in coupling to status grew. The "success story" is a syntactic conformance metric. Severity: HIGH (this is the single load-bearing number of the white-team case).


R-2 — "Authorities cooled" is partly a scope/shim artifact; total system complexity is UP, not down

White-team claim. Authorities COOL or are born-cold (context maxCC 17→3; transitions −227 SLOC); the predecessor core/execution_context.py was "hollowed by −368 lines into a re-export shim."

Counter-analysis (two prongs).

  1. Factual correction on the shim. At v3.2.0 core/execution_context.py is 0 lines — deleted, not a "−368-line re-export shim." The behavior did not get reduced in place; a net-new authority (src/mission_runtime/, 1,336 lines: context 275 + resolution 816 + artifacts 153 + __init__ 92) was born beside it and the old module removed. "Re-extracted simpler (avgCC 5.0→1.5)" describes a 288-line file being replaced by a 1,336-line package. Per-block maxCC fell; mass tripled.
  2. The denominator. maxCC on a young 10-month repo (first commit 2025-08-21) whose v3.2.0 "range" is a rebased-replay squash (see R-4) is exactly the noisy proxy the tactic warns about (no-complexity-capture + squash-merge distortion). And the system-level number the white team never reports:

Re-run data.

core/execution_context.py:    v3.1.10 = 288 lines   v3.2.0 = 0 (deleted, not shim)
new src/mission_runtime/:      1,336 lines net-new authority
TOTAL src/ Python SLOC:        v3.1.10 = 123,507     v3.2.0 = 265,826   (+115%)
  src/specify_cli:             111,628  ->  213,115  (+91%)
  src/runtime (new):                 0  ->   10,348
  src/doctrine:                  6,569  ->   11,114
total .py files in src/:           556  ->    1,000  (+80%)

The system more than doubled in size. "Authorities cooled" describes per-block CC in a handful of cherry- picked modules while the codebase as a whole grew 115% in SLOC and 80% in file count. Cooling a few blocks while the total mass doubles is not a reduction story; at best it is localised cooling inside an expanding system.

Survives? The refutation mostly HOLDS with a concession: the per-block cooling of transitions (−227 SLOC, maxCC 14→6) and context (maxCC →3) is real and reproduced. But "authorities cooled" as a system claim is refuted: the shim is actually a deletion-plus-rebirth that quadrupled the line count for that domain, and total system complexity rose. Severity: MEDIUM-HIGH (the per-module wins are real but the system framing is misleading; the "shim" description is factually wrong).


R-3 — The "ratchet law" is CONFOUNDED: a focused MISSION did the work; the ratchet is an after-installed tripwire

White-team claim. "The law is not a metaphor — it is the literal mechanism separating the shrinking surfaces from the growing ones." I.e. the ratchet causes the shrink.

Counter-analysis. Causation requires that the ratchet drove the reduction over time. The evidence shows the opposite ordering: the status import cleanup AND the ratchet test arrived together, in one squash-merged mission, and the ratchet post-dates the cleanup it supposedly caused.

Re-run data.

test_status_module_boundary.py was ADDED in commit 0b6e2d7d9:
  "feat(kitty/mission-execution-state-domain-remediation-01KT6HVH): squash merge of mission"
That same squash commit deleted 18 deep status imports in one shot.
The test's own comments: "Residual allow-list (post-WP10)", "ROUTE-deferred-to-WP10 allow-list (shrinking ledger)"
  -> the allow-list is the RESIDUE of a mission's manual routing, not a force that drove it.

The arrow is mission → cleanup → install tripwire to prevent regression, not ratchet → cleanup. A shrinking-allowlist prevents backsliding; it does not perform the reduction. The white team's own narrative ("WP10 shrinks the allow-list by adding each deferred symbol") describes humans doing WP-by-WP routing, with the test recording the residue. Attributing the 182→34 move to "the ratchet mechanism" is a just-so story: the ratchet is the fossil of the work, mislabelled as its cause.

The "law" also fails its own symmetry test once the denominator is restored (R-4): the un-ratcheted idioms grew, yes — but so did everything, because the codebase grew 80–115%. The law predicts un-ratcheted = grow and ratcheted = shrink as a mechanism; what actually happened is one mission cut one surface and fenced it, while general growth lifted every uncut idiom. Correlation with "had a ratchet" is confounded with "was the explicit target of a remediation mission."

Survives? The refutation HOLDS. The ratchet is causally downstream of the mission, not upstream of the reduction. Severity: HIGH (this directly refutes the white team's central causal claim — "the ratchet is the mechanism").


R-4 — The "un-ratcheted grew" half is inflated by an uncontrolled denominator (squash + 80% codebase growth)

White-team claim. mission_id[:8] +550%, [:8] +207%, parents[2] +150% — un-ratcheted idioms "GREW", proving the law's second half.

Counter-analysis. These are raw absolute counts across two snapshots where the codebase grew 1.80× by file count / 2.15× by SLOC. An idiom that merely kept pace with codebase growth would show a large raw "+%" while its density is flat or falling. The white team never normalised. The honest test is growth-multiple vs the 1.80× codebase multiple.

Re-run data + normalisation.

idiom            v3.1.10  v3.2.0  multiple   vs 1.80x codebase
mission_id[:8]      4       26     6.5x       real growth (>1.8x)   -> survives
[:8] (any)         15       46     3.07x      real growth           -> survives
parents[2]          4       10     2.5x       real growth, small-N  -> weak survives
kitty/mission      42       63     1.50x      BELOW 1.8x -> DENSITY FELL  -> refutes "grew"

Two of the "growth" idioms survive normalisation as genuine density growth — I concede mission_id[:8] and [:8] did grow in real terms. But kitty/mission (which the white team listed as "grew 42→63") grew slower than the codebase, so its density fell — it did not "grow" in the sense the law needs; it was diluted. And parents[2] is a 4→10 small-N count where ±2 swings the verdict. The white team's "+550%/+207%/+150%" framing on raw counts in an 80%-larger codebase is the textbook denominator failure the tactic's failure-modes warn about, applied selectively to make the un-ratcheted side look worse than density supports.

Survives? The refutation PARTIALLY HOLDS. mission_id[:8] and [:8] genuinely grew (concede the law's second half there); but kitty/mission density fell, and the absolute-% framing systematically overstates the effect. Severity: MEDIUM (the headline identity gap is real; the framing and one of the cited idioms are not).


R-5 — Extending the ratchet DISPLACES, not eliminates — and threading is MORE code, not less

White-team claim. The cheapest 3.2.x action is a mission_id[:8] literal-ban ratchet pointed at resolve_mid8, which "would flip identity from the worst gap to a closing one exactly as the status boundary ratchet did." Implicitly: ratchets reduce.

Counter-analysis (the reduction-skeptic core). The status ratchet did not eliminate the need — it relabelled 254 imports onto the facade (R-1). A mission_id[:8] literal-ban would do the same: it cannot delete the need to derive an 8-char handle; it can only force the 26 callsites to call resolve_mid8(...) instead of slicing inline. That is a displacement of the idiom from one syntax to another, plus the threading machinery to deliver mission_id to each callsite. Net LOC for the context/identity domain went UP, not down, when the SSOT was extracted:

Re-run data.

predecessor execution_context.py:  288 lines  -> deleted
new mission_runtime/ authority:   1,336 lines  (net-new)
resolve_action_context callsites:    3 -> 27   (+24 new param-passing / context-threading sites)
resolve_mid8 callsites:               0 -> 26   (the SSOT now sits at PARITY with the 26 raw mission_id[:8])

The white team itself reports resolve_mid8 (26) at numerical parity with mission_id[:8] (26): two implementations of the same concept now coexist — which by the semantic-compression definition is worse than one, not progress. Threading the canonical context cost a 288→1,336-line authority swap plus 24 new call-threading sites. Every signature that now takes a context param, every freeze/resolution step, is added code. The "reduction" is illusory at the LOC level for this domain.

Survives? The refutation HOLDS. Ratchets move the bulge under the rug (deep→facade; inline-slice→helper- call); they do not eliminate the work, and the SSOT extraction added net LOC while leaving the hand-roll at parity. Severity: HIGH (directly refutes "ratchets reduce" and "extract-then-route is mid-flight progress" — it is, on this evidence, extract-then-coexist).


R-6 — The strangler is in its FAILURE mode, not mid-flight success (the synthesis)

White-team claim. "extract authority → route consumers, even if adoption lags" is the signature of a healthy mid-flight strangler-fig, not drift.

Counter-analysis. The white team's own data shows every domain landed in the same place: execution-context adoption ~5%, identity at parity (26 SSOT vs 26 hand-roll), status "closed" only by the syntactic relabel (R-1) while its fan-in doubled. The defining failure mode of a strangler-fig is precisely two implementations coexisting indefinitely — the new authority built, the old path un-retired. There is no domain in the dataset that reached full adoption with the legacy path retired. "status closed at −76%" is refuted (R-1: it relabelled and coupling grew). "transitions −227 SLOC" is the only clean retirement, and it is one file. A strangler with N half-built SSOTs and zero fully-strangled domains, accumulating authority faster than it retires consumers (system +115% SLOC), is exhibiting the risk the pattern warns about, not its healthy signature. The honest verdict is "extract-then-coexist," which is the failure precondition — it becomes success only if the legacy paths are subsequently retired, for which there is no evidence in this range.

Survives? The refutation HOLDS as the antithesis, with the concession that "incomplete-but-directional" is a defensible alternative reading — the dispute is whether building-without-retiring is "mid-flight" or "stalled." On this 12-day squashed range, no retirement is visible, so the burden of proof for "progressing" is unmet. Severity: MEDIUM-HIGH (interpretive, but the white team has not discharged its burden).


Concessions (where the data genuinely holds)

  • transitions.py −227 SLOC, maxCC 14→6 is a real, clean, in-place reduction with the matrix extracted. The one unambiguous compression win. Conceded.
  • mission_id[:8] and [:8] grew in real density (6.5× and 3.07× vs 1.80× codebase). The identity-gap half of the law survives normalisation. Conceded.
  • The status ratchet is a genuine shrinking-allowlist mechanism (not a static check) and will prevent syntactic backsliding. Conceded — it just doesn't measure coupling, and didn't cause the cut.
  • My own 34-vs-43 discrepancy is itself only a directional finding; both numbers show a large drop in deep imports. The drop in the deep-import syntax is real; its meaning (R-1) is what I refute.

Reproduction (every command, on this branch, read-only)

git merge-base --is-ancestor v3.1.10 v3.2.0; echo $?     # 1 = divergent (rebased-replay)
git log -1 --format='%ci' v3.1.10 v3.2.0                  # 2026-06-04 .. 2026-06-16 (12 days, 2160 nonmerge)
# R-1 relabel:
git grep -h -E 'from specify_cli\.status\.[a-z]' v3.1.10 -- 'src/**/*.py' | wc -l   # 182
git grep -h -E 'from specify_cli\.status\.[a-z]' v3.2.0  -- 'src/**/*.py' | wc -l   # 34
git grep -h -E 'from specify_cli\.status import'  v3.2.0 -- 'src/**/*.py' | wc -l   # 259
git grep -l -E 'specify_cli\.status' v3.2.0 -- 'src/**/*.py' | sort -u | wc -l      # 78 files (was 44)
# R-2 system size:
for f in $(git ls-tree -r --name-only v3.2.0 -- src/ | grep '\.py$'); do git show v3.2.0:$f; done | wc -l  # 265826
git show v3.2.0:src/specify_cli/core/execution_context.py | wc -l                  # 0 (deleted, not shim)
# R-3 causation:
git log --oneline --all --diff-filter=A -- '**/test_status_module_boundary.py'      # 0b6e2d7d9 (01KT6HVH squash)
# R-4 normalisation: divide each idiom multiple by codebase 1.80x file multiple (556->1000)
# R-5 threading:
git grep -h 'resolve_action_context' v3.2.0 -- 'src/**/*.py' | wc -l               # 27 (was 3)
for f in src/mission_runtime/*.py; do git show v3.2.0:$f | wc -l; done             # 1336 total net-new

Tactic failure-modes that bit the white-team analysis: (a) squash-merge distortion — flagged but not neutralised; the 12-day/2160-commit rebased-replay range makes all churn directional and inflates per-file +/−. (b) no-complexity-capture — maxCC cooling reported without the system-level SLOC growth that contradicts it. (c) denominator failure (unlisted, fatal) — raw idiom counts compared across an 80%-larger codebase without normalisation, inflating the "un-ratcheted grew" side and mislabelling diluted idioms as growing.