Test-Suite Friction Audit — "Tests as scaffold, not friction"
Date: 2026-06-22 Epic: #2071 Method: four profile-loaded review lenses (debugger-debbie, reviewer-renata, randy-reducer) + a paula-patterns second-opinion on the randy findings most at risk of being symptomatic. Suite scale: ~1,800 test files, ~500K LOC, 36 conftests.
Verdict on the theory (honest)
The operator's theory — the test suite has degraded from scaffold into friction — is confirmed in its core but materially narrower than the alarming headline. All three investigation lenses independently tempered it:
- Not pervasive tautological-green. The egregious accidental-pass forms came back clean:
assert result == mock.return_value→ 0 occurrences; a test mocking the exact SUT it names → 0 confirmed. The high mock density (~4,500@patch) is overwhelmingly CLI-collaborator style, not false-green. - The "same dir asserted twice" memory is mostly a false alarm. The
read_dir == write_dirand triple-equality assertions the operator remembered are, on inspection, correct convergence/collapse invariants (test_execution_context_parity.py,test_aggregate_surface_resolution.py,test_cli_status_mediation.py) — desired contracts, not codified bugs. - The suite has gold-standard regions.
test_execution_context_parity.py(real subprocess + real git worktrees + on-disk event logs, with explicit anti-vacuity injection proofs),tests/git/test_protection_preserved.py(converged ATDD ratchet), auth single-flight, feature creation, orchestrator integration — these are models to emulate, not friction.
Where the theory IS true, it is dense and localized. The center of gravity is three clusters, not a pervasive rot:
file:line-keyed architectural ratchets (systemic, cross-confirmed by renata + randy) — the real "false-red on a correct change → revert the good change" engine.- Security xfail-masking (renata, localized but pure) — vulnerabilities codified as acceptable; fixing the bug turns the test red.
- Duplicate-knowledge meta scaffold (randy, structural) — N hand-rolled schema sources that drift on every meta.json change.
Plus a contained band of mock-wiring "assert HOW not WHAT" tests and one weak directory
(sync/tracker/).
Consolidated findings → ticket map
Deduped across lenses; severity and the structural fix (with paula's corrections applied to the randy items, so fixes drain debt rather than re-home it).
CT1 — Re-key file:line architectural ratchets to stable anchors + drain deferred-defect entries — HIGH
Sources: renata F3, randy T3, paula T3 correction (cross-confirmed, the systemic finding).
- Evidence:
tests/architectural/test_single_mission_surface_resolver.py_ALLOWLISTED_RAW_JOINSkeys entries by rawsurface_resolver.py:472,_read_path_resolver.py:885,mission_creation.py:328,cycle.py:185; in-file comments confess repeated hand-re-keying (:511 → :641 → :737 → :744;:518→:472;:869→:885).test_no_write_side_rederivation.py:84has a raw("status_transition.py", 336)tuple and a private verbatim copy of_ratchet_keys.code_tokens_by_line. The drift-proof primitive_ratchet_keys.composite_key(qualname + normalized token-line) already exists but is adopted by only one test (test_no_worktree_name_guess.py). ~113 baseline/allowlist refs; 8 files carry "re-key/drifted/ stays green" notes. - Failure mode: a correct, behavior-neutral edit above a pinned line shifts the number and turns the architectural gate RED, forcing a manual re-key in the same PR — friction by construction; bit this cycle repeatedly.
- Structural fix (two tracked obligations — do NOT conflate, per paula):
- Re-key all surviving entries onto
_ratchet_keys.composite_key; delete the private_code_tokens_by_linecopy; convergetest_no_write_side_rederivation.pyonto the shared primitive. (Churn fix.) - Drain with named owners: classify each entry as PERMANENT-BY-DESIGN (the DIAG /
topology-blind-by-design seam joins — annotate as permanent so they aren't mistaken for debt)
vs DEFERRED-DEFECT (
status_transition.py:336→ the #1716 write-surface-selection ladder; the:295/C-007 token-copy deferral). Each deferred entry carries its tracker link + a non-vacuous drain condition — when the fix lands, the entry is removed, not re-keyed.
- Re-key all surviving entries onto
- ⚠ Mission link:
status_transition.py:336is the deferred #1716 write-surface ladder — the exact split-brain mission01KVPR00(FR-007) closes. See "Mission-impact flags" below.
CT2 — De-theater the security path-validation tests — HIGH
Source: renata F1 (the purest instance of the theory).
- Evidence:
tests/adversarial/test_path_validation.py— every malicious-path test has an escape hatchif is_valid: pytest.xfail("Traversal not blocked in current implementation"). 8 traversal/case/symlink/home/absolute/null-byte tests are titled "must be rejected" but green-pass when the path is accepted (a live security gap); fixingvalidate_deliverables_pathturns them XPASS — the suite penalizes the fix. ~5 sibling tests have no assertions at all (pure no-ops). This file holds 7 of the suite's 12 totalpytest.xfailcalls. - Structural fix: decide the real
validate_deliverables_pathcontract, then either (a) make the tests strict failures pinning the desired rejection (RED until fixed, ATDD-style — model:tests/git/test_protection_preserved.py), or (b) if the validator legitimately delegates containment elsewhere, delete these and test the real boundary. Remove the no-op tests.
CT3 — Meta/mission test factory delegating to the production create_mission_core() seam — L
Sources: randy T1 + T2, paula T1 correction.
- Evidence: 329 test files write
meta.jsondirectly; 53_write_meta+ 38_seed_mission/_setup_featurecopies;tests/_factories/__init__.pyis a 0-byte stub with zero importers; thefeature_repofixture (conftest.py:696) writes no meta.json at all (nomission_id/mid8). 385 hardcoded ULID literals, mostly handcrafted placeholders (01AAA…×41,01HXYZ…×36,01TEST…×26) — violating the realistic-test-data rule. - Structural fix (paula correction — NOT a parallel hand-roll):
tests/_factories.make_mission()must be a thin wrapper that delegates tocreate_mission_core()(mission_creation.py:201 — the documented programmatic API; the CLIcreate()is a thin wrapper over it), applying test-specific**overrideson the production-shaped meta. One schema authority; a new required field flows to every test automatically. Ifcreate_mission_corelacks a clean side-effect-free / no-coordination-branch / no-fan-out test entrypoint, that gap is the real finding — add the lever, don't fork the schema. Then migrate_write_meta×53; the 329 raw writers are the long-tail drain. Bundle production-shaped ULID fixtures + ban the placeholder patterns.
CT4 — Re-point mock-wiring "assert HOW not WHAT" tests to observable contracts — M
Sources: renata F4 (~10), debbie T1 (sync/tracker/ ~12), debbie T3 (status/merge wiring band).
- Evidence:
test_dashboard/test_api_handler.py::TestDossierEndpointRouting::*(mockDossierAPIHandler, assert internal forwarding, never the HTTP response);agent/glossary/ test_event_emission.py::*(assert event class instantiated/forwarded, not persisted);cli/commands/test_charter_lint.py::*(assert CLI→LintEngine.runkwargs, not rendered findings);sync/tracker/test_service.py::*(patch the backend method they claim to verify);test_origin.py::test_saas_first_ordering_set_origin_ticket_not_called(assert_not_called on a patched name — an inline/renamed write still passes). Redundant status/merge wiring twins (status/test_agent_status_emit_aggregate_wiring.py::test_emit_does_not_call_transactional_emit_directly,::test_command_module_has_no_direct_transactional_reference— reads module source as text). - Structural fix: re-point each to the observable contract (HTTP status+body; persisted
event/JSONL; rendered lint output; config persisted to disk). For the status/merge band, keep the
one real-outcome test per seam (
test_done_events_committed_to_git, the differential-equivalence- JSON-contract tests) and demote/delete the wiring twins.
- ⚠ Mission link: the status/merge emit-wiring tests wrap the emit/safe_commit seams FR-007/ FR-009 rewrite — they may false-red on a behavior-preserving refactor. See below.
CT5 — Stale golden-count assertions + fakeable-DoD / dead-assertion tail — S
Sources: renata F2 + F5, debbie T2.
- Evidence:
status/test_models.py::test_lane_enum_has_nine_valuesassertslen(Lane) == 10(name/docstring drift afterLane.GENESIS); 182-wide tail ofassert len(...) == Ngolden counts where an adjacent set-equality already covers the contract. Vacuous/dead:sync/tracker/test_local_service.py:420("import" not in source or …is always-true);::test_map_add_and_list_roundtrip(dead patch, overclaiming name);cross_cutting/test_gitignore_manager_unit.py::test_result_object_structure(6hasattr, no values);doctrine/test_structure_templates.py::test_structure_templates_exist(.exists()only);release/test_release_payload_draft.py+doctrine/test_relationship_fields_rejected.py(hasattr/absence only). - Structural fix: rename the lane test; drop redundant
len()==Nwhere set-equality exists (keep counts only where cardinality is the contract); add a content/behavior assertion next to each existence/hasattr check; delete the 2 dead assertions.
CT6 — (Adjacent, deprecation-hygiene) Re-point 77 specify_cli.next shim importers — M
Source: randy T4. Flagged as ADJACENT to test-friction (it is deprecation-migration), include only if the epic owner wants it here; otherwise it belongs to the shared-package-boundary cutover.
- Evidence:
src/specify_cli/next/is a deprecation shim (__removal_release__ = "3.3.0"); 77 test files still import through it (incl. the two largest bridge test files, ~1,727 + ~1,483 LOC). - Fix: mechanical sweep re-pointing importers to
runtime.nextahead of the 3.3.0 cut.
CT7 — Test-hygiene directive/styleguide + guard (recurrence prevention) — M
The epic's "don't recur" obligation, beyond remediation.
- Codify the anti-patterns as doctrine: no xfail-masking a defect (use ATDD strict-RED); no
file:lineratchets (anchor on symbol/AST/fingerprint via_ratchet_keys); test fixtures delegate to production seams (single schema authority); assert observable contracts, not wiring; production-shaped test data. Pair with a guard/ratchet where mechanizable (e.g. banpytest.xfailwith a "not blocked/implemented" reason; ban new rawfile.py:NNNratchet keys). Prior art: the post-merge AST stale-assertion analyzer (mission 068,src/specify_cli/post_merge/).
Where the theory does NOT hold (counterweight — do NOT "fix" these)
- Convergence-invariant assertions (
read_dir == write_dirfor flattened topology, triple-equality collapse) — correct contracts. test_execution_context_parity.py— gold-standard ATDD ratchet with anti-vacuity injection proofs.tests/git/test_protection_preserved.py— the model ATDD ratchet (markers removed on convergence).assert_called*intests/sync/(non-tracker),tests/auth/— legitimate boundary verification (subprocess, httpx, SaaS sink) paired with observable outcomes.- ~85 arg-pinned delegation tests (debbie T5) — brittle-but-defensible style; the regression class is "wrong args forwarded." Leave unless a refactor trips them.
_baselines.yamlcount-keyed ratchets — semantically meaningful burn-down accounting; churn is inherent, not fixable friction.
Mission-impact decisions (for mission single-planning-surface-authority-01KVPR00)
A paula-patterns + architect-alphonso adjudication (2026-06-22), grounded in the guard source, settled how (if at all) each flag is pulled into the mission. Verdicts:
- CT1 re-key (obligation A) → PULL a thin test-only WP00 (front of the seam chain). The
file:lineallowlists intest_no_write_side_rederivation.pyandtest_single_mission_surface_resolver.pygate files this mission edits (mission_creation.py:328via IC-02,_read_path_resolver.py:885via IC-04, plus the write-sidestatus_transition.py:336). The drift-proof primitive_ratchet_keys.composite_key((qualname, token_line), content-addressed) already exists and the guards' discovery already yieldssource+lineno, so converting the allowlists onto it is mechanical — no new infra. Because composite keys survive line drift, the re-key is front-loadable (a plain line re-key is not — it would re-key to a line the seam edits then move again). WP00 converts both guards' allowlists ontocomposite_keyand deletes the duplicated private_code_tokens_by_linecopy, before the seam WPs edit those files — killing the line-drift false-red class for the whole IC chain (and future missions). Leavesurface_resolver.py:472/:477andcycle.py:185content unchanged (they're untouched seam joins; their classification stays #2072 obligation B). - CT1 drain of
status_transition.py:336→ live-evidence-gated subtask of the FR-007/FR-009 write-authority WP (NOT the WP00 re-key, NOT a certain drain). The earlier flag here was wrong::336is not "the ladder FR-007 closes." Per the guard's own docstring it is the_resolve_write_targetfallback arm reached only whenresolve_placement_onlycannot resolve the mission (pre-meta create window / ad-hoc fixture) —_resolve_write_targetalready routes the happy path throughresolve_placement_only. FR-007 closes a different surface (safe-commit._resolve_commit_target, #2063). alphonso argues FR-002/FR-003 (minttopologyintometa.jsonat create) eliminate the pre-meta window, making the_current_brancharm unreachable for real missions → drainable; paula warns against deleting a load-bearing fallback. Resolution (live-evidence rule): the write-authority WP instruments the fallback and proves, on a real create→first-write repro, whether the_current_brancharm is reachable. Proven dead → drain (delete the line + the allowlist entry + fliptest_allow_listed_line_is_the_deferred_head_selectorto assert absence). Still reachable → leave it + re-key only. Do NOT drain speculatively (regression risk) and do NOT re-pin a dead line (immortalized exemption). - CT4 (#2075) → DON'T-PULL. The source-as-text test
(
test_command_module_has_no_direct_transactional_reference) guardsagent/status.py, which the mission does not edit (it editsstatus/emit.py); the emit-parity tests are gold-standard true-red-only safety nets (and already threadtopology=, de-risking FR-001/FR-002); the merge tests sit behindmerge.py, a boundary the mission doesn't cross. One reactive watch-item folds into the FR-009 WP: preservesafe_commit's signature; re-point only the planning assertion to the observable contract if the seam adoption changes its call shape (NFR-002 generic-path tests untouched).
Net: pull a thin test-only WP00 (composite-key re-key of the two guards); fold the :336
live-gated drain into the FR-007/FR-009 WP; CT4 is reactive-only. This is the minimum front-load
that breaks the gate-vs-fix deadlock without regressing behavior or adding needless churn.