Mission Specification: CI Topology Shrink & Guard Un-Blinding
Mission: ci-topology-shrink-01KWQAVX (mission_id 01KWQAVXZGH1G36HB9QKWMYXGD) Created: 2026-07-04 Status: Draft (post-spec squad revision 1) Input: User description: "spec a remediation mission for Slice C — CI-topology shrink (#2378 shard-side + #1933 group-side, sequenced as siblings). it is hurting us, hard." Closes: #2378 (shard-side split) · #1933 (group-side shrink) · #2383 (arch un-blind, P1) — all under epic #1931. Out of scope: #2283 factors (b)/(c) · #2077 CT7 · #2071 audit · the marker→job model itself (#2034/#2368 substrate — consumed, not rebuilt).
Context & Problem (why "hurting us, hard")
.github/workflows/ci-quality.yml routes CI via three coupled authorities (dorny filter block, the core_misc catch-all group, and the FR-010 fail-closed unmatched→run_all alarm), all modeled by tests/architectural/_gate_coverage.py and pinned by the 8 invariants the #2368 CI suite-map mission bound. Live ground truth (pre-spec 3-lens squad + post-spec 3-lens squad, 2026-07-04):
- Wallclock:
fast-tests-core-miscis a single unsharded 16.9-min job;integration-tests-core-misc(holding thearchitectural/adversarialshard, ~12.2 min) declaresneeds: [changes, fast-tests-core-misc]— so the arch shard is serialized after the fast job → a ~29-min critical path = ~85% of the 34-min pipeline, 33% of all CI compute. - Mode A — ~37 UNMAPPED src dirs (68.5k LOC):
auth,migration,tracker,dossier,retrospective,compat,orchestrator_api,bulk_edit,tasks,doctor, etc. are in no named filter group → every touch tripsunmatched→run_all(the entire pipeline reruns). - Mode B — 13 MAPPED-but-ARCH-BLIND dirs (120k LOC, incl.
cli58k,sync17.7k,upgrade15.9k,merge6.3k):tests/architectural+tests/adversarialrun only via the{core_misc, execution_context, acceptance}groups, so an isolated change tocli/,sync/,merge/,upgrade/merges green with the dead-module / stale-symbol / terminology / status-boundary gates never firing. This is the root cause of the orphan-coverage bugs whack-a-moled all session (#2370m_3_2_4, acceptance/state, #2379 migration). - Failure isolation is near-zero; the suite is 0-orphan today only because the whole-tree catch-all sweeps mask everything.
Load-bearing design fact (all 3 post-spec lenses): un-blinding (pillar 2) and the wallclock cut (pillar 3) collide on the arch pole — un-blinding runs the 12.2-min arch/adversarial suite on 100% of PRs, and its needs: fast-tests-core-misc serialization means sharding only the fast job cannot reach the wallclock ceiling. Both are solved by the same move: de-serialize + shard/bound the arch suite (FR-013), realized as an always-on arch job that adds no filter group (so it cannot perturb the parsed-relation invariants).
User Scenarios & Testing (mandatory)
User Story 1 - Focused CI for a single-area change (Priority: P1)
A spec-kitty developer changes one source area (e.g. src/specify_cli/auth/) and CI runs only** that area's focused shard(s) plus the always-on gates — instead of re-running the entire pipeline because the dir was unmapped and tripped the fail-closed catch-all.
Why this priority: The acute everyday pain (Mode A) and the foundational half — the named groups it delivers are the substrate the split (US3) and un-blind (US2) build on.
Independent Test: For each dir in the construction-derived worklist (FR-001), a path-filter fixture test asserts a PR touching only that dir routes to its named group + focused shard and does not set unmatched=true/run_all.
Acceptance Scenarios:
1. Given auth/ is in a named filter group wired to a focused shard, When a PR touches only src/specify_cli/auth/, Then the changes job routes to that group, only the auth shard(s) + always-on gates run, and run_all is not triggered. 2. Given a worklist dir with tests split across two roots (e.g. migration → tests/migration + tests/specify_cli/migration), When its shard/composite is carved, Then both roots are consolidated into one home and the @slow perf test remains excluded (runs only in slow-tests), with no same-tier double-run (FR-012, NFR-003). 3. Given a src path still unmapped (a future dir), When it is touched, Then the fail-closed catch-all still forces coverage (fail-safe preserved, FR-009).
User Story 2 - Architectural guards never silently skip (Priority: P1)
A change confined to a previously arch-blind package (cli/, sync/, upgrade/, merge/, …) can no longer merge green with the architectural + adversarial suite skipped. The dead-module, stale-symbol, terminology, and status-boundary gates run on every source change.
Why this priority: The correctness root cause of the session's recurring orphan-coverage bugs (Mode B, 120k LOC blind) — the single highest-correctness-leverage change. A wallclock win that still lets real defects merge silently would be a hollow victory.
Independent Test: A differential-matrix architectural test asserts arch/adversarial selection over 100% of src/specify_cli/* dirs (0 blind). For each formerly-blind dir, a fixture asserts a touch selects the arch+adversarial suite.
Acceptance Scenarios:
1. Given cli/ was arch-blind, When a PR touches only src/specify_cli/cli/, Then the arch+adversarial suite is selected and the dead-module / stale-symbol / terminology / status-boundary gates execute. 2. Given a PR introducing a new dead module or a legacy-terminology regression in a formerly-blind dir, When CI runs, Then the relevant architectural gate fails the PR (not silently skipped).
User Story 3 - Core-misc is no longer the pipeline long pole (Priority: P2)
The 16.9-min unsharded fast-tests-core-misc monolith is subdivided into a matrix of focused shards and the serialized 12.2-min arch pole is de-coupled/sharded, so the core-misc critical path is no longer the pipeline's long pole and a red result names the specific slice that broke.
Why this priority: The wallclock + failure-isolation win. P2 because it rides on the named-group substrate from US1 and the arch-pole treatment shared with US2.
Independent Test: (a) a structural invariant asserts no single shard collects the full catch-all universe (parsed shard-command set); (b) an acceptance observation at plan/verify time measures the core-misc critical-path lane against a committed baseline (NFR-001). Assert a deliberately-failed shard names its slice.
Acceptance Scenarios:
1. Given the fast job is matrix-sharded and the arch suite is de-serialized (FR-013), When run_all executes, Then the core-misc critical-path lane is ≤ the numeric ceiling (NFR-001) and no single shard collects the whole catch-all universe. 2. Given a test fails in one carved slice, When CI reports, Then the failing job names that slice (not an opaque grab-bag).
Edge Cases
- Windows split-brain: a carved test file also in
ci-windows.yml's staticwindows_criticallist (:24-42) — update that list in the same commit or FR-003c reds. (This is the only real second-layer surface — see FR-008.) - Real-port / daemon serial:
tests/sync/test_orphan_sweep.pybinds ports 9400-9449 and runs-n0serially — any slice containing daemon/real-port tests must preserve the serial pass (FR-011, NFR pins it). dossierlatent gap:tests/dossieriscore_misc-globbed but in no integration matrix shard — carving it fixes a real hole.- Nested test roots:
tests/specify_cli/<D>slices (orchestrator_api,bulk_edit) need the integration matrixignore_argsupdated by hand (not covered by FR-012's whole-tree check). src/runtime: already grouped (nextgroup;integration-tests-next~13.5m) — WP01's routing table must confirm mapped/excluded, not re-promote.doctrineambiguity:src/doctrine/*(templates, already grouped) vssrc/specify_cli/doctrine(unmapped code dir); afast-tests-doctrinejob already exists — WP01 routing table disambiguates so a "promote doctrine" step doesn't collide.- Dir with no test cone: mapped to a group but no tests — must not orphan or produce a vacuous shard.
- Job-count explosion: promoting ~15 dirs to dedicated jobs balloons
quality-gate.needs— composite groups cap it (FR-010, measurable ceiling). - Coverage drop: a new
fast-tests-<D>forgotten from a coverage-consumer'sneedssilently drops its coverage XML — closed by construction (C-005).
Requirements (mandatory)
Functional Requirements
| ID | Title | User Story | Priority | Status |
|---|---|---|---|---|
| FR-001 | Promote every dir in the construction-derived worklist to a named src-backed filter group (dedicated OR composite per FR-010) so none falls to unmatched→run_all. The worklist is a committed, test-consumed census artifact (see NFR-006), NOT a hand list. | US1 | High | Open |
| FR-002 | Register each new group across ALL bound-model surfaces atomically (the 5-edit pattern): dorny filter glob, changes.outputs row, unmatched enumeration loop, ≥1 test-job if:, and the JOB_GROUPS heredoc table — no surface hand-added beside the model. | US1 | High | Open |
| FR-003 | Subdivide fast-tests-core-misc into a matrix of focused shards (mirroring integration-tests-core-misc); each shard owns coherent, non-overlapping test roots. | US3 | High | Open |
| FR-004 | Update the fast-tests-core-misc --ignore mirror list in lockstep with every shard carve (FR-012 invariant), and the integration-matrix ignore_args for nested tests/specify_cli/<D> slices. | US3 | High | Open |
| FR-005 | Make tests/architectural + tests/adversarial run on 100% of src changes (un-blind the 13 Mode-B dirs). Preferred mechanism (Option A): an always-on arch job that adds no filter group (enters only quality-gate.needs + coverage-consumer needs), leaving src_backed_groups/compute_unmatched untouched. | US2 | High | Open |
| FR-006 | Each new shard emits its own coverage XML under a glob-consumed naming convention (coverage-<D>.xml matched by the aggregator's wildcard download) so emit ⇒ consume by construction. | US3 | Medium | Open |
| FR-007 | Register every new test job in ALL consuming needs-lists: quality-gate, sonarcloud, diff-coverage, mutation-testing (and slow-tests only for jobs it actually depends on — it lists fast jobs only). | US3 | High | Open |
| FR-008 | Propagate new/relocated windows-marked test files to ci-windows.yml's static windows_critical list (:24-42) — the sole real second-layer surface. (quality_gate_decision.py holds no job→group data; drift-detector.yml/release.yml carry no shard/ignore names — do NOT edit them.) | US1 | High | Open |
| FR-009 | Preserve fail-safe selection: carve-outs only NARROW; an unmapped or newly-added src path still triggers coverage; nightly run_all still over-covers (no new blind spot); ci:full/ready-for-ci/workflow_dispatch escape hatches remain intact. | US1 | High | Open |
| FR-010 | Use composite groups (e.g. agent_surface = orchestrator_api+tracker+dossier; lifecycle = migration+doctor+bulk_edit) to cap job-count under the measurable ceiling in NFR-005. | US1 | Medium | Open |
| FR-011 | Preserve real-port serial -n0 passes, --dist loadfile (never bare load), and per-worker HOME isolation on every new shard. | US3 | High | Open |
| FR-012 | The migration home (dedicated shard or composite) consolidates tests/migration + tests/specify_cli/migration and preserves the and not slow exclusion (the @slow perf test runs only in slow-tests) — phrased shard-agnostically. | US1 | High | Open |
| FR-013 | De-serialize + bound the arch pole. The arch/adversarial suite must run in parallel with the fast lane (drop the integration-tests-core-misc needs: fast-tests-core-misc serialization for the arch shard) and be sharded/bounded so its tail lands ≤ the NFR-001 ceiling on every PR — the same object as FR-005's always-on un-blind. | US2, US3 | High | Open |
Non-Functional Requirements
| ID | Title | Requirement | Category | Priority | Status |
|---|---|---|---|---|---|
| NFR-001 | Core-misc critical path (absolute) | Record the ~29-min baseline (16.9m fast + 12.2m arch, serialized) in a committed timings artifact at plan time; the post-mission core-misc critical-path lane must be ≤ 55% of that baseline AND ≤ the next-longest independent lane. This is a plan/verify acceptance observation, not a standing unit gate. | Performance | High | Open |
| NFR-002 | Architectural coverage completeness | Architectural + adversarial guards execute on 100% of src/specify_cli/* dirs (13 blind → 0), asserted by a new differential-matrix relation in the bound model (filter_groups + job_gating_groups + gates → arch-selected per dir) — scope the _gate_coverage extension explicitly. | Correctness | High | Open |
| NFR-003 | Same-tier selection uniqueness | No test is selected by >1 fast shard, nor by >1 integration shard — a mechanized invariant distinct from the existing report-only cross-tier duplicate warning. Total run_all selected-test count unchanged; _gate_coverage orphan count stays 0 (baseline 28,573). | Reliability | High | Open |
| NFR-004 | Feedback isolation | A single-area PR runs only its focused shard(s) + always-on gates (incl. the bounded arch job) and no full-matrix run; a failing shard names its slice. | Usability | Medium | Open |
| NFR-005 | Job-count ceiling | Post-mission len(quality-gate.needs) ≤ a pinned ceiling (set at plan time from the composite design), asserted by a test — composites must keep the graph under it. | Maintainability | Medium | Open |
| NFR-006 | Worklist is construction-derived | The FR-001 worklist is computed, not hand-picked: a committed census (e.g. every src/specify_cli/* dir with ≥ N LOC lacking a named group) that the SC-001 test iterates — so the metric measures coverage, not the implementer's constant. | Correctness | High | Open |
| NFR-007 | Invariant integrity | All 8 #2368 WP04 architectural invariants remain green through every WP; the new invariants (NFR-002/003/005, C-005) are additive to the bound model. | Reliability | High | Open |
Constraints
| ID | Title | Constraint | Category | Priority | Status |
|---|---|---|---|---|---|
| C-001 | Consume, don't rebuild the substrate | Do not rebuild the marker→job model or the fail-closed catch-all (#2034/#2368) — extend the bound _gate_coverage model additively. | Technical | High | Open |
| C-002 | No split-brain derived surfaces | Every derived surface (catch-all OR-list, JOB_GROUPS heredoc, ci-windows static list) is asserted-against-parsed-source or the build rejects (Decision 8). | Technical | High | Open |
| C-003 | Shared-hot-file merge strategy | ci-quality.yml is edited by a single owning WP (workflow edits are NOT split per-slice), because per-slice WPs editing one file cannot be owned-file-disjoint and a lanes allocator rejects that. WPs partition by FILE (workflow / bound-model / tests / ci-windows), not by slice. If per-slice workflow edits become necessary, flatten to single_branch with linearized shared-surface edits (docs-consolidation pattern). Topology is a plan-time decision. | Technical | High | Open |
| C-004 | Scope fence | Keep #2283 factors (b)/(c), #2077 CT7, #2071 audit, new markers, gate rewrites, and WIP-title semantics OUT of scope. | Business | High | Open |
| C-005 | Coverage-consumer integrity by construction | New invariant: coverage-emitting jobs ⊆ sonarcloud.needs and critical-path emitters ⊆ diff-coverage.needs (and mutation-testing.needs where it consumes them). Do NOT intersect slow-tests.needs — it lists only fast jobs and would red on arrival. | Technical | High | Open |
| C-006 | Nightly-move deferred | #1933's literal nightly-scheduled-full-suite is OUT (shrink interpretation only); offer a thin nightly-schedule option in-mission ONLY if the committed plan-time timings show the post-shrink PR critical path still > ~15 min. Closeout must state the shrink satisfies #1933's intent (fast, targeted PR CI) with escape hatches + nightly run_all over-cover intact. | Technical | Medium | Open |
Key Entities
- Worklist census artifact: a committed, test-consumed enumeration of the src dirs the mission must map (construction-derived, NFR-006) — the authority SC-001 iterates.
- Filter group (dorny): named path→group mapping; src-backed groups join the
unmatchedenumeration. - Shard job (
fast-tests-<D>/ matrix entry / the always-on arch job): a CI job running a coherent slice under a fixed marker expression; emitscoverage-<D>.xml. - Bound model (
_gate_coverage.py+_gate_coverage_baseline.json): the parsed authority; the 8 existing + new invariants live here. - Needs-list: a job's
needs:set; coverage integrity (C-005) binds the coverage-consumer subset by construction.
Success Criteria (mandatory)
Measurable Outcomes
- SC-001: 0 dirs in the construction-derived worklist (NFR-006) fall to
unmatched→run_all— a mechanized test iterates the committed census and asserts each maps to a named group AND a focused shard (notrun_all). - SC-002: Architectural + adversarial guards run on 100% of
src/specify_cli/*dirs (13 blind → 0), proven by the differential-matrix relation (NFR-002) that fails if any dir is arch-blind. - SC-003: No single shard collects the full catch-all universe (structural invariant over the parsed shard-command set) AND the core-misc critical-path lane meets NFR-001's absolute ceiling (acceptance observation vs committed baseline).
- SC-004:
_gate_coverageorphan count remains 0, totalrun_allselected-test count is unchanged, and the same-tier uniqueness invariant (NFR-003) is green — no drop, no double-run. - SC-005: All 8 #2368 WP04 invariants remain green AND the new coverage-consumer invariant (C-005), the job-count ceiling (NFR-005), and the serial-port preservation invariant (FR-011) are green.
- SC-006: For each carve-out dir, a path-filter fixture demonstrates a single-area PR triggers only its focused shard(s) + always-on gates (incl. the bounded arch job) — not a full-matrix run.