Test Suite Acceleration

Mission ID: 01KV3H590RHSQHF8XV843X5YHA Mission slug: test-suite-acceleration-01KV3H59 Mission type: software-dev Status: Draft

Overview

The Spec Kitty pytest suite (~1,457 test files) is the slowest gate on every change. Two structural problems cap its speed:

1. No parallelism where it would help most. The per-directory CI "fast" shards run single-process, and developers cannot safely run pytest across multiple processes locally because some tests read and truncate a real home-directory–backed queue database (~/.spec-kitty/queue.db). Running parallel workers today would corrupt that shared state. 2. Redundant and over-scaled work. A handful of tests explode into hundreds of collected items, rebuild expensive read-only state per test, repeat real git init, or run the same slow test in more than one CI job.

This mission executes a verified, adversarially-checked acceleration plan (architecture/test-suite-acceleration-plan.md) to make the suite run much faster in both CI and local development while guaranteeing no real assertion path or regression guard is weakened. The plan was produced by a 43-agent audit in which every recommendation passed an independent coverage-safety verification pass.

The keystone is a per-worker home/state isolation capability: once each parallel worker has its own home and config directories, both safe local multi-process runs and the CI fast-shard parallelization become unblocked.

User Scenarios & Testing

Primary scenario — Developer runs the suite locally in parallel

1. A developer finishes a change and wants to validate it before pushing. 2. They run the single documented parallel command. 3. The suite distributes across all available CPU cores; OS-global resource tests (real ports/daemons) run in a dedicated serial pass. 4. The run completes in well under half the previous single-process wall-clock. 5. The developer's real ~/.spec-kitty state is never read, written, or truncated by the run.

Primary scenario — CI runs the fast shards in parallel

1. A pull request triggers CI. 2. The fast shards (charter, cli, sync, doctrine, agent, …) execute across multiple cores instead of single-process. 3. The same set of tests (identical collected node IDs) runs as before. 4. The slowest shard's wall-clock is at least halved, shortening the critical path that gates downstream jobs.

Exception / edge scenarios

home-backed queue DB; each must resolve a distinct, isolated home.

load are replaced by generous timeout guards that still trip a pathological regression.

full-volume variant (nightly / env-gated) so corruption- and uniqueness-detection power is not silently lost.

be routed through any shared/cached fixture.

work-stealing distribution would break file-local autouse resets.

  • Worker collision attempt: Two parallel workers must never share the real
  • CPU contention: Timing-floor assertions (elapsed < 0.1) that flake under
  • Volume-sensitive guards: Reduced-scale default tests must retain a
  • Integrity tests: Idempotency, file-existence, and freshness tests must NOT
  • Distribution mode: Parallel distribution must be file-pinned; bare

Rule playback (must always hold)

no test is silently dropped.

guard is deleted or weakened.

repeated-run stability ratchet.

  • The parallel suite collects the identical node set as the serial suite —
  • Coverage quality never decreases: no genuine assertion path or regression
  • A parallelization flip ships only after the affected shard is green on a

Functional Requirements

IDRequirementStatus
FR-001Developers can run the full test suite across multiple processes locally via one documented command, completing without corrupting any real home-directory state.Draft
FR-002Test execution isolates per-worker home, config, and state directories so parallel workers never share or clobber the real ~/.spec-kitty queue database or other home-backed state.Draft
FR-003The CI fast test shards execute in parallel across available cores instead of single-process, beginning with the critical-path (charter) shard and rolling out shard-by-shard.Draft
FR-004The parallelized suite collects and executes the identical set of tests (same node IDs) as the serial suite for each affected shard; a collection-equivalence check enforces this.Draft
FR-005Tests depending on OS-global resources (real ports, daemons) run in a dedicated serial pass rather than under parallel workers.Draft
FR-006Wall-clock timing-floor assertions that flake under CPU contention are converted to timeout-based guards that still catch pathological performance regressions, without deleting the functional assertions they accompany.Draft
FR-007Redundant test execution is eliminated so that any given slow/performance test runs in exactly one CI job, with no orphaning of negative-path or NFR guard tests.Draft
FR-008High-volume iteration tests (ULID generation volume, FSM parity matrix, sync concurrency loops) are reduced to a representative scale for the default run, with the full-volume variant preserved behind an environment gate or nightly path.Draft
FR-009Expensive read-only setup (migrated-project state, whole-tree AST parse, dependency-graph load) is computed once and shared across tests that only read it, explicitly excluding integrity, idempotency, and freshness tests.Draft
FR-010A cached/templated baseline git-repository fixture replaces repeated real git init for tests needing only a standard repo, while bespoke setups (unborn, detached, bare, worktree) retain their own initialization.Draft
FR-011The documented local default test command and contributor guidance (CLAUDE.md) are updated to the parallel-capable invocation, including the serial-pass caveat for daemon/port tests.Draft
FR-012Each parallelization change rolls out one shard at a time, gated by a repeated-run (run-twice or run-thrice) stability ratchet that must pass before the next shard is flipped.Draft
FR-013The safe-now coverage-neutral quick wins (volume reduction, timing→timeout conversion, slow-test de-duplication, deterministic sleep elimination, verbose-flag removal in CI) ship as an initial wave independent of any parallelization dependency.Draft

Non-Functional Requirements

IDRequirementThreshold / MeasureStatus
NFR-001Local full-suite wall-clock improves on a multi-core developer machine.≥ 2× faster than the single-process baseline on a ≥4-core machine.Draft
NFR-002The slowest CI shard's wall-clock is reduced.Critical-path (charter) shard drops from ~9 min to ≤ 5 min.Draft
NFR-003Per-push CI CPU time is reduced by the safe-now wave alone.≥ 60 s removed per push (volume reduction + slow-test de-dup), before any parallelization.Draft
NFR-004Coverage quality is preserved.New-code coverage stays ≥ 90%; overall line/branch coverage does not decrease versus baseline.Draft
NFR-005Parallel runs are deterministic.Affected suite passes on 3 consecutive parallel runs with zero new flaky tests.Draft
NFR-006Added test infrastructure meets project quality gates.mypy --strict and ruff pass with zero new issues; complexity ≤ 15.Draft
NFR-007Collected test count is conserved across each change.Per-shard collected node count is identical (or changes only by an explicitly asserted, reviewed delta) before vs. after.Draft

Constraints

IDConstraintStatus
C-001No real assertion path or regression guard may be deleted or weakened; every reduction must be coverage-neutral, verified by collection-count equivalence plus an equivalence or mutation check where behavior is restructured.Draft
C-002All changes land via pull request to origin/main; no direct pushes to origin/main (repository policy).Draft
C-003Parallel test distribution must be file-pinned (--dist loadfile), never bare work-stealing distribution, because file-local autouse registry/cache resets assume same-file co-location.Draft
C-004Volume-sensitive stress/corruption/uniqueness guards must continue to run somewhere in CI (nightly or environment-gated); their high-volume power may not be silently removed.Draft
C-005Per-worker isolation must function cross-platform (Linux, macOS, Windows), covering HOME, USERPROFILE, and LOCALAPPDATA.Draft
C-006Production code signatures must not be altered merely to satisfy tests (e.g., deterministic sleep elimination is achieved by module-scoped patching, not by changing production behavior).Draft
C-007Integrity, idempotency, file-existence, and freshness tests are excluded from any shared/cached fixture or de-duplication.Draft

Success Criteria

IDCriterion
SC-001A developer runs one documented command and the full suite finishes in under half the previous wall-clock on a 4-core machine, with their real ~/.spec-kitty state untouched.
SC-002The slowest CI shard's wall-clock is at least halved relative to the pre-mission baseline.
SC-003Coverage percentage does not drop and no test is silently dropped: per-shard collected node counts are equal (or differ only by a reviewed, asserted amount).
SC-004Three consecutive parallel CI runs of the affected shards are green with no new flaky tests.
SC-005The safe-now wave removes ≥ 60 s of CI CPU per push with zero change to coverage.
SC-006Running the suite in parallel with no worker isolation is demonstrably prevented from touching real home state (a regression test proves two workers resolve distinct homes).

Key Entities

state directories to a worker-unique temporary location; the master enabler for both local and CI parallelism.

serial) and "integration" shards (already parallel) are the two families.

per test to replace repeated real git init.

full-volume iteration for nightly/opt-in runs while the default run uses a representative scale.

parallelization flip is accepted.

a change is coverage-neutral.

  • Per-worker isolation fixture — redirects each parallel worker's home and
  • CI shard — a CI job running a subset of tests; "fast" shards (currently
  • Templated baseline git-repo fixture — a once-built repo template cloned
  • Volume gate — an environment switch (e.g. a *_FULL flag) that restores
  • Stability ratchet — a repeated-run gate that must pass before a
  • Coverage-equivalence check — a collection-count/mutation safeguard proving

Assumptions

loadfile` pattern is already proven in production on the integration shards; this mission extends that proven pattern, it does not introduce it.

is the authoritative source for the specific files, hazards, and safeguards; this spec captures the WHAT/WHY and that document captures the evidence.

are estimates to be re-measured during implementation; the NFR thresholds are the binding targets.

force throughout.

  • pytest-xdist is already a project dependency and the `-n auto --dist
  • architecture/test-suite-acceleration-plan.md (the 43-agent verified audit)
  • The numeric timing figures in the audit (e.g. charter ~9.1 min, ULID ~36 s)
  • The repository's no-direct-push and 90%-new-code-coverage policies remain in

Out of Scope

features).

specifically identified, coverage-safe conversions are in scope.

  • Rewriting or re-architecting production (non-test) code for performance.
  • Changing the set of behaviors the suite verifies (adding or removing product
  • Migrating to a different test runner or CI provider.
  • Converting the 315 subprocess-based tests to in-process wholesale; only the