Context and Problem Statement
Line and branch coverage measure whether tests execute code; they do not measure whether tests would detect real bugs. Several Spec Kitty modules have had high coverage but weak assertions — tests that run the code path and assert only that nothing raised. When we wanted an objective signal of test effectiveness (not just reach), we needed a different tool.
The question was not whether to adopt mutation testing — the value is well-established — but at what level of ceremony, and with what scope.
Decision
Adopt mutmut 3.5.0 as a local-only developer quality gate. Capture the methodology as
first-class doctrine. Defer CI integration until the workflow is proven and the contributor-facing
guidance is written.
Concretely:
- Tooling:
mutmutadded under[project.optional-dependencies.test]. Configuration lives in[tool.mutmut]inpyproject.toml. - Scope:
paths_to_mutate = ["src/"]— the entire first-party source tree is in scope when a developer invokesmutmut runlocally. - Doctrine: the curated artifact set landed in
src/doctrine/tactics/built-in/testing/mutation-testing-workflow.tactic.yaml,src/doctrine/styleguides/shipped/mutation-aware-test-design.styleguide.yaml, and language-specific toolguides for Python (mutmut) and TypeScript (stryker). The DRG graph anchors these toDIRECTIVE_034(Test-First Development). - No CI gate. No release-blocking check. No dashboard. Mutation score is something a developer checks before handing a mission over for review — a local-only signal, not a shared scoreboard.
Rationale
- Mutation runs are slow and flaky in constrained environments. A full
mutmut runagainst spec- kitty takes hours and needs a roomy sandbox. Putting this behind a CI gate would make CI roughly 20× slower and would fight with themutmut 3.5.0trampoline bug that breaks any test spawningpython -m specify_clias a subprocess. - Doctrine beats tooling when you haven't adopted the discipline. The kill-the-survivor workflow
requires judgement about which mutants are equivalent vs. which signal weak assertions. Without the
mutation-aware-test-designpatterns (Boundary Pair, Non-Identity Inputs, Bi-Directional Logic) in muscle memory, a CI gate just teaches contributors to game the score. Ship the doctrine, prove the workflow locally, then gate. - Signal-to-noise is better at the developer's desk. Contributors triaging their own surviving mutants learn the patterns; contributors triaging another team's failing CI run learn to mark mutants "equivalent" and move on.
Current state (As-Is)
- Each mutmut invocation forks a
mutants/sandbox directory and runs the project's test suite against each injected mutant. Baseline-test failures in the sandbox currently preventmutmut runfrom reaching the mutant-testing phase. - We keep the sandbox baseline green via an explicit
--ignore=list inpyproject.toml[tool.mutmut].pytest_add_cli_args. Each entry targets tests that are incompatible with the sandbox for one of four reasons:- Subprocess CLI invocation: tests that shell out to
python -m specify_clitrip the knownmutmut 3.5.0trampoline bug (AttributeError: 'NoneType' object has no attribute 'max_stack_depth'inrecord_trampoline_hit). - Whole-codebase AST walks: pytestarch, chokepoint coverage, ownership invariant checks — these blow the 30-second per-test timeout.
- Wheel builds / multiprocessing spawn: adversarial distribution tests and concurrency tests
that conflict with mutmut's
set_start_method('fork'). - Repo-state dependencies: tests that read files outside
also_copyd directories (livekitty-specs/,scripts/, cross-mission fixtures).
- Subprocess CLI invocation: tests that shell out to
- Additionally, marker-based deselection removes
slow,e2e,distribution,adversarial,windows_ci,platform_darwin,live_adapter, andarchitecturaltests. type_check_commandis disabled because mutmut's line-based JSON parser cannot handle mypy's multi-linehintfields. The cost is accepting that mutants producing type errors will be tested rather than filtered out; they typically get killed on import failure.
The --ignore= list is a brittle external registry — it goes stale when tests move or rename, and the
reason a test is excluded lives in a TOML comment rather than next to the test itself.
Target state (landed 2026-04-20)
The explicit per-file --ignore= list has been replaced with two pytest markers:
non_sandbox and flaky. The two categories capture different exclusion intents and must
not be conflated.
- Register both markers in
[tool.pytest.ini_options].markers:non_sandbox: test is structurally incompatible with mutmut's forked sandbox — subprocess CLI calls, whole-codebase AST walks, wheel builds, or repo-state fixtures that fall outside mutmut's also_copy tree. Exclusion is deterministic: the test will never pass in the sandbox until the underlying sandbox constraint changes.flaky: test passes reliably in the main test suite but produces non-deterministic results under mutmut (or other mutation/forking pipelines) — e.g., race conditions in target-branch detection, time-sensitive assertions, implicit reliance on process-wide state. Re-evaluate on every pipeline tooling upgrade; aim to fix root causes and remove the marker.
- At the top of each currently-ignored test file, add
pytestmark = pytest.mark.<marker>with a one-line comment stating why. For wholesale-ignored directories, use a directory-levelconftest.pythat applies the marker collectively. - Collapse the long
--ignore=block inpyproject.tomlinto two deselection clauses:"-m", "... and not non_sandbox and not flaky".
Why the distinction matters:
non_sandboxtests are accepted losses — we've decided the coverage trade-off is worth the stability. No follow-up is planned unless the upstream trampoline bug is fixed.flakytests are debt — they indicate real test-suite weaknesses (hidden global state, timing dependencies, missing fixture isolation). Eachflakymarker should have a TODO comment pointing to the root cause; they should shrink over time, not accumulate.
Benefits:
- Co-located intent. The "why" sits next to the test, not in a separate config file.
- Rename-safe. Moving or renaming a test no longer silently removes it from the ignore list.
- Individually re-enableable. When the upstream mutmut trampoline bug is fixed, we remove the
non_sandboxmarker from subprocess tests without touching the TOML. - Discoverable.
pytest -m non_sandbox --collect-onlyandpytest -m flaky --collect-onlylist exactly which tests are excluded and for which reason — replacinggreparchaeology onpyproject.toml. - Signals tech debt. The
flakybucket surfaces tests that may be lying to us in CI; thenon_sandboxbucket does not.
Required follow-up work
- Contributor guidance document. Landed 2026-04-20 at
docs/how-to/run-mutation-tests.md. Covers runningmutmut run, reading results, the kill-the-survivor workflow, equivalent-mutant suppression, and how to add thenon_sandbox/flakymarkers to new tests that won't run in the mutant sandbox. - User-facing rationale. A short section in the main
README.mdor a dedicated explanation doc describing why mutation score matters and how it differs from coverage — enough to prevent a contributor from treating the two as interchangeable. - Re-evaluate CI integration once (a) the
mutmut 3.5.0trampoline bug is fixed upstream or worked around at the project level, and (b) the doctrine is known to be internalised — likely measured by contributors citing themutation-aware-test-designpatterns in code review.
Consequences
Positive:
- Developers have a real tool to answer "if I introduced a bug here, would my tests fail?" rather than the proxy question "did my tests run this line?".
- Doctrine artifacts travel with the project: any contributor who picks up a mission inherits the mutation-aware test design patterns.
- No CI slowdown. No review-board arguments about mutation score thresholds.
Negative:
- The
--ignore=list will drift until thenon_sandboxmarker migration lands. - Contributors who never run
mutmut runlocally get no mutation signal at all. Without CI this is a soft policy, not an enforced gate. - Mutmut's subprocess-unfriendly trampoline forces ignores around spec-kitty's CLI integration tests. That's a real coverage blind spot for CLI behaviour, mitigated (not eliminated) by the unit tests that cover the same code paths in-process.
Related
- Tactic:
src/doctrine/tactics/built-in/testing/mutation-testing-workflow.tactic.yaml - Styleguide:
src/doctrine/styleguides/shipped/mutation-aware-test-design.styleguide.yaml - Toolguide:
src/doctrine/toolguides/shipped/PYTHON_MUTATION_TOOLS.md - Toolguide:
src/doctrine/toolguides/shipped/TYPESCRIPT_MUTATION_TOOLS.md - Import candidate:
src/doctrine/_reference/craft-guidelines-library/candidates/mutation-testing-skillpack.import.yaml - Directive:
DIRECTIVE_034(Test-First Development)