Test-flakiness handling policy
A flaky test is one whose pass/fail outcome changes between runs without any change to the code under test — it sometimes goes red on CI for reasons unrelated to the diff under review. Flakes waste review cycles and, left unmanaged, normalise a red CI that everyone learns to ignore.
This page is the suite-wide policy for handling them: how we detect a flake, what we are allowed to do about it (and explicitly not allowed to do), and the current disposition of every known flake surface in this repo.
The one rule that governs everything below: we never make a test pass by retrying it until it goes green. A green-after-retry is the "fixed because it looks fixed" trap — it hides genuine regressions and is incompatible with this repo's live-evidence / anti-laziness discipline. Retry plugins (
pytest-rerunfailures, the PyPIflakyplugin) are therefore not adopted, and are not present in the dependency set.
⚠️ Naming: flaky the marker is not "flaky" the concept
This repo already registers a flaky pytest marker, and it means something
narrow and unrelated to this policy. Per
ADR 2026-04-20-1,
@pytest.mark.flaky means "passes reliably in the main suite but is
non-deterministic under mutmut / forking pipelines" — it is a deselection
bucket for the mutation-testing sandbox, not a CI-quarantine mechanism.
Do not reach for the flaky marker to deal with a CI flake. If a genuine
quarantine mechanism is ever needed, it must use a different name
(quarantine, see below) so the two never get conflated.
The three tiers
Every flake in this suite falls into one of three tiers, and each tier has a single sanctioned response.
| Tier | What it is | Example | Sanctioned response |
|---|---|---|---|
| 1. Threshold / budget gate | A test that asserts a measurement stays under a wall-clock / size budget. CI runners are shared and noisy, so a generous gate occasionally trips with no real regression. | NFR-002 latency (tests/architectural/test_wp_prompt_build_latency.py), doctor restart-daemon timing, the -m timing gate. |
Tune the budget — never retry. Widen the budget to absorb runner variance. A real regression adds time consistently and still trips a generous gate; a retry would mask exactly that. Investigate before bumping; the budget is the gate, not the regression. |
| 2. Correctness test | A test of logical behaviour that should be 100% deterministic. If it flakes, the test (or the code) has a hidden nondeterminism — shared global state, ordering assumptions, fixture-teardown races, monkeypatch leakage, import-time side effects. | tests/specify_cli/shims/test_registry.py (parallel-collection nondeterminism). |
Fix the root cause — never retry. A correctness test that needs a retry is lying. Find the nondeterminism and remove it. |
| 3. Genuinely environmental | A test that depends on an OS-global resource — real TCP ports, singleton daemons, the real filesystem — that cannot be fully isolated per worker. | Real-port / daemon suites (tests/sync/test_orphan_sweep.py, ports 9400–9449). |
Surgical handling only. First serialise (-n0) and isolate (per-worker HOME) — see testing-parallel.md. Only if a residual, irreducible environmental flake remains do we quarantine (below) — never a blanket retry. |
Detection: confirm a flake before you treat it
One red run is a single data point — pytest alone cannot tell flaky from broken. Before declaring a test "flaky" and applying any of the responses above, reproduce the nondeterminism:
- Re-run the test in isolation, and under the parallel runner it failed on:
PWHEADLESS=1 pytest <path> -n auto --dist loadfile -p no:cacheprovider - For ordering / hash-seed flakes, re-run under different
PYTHONHASHSEEDvalues and diff the collected node IDs. - For a confidence signal, use the existing stability ratchet — N consecutive
green parallel runs, the same gate CI uses for shard flips. The authoritative
invocation lives in
testing-parallel.md → Running the stability ratchet locally;
point it at the suspect
<path>instead of fabricating a new command. (Full harness:tests/_support/coverage_safety/README.md.)
A test that cannot be made to fail again under these probes is not confirmed flaky — do not annotate it.
Tooling decision
No retry tooling — ever. pytest-rerunfailures / PyPI flaky are
deliberately not in the dependency set (see the rule at the top). Detection and
isolation reuse what the repo already has; the only thing built specifically for
this policy is the quarantine marker.
- No retry plugin. See the rule at the top.
- Detection uses the existing stability ratchet
(
tests._support.coverage_safety.ratchet), not a new confidence tool. - Isolation (the first line of defence for Tier 3) is the existing
per-worker HOME isolation + serial
-n0pass documented in testing-parallel.md.
quarantine — built, env-gated, non-blocking
The sanctioned mechanism for an irreducible Tier-3 flake is a dedicated
quarantine marker — not a retry, and not the existing flaky marker.
It is implemented as a single, un-bypassable chokepoint:
- Registered canonically in
pytest.ini'smarkersblock — the single source of truth for the marker registry (#2034) — sufficient for--strict-markers. - Held out of every normal run.
tests/conftest.py'spytest_collection_modifyitemsskips any@pytest.mark.quarantinetest unlessSPEC_KITTY_RUN_QUARANTINE=1. Because this is collection-time and global, no-mselector in any CI job (present or future) can accidentally run a quarantined test — the gate cannot be forgotten. The opt-in is strict (only the literal"1"); the pure decision lives intests/_support/quarantine.pyand is unit-tested. - Visible, never blocking. The
quarantine-visibilityjob inci-quality.ymlsetsSPEC_KITTY_RUN_QUARANTINE=1and runs-m quarantinefor real, so a quarantined flake is still seen failing — never silently retried to green. The job is deliberately excluded from thequality-gateaggregation (and must stay out of branch-protection required checks), so it can never turnmainred or block an unrelated PR. It tolerates an empty quarantine set (pytest exit code 5) so "nothing quarantined" is green.
To quarantine a test: mark it @pytest.mark.quarantine with a one-line
reason and a tracking-issue link — every quarantined test is tech debt with
an owner. The wiring above (test_quarantine_marker.py) is enforced.
As of this writing no test is quarantined (see the disposition table — every known surface is fixed or correctly handled). The mechanism exists so the first irreducible flake has a sanctioned home instead of a retry.
Audit + disposition of known flake surfaces
| Surface | Tier | Disposition |
|---|---|---|
tests/architectural/test_wp_prompt_build_latency.py (NFR-002 latency) |
1 | Resolved — keep tuning. Budget already widened 8.0 → 10.0s after a shared runner measured 8.50s with no code regression (PR #2036). The file carries inline rationale that is the Tier-1 policy. No further change. |
doctor restart-daemon NFR-002 timing, -m timing gate |
1 | Policy-covered. Treat as a budget gate: tune, never retry. Runs only in dedicated timing jobs (-m timing is excluded from the fast suite). |
tests/sync/test_orphan_sweep.py (real ports 9400–9449, daemons) |
3 | Resolved — already serialised. Runs in its own -n0 serial pass, excluded from the parallel pool (CLAUDE.md / testing-parallel.md). No quarantine needed. |
tests/specify_cli/shims/test_registry.py (parallel-collection nondeterminism) |
2 | Fixed at root cause. Parametrising over list(<frozenset>) produced a PYTHONHASHSEED-dependent case order, so xdist workers collected different orders ("Different tests were collected between gw0 and gwN"). Changed to sorted(<frozenset>), making collection order deterministic across workers. Verified: identical node-id order under different hash seeds. |
Adding a new test? Avoid the common root causes
When a correctness (Tier 2) test flakes, it is almost always one of these — fix the cause, do not retry:
- Unordered data used where order matters. Iterating a
set/frozenset/dictfor parametrize IDs or assertion sequences →PYTHONHASHSEED-dependent order. Wrap insorted(...). - Fixture-teardown races / leaked global state between tests sharing a module or process.
monkeypatch/ env-var leakage across tests (rely on the per-worker HOME isolation; don't mutate process-global state without restoring it).- Import-time side effects that bind a path or singleton before fixtures run.
- Time-sensitive assertions in a non-timing test — move the timing concern to a Tier-1 budget gate with a generous threshold, or remove the wall-clock dependency.
See also
- Running the test suite in parallel — per-worker HOME isolation, the serial daemon pass, and the stability ratchet.
- ADR 2026-04-20-1
— the
flaky/non_sandboxmarkers (mutmut deselection; distinct from this policy). - How-to: run mutation tests.