Doctrine Alignment Analysis
Status: Discovery
Date: 2026-03-12
Tactics reviewed:
test-boundaries-by-responsibility.tactic.mdtesting-select-appropriate-level.tactic.mdtest-to-system-reconstruction.tactic.md
Tactic Summaries
1. Test Boundaries by Responsibility
Draw test boundaries around functional responsibility, not code structure. Mock what's outside the responsibility boundary (DB, filesystem, network), exercise everything inside with real implementations. Boundaries should survive internal refactors without breaking tests.
2. Select Appropriate Level
The test pyramid — many unit, some integration, few system. Pick the minimal level that gives confidence given the risk of the change. Don't default to one level for everything. Explicitly document what's not tested and why.
3. Test-to-System Reconstruction
Tests should be readable enough that someone with zero implementation context can reconstruct the system's behavior purely from test code. If they can't, the tests are testing mechanics, not documenting behavior.
Current Suite vs Doctrine
❌ Inverted Pyramid
The numbers reveal the shape:
| Level | Files | LOC | Doctrine says |
|---|---|---|---|
| Unit | 60 | 19.5K | Should be most |
| Integration/CLI | 149 + 42 + 25 | 73.5K | Should be some |
| E2E | 3 + 6 | 1.8K | Should be few |
The specify_cli/ directory (52K LOC, 149 files) is the elephant. These tests import real modules but also spawn subprocesses, scaffold git repos, and do filesystem work. They are labeled as unit-adjacent but behave like integration tests. The pyramid is inverted — ~4× more integration-weight code than pure unit code.
❌ Boundaries Follow Structure, Not Responsibility
Tactic 1 says: "Don't mock by layer — mock by whether the component directly implements the feature logic."
Our tests often do the opposite:
specify_cli/tests exercise the CLI layer by calling real implementations all the way down to the filesystem — no responsibility boundary drawn, just "test the whole stack through the CLI entry point."integration/tests do the same but viasubprocess.run(), adding 1–5s per test for no extra behavioral coverage over thespecify_cli/tests.- Many tests scaffold complete
.kittify/project trees to validate a single function. That's insufficient isolation per the tactic — the filesystem is an outside boundary component that should be stubbed for unit-level tests.
⚠️ Shift-Left Gap
"Shifting left" means catching defects at the cheapest possible level. Current state:
- A developer changing
status/reducer.pycan't get fast feedback — even the "unit" tests inspecify_cli/status/scaffold filesystem state. - The fast tier (Tier 1 from the initiative README) covers only doctrine/template/docs — ~20K LOC. Core business logic (status model, merge, frontmatter, tasks) has no sub-second test path.
- We're paying integration-test costs for unit-test-level questions. That's shifting right.
⚠️ Test-to-System Reconstruction Readiness
Tactic 3 asks: "Can someone reconstruct the system from tests alone?"
- Good:
tests/doctrine/andtests/contract/are excellent — schema-compliance tests that clearly document "what the system promises." - Bad: Many
specify_cli/tests have opaque fixture setup (copy entire project trees, run CLI, parse output). A naive reader would understand that something works but not what behavior is guaranteed. - Missing: Almost no explicit documentation of what is NOT tested. The tactics require this.
✅ What We Do Well
- Adversarial tests exist (6 files) — aligns with the ATDD adversarial complement.
- Marker system is in place (
e2e,slow,jj,adversarial) — just underused. - No significant duplication — tests at different levels genuinely test different aspects.
sync/conftest.pyis a model of correct boundaries: everything outside the sync module is mocked, real implementations inside. This is exactly what Tactic 1 prescribes.
Verdict
Form vs Function
Form: The directory structure looks like a pyramid (unit/, integration/, e2e/), but the actual execution profile is an inverted trapezoid. The specify_cli/ tests blur the boundary.
Function: We're testing real behavior (good), but at the wrong cost level (bad). Most of the 20–30 min runtime is spent on filesystem scaffolding and subprocess overhead that could be eliminated by drawing responsibility boundaries and stubbing I/O.
Shift-Left Score: Weak
The fastest useful feedback loop is ~3 min, not ~30s. Core business logic lacks a pure-unit test path. The doctrine says unit tests should be the primary confidence mechanism — ours are a minority.
Recommended Actions (Doctrine-Aligned)
Immediate (shift-left)
- Identify the top-10 most-changed modules (via
git log) and ensure each has a pure-unit test file with filesystem/subprocess stubbed out. - Extract responsibility boundaries for
status/,merge/,frontmatter/— these are the core logic modules that should have sub-second test paths. - Add
@pytest.mark.subprocessmarker to all tests that callsubprocess.run()so developers can exclude them locally.
Short-term (pyramid correction)
- Audit
specify_cli/tests — for each file, decide: is this testing CLI wiring (integration) or business logic (unit)? Split accordingly. - Create
tests/unit/mirrors for core modules currently only tested via CLI integration inspecify_cli/. - Consolidate duplicate fixtures (
isolated_env,run_cli) to rootconftest.py.
Strategic (reconstruction readiness)
- Run the Test-to-System Reconstruction tactic on the
status/module (highest complexity, ~800 tests recommended scope). Use the dual-agent protocol from the tactic. Target: ≥85% overall accuracy. - Document untested areas — create a
TESTING_GAPS.mdor equivalent that explicitly states what is not covered and why. - Adopt test scope statements (from Tactic 1) for new tests: "This test validates [X] by exercising [Y] while stubbing [Z]."