Spec Kitty

└─ kitty-specs
   └─ Review Loop Stabilization

Mission Run:

📚 Docs ↗

Tasks: 066 Review Loop Stabilization

Mission: 066-review-loop-stabilization Target Branch: main Created: 2026-04-06

Subtask Index

ID	Description	WP	Parallel
T001	Create review module scaffold (__init__.py, artifacts.py with dataclasses)	WP01
T002	Implement ReviewCycleArtifact write() — YAML frontmatter + markdown body	WP01
T003	Implement from_file() and latest() — parse artifacts, find highest cycle	WP01
T004	Rewrite _persist_review_feedback() in tasks.py — create ReviewCycleArtifact	WP01
T005	Update _resolve_review_feedback_pointer() — dual resolution (legacy + new)	WP01
T006	Write tests for artifact CRUD, frontmatter round-trip, pointer resolution	WP01
T007	Create fix_prompt.py with generate_fix_prompt()	WP02
T008	Implement fix-prompt template rendering	WP02
T009	Add fix-mode detection in workflow.py implement path	WP02
T010	Implement mode switching — fix-prompt vs full-prompt	WP02
T011	Write tests for fix-prompt generation, sizing, end-to-end flow	WP02
T012	Create dirty_classifier.py with classify_dirty_paths()	WP03	[D]
T013	Implement classification rules — blocking vs benign path patterns	WP03	[D]
T014	Update _validate_ready_for_review() — use classifier, only block on blocking	WP03	[D]
T015	Update review prompt — surface writable in-repo feedback path	WP03	[D]
T016	Write tests for classification, validation, review prompt path	WP03	[D]
T017	Create baseline.py with BaselineTestResult and TestFailure dataclasses	WP04	[D]
T018	Implement capture_baseline() — pytest --junitxml + JUnit XML parsing	WP04	[D]
T019	Implement load_baseline() and diff_baseline() — cached lookup + diff	WP04	[D]
T020	Hook capture_baseline() into implement path (before agent starts coding)	WP04	[D]
T021	Hook diff_baseline() into review prompt — Baseline Context section	WP04	[D]
T022	Add review.test_command config support for non-pytest runners	WP04	[D]
T023	Write tests for capture, JSON round-trip, diff, config, review prompt	WP04	[D]
T024	Create lock.py with ReviewLock dataclass — acquire, release, is_stale	WP05	[D]
T025	Implement stale lock detection — cross-platform PID check	WP05	[D]
T026	Hook lock acquire/release into agent action review	WP05	[D]
T027	Add .spec-kitty/ to .gitignore	WP05	[D]
T028	Implement opt-in env-var isolation config from .kittify/config.yaml	WP05	[D]
T029	Write tests for lock lifecycle, stale detection, concurrent block, config	WP05	[D]
T030	Create arbiter.py with ArbiterCategory, ArbiterChecklist, ArbiterDecision	WP06	[D]
T031	Implement prompt_arbiter_checklist() — 5-question checklist + category	WP06	[D]
T032	Implement override detection in move-task — forward --force after rejection	WP06	[D]
T033	Persist ArbiterDecision in review-cycle artifact frontmatter	WP06	[D]
T034	Make arbiter decisions visible in agent tasks status	WP06	[D]
T035	Write tests for checklist, detection, persistence, visibility	WP06	[D]

Work Packages

WP01: Persisted Review Artifact Model

Goal: Define review-cycle artifact schema, move feedback from .git/ to committed kitty-specs/ artifacts, add backward-compatible pointer resolution. Priority: P0 — foundation for WP02 Dependencies: None Issues: #432, storage side of #433 Estimated prompt size: ~400 lines

□ T001 Create review module scaffold (__init__.py, artifacts.py with dataclasses) (WP01)
□ T002 Implement ReviewCycleArtifact write() — YAML frontmatter + markdown body (WP01)
□ T003 Implement from_file() and latest() — parse artifacts, find highest cycle (WP01)
□ T004 Rewrite _persist_review_feedback() in tasks.py — create ReviewCycleArtifact (WP01)
□ T005 Update _resolve_review_feedback_pointer() — dual resolution (legacy + new) (WP01)
□ T006 Write tests for artifact CRUD, frontmatter round-trip, pointer resolution (WP01)

Implementation sketch: Create new src/specify_cli/review/ module. Define ReviewCycleArtifact and AffectedFile frozen dataclasses following existing StatusEvent patterns. Replace _persist_review_feedback() to write review-cycle artifacts to kitty-specs/<mission>/tasks/<WP-slug>/. Update pointer resolver for dual-format resolution.

Risks: Legacy feedback:// pointers must continue resolving. Test with pre-066 event log entries.

WP02: Focused Rejection Recovery

Goal: Generate fix-mode prompts from persisted review-cycle artifacts instead of replaying full WP prompts. Priority: P0 — core value proposition Dependencies: WP01 Issues: #430, integration side of #433 Estimated prompt size: ~380 lines

✅ T007 Create fix_prompt.py with generate_fix_prompt() (WP02)
✅ T008 Implement fix-prompt template rendering (WP02)
✅ T009 Add fix-mode detection in workflow.py implement path (WP02)
✅ T010 Implement mode switching — fix-prompt vs full-prompt (WP02)
✅ T011 Write tests for fix-prompt generation, sizing, end-to-end flow (WP02)

Implementation sketch: Create generate_fix_prompt() that reads latest ReviewCycleArtifact, extracts affected file paths/line ranges, reads current code from disk, and produces a focused prompt. Modify agent action implement in workflow.py to detect prior rejection cycles and switch modes.

Risks: Fix-prompt must be self-contained — agent should not need to read the original WP prompt. Verify prompt sizing meets NFR-001 (<25% of original for single-file findings).

WP03: External Reviewer Handoff

Goal: Implement dirty-state classification and writable in-repo feedback path for external reviewers. Priority: P1 Dependencies: None Issues: #439 Estimated prompt size: ~370 lines

✅ T012 Create dirty_classifier.py with classify_dirty_paths() (WP03)
✅ T013 Implement classification rules — blocking vs benign path patterns (WP03)
✅ T014 Update _validate_ready_for_review() — use classifier, only block on blocking (WP03)
✅ T015 Update review prompt — surface writable in-repo feedback path (WP03)
✅ T016 Write tests for classification, validation, review prompt path (WP03)

Implementation sketch: Create classify_dirty_paths() that partitions git status --porcelain output. Blocking: WP-owned source files, WP's task file. Benign: status.events.jsonl, status.json, other WPs' task files, metadata. Update _validate_ready_for_review() to call classifier. Update review prompt to show in-repo writable path.

WP01 interaction note: WP03 changes where the review prompt tells the reviewer to write. The --review-feedback-file flag still accepts any path. If WP01 hasn't landed, move-task persists to .git/ (old behavior). After WP01 lands, move-task persists to the same in-repo location. Convergence is natural.

Risks: Classification rules must not accidentally block on WP-owned files that were legitimately committed. Test with real multi-WP dirty state.

WP04: Baseline Test Capture

Goal: Capture baseline test results at implement time, surface delta in review prompts. Priority: P1 Dependencies: None Issues: #444 Estimated prompt size: ~450 lines

□ T017 Create baseline.py with BaselineTestResult and TestFailure dataclasses (WP04)
□ T018 Implement capture_baseline() — pytest --junitxml + JUnit XML parsing (WP04)
□ T019 Implement load_baseline() and diff_baseline() — cached lookup + diff (WP04)
□ T020 Hook capture_baseline() into implement path (before agent starts coding) (WP04)
□ T021 Hook diff_baseline() into review prompt — Baseline Context section (WP04)
□ T022 Add review.test_command config support for non-pytest runners (WP04)
□ T023 Write tests for capture, JSON round-trip, diff, config, review prompt (WP04)

Implementation sketch: Run pytest --junitxml=<tmpfile> on the base branch at implement time. Parse JUnit XML via xml.etree.ElementTree. Save structured results to baseline-tests.json. At review time, diff cached baseline against current test run. Add "Baseline Context" section to review prompt. Non-pytest projects configure review.test_command in .kittify/config.yaml.

Risks: Test suite may fail to run at implement time (missing deps, broken env). Handle gracefully — create baseline artifact with sentinel values and warn, don't block implementation.

WP05: Concurrent Review Isolation

Goal: Serialize concurrent reviews by default; opt-in env-var isolation for projects that configure it. Priority: P2 Dependencies: None Issues: #440 Estimated prompt size: ~380 lines

✅ T024 Create lock.py with ReviewLock dataclass — acquire, release, is_stale (WP05)
✅ T025 Implement stale lock detection — cross-platform PID check (WP05)
✅ T026 Hook lock acquire/release into agent action review (WP05)
✅ T027 Add .spec-kitty/ to .gitignore (WP05)
✅ T028 Implement opt-in env-var isolation config from .kittify/config.yaml (WP05)
✅ T029 Write tests for lock lifecycle, stale detection, concurrent block, config (WP05)

Implementation sketch: Primary (80% effort): ReviewLock serialization via .spec-kitty/review-lock.json. Acquire on review start, release on move-task. Stale detection via os.kill(pid, 0). Optional (20%): read review.concurrent_isolation from config for env-var scoping. Add .spec-kitty/ to .gitignore.

Risks: PID-based stale detection may behave differently across platforms. Use try/except around os.kill with fallback to file age check.

WP06: Arbiter Ergonomics

Goal: Add structured arbiter checklist and rationale model for false-positive review rejections. Priority: P2 Dependencies: None Issues: #441 Estimated prompt size: ~400 lines

✅ T030 Create arbiter.py with ArbiterCategory, ArbiterChecklist, ArbiterDecision (WP06)
✅ T031 Implement prompt_arbiter_checklist() — 5-question checklist + category (WP06)
✅ T032 Implement override detection in move-task — forward --force after rejection (WP06)
✅ T033 Persist ArbiterDecision in review-cycle artifact frontmatter (WP06)
✅ T034 Make arbiter decisions visible in agent tasks status (WP06)
✅ T035 Write tests for checklist, detection, persistence, visibility (WP06)

Implementation sketch: Create ArbiterCategory StrEnum (5 categories), ArbiterChecklist (5 boolean questions), ArbiterDecision. Detect override: when --force moves WP forward from planned and latest event was for_review → planned with review_ref. Run checklist, persist decision in review-cycle artifact frontmatter. review_ref points to same review-cycle:// artifact — no new pointer scheme.

Risks: Override detection must not trigger on normal claim/re-claim workflows. Only trigger when a rejection event exists in the log.

Execution Tracks

Track A: Review artifact pipeline (sequential)

WP01 (artifact model) ──> WP02 (fix-mode prompts + wiring)

Track B: Independent improvements (parallel)

WP03 (dirty-state classification)      ─┐
WP04 (baseline test capture)           ─┼── all independent
WP05 (concurrent review serialization) ─┤
WP06 (arbiter ergonomics)              ─┘

Maximum parallelization: 5 WPs can execute simultaneously (WP02 after WP01 completes, WP03-WP06 in parallel from the start).