Spec Kitty

└─ kitty-specs
   └─ Runtime Recovery And Audit Safety

Mission Run:

📚 Docs ↗

Runtime Recovery And Audit Safety

Mission: 067-runtime-recovery-and-audit-safety Priority: P1 stabilization Type: software-dev Target branch: main Validated against: commit 1b01760e (2026-04-06)

Feature Overview

Spec Kitty's runtime currently cannot survive interruption gracefully, routes canonical workflows through a misapplied generic abstraction, lacks support for codebase-wide audit work, and misreports mission progress. This mission makes the runtime recoverable, removes the wrong shim abstraction, enables realistic audit and bulk-edit workflows, and ensures every operator-facing surface tells the truth about mission progress.

Background & Motivation

Problem Statement

Five categories of stabilization debt remain on main after the prior review-loop tranche:

1. Merge fragility: spec-kitty merge is non-idempotent. If interrupted mid-operation, it leaves incomplete cleanup with no supported recovery path. Operators must guess at manual Git state repair.

2. Implementation crash exposure: When the implementation phase crashes (process kill, network drop, OOM), existing branches and worktrees survive but Spec Kitty's state does not. There is no supported way to reconcile or resume — only manual Git escape hatches.

3. Wrong abstraction for canonical commands: The generic agent shim runtime resolves WP context before dispatching, which blocks non-WP commands (like accept) behind unnecessary context resolution. The accept action has shim support but is rejected by the canonical action resolver, creating an inconsistent dead path.

4. No audit-mode or bulk-edit safety: Audit and cutover work packages need to operate across the entire codebase, but the current WP ownership model forces fake narrow scope. Template and documentation directories are invisible to audit validation. Bulk rename/cutover edits have no guardrail for distinguishing string occurrence categories (identifiers vs. prose vs. comments), leading to silent breakage.

5. Dishonest progress reporting: The CLI dashboard and downstream sync surfaces compute progress as done / total. Work packages that are claimed, in_progress, for_review, or approved contribute 0% — operators see 0% progress even when most WPs are nearly complete.

Scope

In scope (8 issues):

#	Issue	Summary
1	#416	Merge is non-idempotent; leaves incomplete cleanup after interruption
2	#415	No crash recovery for the implementation phase
3	#414	`accept` action not registered in context resolver despite shim support
4	#412	Generic agent shim runtime should be replaced with direct canonical commands
5	#442	No codebase-wide audit WP mode; template/doc coverage missing from validation
6	#447	Completion percentage only counts WPs in `done`
7	#443	Duplicate/smaller framing of progress bug; consolidated into #447
8	#393	No guardrail for distinguishing string occurrence categories in bulk edits

Explicitly not in scope:

#401 — Revalidated as stale; current emitter already writes top-level from_lane/to_lane
Review-loop issues from the prior tranche: #430, #432, #433, #439, #440, #441, #444

Actors

Actor	Description
Operator	Human developer or CI system running `spec-kitty` CLI commands
Agent	AI coding agent (Claude, Codex, Gemini, etc.) executing WP workflows via slash commands
Reviewer	Human or agent performing review and acceptance of completed WPs
Auditor	Agent or human conducting codebase-wide audit/cutover work

User Scenarios & Testing

Scenario 1: Merge Recovery After Interruption

Actor: Operator Trigger: spec-kitty merge is interrupted by process kill, network failure, or Ctrl-C during a multi-WP merge sequence. Flow: Operator reruns spec-kitty merge (or spec-kitty merge --resume). The system detects the partial state, identifies which WPs completed and which did not, and resumes from the last incomplete WP without requiring manual Git cleanup. Success: All WPs eventually merge successfully. No duplicate status events are emitted for already-completed WPs. State is consistent.

Scenario 2: Implementation Crash and Reconciliation

Actor: Agent Trigger: Agent process dies during spec-kitty implement WP03. The Git branch and worktree exist, but Spec Kitty's internal state (lane transitions, workspace tracking) is inconsistent. Flow: Operator or agent runs a recovery/reconciliation command. The system detects existing branches and worktrees, reconciles them with the expected state, and allows implementation to continue without starting over. Success: The WP resumes from its last consistent state. No work is lost. No manual git worktree commands needed.

Scenario 3: Direct Canonical Command Execution

Actor: Agent Trigger: Agent invokes a slash command (e.g., /spec-kitty.accept) or CLI command that was previously routed through the generic agent shim runtime. Flow: The command executes directly against the canonical command surface without intermediate WP context resolution that would block non-WP-scoped commands. Success: accept and all other canonical actions resolve and execute consistently. No "action not registered" errors for actions that have shim support.

Scenario 4: Codebase-Wide Audit Work Package

Actor: Auditor Trigger: A mission includes a WP whose job is repo-wide leak detection, terminology audit, or template coverage validation. Flow: The auditor defines an audit-scoped WP that operates across the entire codebase. Validation explicitly checks command template directories and documentation files. Template/doc coverage gaps produce warnings or errors. Success: The audit WP runs without being forced into fake narrow file ownership. All template and doc directories are surfaced as audit targets.

Scenario 5: Bulk Rename with Occurrence Classification

Actor: Agent performing terminology cutover Trigger: A cutover WP requires renaming a term across the codebase. Flow: Before bulk edits proceed, the system requires classification of each occurrence category (identifier, prose, comment, path, configuration). After edits, a verification step confirms no unintended changes leaked across categories. Success: Identifiers are renamed correctly. Prose mentions are updated appropriately. No silent breakage in paths, configs, or unrelated string matches.

Scenario 6: Truthful Progress Dashboard

Actor: Operator Trigger: Operator runs spec-kitty status or views the dashboard during an active mission where 3 of 5 WPs are for_review and 1 is in_progress. Flow: The progress display shows a percentage reflecting the weighted contribution of all in-flight states, not just done. Success: Progress shows a meaningful non-zero percentage (e.g., ~70%) rather than 0% because no WP has reached done yet.

Functional Requirements

ID	Requirement	Status
FR-001	The merge operation shall detect and recover from partial completion state when rerun after interruption	Proposed
FR-002	The merge operation shall track per-WP completion state persistently so that completed WPs are not re-merged on retry	Proposed
FR-003	The merge operation shall guard against duplicate status events on retry by checking for existing event_ids before emitting transitions for already-completed WPs	Proposed
FR-004	The implementation workflow shall provide a reconciliation command that detects existing branches and worktrees and aligns internal state with filesystem reality	Proposed
FR-005	Implementation recovery shall allow continuation of a WP from its last consistent state without requiring the operator to manually repair Git state	Proposed
FR-006	Generated CLI-driven command files shall invoke canonical commands directly, without routing through a generic shim runtime	Proposed
FR-007	The `accept` action shall be registered in the action resolver and shall execute successfully when invoked	Proposed
FR-008	All canonical actions that have shim support shall have corresponding entries in the action resolver	Proposed
FR-009	Audit-scoped work packages shall be definable with codebase-wide ownership rather than narrow file-set ownership	Proposed
FR-010	Audit validation shall explicitly include command template directories and documentation files as coverage targets	Proposed
FR-011	Bulk rename/cutover workflows shall require occurrence classification before edits proceed	Proposed
FR-012	Bulk rename/cutover workflows shall include a post-edit verification step that confirms no unintended cross-category changes	Proposed
FR-013	A single canonical progress formula shall be used across CLI, dashboard, and downstream sync surfaces	Proposed
FR-014	The progress formula shall assign weighted contributions to `claimed`, `in_progress`, `for_review`, and `approved` states, not only `done`	Proposed
FR-015	Issue #443 shall be closed or cross-linked as consolidated into #447 when the progress fix ships	Proposed
FR-016	Audit validation shall produce warnings or errors when command templates or documentation have coverage gaps relative to the audit scope	Proposed

Non-Functional Requirements

ID	Requirement	Threshold	Status
NFR-001	Merge recovery shall complete within 2x the wall-clock time of a clean merge for the same WP set	≤ 2x clean merge time	Proposed
NFR-002	Implementation reconciliation shall detect and report stale state within 5 seconds for a feature with up to 20 WPs	≤ 5 seconds	Proposed
NFR-003	Progress computation shall produce identical results across CLI, dashboard, and sync surfaces for the same event log input	100% consistency	Proposed
NFR-004	Occurrence classification output shall be human-reviewable (structured, not opaque)	Structured categories visible to operator	Proposed
NFR-005	New code shall maintain 90%+ test coverage	≥ 90% line coverage	Proposed
NFR-006	All new code shall pass strict type checking with no errors	0 type errors	Proposed

Constraints

ID	Constraint	Status
C-001	Merge recovery must work with the existing event-log-based status model (Phase 2); no regression to frontmatter-based state	Confirmed
C-002	Implementation recovery must operate through Spec Kitty's workflow commands, not require manual Git-only escape hatches	Confirmed
C-003	The shim removal must preserve compatibility with all 12 supported AI agents' command file formats	Confirmed
C-004	Audit-mode changes must not break the existing narrow-ownership WP model for non-audit work packages	Confirmed
C-005	Progress formula changes must not break existing SaaS sync or downstream API consumers	Confirmed
C-006	Issue #401 is excluded — the current emitter already handles `from_lane`/`to_lane` correctly	Confirmed
C-007	Review-loop issues from the prior tranche (#430, #432, #433, #439, #440, #441, #444) are excluded	Confirmed

Success Criteria

1. An operator whose merge was interrupted at WP03 of 5 can rerun the merge command and have it complete WP03–WP05 without manual Git intervention, achieving full merge in one additional invocation. 2. An agent whose implementation session crashed can invoke a recovery command and resume coding on the same WP within 30 seconds, with no lost commits. 3. Every slash command across all 12 agent surfaces executes its canonical action directly — no command routes through a generic shim runtime for dispatch. 4. An auditor can define and execute a WP that scans the entire repository, including all command template and documentation directories, without fabricating narrow file ownership. 5. A mission with 5 WPs where 3 are for_review and 1 is in_progress shows progress of approximately 60–75% (not 0%) across all operator-facing surfaces. 6. A bulk rename of a term produces a structured classification report before any files are modified, and a verification report afterward confirming category-correct changes.

Suggested Work Package Decomposition

WP	Title	Issues	Summary
WP01	Merge interruption and recovery	#416	Make merge idempotent or resumable; prevent half-written state
WP02	Implementation crash recovery	#415	Reconcile existing branches/worktrees after interruption
WP03	Canonical execution surface cleanup	#412, #414	Remove generic shim runtime; register `accept` in action resolver
WP04	Audit-mode and bulk-edit safety	#442, #393	Codebase-wide WP scope; occurrence classification for bulk edits. Planning note: contains two distinct concerns (audit scope relaxation vs. occurrence classification workflow) — consider splitting during `/spec-kitty.plan`
WP05	Canonical progress reporting	#447, #443	Single weighted progress formula across all surfaces

Suggested Execution Order

Based on risk, dependency, and operator impact analysis:

1. WP05 first — lowest risk, highest operator impact; the weighted progress module already exists and is tested; this is a callsite-replacement task 2. WP03 + WP01 in parallel — independent of each other; WP03 is a shim removal + migration, WP01 is merge state extension 3. WP02 next — builds on understanding from WP01's merge state work; can reference recovery patterns established there 4. WP04 last — hardest WP, benefits from all prior stabilization being in place

Dependencies

Dependency	Impact
Existing merge state model (`MergeState`, mission-scoped at `.kittify/runtime/merge/<mission_id>/state.json`)	WP01 extends this; must understand current mission-scoped persistence format
Event-log status model (Phase 2)	WP01 and WP05 interact with event log; must not regress
Agent directory configuration (`get_agent_dirs_for_project`)	WP03 must generate direct command files for all configured agents
Lane/worktree resolution	WP02 must reconcile against existing lane-based worktree paths

Assumptions

1. The existing MergeState persistence model (mission-scoped at .kittify/runtime/merge/<mission_id>/state.json) is the right foundation for merge recovery (extend, not replace). 2. Implementation recovery targets the lane-based worktree model (2.x+); legacy per-WP worktrees are not a recovery target. 3. After shim removal, accept becomes a direct canonical command (like implement and review), invoked directly rather than through shim dispatch. 4. Weighted progress values for intermediate lanes are a design decision for planning phase (not specified here). 5. Occurrence classification is a workflow step (prompt/template guidance + structured output), not a fully automated NLP classifier.

Risks

Risk	Likelihood	Impact	Mitigation
Merge recovery introduces new race conditions in concurrent agent scenarios	Medium	High	Test with simulated interruption at each WP boundary; ensure atomic state transitions
Merge retry emits duplicate status events for already-completed WPs	High	Medium	FR-003 requires event_id dedup check before emitting; JSONL is line-atomic so partial writes are unlikely, but duplicates on retry are the real risk
Removing the shim runtime breaks agent command files that depend on shim-resolved context	Low	High	C-003 requires 12-agent compatibility; test at least claude, codex, opencode
Weighted progress formula disagreements across surfaces if formula is not truly shared	Medium	Medium	FR-013 mandates single formula; test all three surfaces against same input
Audit-mode WPs create overly broad blast radius for changes	Low	Medium	Audit scope is read-only validation + classification; actual edits still require per-file review

Issue Hygiene

Each in-scope issue shall be updated with current-main root cause findings during implementation.
#401 is explicitly documented as revalidated-stale and excluded.
#443 shall be closed or cross-linked as consolidated into #447 when the progress fix implementation fully subsumes it.
This mission shall not silently widen into long-horizon redesign work (JSON-canonical planning, doctrine architecture).

Verification Expectations

Targeted tests for interrupted merge/retry or merge recovery behavior
Tests for recovery/reconciliation of existing implementation branches/worktrees
Tests proving generated command surfaces use direct canonical commands after shim cleanup
Tests covering accept action resolution consistency
Tests for audit-mode validation behavior around template/doc coverage and codebase-wide scope
Tests for occurrence classification or equivalent guardrail enforcement in bulk-edit workflows
Tests for shared weighted-progress calculation across CLI/dashboard surfaces

Definition of Done

This mission is done when Spec Kitty can survive interruption better, stop routing canonical workflows through the wrong shim abstraction, support realistic audit/cutover work, and report mission progress truthfully across its operator-facing surfaces.