Runtime Recovery And Audit Safety

Mission: 067-runtime-recovery-and-audit-safety Priority: P1 stabilization Type: software-dev Target branch: main Validated against: commit 1b01760e (2026-04-06)

Feature Overview

Spec Kitty's runtime currently cannot survive interruption gracefully, routes canonical workflows through a misapplied generic abstraction, lacks support for codebase-wide audit work, and misreports mission progress. This mission makes the runtime recoverable, removes the wrong shim abstraction, enables realistic audit and bulk-edit workflows, and ensures every operator-facing surface tells the truth about mission progress.

Background & Motivation

Problem Statement

Five categories of stabilization debt remain on main after the prior review-loop tranche:

1. Merge fragility: spec-kitty merge is non-idempotent. If interrupted mid-operation, it leaves incomplete cleanup with no supported recovery path. Operators must guess at manual Git state repair.

2. Implementation crash exposure: When the implementation phase crashes (process kill, network drop, OOM), existing branches and worktrees survive but Spec Kitty's state does not. There is no supported way to reconcile or resume — only manual Git escape hatches.

3. Wrong abstraction for canonical commands: The generic agent shim runtime resolves WP context before dispatching, which blocks non-WP commands (like accept) behind unnecessary context resolution. The accept action has shim support but is rejected by the canonical action resolver, creating an inconsistent dead path.

4. No audit-mode or bulk-edit safety: Audit and cutover work packages need to operate across the entire codebase, but the current WP ownership model forces fake narrow scope. Template and documentation directories are invisible to audit validation. Bulk rename/cutover edits have no guardrail for distinguishing string occurrence categories (identifiers vs. prose vs. comments), leading to silent breakage.

5. Dishonest progress reporting: The CLI dashboard and downstream sync surfaces compute progress as done / total. Work packages that are claimed, in_progress, for_review, or approved contribute 0% — operators see 0% progress even when most WPs are nearly complete.

Scope

In scope (8 issues):

#IssueSummary
1#416Merge is non-idempotent; leaves incomplete cleanup after interruption
2#415No crash recovery for the implementation phase
3#414accept action not registered in context resolver despite shim support
4#412Generic agent shim runtime should be replaced with direct canonical commands
5#442No codebase-wide audit WP mode; template/doc coverage missing from validation
6#447Completion percentage only counts WPs in done
7#443Duplicate/smaller framing of progress bug; consolidated into #447
8#393No guardrail for distinguishing string occurrence categories in bulk edits

Explicitly not in scope:

  • #401 — Revalidated as stale; current emitter already writes top-level from_lane/to_lane
  • Review-loop issues from the prior tranche: #430, #432, #433, #439, #440, #441, #444

Actors

ActorDescription
OperatorHuman developer or CI system running spec-kitty CLI commands
AgentAI coding agent (Claude, Codex, Gemini, etc.) executing WP workflows via slash commands
ReviewerHuman or agent performing review and acceptance of completed WPs
AuditorAgent or human conducting codebase-wide audit/cutover work

User Scenarios & Testing

Scenario 1: Merge Recovery After Interruption

Actor: Operator Trigger: spec-kitty merge is interrupted by process kill, network failure, or Ctrl-C during a multi-WP merge sequence. Flow: Operator reruns spec-kitty merge (or spec-kitty merge --resume). The system detects the partial state, identifies which WPs completed and which did not, and resumes from the last incomplete WP without requiring manual Git cleanup. Success: All WPs eventually merge successfully. No duplicate status events are emitted for already-completed WPs. State is consistent.

Scenario 2: Implementation Crash and Reconciliation

Actor: Agent Trigger: Agent process dies during spec-kitty implement WP03. The Git branch and worktree exist, but Spec Kitty's internal state (lane transitions, workspace tracking) is inconsistent. Flow: Operator or agent runs a recovery/reconciliation command. The system detects existing branches and worktrees, reconciles them with the expected state, and allows implementation to continue without starting over. Success: The WP resumes from its last consistent state. No work is lost. No manual git worktree commands needed.

Scenario 3: Direct Canonical Command Execution

Actor: Agent Trigger: Agent invokes a slash command (e.g., /spec-kitty.accept) or CLI command that was previously routed through the generic agent shim runtime. Flow: The command executes directly against the canonical command surface without intermediate WP context resolution that would block non-WP-scoped commands. Success: accept and all other canonical actions resolve and execute consistently. No "action not registered" errors for actions that have shim support.

Scenario 4: Codebase-Wide Audit Work Package

Actor: Auditor Trigger: A mission includes a WP whose job is repo-wide leak detection, terminology audit, or template coverage validation. Flow: The auditor defines an audit-scoped WP that operates across the entire codebase. Validation explicitly checks command template directories and documentation files. Template/doc coverage gaps produce warnings or errors. Success: The audit WP runs without being forced into fake narrow file ownership. All template and doc directories are surfaced as audit targets.

Scenario 5: Bulk Rename with Occurrence Classification

Actor: Agent performing terminology cutover Trigger: A cutover WP requires renaming a term across the codebase. Flow: Before bulk edits proceed, the system requires classification of each occurrence category (identifier, prose, comment, path, configuration). After edits, a verification step confirms no unintended changes leaked across categories. Success: Identifiers are renamed correctly. Prose mentions are updated appropriately. No silent breakage in paths, configs, or unrelated string matches.

Scenario 6: Truthful Progress Dashboard

Actor: Operator Trigger: Operator runs spec-kitty status or views the dashboard during an active mission where 3 of 5 WPs are for_review and 1 is in_progress. Flow: The progress display shows a percentage reflecting the weighted contribution of all in-flight states, not just done. Success: Progress shows a meaningful non-zero percentage (e.g., ~70%) rather than 0% because no WP has reached done yet.

Functional Requirements

IDRequirementStatus
FR-001The merge operation shall detect and recover from partial completion state when rerun after interruptionProposed
FR-002The merge operation shall track per-WP completion state persistently so that completed WPs are not re-merged on retryProposed
FR-003The merge operation shall guard against duplicate status events on retry by checking for existing event_ids before emitting transitions for already-completed WPsProposed
FR-004The implementation workflow shall provide a reconciliation command that detects existing branches and worktrees and aligns internal state with filesystem realityProposed
FR-005Implementation recovery shall allow continuation of a WP from its last consistent state without requiring the operator to manually repair Git stateProposed
FR-006Generated CLI-driven command files shall invoke canonical commands directly, without routing through a generic shim runtimeProposed
FR-007The accept action shall be registered in the action resolver and shall execute successfully when invokedProposed
FR-008All canonical actions that have shim support shall have corresponding entries in the action resolverProposed
FR-009Audit-scoped work packages shall be definable with codebase-wide ownership rather than narrow file-set ownershipProposed
FR-010Audit validation shall explicitly include command template directories and documentation files as coverage targetsProposed
FR-011Bulk rename/cutover workflows shall require occurrence classification before edits proceedProposed
FR-012Bulk rename/cutover workflows shall include a post-edit verification step that confirms no unintended cross-category changesProposed
FR-013A single canonical progress formula shall be used across CLI, dashboard, and downstream sync surfacesProposed
FR-014The progress formula shall assign weighted contributions to claimed, in_progress, for_review, and approved states, not only doneProposed
FR-015Issue #443 shall be closed or cross-linked as consolidated into #447 when the progress fix shipsProposed
FR-016Audit validation shall produce warnings or errors when command templates or documentation have coverage gaps relative to the audit scopeProposed

Non-Functional Requirements

IDRequirementThresholdStatus
NFR-001Merge recovery shall complete within 2x the wall-clock time of a clean merge for the same WP set≤ 2x clean merge timeProposed
NFR-002Implementation reconciliation shall detect and report stale state within 5 seconds for a feature with up to 20 WPs≤ 5 secondsProposed
NFR-003Progress computation shall produce identical results across CLI, dashboard, and sync surfaces for the same event log input100% consistencyProposed
NFR-004Occurrence classification output shall be human-reviewable (structured, not opaque)Structured categories visible to operatorProposed
NFR-005New code shall maintain 90%+ test coverage≥ 90% line coverageProposed
NFR-006All new code shall pass strict type checking with no errors0 type errorsProposed

Constraints

IDConstraintStatus
C-001Merge recovery must work with the existing event-log-based status model (Phase 2); no regression to frontmatter-based stateConfirmed
C-002Implementation recovery must operate through Spec Kitty's workflow commands, not require manual Git-only escape hatchesConfirmed
C-003The shim removal must preserve compatibility with all 12 supported AI agents' command file formatsConfirmed
C-004Audit-mode changes must not break the existing narrow-ownership WP model for non-audit work packagesConfirmed
C-005Progress formula changes must not break existing SaaS sync or downstream API consumersConfirmed
C-006Issue #401 is excluded — the current emitter already handles from_lane/to_lane correctlyConfirmed
C-007Review-loop issues from the prior tranche (#430, #432, #433, #439, #440, #441, #444) are excludedConfirmed

Success Criteria

1. An operator whose merge was interrupted at WP03 of 5 can rerun the merge command and have it complete WP03–WP05 without manual Git intervention, achieving full merge in one additional invocation. 2. An agent whose implementation session crashed can invoke a recovery command and resume coding on the same WP within 30 seconds, with no lost commits. 3. Every slash command across all 12 agent surfaces executes its canonical action directly — no command routes through a generic shim runtime for dispatch. 4. An auditor can define and execute a WP that scans the entire repository, including all command template and documentation directories, without fabricating narrow file ownership. 5. A mission with 5 WPs where 3 are for_review and 1 is in_progress shows progress of approximately 60–75% (not 0%) across all operator-facing surfaces. 6. A bulk rename of a term produces a structured classification report before any files are modified, and a verification report afterward confirming category-correct changes.

Suggested Work Package Decomposition

WPTitleIssuesSummary
WP01Merge interruption and recovery#416Make merge idempotent or resumable; prevent half-written state
WP02Implementation crash recovery#415Reconcile existing branches/worktrees after interruption
WP03Canonical execution surface cleanup#412, #414Remove generic shim runtime; register accept in action resolver
WP04Audit-mode and bulk-edit safety#442, #393Codebase-wide WP scope; occurrence classification for bulk edits. Planning note: contains two distinct concerns (audit scope relaxation vs. occurrence classification workflow) — consider splitting during /spec-kitty.plan
WP05Canonical progress reporting#447, #443Single weighted progress formula across all surfaces

Suggested Execution Order

Based on risk, dependency, and operator impact analysis:

1. WP05 first — lowest risk, highest operator impact; the weighted progress module already exists and is tested; this is a callsite-replacement task 2. WP03 + WP01 in parallel — independent of each other; WP03 is a shim removal + migration, WP01 is merge state extension 3. WP02 next — builds on understanding from WP01's merge state work; can reference recovery patterns established there 4. WP04 last — hardest WP, benefits from all prior stabilization being in place

Dependencies

DependencyImpact
Existing merge state model (MergeState, mission-scoped at .kittify/runtime/merge/<mission_id>/state.json)WP01 extends this; must understand current mission-scoped persistence format
Event-log status model (Phase 2)WP01 and WP05 interact with event log; must not regress
Agent directory configuration (get_agent_dirs_for_project)WP03 must generate direct command files for all configured agents
Lane/worktree resolutionWP02 must reconcile against existing lane-based worktree paths

Assumptions

1. The existing MergeState persistence model (mission-scoped at .kittify/runtime/merge/<mission_id>/state.json) is the right foundation for merge recovery (extend, not replace). 2. Implementation recovery targets the lane-based worktree model (2.x+); legacy per-WP worktrees are not a recovery target. 3. After shim removal, accept becomes a direct canonical command (like implement and review), invoked directly rather than through shim dispatch. 4. Weighted progress values for intermediate lanes are a design decision for planning phase (not specified here). 5. Occurrence classification is a workflow step (prompt/template guidance + structured output), not a fully automated NLP classifier.

Risks

RiskLikelihoodImpactMitigation
Merge recovery introduces new race conditions in concurrent agent scenariosMediumHighTest with simulated interruption at each WP boundary; ensure atomic state transitions
Merge retry emits duplicate status events for already-completed WPsHighMediumFR-003 requires event_id dedup check before emitting; JSONL is line-atomic so partial writes are unlikely, but duplicates on retry are the real risk
Removing the shim runtime breaks agent command files that depend on shim-resolved contextLowHighC-003 requires 12-agent compatibility; test at least claude, codex, opencode
Weighted progress formula disagreements across surfaces if formula is not truly sharedMediumMediumFR-013 mandates single formula; test all three surfaces against same input
Audit-mode WPs create overly broad blast radius for changesLowMediumAudit scope is read-only validation + classification; actual edits still require per-file review

Issue Hygiene

  • Each in-scope issue shall be updated with current-main root cause findings during implementation.
  • #401 is explicitly documented as revalidated-stale and excluded.
  • #443 shall be closed or cross-linked as consolidated into #447 when the progress fix implementation fully subsumes it.
  • This mission shall not silently widen into long-horizon redesign work (JSON-canonical planning, doctrine architecture).

Verification Expectations

  • Targeted tests for interrupted merge/retry or merge recovery behavior
  • Tests for recovery/reconciliation of existing implementation branches/worktrees
  • Tests proving generated command surfaces use direct canonical commands after shim cleanup
  • Tests covering accept action resolution consistency
  • Tests for audit-mode validation behavior around template/doc coverage and codebase-wide scope
  • Tests for occurrence classification or equivalent guardrail enforcement in bulk-edit workflows
  • Tests for shared weighted-progress calculation across CLI/dashboard surfaces

Definition of Done

This mission is done when Spec Kitty can survive interruption better, stop routing canonical workflows through the wrong shim abstraction, support realistic audit/cutover work, and report mission progress truthfully across its operator-facing surfaces.