Tasks: CLI Session Survival and Daemon Singleton

Mission: cli-session-survival-daemon-singleton-01KQ9M3M Branch: main (planning, base, and merge target — all main) Spec: spec.md Plan: plan.md Generated: 2026-04-28


Branch Strategy

  • Current branch at workflow start: main
  • Planning/base branch for this mission: main
  • Final merge target for completed changes: main
  • Branch matches target: ✅ true

Per spec-kitty agent context resolve: "Current branch at workflow start: main. Planning/base branch for this feature: main. Completed changes must merge into main."

Execution worktrees are allocated per computed lane from lanes.json after finalize-tasks runs. Agents working a WP MUST enter the workspace path printed by spec-kitty implement WP##, not reconstruct paths manually.


Subtask Index

IDDescriptionWPParallel
T001Create src/specify_cli/core/file_lock.py skeleton with LockRecord dataclass and module docstringWP01[D]
T002Implement MachineFileLock async context manager (acquire/release, hold-ceiling, atomic content write)WP01[D]
T003Implement read_lock_record() and force_release() helpers for diagnostic + repair useWP01[D]
T004Implement age-based stale-lock adoption (cooperate-after-timeout) + cross-platform primitive dispatchWP01[D]
T005Author tests/core/test_file_lock.py (7 cases per contracts/refresh-lock.md)WP01[D]
T006Create src/specify_cli/auth/refresh_transaction.py with RefreshOutcome enum + run_refresh_transaction() skeletonWP02[D]
T007Implement reload-before-refresh (FR-004 AdoptedNewer) + lock-timeout adopt/error branches (FR-016/FR-017)WP02[D]
T008Implement stale-grant reconciler (FR-005 CurrentRejectionCleared vs FR-006 StaleRejectionPreserved)WP02[D]
T009Wire user-readable re-login message + recovery command on confirmed-current rejection (FR-007)WP02[D]
T010Refactor TokenManager.refresh_if_needed to delegate to run_refresh_transaction; preserve in-process asyncio.Lock (FR-003); add INFO logs per outcome (FR-019)WP02[D]
T011Extend tests/auth/test_token_manager.py with new internal-flow coverage and golden FR-020 status output checkWP02[D]
T012Author tests/auth/concurrency/conftest.py (in-process fake refresh server fixture + tmp-rooted auth-store env override)WP03[D]
T013Author tests/auth/concurrency/test_machine_refresh_lock.py (concurrent same-process refresh produces one network call) and tests/auth/concurrency/test_stale_grant_preservation.py (rotate-then-stale-grant scenario + clear-message assertion)WP03[D]
T014Author tests/auth/concurrency/test_incident_regression.py (subprocess-based two-worker test reproducing the incident under file barriers; ≤30 s wall-clock)WP03[D]
T015Add DAEMON_TICK_SECONDS=30 constant and _start_self_check_tick(server, my_port) helper in sync/daemon.pyWP04[D]
T016Wire the tick thread into run_sync_daemon; cancel on serve_forever exitWP04[D]
T017Implement self-retirement decision (parse DAEMON_STATE_FILE; if recorded port ≠ self.port and record valid, call server.shutdown()); preserve no-rewrite invariant when state file missing/malformedWP04[D]
T018Author tests/sync/test_daemon_self_retirement.py (4 cases: retires-on-mismatch, continues-on-match, continues-on-missing, continues-on-malformed)WP04[D]
T019Create src/specify_cli/sync/orphan_sweep.py with OrphanDaemon and SweepReport frozen dataclasses + module docstringWP05[D]
T020Implement enumerate_orphans() — port scan 9400-9450, /api/health probe, classify by protocol_version+package_version, exclude state-file portWP05[D]
T021Implement sweep_orphans() — escalating shutdown (HTTP → terminate(1s) → kill(1s)) + state-file cleanup; return SweepReportWP05[D]
T022Author tests/sync/test_orphan_sweep.py (7 cases per contracts/daemon-singleton.md)WP05[D]
T023Create src/specify_cli/cli/commands/_auth_doctor.py with DoctorReport dataclass + assemble_report() (read-only data gathering only)WP06[D]
T024Implement Rich rendering for all 7 sections (Identity, Tokens, Storage, Refresh Lock, Daemon, Orphans, Findings) reusing _auth_status formattersWP06[D]
T025Implement findings-and-remediation logic (F-001..F-007) + exit-code policy (0 / 1 / 2)WP06[D]
T026Implement --json mode with schema_version: 1; assert no network calls fire on default invocation (C-007)WP06[D]
T027Wire @app.command() doctor in cli/commands/auth.py with --reset, --unstick-lock, --stuck-threshold, --json flags; --reset calls sweep_orphans; --unstick-lock calls force_release only when stuckWP06[D]
T028Author tests/auth/test_auth_doctor_report.py + tests/auth/test_auth_doctor_repair.py + tests/auth/test_auth_doctor_offline.py (combined per the contract test tables)WP06[D]

Total: 28 subtasks across 6 work packages.


Phase 1 — Setup

(No setup WPs required; the dev environment is configured and the auth and sync subsystems already exist.)


Phase 2 — Foundational

WP01 — Cross-platform machine-wide file lock helper

Goal: Build src/specify_cli/core/file_lock.py — a self-contained, cross-platform machine-wide lock helper used by the refresh transaction (WP02) and the doctor command (WP06). Wraps fcntl.flock (POSIX) and msvcrt.locking (Windows). Supports bounded acquisition, hold-ceiling enforcement, age-based staleness, and atomic JSON content for diagnostics. Priority: P1 (blocks WP02, WP06) Estimated prompt size: ~360 lines (5 subtasks × ~70 lines) Independent test: pytest tests/core/test_file_lock.py -v passes. Dependencies: none

Included subtasks:

  • ✅ T001 Module skeleton + LockRecord dataclass (WP01)
  • ✅ T002 MachineFileLock async context manager (WP01)
  • ✅ T003 read_lock_record + force_release helpers (WP01)
  • ✅ T004 Stale-lock adoption + cross-platform dispatch (WP01)
  • ✅ T005 Test suite — 7 cases per contract (WP01)

Implementation sketch: Stand up the dataclass and module docstring (T001), then build the MachineFileLock async context manager (T002) using a non-blocking fcntl.flock/msvcrt.locking acquire with a bounded-wait loop. Add read_lock_record/force_release (T003) for the doctor surface. Add age-based adoption (T004) so a process holding a stuck lock cannot block forever. Tests last (T005) cover all branches against a tmp-path lock root.

Parallel opportunities: T005 test scaffolding can be drafted alongside T002.

Risks: Cross-platform branch in T004 — Windows-only path is exercised on POSIX CI via pytest.mark.skipif. The lift of _is_daemon_lock_contention from sync/daemon.py is deferred: WP01 ships its own predicate locally; WP04 may import or unify later. This avoids overlapping owned_files between WP01 and WP04.

Prompt file: tasks/WP01-machine-file-lock-helper.md


Phase 3 — Story WPs

WP02 — Refresh transaction with stale-grant preservation

Goal: Build src/specify_cli/auth/refresh_transaction.py and rewire TokenManager.refresh_if_needed to delegate through it. The transaction is bounded by MachineFileLock (WP01). It reloads persisted material before deciding, adopts newer-and-valid material when present (FR-004), and on invalid_grant/session_invalid reloads again to distinguish stale-token rejection (preserve session, FR-006) from current-token rejection (clear session + tell the user how to re-login, FR-005/FR-007). Priority: P1 (the incident fix and the heart of the mission) Estimated prompt size: ~480 lines (6 subtasks × ~80 lines) Independent test: pytest tests/auth/test_token_manager.py -v passes; existing tests in this file remain green. Dependencies: WP01

Included subtasks:

  • ✅ T006 refresh_transaction.py skeleton + RefreshOutcome enum (WP02)
  • ✅ T007 Reload-before-refresh + lock-timeout adopt/error (WP02)
  • ✅ T008 Stale-grant reconciler — current-vs-stale rejection branches (WP02)
  • ✅ T009 User-readable re-login message + recovery command (WP02)
  • ✅ T010 TokenManager.refresh_if_needed delegates; preserve asyncio.Lock; INFO logs per outcome (WP02)
  • ✅ T011 Extend test_token_manager.py with new-flow coverage + golden FR-020 status output (WP02)

Implementation sketch: Build the new module top-down (skeleton → happy path → failure-mode reconciler → user-facing message). Then rewire TokenManager so the in-process asyncio.Lock runs first as a same-process fast path (FR-003) and run_refresh_transaction runs inside it. Tests assert each RefreshOutcome produces a unique observable side-effect (storage write or not, log line, exception class).

Parallel opportunities: T011 test additions can be drafted alongside T010 wiring once the RefreshOutcome enum is stable.

Risks: NFR-001's 50 ms p95 overhead is tight. The lock acquire path is the dominant cost; the bounded-acquire loop must use 100 ms sleep increments only when contended, not in the happy path. Backward-compat: RefreshTokenExpiredError and SessionInvalidError must still propagate so existing callers in auth/transport.py keep working.

Prompt file: tasks/WP02-refresh-transaction.md


WP03 — Concurrency and multiprocess regression tests

Goal: Verify WP01+WP02 behavior under concurrent and multiprocess load. Three tests: same-process concurrent refresh (one network call), rotate-then-stale-grant scenario (session preserved + clear message on real rejection), and the incident regression (two real subprocesses driving the rotate-then-stale-grant ordering against a fake server, bounded ≤30 s wall-clock per NFR-005). Priority: P1 (NFR-005 anchor) Estimated prompt size: ~280 lines (3 subtasks × ~90 lines) Independent test: pytest tests/auth/concurrency -v passes (all three tests). Dependencies: WP01, WP02

Included subtasks:

  • ✅ T012 tests/auth/concurrency/conftest.py — in-process fake refresh server + tmp-auth-store env override (WP03)
  • ✅ T013 test_machine_refresh_lock.py + test_stale_grant_preservation.py (WP03)
  • ✅ T014 test_incident_regression.py — subprocess-based two-worker test under file barriers (WP03)

Implementation sketch: Start with the conftest fixture (T012) so the rest of WP03 has a working harness. Add the deterministic single-process tests next (T013) to confirm the contract before scaling to subprocesses. Finish with T014 — the subprocess regression — using subprocess.Popen and file-system barriers (tmp_path / "rotated.flag") to sequence Worker A (rotates) then Worker B (stale-grant) without time.sleep-based ordering.

Parallel opportunities: Once T012 lands, T013 and T014 can be drafted in parallel.

Risks: Multiprocess test flake on slow CI (R5). Counter-design: file-barrier sequencing only, hard 30 s wall-clock cap, no shared event loops.

Prompt file: tasks/WP03-concurrency-and-incident-regression-tests.md


WP04 — Daemon self-retirement tick

Goal: Make every sync daemon poll DAEMON_STATE_FILE every DAEMON_TICK_SECONDS=30. If the recorded port differs from the daemon's own bound port, the daemon initiates clean shutdown. Closes the gap that allowed orphan daemons to accumulate after the existing daemon-spawn lock drift (FR-008/FR-010). Priority: P1 Estimated prompt size: ~280 lines (4 subtasks × ~70 lines) Independent test: pytest tests/sync/test_daemon_self_retirement.py -v passes. Dependencies: none

Included subtasks:

  • ✅ T015 DAEMON_TICK_SECONDS constant + _start_self_check_tick helper (WP04)
  • ✅ T016 Wire tick thread into run_sync_daemon; cancel on serve_forever exit (WP04)
  • ✅ T017 Self-retirement decision: parse state file, compare port, server.shutdown() (WP04)
  • ✅ T018 Test suite — retires/continues across 4 state-file scenarios (WP04)

Implementation sketch: Add the constant and helper as additive code in sync/daemon.py. Wire the helper as a daemon thread started in run_sync_daemon and cancel it in the finally block. The retirement decision must NEVER rewrite the state file (state ownership is _ensure_sync_daemon_running_locked only).

Parallel opportunities: Independent of WP01, WP02, WP03, WP05.

Risks: R6 — two concurrent daemons each thinking they won. Counter-design: the state file is the tiebreaker; whichever daemon's port is not in the state file shuts down. Tick interval (30 s) trades convergence latency against disk noise.

Prompt file: tasks/WP04-daemon-self-retirement-tick.md


WP05 — Orphan daemon sweep

Goal: Build src/specify_cli/sync/orphan_sweep.py with enumerate_orphans() (probe ports 9400-9450 via /api/health and classify by protocol_version+package_version) and sweep_orphans() (escalating shutdown: HTTP → SIGTERM → SIGKILL). Provides the user-triggered recovery path consumed by auth doctor --reset (WP06). Priority: P1 Estimated prompt size: ~340 lines (4 subtasks × ~85 lines) Independent test: pytest tests/sync/test_orphan_sweep.py -v passes (7 cases). Dependencies: none

Included subtasks:

  • ✅ T019 orphan_sweep.py skeleton — OrphanDaemon and SweepReport dataclasses (WP05)
  • ✅ T020 enumerate_orphans() — port scan + identity probe (WP05)
  • ✅ T021 sweep_orphans() — escalating shutdown + state-file cleanup (WP05)
  • ✅ T022 Test suite — 7 cases per contracts/daemon-singleton.md (WP05)

Implementation sketch: Module skeleton first (T019). The probe (T020) must use a tight per-port connect_ex timeout (50 ms) so the worst-case 50-port scan stays within NFR-006's 3 s ceiling. The sweep (T021) escalates: HTTP shutdown without token (best-effort, 403 expected), then psutil.Process.terminate() (1 s wait), then kill() (1 s wait). State-file cleanup is best-effort.

Parallel opportunities: Independent of WP02, WP03, WP04. Can run in parallel with WP04 in a different lane.

Risks: R4 — non-Spec-Kitty process listening on the same port. Counter-design: orphan classification requires both protocol_version and package_version JSON keys in the response; anything else is left alone.

Prompt file: tasks/WP05-orphan-daemon-sweep.md


WP06 — spec-kitty auth doctor command

Goal: Add the auth doctor typer command. Default invocation is read-only and reports 7 sections (Identity, Tokens, Storage, Refresh Lock, Daemon, Orphans, Findings/Remediation). --reset calls sweep_orphans() from WP05. --unstick-lock calls force_release() from WP01 only when the lock is older than --stuck-threshold (default 60 s). --json emits the schema in data-model.md. C-007 forbids network calls on the default path. Priority: P1 Estimated prompt size: ~520 lines (6 subtasks × ~85 lines — slightly over target because all paths share one rendering surface and need cohesive guidance) Independent test: pytest tests/auth/test_auth_doctor_report.py tests/auth/test_auth_doctor_repair.py tests/auth/test_auth_doctor_offline.py -v passes. Dependencies: WP01 (lock introspection), WP05 (orphan listing).

Included subtasks:

  • ✅ T023 _auth_doctor.py skeleton — DoctorReport + assemble_report() (read-only) (WP06)
  • ✅ T024 Rich rendering of 7 sections (reuse _auth_status formatters) (WP06)
  • ✅ T025 Findings + remediation (F-001..F-007) + exit-code policy (WP06)
  • ✅ T026 --json mode with schema_version: 1; assert no-network on default (WP06)
  • ✅ T027 Wire doctor typer subcommand with all flags in cli/commands/auth.py (WP06)
  • ✅ T028 Test suite — report + repair + offline (3 test files combined per contract) (WP06)

Implementation sketch: Build the data layer first (T023): pure functions reading session, lock, daemon, and orphan state into a DoctorReport. Add Rich rendering (T024) on top, reusing formatters from _auth_status.py. Layer findings/remediation logic (T025), then --json (T026). Wire typer last (T027). Tests gate every path including the C-007 offline guarantee (T028).

Parallel opportunities: T024 (rendering) and T026 (--json) can be drafted in parallel once T023 lands.

Risks: NFR-006's 3 s ceiling is tight if the port scan is naive. Counter-design: connect_ex with 50 ms per-port timeout filters closed ports fast; HTTP probe only fires for ports that accepted the connection.

Prompt file: tasks/WP06-auth-doctor-command.md


Phase 4 — Polish

(No polish WPs required. Documentation updates are folded into each WP's DoD as needed.)


Lane partition (proposed; finalize-tasks will compute)

  • Lane A (sequential foundation): WP01 → WP02 → WP03 → WP06
  • Lane B (independent): WP04
  • Lane C (independent): WP05

WP06 must follow WP01 + WP05; running WP06 in Lane A means it waits for both. If finalize-tasks chooses to put WP05 into a fourth lane, WP06 must rebase across both before merging.

Implementation order recommendation

WP01 first (foundation; blocks WP02 and WP06). Then WP02 (the incident fix). WP03 verifies WP01+WP02. WP04 and WP05 are independent and can run in parallel after WP01 lands. WP06 lands last because it consumes both WP01 and WP05.