Tasks: CLI Session Survival and Daemon Singleton
Mission: cli-session-survival-daemon-singleton-01KQ9M3M Branch: main (planning, base, and merge target — all main) Spec: spec.md Plan: plan.md Generated: 2026-04-28
Branch Strategy
- Current branch at workflow start:
main - Planning/base branch for this mission:
main - Final merge target for completed changes:
main - Branch matches target: ✅ true
Per spec-kitty agent context resolve: "Current branch at workflow start: main. Planning/base branch for this feature: main. Completed changes must merge into main."
Execution worktrees are allocated per computed lane from lanes.json after finalize-tasks runs. Agents working a WP MUST enter the workspace path printed by spec-kitty implement WP##, not reconstruct paths manually.
Subtask Index
| ID | Description | WP | Parallel |
|---|---|---|---|
| T001 | Create src/specify_cli/core/file_lock.py skeleton with LockRecord dataclass and module docstring | WP01 | [D] |
| T002 | Implement MachineFileLock async context manager (acquire/release, hold-ceiling, atomic content write) | WP01 | [D] |
| T003 | Implement read_lock_record() and force_release() helpers for diagnostic + repair use | WP01 | [D] |
| T004 | Implement age-based stale-lock adoption (cooperate-after-timeout) + cross-platform primitive dispatch | WP01 | [D] |
| T005 | Author tests/core/test_file_lock.py (7 cases per contracts/refresh-lock.md) | WP01 | [D] |
| T006 | Create src/specify_cli/auth/refresh_transaction.py with RefreshOutcome enum + run_refresh_transaction() skeleton | WP02 | [D] |
| T007 | Implement reload-before-refresh (FR-004 AdoptedNewer) + lock-timeout adopt/error branches (FR-016/FR-017) | WP02 | [D] |
| T008 | Implement stale-grant reconciler (FR-005 CurrentRejectionCleared vs FR-006 StaleRejectionPreserved) | WP02 | [D] |
| T009 | Wire user-readable re-login message + recovery command on confirmed-current rejection (FR-007) | WP02 | [D] |
| T010 | Refactor TokenManager.refresh_if_needed to delegate to run_refresh_transaction; preserve in-process asyncio.Lock (FR-003); add INFO logs per outcome (FR-019) | WP02 | [D] |
| T011 | Extend tests/auth/test_token_manager.py with new internal-flow coverage and golden FR-020 status output check | WP02 | [D] |
| T012 | Author tests/auth/concurrency/conftest.py (in-process fake refresh server fixture + tmp-rooted auth-store env override) | WP03 | [D] |
| T013 | Author tests/auth/concurrency/test_machine_refresh_lock.py (concurrent same-process refresh produces one network call) and tests/auth/concurrency/test_stale_grant_preservation.py (rotate-then-stale-grant scenario + clear-message assertion) | WP03 | [D] |
| T014 | Author tests/auth/concurrency/test_incident_regression.py (subprocess-based two-worker test reproducing the incident under file barriers; ≤30 s wall-clock) | WP03 | [D] |
| T015 | Add DAEMON_TICK_SECONDS=30 constant and _start_self_check_tick(server, my_port) helper in sync/daemon.py | WP04 | [D] |
| T016 | Wire the tick thread into run_sync_daemon; cancel on serve_forever exit | WP04 | [D] |
| T017 | Implement self-retirement decision (parse DAEMON_STATE_FILE; if recorded port ≠ self.port and record valid, call server.shutdown()); preserve no-rewrite invariant when state file missing/malformed | WP04 | [D] |
| T018 | Author tests/sync/test_daemon_self_retirement.py (4 cases: retires-on-mismatch, continues-on-match, continues-on-missing, continues-on-malformed) | WP04 | [D] |
| T019 | Create src/specify_cli/sync/orphan_sweep.py with OrphanDaemon and SweepReport frozen dataclasses + module docstring | WP05 | [D] |
| T020 | Implement enumerate_orphans() — port scan 9400-9450, /api/health probe, classify by protocol_version+package_version, exclude state-file port | WP05 | [D] |
| T021 | Implement sweep_orphans() — escalating shutdown (HTTP → terminate(1s) → kill(1s)) + state-file cleanup; return SweepReport | WP05 | [D] |
| T022 | Author tests/sync/test_orphan_sweep.py (7 cases per contracts/daemon-singleton.md) | WP05 | [D] |
| T023 | Create src/specify_cli/cli/commands/_auth_doctor.py with DoctorReport dataclass + assemble_report() (read-only data gathering only) | WP06 | [D] |
| T024 | Implement Rich rendering for all 7 sections (Identity, Tokens, Storage, Refresh Lock, Daemon, Orphans, Findings) reusing _auth_status formatters | WP06 | [D] |
| T025 | Implement findings-and-remediation logic (F-001..F-007) + exit-code policy (0 / 1 / 2) | WP06 | [D] |
| T026 | Implement --json mode with schema_version: 1; assert no network calls fire on default invocation (C-007) | WP06 | [D] |
| T027 | Wire @app.command() doctor in cli/commands/auth.py with --reset, --unstick-lock, --stuck-threshold, --json flags; --reset calls sweep_orphans; --unstick-lock calls force_release only when stuck | WP06 | [D] |
| T028 | Author tests/auth/test_auth_doctor_report.py + tests/auth/test_auth_doctor_repair.py + tests/auth/test_auth_doctor_offline.py (combined per the contract test tables) | WP06 | [D] |
Total: 28 subtasks across 6 work packages.
Phase 1 — Setup
(No setup WPs required; the dev environment is configured and the auth and sync subsystems already exist.)
Phase 2 — Foundational
WP01 — Cross-platform machine-wide file lock helper
Goal: Build src/specify_cli/core/file_lock.py — a self-contained, cross-platform machine-wide lock helper used by the refresh transaction (WP02) and the doctor command (WP06). Wraps fcntl.flock (POSIX) and msvcrt.locking (Windows). Supports bounded acquisition, hold-ceiling enforcement, age-based staleness, and atomic JSON content for diagnostics. Priority: P1 (blocks WP02, WP06) Estimated prompt size: ~360 lines (5 subtasks × ~70 lines) Independent test: pytest tests/core/test_file_lock.py -v passes. Dependencies: none
Included subtasks:
- ✅ T001 Module skeleton + LockRecord dataclass (WP01)
- ✅ T002 MachineFileLock async context manager (WP01)
- ✅ T003 read_lock_record + force_release helpers (WP01)
- ✅ T004 Stale-lock adoption + cross-platform dispatch (WP01)
- ✅ T005 Test suite — 7 cases per contract (WP01)
Implementation sketch: Stand up the dataclass and module docstring (T001), then build the MachineFileLock async context manager (T002) using a non-blocking fcntl.flock/msvcrt.locking acquire with a bounded-wait loop. Add read_lock_record/force_release (T003) for the doctor surface. Add age-based adoption (T004) so a process holding a stuck lock cannot block forever. Tests last (T005) cover all branches against a tmp-path lock root.
Parallel opportunities: T005 test scaffolding can be drafted alongside T002.
Risks: Cross-platform branch in T004 — Windows-only path is exercised on POSIX CI via pytest.mark.skipif. The lift of _is_daemon_lock_contention from sync/daemon.py is deferred: WP01 ships its own predicate locally; WP04 may import or unify later. This avoids overlapping owned_files between WP01 and WP04.
Prompt file: tasks/WP01-machine-file-lock-helper.md
Phase 3 — Story WPs
WP02 — Refresh transaction with stale-grant preservation
Goal: Build src/specify_cli/auth/refresh_transaction.py and rewire TokenManager.refresh_if_needed to delegate through it. The transaction is bounded by MachineFileLock (WP01). It reloads persisted material before deciding, adopts newer-and-valid material when present (FR-004), and on invalid_grant/session_invalid reloads again to distinguish stale-token rejection (preserve session, FR-006) from current-token rejection (clear session + tell the user how to re-login, FR-005/FR-007). Priority: P1 (the incident fix and the heart of the mission) Estimated prompt size: ~480 lines (6 subtasks × ~80 lines) Independent test: pytest tests/auth/test_token_manager.py -v passes; existing tests in this file remain green. Dependencies: WP01
Included subtasks:
- ✅ T006
refresh_transaction.pyskeleton +RefreshOutcomeenum (WP02) - ✅ T007 Reload-before-refresh + lock-timeout adopt/error (WP02)
- ✅ T008 Stale-grant reconciler — current-vs-stale rejection branches (WP02)
- ✅ T009 User-readable re-login message + recovery command (WP02)
- ✅ T010
TokenManager.refresh_if_neededdelegates; preserveasyncio.Lock; INFO logs per outcome (WP02) - ✅ T011 Extend
test_token_manager.pywith new-flow coverage + golden FR-020 status output (WP02)
Implementation sketch: Build the new module top-down (skeleton → happy path → failure-mode reconciler → user-facing message). Then rewire TokenManager so the in-process asyncio.Lock runs first as a same-process fast path (FR-003) and run_refresh_transaction runs inside it. Tests assert each RefreshOutcome produces a unique observable side-effect (storage write or not, log line, exception class).
Parallel opportunities: T011 test additions can be drafted alongside T010 wiring once the RefreshOutcome enum is stable.
Risks: NFR-001's 50 ms p95 overhead is tight. The lock acquire path is the dominant cost; the bounded-acquire loop must use 100 ms sleep increments only when contended, not in the happy path. Backward-compat: RefreshTokenExpiredError and SessionInvalidError must still propagate so existing callers in auth/transport.py keep working.
Prompt file: tasks/WP02-refresh-transaction.md
WP03 — Concurrency and multiprocess regression tests
Goal: Verify WP01+WP02 behavior under concurrent and multiprocess load. Three tests: same-process concurrent refresh (one network call), rotate-then-stale-grant scenario (session preserved + clear message on real rejection), and the incident regression (two real subprocesses driving the rotate-then-stale-grant ordering against a fake server, bounded ≤30 s wall-clock per NFR-005). Priority: P1 (NFR-005 anchor) Estimated prompt size: ~280 lines (3 subtasks × ~90 lines) Independent test: pytest tests/auth/concurrency -v passes (all three tests). Dependencies: WP01, WP02
Included subtasks:
- ✅ T012
tests/auth/concurrency/conftest.py— in-process fake refresh server + tmp-auth-store env override (WP03) - ✅ T013
test_machine_refresh_lock.py+test_stale_grant_preservation.py(WP03) - ✅ T014
test_incident_regression.py— subprocess-based two-worker test under file barriers (WP03)
Implementation sketch: Start with the conftest fixture (T012) so the rest of WP03 has a working harness. Add the deterministic single-process tests next (T013) to confirm the contract before scaling to subprocesses. Finish with T014 — the subprocess regression — using subprocess.Popen and file-system barriers (tmp_path / "rotated.flag") to sequence Worker A (rotates) then Worker B (stale-grant) without time.sleep-based ordering.
Parallel opportunities: Once T012 lands, T013 and T014 can be drafted in parallel.
Risks: Multiprocess test flake on slow CI (R5). Counter-design: file-barrier sequencing only, hard 30 s wall-clock cap, no shared event loops.
Prompt file: tasks/WP03-concurrency-and-incident-regression-tests.md
WP04 — Daemon self-retirement tick
Goal: Make every sync daemon poll DAEMON_STATE_FILE every DAEMON_TICK_SECONDS=30. If the recorded port differs from the daemon's own bound port, the daemon initiates clean shutdown. Closes the gap that allowed orphan daemons to accumulate after the existing daemon-spawn lock drift (FR-008/FR-010). Priority: P1 Estimated prompt size: ~280 lines (4 subtasks × ~70 lines) Independent test: pytest tests/sync/test_daemon_self_retirement.py -v passes. Dependencies: none
Included subtasks:
- ✅ T015
DAEMON_TICK_SECONDSconstant +_start_self_check_tickhelper (WP04) - ✅ T016 Wire tick thread into
run_sync_daemon; cancel on serve_forever exit (WP04) - ✅ T017 Self-retirement decision: parse state file, compare port, server.shutdown() (WP04)
- ✅ T018 Test suite — retires/continues across 4 state-file scenarios (WP04)
Implementation sketch: Add the constant and helper as additive code in sync/daemon.py. Wire the helper as a daemon thread started in run_sync_daemon and cancel it in the finally block. The retirement decision must NEVER rewrite the state file (state ownership is _ensure_sync_daemon_running_locked only).
Parallel opportunities: Independent of WP01, WP02, WP03, WP05.
Risks: R6 — two concurrent daemons each thinking they won. Counter-design: the state file is the tiebreaker; whichever daemon's port is not in the state file shuts down. Tick interval (30 s) trades convergence latency against disk noise.
Prompt file: tasks/WP04-daemon-self-retirement-tick.md
WP05 — Orphan daemon sweep
Goal: Build src/specify_cli/sync/orphan_sweep.py with enumerate_orphans() (probe ports 9400-9450 via /api/health and classify by protocol_version+package_version) and sweep_orphans() (escalating shutdown: HTTP → SIGTERM → SIGKILL). Provides the user-triggered recovery path consumed by auth doctor --reset (WP06). Priority: P1 Estimated prompt size: ~340 lines (4 subtasks × ~85 lines) Independent test: pytest tests/sync/test_orphan_sweep.py -v passes (7 cases). Dependencies: none
Included subtasks:
- ✅ T019
orphan_sweep.pyskeleton —OrphanDaemonandSweepReportdataclasses (WP05) - ✅ T020
enumerate_orphans()— port scan + identity probe (WP05) - ✅ T021
sweep_orphans()— escalating shutdown + state-file cleanup (WP05) - ✅ T022 Test suite — 7 cases per
contracts/daemon-singleton.md(WP05)
Implementation sketch: Module skeleton first (T019). The probe (T020) must use a tight per-port connect_ex timeout (50 ms) so the worst-case 50-port scan stays within NFR-006's 3 s ceiling. The sweep (T021) escalates: HTTP shutdown without token (best-effort, 403 expected), then psutil.Process.terminate() (1 s wait), then kill() (1 s wait). State-file cleanup is best-effort.
Parallel opportunities: Independent of WP02, WP03, WP04. Can run in parallel with WP04 in a different lane.
Risks: R4 — non-Spec-Kitty process listening on the same port. Counter-design: orphan classification requires both protocol_version and package_version JSON keys in the response; anything else is left alone.
Prompt file: tasks/WP05-orphan-daemon-sweep.md
WP06 — spec-kitty auth doctor command
Goal: Add the auth doctor typer command. Default invocation is read-only and reports 7 sections (Identity, Tokens, Storage, Refresh Lock, Daemon, Orphans, Findings/Remediation). --reset calls sweep_orphans() from WP05. --unstick-lock calls force_release() from WP01 only when the lock is older than --stuck-threshold (default 60 s). --json emits the schema in data-model.md. C-007 forbids network calls on the default path. Priority: P1 Estimated prompt size: ~520 lines (6 subtasks × ~85 lines — slightly over target because all paths share one rendering surface and need cohesive guidance) Independent test: pytest tests/auth/test_auth_doctor_report.py tests/auth/test_auth_doctor_repair.py tests/auth/test_auth_doctor_offline.py -v passes. Dependencies: WP01 (lock introspection), WP05 (orphan listing).
Included subtasks:
- ✅ T023
_auth_doctor.pyskeleton —DoctorReport+assemble_report()(read-only) (WP06) - ✅ T024 Rich rendering of 7 sections (reuse
_auth_statusformatters) (WP06) - ✅ T025 Findings + remediation (F-001..F-007) + exit-code policy (WP06)
- ✅ T026
--jsonmode withschema_version: 1; assert no-network on default (WP06) - ✅ T027 Wire
doctortyper subcommand with all flags incli/commands/auth.py(WP06) - ✅ T028 Test suite — report + repair + offline (3 test files combined per contract) (WP06)
Implementation sketch: Build the data layer first (T023): pure functions reading session, lock, daemon, and orphan state into a DoctorReport. Add Rich rendering (T024) on top, reusing formatters from _auth_status.py. Layer findings/remediation logic (T025), then --json (T026). Wire typer last (T027). Tests gate every path including the C-007 offline guarantee (T028).
Parallel opportunities: T024 (rendering) and T026 (--json) can be drafted in parallel once T023 lands.
Risks: NFR-006's 3 s ceiling is tight if the port scan is naive. Counter-design: connect_ex with 50 ms per-port timeout filters closed ports fast; HTTP probe only fires for ports that accepted the connection.
Prompt file: tasks/WP06-auth-doctor-command.md
Phase 4 — Polish
(No polish WPs required. Documentation updates are folded into each WP's DoD as needed.)
Lane partition (proposed; finalize-tasks will compute)
- Lane A (sequential foundation): WP01 → WP02 → WP03 → WP06
- Lane B (independent): WP04
- Lane C (independent): WP05
WP06 must follow WP01 + WP05; running WP06 in Lane A means it waits for both. If finalize-tasks chooses to put WP05 into a fourth lane, WP06 must rebase across both before merging.
Implementation order recommendation
WP01 first (foundation; blocks WP02 and WP06). Then WP02 (the incident fix). WP03 verifies WP01+WP02. WP04 and WP05 are independent and can run in parallel after WP01 lands. WP06 lands last because it consumes both WP01 and WP05.