Contracts
auth-doctor.md
Contract — spec-kitty auth doctor
> Implements FR-011, FR-012, FR-013, FR-014, FR-015; NFR-006; C-007, C-008. > Owned by WP06 (src/specify_cli/cli/commands/_auth_doctor.py and the > new @app.command() doctor in src/specify_cli/cli/commands/auth.py).
CLI surface
Usage: spec-kitty auth doctor [OPTIONS]
Diagnose CLI auth and sync-daemon state. Default invocation is read-only.
Options:
--json Emit findings as a JSON document instead of Rich layout.
--reset Sweep orphan sync daemons in the reserved port range.
--unstick-lock Force-release the machine-wide refresh lock if
its age exceeds the stuck threshold.
--stuck-threshold S Age (seconds) above which the refresh lock is
considered stuck. Default: 60.
-h, --help Show this message and exit.
--reset and --unstick-lock are independent flags; passing both runs both repairs in that order. There is no --auto-fix (C-008).
Default invocation (no flags) — schema
Sections rendered in order. Each section has a Rich representation and a JSON representation.
1. Identity
Reuses helpers from _auth_status.py. Renders the same User / User ID / Teams / Auth method block as auth status for authenticated sessions. For unauthenticated state, prints the existing "Not authenticated" message and continues to render diagnostic sections.
2. Tokens
| Field | Source | Rendered as |
|---|---|---|
| Access remaining | session.access_token_expires_at - now() | format_duration(s) (existing helper) |
| Refresh remaining | session.refresh_token_expires_at - now() | format_duration(s), or "server-managed (legacy)" for None |
3. Storage
| Field | Source |
|---|---|
| Backend | format_storage_backend(session.storage_backend) |
| Persisted-vs-in-memory drift | session_id and refresh_token of _storage.read() compared against tm.get_current_session() |
If drift is detected, surface as info (not warn) — drift is expected during a refresh transaction, just not after one.
4. Refresh Lock
| Field | Source |
|---|---|
| Held? | read_lock_record(REFRESH_LOCK_PATH) is not None |
| Holder PID | record.pid |
| Acquired at | record.started_at |
| Age | record.age_s |
| Stuck? | record.age_s > stuck_threshold |
| Same host? | record.host == socket.gethostname() |
5. Daemon
| Field | Source |
|---|---|
| Active? | get_sync_daemon_status().healthy (existing function) |
| PID | from state file |
| Port | from state file |
| Package version | from /api/health response |
| Protocol version | from /api/health response |
6. Orphans
A table of any orphan daemons (enumerate_orphans()). Empty table if none.
| Column | Source |
|---|---|
| PID | OrphanDaemon.pid |
| Port | OrphanDaemon.port |
| Package version | OrphanDaemon.package_version |
7. Findings & Remediation
Compute the findings list:
| ID | Trigger | Severity | Remediation command |
|---|---|---|---|
| F-001 | No session loaded | critical | spec-kitty auth login |
| F-002 | Orphans present | warn | spec-kitty auth doctor --reset |
| F-003 | Refresh lock stuck (age > stuck_threshold) | critical | spec-kitty auth doctor --unstick-lock |
| F-004 | Daemon running but version mismatches package version | warn | spec-kitty sync restart (existing) |
| F-005 | Daemon expected (rollout enabled) but not running | info | next CLI command will start it |
| F-006 | Persisted/in-memory drift after no in-flight refresh | warn | spec-kitty auth doctor re-run after a CLI command |
| F-007 | Lock holder is on a different host (NFS scenario) | warn | manual investigation; not auto-resolvable |
Each finding renders as: [severity] summary followed by an indented Run: <command> line. When findings is empty, the report ends with No problems detected.
Exit codes
| Exit | Meaning |
|---|---|
| 0 | Report rendered. No critical findings remain after any repairs the user requested. |
| 1 | Report rendered, but at least one critical finding remains. (E.g. auth doctor was run without --unstick-lock while the lock is stuck.) |
| 2 | Internal error (exception during diagnostic gathering). Stack trace printed; report is partial. |
auth doctor is a diagnostic, not a gate — exit 1 is informational, not a CI failure pattern. Scripts that want to fail on critical findings must check the JSON output.
--json output (machine-readable)
See data-model.md §"DoctorReport JSON schema" for the full shape. The schema is versioned (schema_version: 1) so future tranches can extend it without breaking consumers.
--reset semantics
1. Run the default report once (read-only). 2. If findings contains F-002: call sweep_orphans(enumerate_orphans()). 3. Re-run the report (post-reset) and print the sweep summary (<n> orphans swept, <m> failed). 4. Exit code based on post-reset state.
--reset is a no-op when no orphans are detected.
--unstick-lock semantics
1. Run the default report once (read-only). 2. If findings contains F-003: call force_release(REFRESH_LOCK_PATH, only_if_age_s=stuck_threshold). 3. Re-run the report and print the unstick outcome. 4. Exit code based on post-unstick state.
--unstick-lock is a no-op when the lock is not stuck. The only_if_age_s parameter prevents the user from accidentally dropping a healthy in-flight lock.
C-007 enforcement (no network calls in default invocation)
tests/auth/test_auth_doctor_offline.py patches httpx.AsyncClient, urllib.request.urlopen, and socket.create_connection with mocks that fail the test if invoked. The test then runs auth doctor with no flags and asserts no patch was triggered.
The local-only probes that enumerate_orphans() performs are 127.0.0.1 TCP connects, which are explicitly allowed (the contract calls them "local" and excludes them from C-007's scope; see C-007 text: "MUST NOT require network access").
NFR-006 enforcement (≤3 s time-to-actionable)
auth doctor uses tight per-probe timeouts (0.5 s for each daemon health probe; total worst case 50 × 0.5 s = 25 s with no daemons, which violates NFR-006). Mitigation: connection-attempt timeout drops to 50 ms for connect_ex before the HTTP probe, so closed ports are filtered in <1 s total. This brings the typical run to <300 ms and the maximum to <3 s in adversarial cases (every port answering slowly).
tests/auth/test_auth_doctor_report.py::test_runs_under_three_seconds asserts a 3-second wall-clock ceiling under realistic fixture state.
Test contract
tests/auth/test_auth_doctor_report.py
| Test | Predicate |
|---|---|
test_renders_authenticated_no_findings | Healthy state ⇒ all sections render; findings empty; exit 0. |
test_renders_unauthenticated | No session ⇒ F-001 critical; report still complete; exit 1. |
test_renders_orphan_finding | One orphan present ⇒ F-002 warn; report completes; exit 0 (warn is not critical). |
test_renders_stuck_lock_finding | Lock record 120 s old ⇒ F-003 critical; exit 1. |
test_renders_legacy_session | refresh_token_expires_at is None ⇒ "server-managed (legacy)" string; no extra finding. |
test_runs_under_three_seconds | 50-port scan + healthy state completes in <3 s. |
test_json_output_schema | --json output validates against the schema in data-model.md §5. |
tests/auth/test_auth_doctor_repair.py
| Test | Predicate |
|---|---|
test_reset_sweeps_orphans | Two daemons, one orphan ⇒ --reset invokes sweep_orphans; orphan terminated. |
test_reset_noop_when_no_orphans | No orphans ⇒ --reset does not call sweep_orphans. |
test_unstick_drops_old_lock | Lock 120 s old ⇒ --unstick-lock removes lock file. |
test_unstick_preserves_fresh_lock | Lock 5 s old ⇒ --unstick-lock is a no-op; lock still held. |
test_combined_flags_run_both | --reset --unstick-lock runs both repairs. |
tests/auth/test_auth_doctor_offline.py
| Test | Predicate |
|---|---|
test_no_outbound_http | Default invocation makes zero httpx/urllib outbound calls (only 127.0.0.1 connects allowed). |
test_no_state_mutation_default | After default invocation: no files removed, no processes terminated, no locks released. |
daemon-singleton.md
Contract — Daemon Convergence and Orphan Sweep
> Implements FR-008, FR-009, FR-010. > Owned by WP04 (src/specify_cli/sync/daemon.py modifications) and > WP05 (src/specify_cli/sync/orphan_sweep.py new module). Consumed by > WP06 (auth doctor).
Reserved port range (existing, unchanged)
DAEMON_PORT_START = 9400 # src/specify_cli/sync/daemon.py
DAEMON_PORT_MAX_ATTEMPTS = 50 # i.e. 9400 .. 9449
A "Spec Kitty sync daemon port" is any TCP port in [9400, 9450).
State file (existing, unchanged)
~/.spec-kitty/sync-daemon (POSIX) or %LOCALAPPDATA%\spec-kitty\daemon\sync-daemon (Windows). Plain text, four lines:
http://127.0.0.1:9400 # url
9400 # port
<bearer-token-hex> # token (POST auth for trigger/publish/shutdown)
<pid> # owner pid
Atomically written by _write_daemon_file. No format change.
Singleton rule
The daemon whose port matches the recorded port in DAEMON_STATE_FILE is the user-level singleton. Only one daemon process can be the singleton at any moment.
Ownership transitions only via _ensure_sync_daemon_running_locked, which is gated by the existing DAEMON_LOCK_FILE (~/.spec-kitty/sync-daemon.lock, fcntl.flock / msvcrt.locking). No new lock is introduced for daemon ownership.
Self-retirement tick (NEW — WP04)
DAEMON_TICK_SECONDS: int = 30
Inside run_sync_daemon, a daemon-side scheduled task fires every DAEMON_TICK_SECONDS. On each tick:
1. Read DAEMON_STATE_FILE via the existing _parse_daemon_file. 2. If the parsed port equals self.port: continue running (this process is the singleton). 3. If the parsed port differs from self.port and the parsed record looks valid (port and PID present, PID alive): initiate server.shutdown() and exit cleanly. 4. If the state file is missing or malformed: continue running but do not rewrite the state file from this code path (state file is owned by _ensure_sync_daemon_running_locked only).
Implementation: a threading.Timer-style daemon thread inside run_sync_daemon is sufficient; the existing BaseHTTPRequestHandler loop continues to handle requests in parallel.
Implementation note
run_sync_daemon becomes:
def run_sync_daemon(port: int, daemon_token: str | None) -> None:
from specify_cli.sync.runtime import get_runtime
get_runtime()
handler_class = type(
"SyncDaemonRouter",
(SyncDaemonHandler,),
{"daemon_token": daemon_token},
)
server = HTTPServer(("127.0.0.1", port), handler_class)
# NEW — WP04: self-retirement tick.
tick_thread = _start_self_check_tick(server, my_port=port)
try:
server.serve_forever()
finally:
tick_thread.cancel()
Orphan identification (NEW — WP05)
@dataclass(frozen=True)
class OrphanDaemon:
pid: int | None
port: int
package_version: str | None
protocol_version: int | None
def enumerate_orphans() -> list[OrphanDaemon]: ...
Algorithm:
1. Read DAEMON_STATE_FILE once, capture current_port (or None). 2. For each port in [9400, 9450): 1. Open a TCP probe socket; if connect fails, skip. 2. Issue GET http://127.0.0.1:{port}/api/health with timeout=0.5. 3. Parse response JSON. If protocol_version and package_version keys are both present, this is a Spec Kitty daemon. 4. If port == current_port: skip (singleton, not an orphan). 5. Otherwise: append OrphanDaemon with PID looked up via psutil.net_connections() filter laddr.port == port and status == "LISTEN".
Anything that does not respond, or whose response lacks both required keys, is not classified as a Spec Kitty daemon. Sweep never touches it.
Orphan sweep (NEW — WP05)
@dataclass(frozen=True)
class SweepReport:
swept: list[OrphanDaemon]
failed: list[tuple[OrphanDaemon, str]] # (orphan, reason)
duration_s: float
def sweep_orphans(orphans: list[OrphanDaemon], *, timeout_s: float = 5.0) -> SweepReport: ...
Per orphan, escalate in order until the port stops listening:
1. Graceful HTTP shutdown: POST http://127.0.0.1:{port}/api/shutdown without a token. Today this returns 403. Pre-existing daemons stay alive — that's fine; we escalate. 2. SIGTERM via psutil: psutil.Process(orphan.pid).terminate(). Wait up to 1 s for port to free. 3. SIGKILL via psutil: psutil.Process(orphan.pid).kill(). Wait up to 1 s. 4. State-file cleanup: if a state file points at the orphan port, remove it.
Each step's failure (no PID, AccessDenied, port still listening) is recorded in SweepReport.failed.
Total bounded duration: timeout_s × len(orphans) worst case. Default timeout_s=5.0.
Test contract
tests/sync/test_daemon_self_retirement.py
| Test | Predicate |
|---|---|
test_self_retires_when_port_mismatch | Start daemon on port A; write state file pointing at port B; daemon exits within 2 ticks. |
test_continues_when_port_matches | Start daemon on port A; state file points at port A; daemon stays alive over 3 ticks. |
test_continues_when_state_file_missing | Start daemon; remove state file; daemon stays alive (does not self-rewrite). |
test_continues_when_state_file_malformed | Start daemon; corrupt state file; daemon stays alive. |
tests/sync/test_orphan_sweep.py
| Test | Predicate |
|---|---|
test_enumerate_finds_singleton_only | One daemon on 9400; state file points at 9400; enumerate_orphans() returns []. |
test_enumerate_finds_orphan | Two daemons on 9400 and 9401; state file points at 9400; enumerate_orphans() returns one entry on 9401. |
test_enumerate_skips_non_spec_kitty | Plain HTTP server on 9402 returning 200 without our keys; not classified as orphan. |
test_enumerate_skips_closed_ports | No process on 9402; not in the result. |
test_sweep_terminates_orphan | Orphan running; after sweep_orphans(), port is closed and report swept lists it. |
test_sweep_does_not_touch_singleton | Singleton + orphan; only the orphan is terminated. |
test_sweep_records_failure_on_access_denied | Orphan PID exists but terminate() raises AccessDenied; recorded in failed. |
refresh-lock.md
Contract — Machine-wide Refresh Lock
> Implements FR-001, FR-002, FR-016, FR-017, FR-018; NFR-002, NFR-008. > Owned by WP01 (src/specify_cli/core/file_lock.py) and consumed by > WP02 (src/specify_cli/auth/refresh_transaction.py) and WP06 > (src/specify_cli/cli/commands/_auth_doctor.py).
Path
| Platform | Path |
|---|---|
| POSIX (macOS, Linux) | ~/.spec-kitty/auth/refresh.lock |
| Windows | %LOCALAPPDATA%\spec-kitty\auth\refresh.lock (resolved via specify_cli.paths.get_runtime_root()) |
The directory is created (with parents) on first acquisition. Permissions default to user-only (0o700 directory, 0o600 file) on POSIX.
OS-level primitive
| Platform | Call |
|---|---|
| POSIX | `fcntl.flock(fd, LOCK_EX \ |
| Windows | msvcrt.locking(fd, LK_NBLCK, 1) to acquire; msvcrt.locking(fd, LK_UNLCK, 1) to release |
Both calls are non-blocking. Contention errors (BlockingIOError, EACCES, EAGAIN, EDEADLK) are detected via the _is_daemon_lock_contention predicate (lifted from sync/daemon.py into core/file_lock.py).
Content
The lock file holds a JSON record describing the current holder. After the OS lock is acquired, the holder atomically writes:
{
"schema_version": 1,
"pid": 12345,
"started_at": "2026-04-28T10:30:00+00:00",
"host": "robert-mbp.local",
"version": "3.2.0a5"
}
| Field | Type | Source |
|---|---|---|
schema_version | int | 1 for Tranche 1 |
pid | int | os.getpid() |
started_at | ISO-8601 UTC | datetime.now(UTC).isoformat() |
host | str | socket.gethostname() |
version | str | importlib.metadata.version("spec-kitty-cli") (with "unknown" fallback) |
Content is written via specify_cli.core.atomic.atomic_write so that readers (such as auth doctor) never observe a partial record.
Public Python API
class LockRecord(BaseModel): # frozen dataclass in implementation
schema_version: int
pid: int
started_at: datetime # tz-aware UTC
host: str
version: str
@property
def age_s(self) -> float: ...
@property
def is_stuck(self, threshold_s: float = 60.0) -> bool: ...
class MachineFileLock:
"""Async context manager. Acquires the OS lock, writes content, yields."""
def __init__(
self,
path: Path,
*,
max_hold_s: float = 10.0, # NFR-002 ceiling
stale_after_s: float = 60.0, # adopt-after-stale threshold (R2)
acquire_timeout_s: float = 10.0,# bounded wait
) -> None: ...
async def __aenter__(self) -> LockRecord: ...
async def __aexit__(self, *args) -> None: ...
def read_lock_record(path: Path) -> LockRecord | None:
"""Read the lock record without acquiring the OS lock. Used by auth doctor."""
def force_release(path: Path, *, only_if_age_s: float = 60.0) -> bool:
"""Drop the lock file iff the record is older than `only_if_age_s`. Used by `auth doctor --unstick-lock`."""
Semantics
Acquisition
1. Open lock file for write (create if missing). 2. Loop up to acquire_timeout_s: 1. Try OS lock. On success, write content via atomic_write, return. 2. On contention error: read existing record. If record.age_s > stale_after_s, this process may delete the file and retry one more iteration to claim it. Otherwise sleep 100 ms and retry. 3. If acquire_timeout_s elapses, raise LockAcquireTimeout.
Release
1. Truncate or remove the content file (best-effort). 2. Release the OS lock. 3. Close the FD.
The OS lock release is unconditional — it always happens via try/finally, even on exception in the protected block.
Hold-ceiling enforcement
The protected block (the body of async with MachineFileLock(...)) must complete within max_hold_s. Callers enforce this with asyncio.wait_for(...) around their work. If the inner work raises asyncio.TimeoutError, the lock is released and the caller propagates the timeout.
This is how NFR-002 ("≤ 10 s lock hold") is enforced.
Staleness rule
A lock with record.age_s > stale_after_s (default 60 s) is considered abandoned. Any process attempting to acquire may delete the lock file before retrying. This protects against process-killed-mid-transaction scenarios (R2). The threshold is generous (6× the hold ceiling) so that a slow but legitimate transaction is never preempted.
auth doctor --unstick-lock exposes this to the user: it calls force_release(path, only_if_age_s=60.0) and prints the outcome.
Failure modes
| Mode | Behavior |
|---|---|
| Lock dir does not exist | Helper creates with parents and 0o700. |
| Lock file is corrupt JSON | read_lock_record() returns None. Acquisition treats as unheld and rewrites on success. |
| Holder process is dead but lock file exists | Helper detects via record.age_s > stale_after_s; staleness rule applies. PID liveness check via psutil.pid_exists is consulted as a faster predicate but does not bypass age check (R7 — older CLI versions may not write the same content). |
| Holder is on a different host (NFS scenario) | record.host != socket.gethostname() triggers a warning surfaced through auth doctor. Lock is still respected on the local host; remote-host stuck locks are user-resolved. |
Test contract
tests/core/test_file_lock.py:
| Test | Predicate |
|---|---|
test_acquire_and_release | Single-process happy path; record on disk after acquire, gone after release. |
test_concurrent_acquire_serialized | Two asyncio.create_task callers serialize through one lock. |
test_acquire_timeout_raises | When held by a fixture process, second acquire raises LockAcquireTimeout after acquire_timeout_s. |
test_stale_lock_adopted | Lock file with started_at 120 s ago is adopted on next acquire. |
test_force_release_only_when_stuck | force_release(only_if_age_s=60) returns False on a fresh lock, True on a 120-s-old lock. |
test_atomic_content_write | Reader never observes a half-written record (partial-failure injection). |
test_platform_dispatch | On POSIX, fcntl.flock is invoked; on Windows, msvcrt.locking. (Marked with pytest.mark.skipif.) |