Contracts

auth-doctor.md

Contract — spec-kitty auth doctor

> Implements FR-011, FR-012, FR-013, FR-014, FR-015; NFR-006; C-007, C-008. > Owned by WP06 (src/specify_cli/cli/commands/_auth_doctor.py and the > new @app.command() doctor in src/specify_cli/cli/commands/auth.py).

CLI surface

Usage: spec-kitty auth doctor [OPTIONS]

  Diagnose CLI auth and sync-daemon state. Default invocation is read-only.

Options:
  --json                 Emit findings as a JSON document instead of Rich layout.
  --reset                Sweep orphan sync daemons in the reserved port range.
  --unstick-lock         Force-release the machine-wide refresh lock if
                         its age exceeds the stuck threshold.
  --stuck-threshold S    Age (seconds) above which the refresh lock is
                         considered stuck. Default: 60.
  -h, --help             Show this message and exit.

--reset and --unstick-lock are independent flags; passing both runs both repairs in that order. There is no --auto-fix (C-008).

Default invocation (no flags) — schema

Sections rendered in order. Each section has a Rich representation and a JSON representation.

1. Identity

Reuses helpers from _auth_status.py. Renders the same User / User ID / Teams / Auth method block as auth status for authenticated sessions. For unauthenticated state, prints the existing "Not authenticated" message and continues to render diagnostic sections.

2. Tokens

FieldSourceRendered as
Access remainingsession.access_token_expires_at - now()format_duration(s) (existing helper)
Refresh remainingsession.refresh_token_expires_at - now()format_duration(s), or "server-managed (legacy)" for None

3. Storage

FieldSource
Backendformat_storage_backend(session.storage_backend)
Persisted-vs-in-memory driftsession_id and refresh_token of _storage.read() compared against tm.get_current_session()

If drift is detected, surface as info (not warn) — drift is expected during a refresh transaction, just not after one.

4. Refresh Lock

FieldSource
Held?read_lock_record(REFRESH_LOCK_PATH) is not None
Holder PIDrecord.pid
Acquired atrecord.started_at
Agerecord.age_s
Stuck?record.age_s > stuck_threshold
Same host?record.host == socket.gethostname()

5. Daemon

FieldSource
Active?get_sync_daemon_status().healthy (existing function)
PIDfrom state file
Portfrom state file
Package versionfrom /api/health response
Protocol versionfrom /api/health response

6. Orphans

A table of any orphan daemons (enumerate_orphans()). Empty table if none.

ColumnSource
PIDOrphanDaemon.pid
PortOrphanDaemon.port
Package versionOrphanDaemon.package_version

7. Findings & Remediation

Compute the findings list:

IDTriggerSeverityRemediation command
F-001No session loadedcriticalspec-kitty auth login
F-002Orphans presentwarnspec-kitty auth doctor --reset
F-003Refresh lock stuck (age > stuck_threshold)criticalspec-kitty auth doctor --unstick-lock
F-004Daemon running but version mismatches package versionwarnspec-kitty sync restart (existing)
F-005Daemon expected (rollout enabled) but not runninginfonext CLI command will start it
F-006Persisted/in-memory drift after no in-flight refreshwarnspec-kitty auth doctor re-run after a CLI command
F-007Lock holder is on a different host (NFS scenario)warnmanual investigation; not auto-resolvable

Each finding renders as: [severity] summary followed by an indented Run: <command> line. When findings is empty, the report ends with No problems detected.

Exit codes

ExitMeaning
0Report rendered. No critical findings remain after any repairs the user requested.
1Report rendered, but at least one critical finding remains. (E.g. auth doctor was run without --unstick-lock while the lock is stuck.)
2Internal error (exception during diagnostic gathering). Stack trace printed; report is partial.

auth doctor is a diagnostic, not a gate — exit 1 is informational, not a CI failure pattern. Scripts that want to fail on critical findings must check the JSON output.

--json output (machine-readable)

See data-model.md §"DoctorReport JSON schema" for the full shape. The schema is versioned (schema_version: 1) so future tranches can extend it without breaking consumers.

--reset semantics

1. Run the default report once (read-only). 2. If findings contains F-002: call sweep_orphans(enumerate_orphans()). 3. Re-run the report (post-reset) and print the sweep summary (<n> orphans swept, <m> failed). 4. Exit code based on post-reset state.

--reset is a no-op when no orphans are detected.

--unstick-lock semantics

1. Run the default report once (read-only). 2. If findings contains F-003: call force_release(REFRESH_LOCK_PATH, only_if_age_s=stuck_threshold). 3. Re-run the report and print the unstick outcome. 4. Exit code based on post-unstick state.

--unstick-lock is a no-op when the lock is not stuck. The only_if_age_s parameter prevents the user from accidentally dropping a healthy in-flight lock.

C-007 enforcement (no network calls in default invocation)

tests/auth/test_auth_doctor_offline.py patches httpx.AsyncClient, urllib.request.urlopen, and socket.create_connection with mocks that fail the test if invoked. The test then runs auth doctor with no flags and asserts no patch was triggered.

The local-only probes that enumerate_orphans() performs are 127.0.0.1 TCP connects, which are explicitly allowed (the contract calls them "local" and excludes them from C-007's scope; see C-007 text: "MUST NOT require network access").

NFR-006 enforcement (≤3 s time-to-actionable)

auth doctor uses tight per-probe timeouts (0.5 s for each daemon health probe; total worst case 50 × 0.5 s = 25 s with no daemons, which violates NFR-006). Mitigation: connection-attempt timeout drops to 50 ms for connect_ex before the HTTP probe, so closed ports are filtered in <1 s total. This brings the typical run to <300 ms and the maximum to <3 s in adversarial cases (every port answering slowly).

tests/auth/test_auth_doctor_report.py::test_runs_under_three_seconds asserts a 3-second wall-clock ceiling under realistic fixture state.

Test contract

tests/auth/test_auth_doctor_report.py

TestPredicate
test_renders_authenticated_no_findingsHealthy state ⇒ all sections render; findings empty; exit 0.
test_renders_unauthenticatedNo session ⇒ F-001 critical; report still complete; exit 1.
test_renders_orphan_findingOne orphan present ⇒ F-002 warn; report completes; exit 0 (warn is not critical).
test_renders_stuck_lock_findingLock record 120 s old ⇒ F-003 critical; exit 1.
test_renders_legacy_sessionrefresh_token_expires_at is None ⇒ "server-managed (legacy)" string; no extra finding.
test_runs_under_three_seconds50-port scan + healthy state completes in <3 s.
test_json_output_schema--json output validates against the schema in data-model.md §5.

tests/auth/test_auth_doctor_repair.py

TestPredicate
test_reset_sweeps_orphansTwo daemons, one orphan ⇒ --reset invokes sweep_orphans; orphan terminated.
test_reset_noop_when_no_orphansNo orphans ⇒ --reset does not call sweep_orphans.
test_unstick_drops_old_lockLock 120 s old ⇒ --unstick-lock removes lock file.
test_unstick_preserves_fresh_lockLock 5 s old ⇒ --unstick-lock is a no-op; lock still held.
test_combined_flags_run_both--reset --unstick-lock runs both repairs.

tests/auth/test_auth_doctor_offline.py

TestPredicate
test_no_outbound_httpDefault invocation makes zero httpx/urllib outbound calls (only 127.0.0.1 connects allowed).
test_no_state_mutation_defaultAfter default invocation: no files removed, no processes terminated, no locks released.

daemon-singleton.md

Contract — Daemon Convergence and Orphan Sweep

> Implements FR-008, FR-009, FR-010. > Owned by WP04 (src/specify_cli/sync/daemon.py modifications) and > WP05 (src/specify_cli/sync/orphan_sweep.py new module). Consumed by > WP06 (auth doctor).

Reserved port range (existing, unchanged)

DAEMON_PORT_START = 9400        # src/specify_cli/sync/daemon.py
DAEMON_PORT_MAX_ATTEMPTS = 50   # i.e. 9400 .. 9449

A "Spec Kitty sync daemon port" is any TCP port in [9400, 9450).

State file (existing, unchanged)

~/.spec-kitty/sync-daemon (POSIX) or %LOCALAPPDATA%\spec-kitty\daemon\sync-daemon (Windows). Plain text, four lines:

http://127.0.0.1:9400      # url
9400                        # port
<bearer-token-hex>          # token (POST auth for trigger/publish/shutdown)
<pid>                       # owner pid

Atomically written by _write_daemon_file. No format change.

Singleton rule

The daemon whose port matches the recorded port in DAEMON_STATE_FILE is the user-level singleton. Only one daemon process can be the singleton at any moment.

Ownership transitions only via _ensure_sync_daemon_running_locked, which is gated by the existing DAEMON_LOCK_FILE (~/.spec-kitty/sync-daemon.lock, fcntl.flock / msvcrt.locking). No new lock is introduced for daemon ownership.

Self-retirement tick (NEW — WP04)

DAEMON_TICK_SECONDS: int = 30

Inside run_sync_daemon, a daemon-side scheduled task fires every DAEMON_TICK_SECONDS. On each tick:

1. Read DAEMON_STATE_FILE via the existing _parse_daemon_file. 2. If the parsed port equals self.port: continue running (this process is the singleton). 3. If the parsed port differs from self.port and the parsed record looks valid (port and PID present, PID alive): initiate server.shutdown() and exit cleanly. 4. If the state file is missing or malformed: continue running but do not rewrite the state file from this code path (state file is owned by _ensure_sync_daemon_running_locked only).

Implementation: a threading.Timer-style daemon thread inside run_sync_daemon is sufficient; the existing BaseHTTPRequestHandler loop continues to handle requests in parallel.

Implementation note

run_sync_daemon becomes:

def run_sync_daemon(port: int, daemon_token: str | None) -> None:
    from specify_cli.sync.runtime import get_runtime
    get_runtime()
    handler_class = type(
        "SyncDaemonRouter",
        (SyncDaemonHandler,),
        {"daemon_token": daemon_token},
    )
    server = HTTPServer(("127.0.0.1", port), handler_class)

    # NEW — WP04: self-retirement tick.
    tick_thread = _start_self_check_tick(server, my_port=port)
    try:
        server.serve_forever()
    finally:
        tick_thread.cancel()

Orphan identification (NEW — WP05)

@dataclass(frozen=True)
class OrphanDaemon:
    pid: int | None
    port: int
    package_version: str | None
    protocol_version: int | None


def enumerate_orphans() -> list[OrphanDaemon]: ...

Algorithm:

1. Read DAEMON_STATE_FILE once, capture current_port (or None). 2. For each port in [9400, 9450): 1. Open a TCP probe socket; if connect fails, skip. 2. Issue GET http://127.0.0.1:{port}/api/health with timeout=0.5. 3. Parse response JSON. If protocol_version and package_version keys are both present, this is a Spec Kitty daemon. 4. If port == current_port: skip (singleton, not an orphan). 5. Otherwise: append OrphanDaemon with PID looked up via psutil.net_connections() filter laddr.port == port and status == "LISTEN".

Anything that does not respond, or whose response lacks both required keys, is not classified as a Spec Kitty daemon. Sweep never touches it.

Orphan sweep (NEW — WP05)

@dataclass(frozen=True)
class SweepReport:
    swept: list[OrphanDaemon]
    failed: list[tuple[OrphanDaemon, str]]  # (orphan, reason)
    duration_s: float


def sweep_orphans(orphans: list[OrphanDaemon], *, timeout_s: float = 5.0) -> SweepReport: ...

Per orphan, escalate in order until the port stops listening:

1. Graceful HTTP shutdown: POST http://127.0.0.1:{port}/api/shutdown without a token. Today this returns 403. Pre-existing daemons stay alive — that's fine; we escalate. 2. SIGTERM via psutil: psutil.Process(orphan.pid).terminate(). Wait up to 1 s for port to free. 3. SIGKILL via psutil: psutil.Process(orphan.pid).kill(). Wait up to 1 s. 4. State-file cleanup: if a state file points at the orphan port, remove it.

Each step's failure (no PID, AccessDenied, port still listening) is recorded in SweepReport.failed.

Total bounded duration: timeout_s × len(orphans) worst case. Default timeout_s=5.0.

Test contract

tests/sync/test_daemon_self_retirement.py

TestPredicate
test_self_retires_when_port_mismatchStart daemon on port A; write state file pointing at port B; daemon exits within 2 ticks.
test_continues_when_port_matchesStart daemon on port A; state file points at port A; daemon stays alive over 3 ticks.
test_continues_when_state_file_missingStart daemon; remove state file; daemon stays alive (does not self-rewrite).
test_continues_when_state_file_malformedStart daemon; corrupt state file; daemon stays alive.

tests/sync/test_orphan_sweep.py

TestPredicate
test_enumerate_finds_singleton_onlyOne daemon on 9400; state file points at 9400; enumerate_orphans() returns [].
test_enumerate_finds_orphanTwo daemons on 9400 and 9401; state file points at 9400; enumerate_orphans() returns one entry on 9401.
test_enumerate_skips_non_spec_kittyPlain HTTP server on 9402 returning 200 without our keys; not classified as orphan.
test_enumerate_skips_closed_portsNo process on 9402; not in the result.
test_sweep_terminates_orphanOrphan running; after sweep_orphans(), port is closed and report swept lists it.
test_sweep_does_not_touch_singletonSingleton + orphan; only the orphan is terminated.
test_sweep_records_failure_on_access_deniedOrphan PID exists but terminate() raises AccessDenied; recorded in failed.

refresh-lock.md

Contract — Machine-wide Refresh Lock

> Implements FR-001, FR-002, FR-016, FR-017, FR-018; NFR-002, NFR-008. > Owned by WP01 (src/specify_cli/core/file_lock.py) and consumed by > WP02 (src/specify_cli/auth/refresh_transaction.py) and WP06 > (src/specify_cli/cli/commands/_auth_doctor.py).

Path

PlatformPath
POSIX (macOS, Linux)~/.spec-kitty/auth/refresh.lock
Windows%LOCALAPPDATA%\spec-kitty\auth\refresh.lock (resolved via specify_cli.paths.get_runtime_root())

The directory is created (with parents) on first acquisition. Permissions default to user-only (0o700 directory, 0o600 file) on POSIX.

OS-level primitive

PlatformCall
POSIX`fcntl.flock(fd, LOCK_EX \
Windowsmsvcrt.locking(fd, LK_NBLCK, 1) to acquire; msvcrt.locking(fd, LK_UNLCK, 1) to release

Both calls are non-blocking. Contention errors (BlockingIOError, EACCES, EAGAIN, EDEADLK) are detected via the _is_daemon_lock_contention predicate (lifted from sync/daemon.py into core/file_lock.py).

Content

The lock file holds a JSON record describing the current holder. After the OS lock is acquired, the holder atomically writes:

{
  "schema_version": 1,
  "pid": 12345,
  "started_at": "2026-04-28T10:30:00+00:00",
  "host": "robert-mbp.local",
  "version": "3.2.0a5"
}
FieldTypeSource
schema_versionint1 for Tranche 1
pidintos.getpid()
started_atISO-8601 UTCdatetime.now(UTC).isoformat()
hoststrsocket.gethostname()
versionstrimportlib.metadata.version("spec-kitty-cli") (with "unknown" fallback)

Content is written via specify_cli.core.atomic.atomic_write so that readers (such as auth doctor) never observe a partial record.

Public Python API

class LockRecord(BaseModel):  # frozen dataclass in implementation
    schema_version: int
    pid: int
    started_at: datetime  # tz-aware UTC
    host: str
    version: str

    @property
    def age_s(self) -> float: ...
    @property
    def is_stuck(self, threshold_s: float = 60.0) -> bool: ...


class MachineFileLock:
    """Async context manager. Acquires the OS lock, writes content, yields."""
    def __init__(
        self,
        path: Path,
        *,
        max_hold_s: float = 10.0,       # NFR-002 ceiling
        stale_after_s: float = 60.0,    # adopt-after-stale threshold (R2)
        acquire_timeout_s: float = 10.0,# bounded wait
    ) -> None: ...

    async def __aenter__(self) -> LockRecord: ...
    async def __aexit__(self, *args) -> None: ...


def read_lock_record(path: Path) -> LockRecord | None:
    """Read the lock record without acquiring the OS lock. Used by auth doctor."""

def force_release(path: Path, *, only_if_age_s: float = 60.0) -> bool:
    """Drop the lock file iff the record is older than `only_if_age_s`. Used by `auth doctor --unstick-lock`."""

Semantics

Acquisition

1. Open lock file for write (create if missing). 2. Loop up to acquire_timeout_s: 1. Try OS lock. On success, write content via atomic_write, return. 2. On contention error: read existing record. If record.age_s > stale_after_s, this process may delete the file and retry one more iteration to claim it. Otherwise sleep 100 ms and retry. 3. If acquire_timeout_s elapses, raise LockAcquireTimeout.

Release

1. Truncate or remove the content file (best-effort). 2. Release the OS lock. 3. Close the FD.

The OS lock release is unconditional — it always happens via try/finally, even on exception in the protected block.

Hold-ceiling enforcement

The protected block (the body of async with MachineFileLock(...)) must complete within max_hold_s. Callers enforce this with asyncio.wait_for(...) around their work. If the inner work raises asyncio.TimeoutError, the lock is released and the caller propagates the timeout.

This is how NFR-002 ("≤ 10 s lock hold") is enforced.

Staleness rule

A lock with record.age_s > stale_after_s (default 60 s) is considered abandoned. Any process attempting to acquire may delete the lock file before retrying. This protects against process-killed-mid-transaction scenarios (R2). The threshold is generous (6× the hold ceiling) so that a slow but legitimate transaction is never preempted.

auth doctor --unstick-lock exposes this to the user: it calls force_release(path, only_if_age_s=60.0) and prints the outcome.

Failure modes

ModeBehavior
Lock dir does not existHelper creates with parents and 0o700.
Lock file is corrupt JSONread_lock_record() returns None. Acquisition treats as unheld and rewrites on success.
Holder process is dead but lock file existsHelper detects via record.age_s > stale_after_s; staleness rule applies. PID liveness check via psutil.pid_exists is consulted as a faster predicate but does not bypass age check (R7 — older CLI versions may not write the same content).
Holder is on a different host (NFS scenario)record.host != socket.gethostname() triggers a warning surfaced through auth doctor. Lock is still respected on the local host; remote-host stuck locks are user-resolved.

Test contract

tests/core/test_file_lock.py:

TestPredicate
test_acquire_and_releaseSingle-process happy path; record on disk after acquire, gone after release.
test_concurrent_acquire_serializedTwo asyncio.create_task callers serialize through one lock.
test_acquire_timeout_raisesWhen held by a fixture process, second acquire raises LockAcquireTimeout after acquire_timeout_s.
test_stale_lock_adoptedLock file with started_at 120 s ago is adopted on next acquire.
test_force_release_only_when_stuckforce_release(only_if_age_s=60) returns False on a fresh lock, True on a 120-s-old lock.
test_atomic_content_writeReader never observes a half-written record (partial-failure injection).
test_platform_dispatchOn POSIX, fcntl.flock is invoked; on Windows, msvcrt.locking. (Marked with pytest.mark.skipif.)