Spec: Sync Rejection Classification And Queue Retry Hygiene
Purpose (TL;DR)
Stop incrementing retry_count for batch-level auth/teamspace failures. After the private-Teamspace ingress hardening, the typical shared-team 403 path is fixed, but if a POST still returns 401/403 (auth_expired, unauthorized, or direct_ingress_missing_private_team) after events were drained from the queue, queue.process_batch_results() still flags those events as rejected and bumps retry_count. These are batch-level (not per-event) failures: the server never inspected individual rows, so per-event retry attribution is incorrect and eventually poisons the queue.
Context
GitHub issue: Priivacy-ai/spec-kitty#889.
Today, in src/specify_cli/sync/batch.py, when a batch POST fails with HTTP 401/403/5xx the CLI walks every drained event and records:
BatchEventResult(
event_id=...,
status="rejected",
error=...,
error_category=<auth_expired | unauthorized | direct_ingress_missing_private_team | server_error>,
)
It then calls queue.process_batch_results(result.event_results), which in queue.py runs:
UPDATE queue SET retry_count = retry_count + 1
WHERE event_id IN (...)
That is fine for events the server actually evaluated and refused (per-event rejections returned in a 200 response body), but it is wrong for:
| Category | Reality |
|---|---|
auth_expired (401) | Token issue; events were never adjudicated |
unauthenticated | Same as above |
unauthorized (403, not teamspace) | Permission issue; server didn't reach event validation |
direct_ingress_missing_private_team (403) | Private Teamspace not provisioned for this user |
server_error (5xx) | Server fault; events were not evaluated |
retryable_transport (timeout, connection) | Network fault; same as above |
For these categories, the queue should treat the events as non-mutating — do not bump retry_count. Events must remain durably queued for a later drain once the operator fixes auth, the teamspace is provisioned, or the server recovers.
failed_permanent (e.g. oversized event) and rejected (per-event 200-response rejection because the server actually validated and refused that row) keep their existing behavior.
In Scope
auth_expired, unauthenticated, unauthorized, direct_ingress_missing_private_team, server_error, retryable_transport.
equivalent) so process_batch_results can distinguish per-event content rejections (rejected, retry_count bumped) from batch-level transient failures (failed_transient, retry_count untouched).
one of the batch-level categories to use the new transient status instead.
(existing behavior preserved).
queue-mutation semantics change.
- Classify the following categories as batch-level / non-mutating:
- Extend
BatchEventResultto carry a new status"failed_transient"(or - Update every call site in
batch.pythat recordsstatus="rejected"with - Update
process_batch_resultsso: success/duplicate/failed_permanent→ DELETE from queue.rejected→ bumpretry_count(true per-event content rejection).failed_transient→ no mutation (leave row untouched for next drain).- Add focused regression tests under
tests/sync/proving: - 401 batch POST:
retry_countdoes not advance for queued events. - 403 with
direct_ingress_missing_private_team:retry_countunchanged. - 403 generic (unauthorized):
retry_countunchanged. - 5xx:
retry_countunchanged. - 200 response with per-event rejections:
retry_countstill increments - Keep operator-facing stdout/log output clear and unchanged in tone; only the
Out of Scope
daemon will retry naturally on its next tick).
- Changing categorization keyword lists or summary formatting.
- Daemon backoff/scheduling for transient failures (the queue stays as-is; the
- Body queue (
body_queue.py) — different retry path, not in this mission.
Acceptance Criteria
1. New tests under tests/sync/ cover all categories above. Each asserts both BatchEventResult.error_category and the post-call retry_count for the affected queue rows. 2. SPEC_KITTY_ENABLE_SAAS_SYNC=1 uv run pytest tests/sync/ -q passes for the tests we add and does not regress any tests previously passing (the seven pre-existing, infrastructure-related failures remain out of scope). 3. process_batch_results no longer bumps retry_count for batch-level auth / teamspace / 5xx / transport failures. 4. Existing per-event content-rejection semantics (status rejected from the 200-response body) continue to bump retry_count exactly as before. 5. Operator-facing summary lines (e.g. format_sync_summary) still surface the categorized failure clearly.
Functional Requirements
(rejected) from batch-level transient failure (failed_transient).
for failed_transient results.
failed_transient with category auth_expired.
direct_ingress_missing_private_team MUST classify every drained event as failed_transient with that category.
classify events as failed_transient with category unauthorized.
with category server_error.
failed_transient with category retryable_transport.
returns None) MUST classify events as failed_transient with category direct_ingress_missing_private_team.
produce rejected results that bump retry_count.
- FR-1
BatchEventResultMUST distinguish per-event content rejection - FR-2
OfflineQueue.process_batch_resultsMUST NOT bumpretry_count - FR-3 Batch HTTP 401 responses MUST classify every drained event as
- FR-4 Batch HTTP 403 responses where the body matches
- FR-5 Batch HTTP 403 responses that do NOT match teamspace messaging MUST
- FR-6 Batch HTTP 5xx responses MUST classify events as
failed_transient - FR-7 Timeout / connection / network failures MUST classify events as
- FR-8 The "skipped: no Private Teamspace" pre-flight path (
_team_slug - FR-9 Per-event content rejections from a 200 response MUST continue to
- FR-10
failed_permanentsemantics (oversized events) remain unchanged.
Non-Functional Requirements
src/specify_cli/sync/queue.py, and tests/sync/.
list categories and counts; transient and content-rejection counts may both appear under failed_results for summary purposes.
- NFR-1 No new dependencies.
- NFR-2 Changes confined to
src/specify_cli/sync/batch.py, - NFR-3 Operator-facing stdout in
format_sync_summarymust continue to
Risks
three values could break. Mitigation: the only consumer in-tree is process_batch_results plus formatting helpers; both are updated in this mission.
will simply be drained from local memory but stay in SQLite — same as today.
- A consumer that reads
BatchEventResult.statusand expects only the existing - If a future caller forgets to call
process_batch_results, transient events