Spec Kitty

└─ kitty-specs
   └─ Orchestrator End-to-End Testing Suite

Mission Run:

📚 Docs ↗

Feature Specification: Orchestrator End-to-End Testing Suite

Feature Branch: 021-orchestrator-end-to-end-testing-suite Created: 2026-01-19 Status: Draft Input: Comprehensive end-to-end testing for the orchestrator (feature 020) covering all agents with tiered coverage

Clarifications

Session 2026-01-19

Q: Should tests deliberately trigger retry/fallback logic, or only test when real failures occur? → A: Remove retry/fallback testing from scope; only test with real failures that occur naturally.

Overview

This feature provides a comprehensive end-to-end testing infrastructure for the Autonomous Multi-Agent Orchestrator (feature 020). The testing suite validates that the orchestrator correctly executes work packages across multiple AI agents and maintains state through complex multi-turn workflows.

Key design decisions:

Tiered agent coverage: Core agents (Claude Code, Codex, Copilot, Gemini, OpenCode) get full integration tests; extended agents get smoke tests
Real agent execution: Tests call actual agent CLIs, not mocks, to validate true end-to-end behavior
Checkpoint-based fixtures: Pre-created snapshots at known states enable faster test execution while maintaining realism
Tiered failure handling: Core agent unavailability fails tests; extended agent unavailability skips gracefully

User Scenarios & Testing (mandatory)

User Story 1 - Happy Path Orchestration Test (Priority: P1)

A test developer wants to verify that the orchestrator can execute a simple feature end-to-end: implement a WP, review it, and mark it complete.

Why this priority: The happy path is the foundation - if basic orchestration doesn't work, nothing else matters.

Independent Test: Can be tested by running a single-WP feature through orchestration and verifying the WP reaches "done" lane with commits.

Acceptance Scenarios:

1. Given a fixture feature with one independent WP and a core agent available, When the orchestration test runs, Then the WP is implemented, reviewed, and marked done.

2. Given a fixture feature with three independent WPs, When the orchestration test runs, Then all three WPs execute in parallel (up to concurrency limit) and complete successfully.

3. Given orchestration completes, When the test validates results, Then each WP has commits in its worktree and correct lane status.

User Story 2 - Agent Availability Detection (Priority: P1)

A test developer wants tests to behave correctly based on which agents are installed and authenticated on the test machine.

Why this priority: Tests must run reliably across different environments; proper skip/fail behavior prevents false positives and debugging headaches.

Independent Test: Can be tested by mocking agent detection and verifying correct skip/fail behavior.

Acceptance Scenarios:

1. Given Claude Code is not installed, When a test requiring Claude Code runs, Then the test fails with a clear error message about the missing agent.

2. Given Cursor is not installed (extended tier), When a smoke test for Cursor runs, Then the test is skipped with a warning, not failed.

3. Given all core agents are available, When the full integration suite runs, Then no tests are skipped due to agent availability.

4. Given agent detection runs, When an agent is installed but not authenticated, Then the detection reports the agent as unavailable with auth failure reason.

User Story 3 - Review Cycle Testing (Priority: P1)

A test developer wants to verify the orchestrator handles review rejection and re-implementation cycles correctly.

Why this priority: Review cycles are the core value of cross-agent review; bugs here would undermine the orchestrator's purpose.

Independent Test: Can be tested with a fixture that triggers review rejection and verifying the re-implementation flow.

Acceptance Scenarios:

1. Given a WP implementation that fails review, When orchestration processes the review result, Then the WP is sent back for re-implementation.

2. Given a WP goes through rejection -> re-implement -> re-review -> approve, When the test validates state, Then the state file shows correct transition history.

3. Given max review cycles exceeded, When orchestration continues, Then the WP is marked as failed and user is alerted.

User Story 4 - Fixture Snapshot Management (Priority: P2)

A test developer wants to create and use checkpoint snapshots to speed up test execution without sacrificing realism.

Why this priority: Full end-to-end tests are slow; snapshots enable fast iteration while testing specific scenarios.

Independent Test: Can be tested by creating a snapshot, restoring it, and verifying the restored state matches the original.

Acceptance Scenarios:

1. Given an orchestration run reaches "WP implemented, awaiting review" state, When the snapshot tool runs, Then a checkpoint is created that can restore this exact state.

2. Given a checkpoint snapshot exists, When a test loads the snapshot, Then the test starts from that checkpoint state, not from scratch.

3. Given multiple checkpoints exist for different states, When a test selects a checkpoint, Then only the relevant state is loaded.

4. Given the orchestrator code changes, When snapshots become invalid, Then the fixture tooling detects and reports stale snapshots.

User Story 5 - Parallel Execution and Dependency Testing (Priority: P2)

A test developer wants to verify the orchestrator respects WP dependencies and executes independent WPs in parallel.

Why this priority: Parallel execution is a key performance feature; dependency bugs could cause race conditions or incorrect ordering.

Independent Test: Can be tested with a fixture having specific dependency patterns and verifying execution order.

Acceptance Scenarios:

1. Given WP01, WP02, WP03 are independent, When orchestration runs with concurrency=3, Then all three start simultaneously.

2. Given WP04 depends on WP01 and WP02, When WP01 completes but WP02 is running, Then WP04 does not start until WP02 completes.

3. Given circular dependencies in fixture, When orchestration attempts to start, Then orchestration fails with clear circular dependency error before any WP execution.

4. Given diamond dependency pattern (WP04 depends on WP02 and WP03, both depend on WP01), When orchestration runs, Then execution order is correct and WP04 starts only after both WP02 and WP03 complete.

User Story 6 - Extended Agent Smoke Tests (Priority: P3)

A test developer wants basic validation that extended-tier agents (Cursor, Qwen, Augment, Kilocode, Roo, Windsurf, Amazon Q) can be invoked by the orchestrator.

Why this priority: Full integration tests for all agents would be prohibitively slow; smoke tests provide confidence without excessive runtime.

Independent Test: Can be tested by invoking each extended agent with a minimal task and verifying basic response.

Acceptance Scenarios:

1. Given an extended agent is installed, When smoke test runs, Then the agent successfully receives and acknowledges a minimal task.

2. Given an extended agent is not installed, When smoke test runs, Then the test is skipped with informative message.

3. Given all extended agents are available, When smoke test suite runs, Then each agent completes its minimal task.

Edge Cases

What happens when a fixture snapshot references agents that aren't installed? (Test skips with clear message about missing dependencies)
What happens when checkpoint restoration fails mid-way? (Cleanup partial state and report which step failed)
What happens when two tests try to use the same fixture concurrently? (Fixture isolation via unique directories or locking)
What happens when agent output is malformed? (Parse error is captured, WP marked as needing retry)
What happens when git operations fail during fixture setup? (Clear error with git state, fixture marked as corrupt)
What happens when test timeout expires during long agent execution? (Agent process killed, test fails with timeout indicator)

Requirements (mandatory)

Functional Requirements

Agent Availability Detection

FR-001: System MUST detect installation status of all 12 supported agents
FR-002: System MUST verify authentication status for detected agents
FR-003: System MUST categorize agents into core tier (Claude Code, Codex, Copilot, Gemini, OpenCode) and extended tier (remaining 7 agents)
FR-004: System MUST fail tests when core tier agents are unavailable
FR-005: System MUST skip tests with warning when extended tier agents are unavailable

Fixture Management

FR-006: System MUST provide tooling to create checkpoint snapshots at defined orchestration states
FR-007: System MUST support snapshots for: "WP created", "WP implemented", "review pending", "review rejected", "review approved", "WP merged"
FR-008: System MUST restore snapshots to exact state including git worktrees, lane status, and state files
FR-009: System MUST detect and report stale snapshots when orchestrator code changes
FR-010: System MUST isolate fixtures to prevent test interference

Core Integration Tests

FR-011: System MUST test happy path: implement -> review -> done
FR-012: System MUST test review cycles: implement -> review-reject -> re-implement -> review-approve -> done
FR-013: System MUST test parallel execution with configurable concurrency
FR-014: System MUST test dependency ordering with various graph patterns (linear, fan-out, diamond)

Extended Agent Smoke Tests

FR-015: System MUST test basic invocation for each extended agent
FR-016: System MUST verify agent receives task and produces some output
FR-017: System MUST complete smoke tests in under 60 seconds per agent

Test Organization

FR-018: System MUST organize tests by category: availability, fixtures, integration, smoke
FR-019: System MUST support running specific test categories via pytest markers
FR-020: System MUST provide clear test output distinguishing skips, failures, and passes
FR-021: System MUST support parallel test execution where fixtures allow

State Validation

FR-022: System MUST validate orchestration state file integrity after each test
FR-023: System MUST verify WP lane transitions are recorded correctly
FR-024: System MUST validate git state (commits, branches) matches expected post-orchestration state

Key Entities

AgentAvailability: Detection result for a single agent. Includes: agent_id, is_installed, is_authenticated, tier (core/extended), and failure_reason if unavailable.

FixtureCheckpoint: A snapshot of orchestration state at a known point. Includes: checkpoint_name, orchestration_state, git_state (branches, commits), created_at, and orchestrator_version.

TestCategory: Classification of tests. Values: availability, fixture_management, integration_happy_path, integration_review_cycles, integration_parallel, smoke_extended.

TestResult: Outcome of a single test. Includes: test_name, category, status (passed/failed/skipped), duration, skip_reason if skipped, failure_details if failed.

Success Criteria (mandatory)

Measurable Outcomes

SC-001: Full integration test suite for core agents completes in under 30 minutes
SC-002: Smoke test suite for extended agents completes in under 10 minutes (when all agents available)
SC-003: Fixture snapshots reduce test startup time by at least 70% compared to fresh setup
SC-004: Test suite correctly identifies orchestrator bugs with zero false positives over 10 consecutive runs
SC-005: All 5 core agents have full integration test coverage
SC-006: All 7 extended agents have smoke test coverage
SC-007: Test results clearly distinguish between "agent unavailable" skips and actual test failures
SC-008: Developers can run tests locally with 2 or more core agents installed

Assumptions

At least 2 core agents are installed for meaningful local test runs
Test machine has sufficient resources to run agent processes concurrently
Git is available and properly configured
Network connectivity is available for cloud-based agents during test runs
Orchestrator (feature 020) is fully implemented and merged
pytest is the test framework (already used in spec-kitty)

Dependencies

Feature 020: Autonomous Multi-Agent Orchestrator (must be complete and merged)
Existing spec-kitty test infrastructure (pytest, fixtures)
Existing agent invoker implementations from feature 020
Git worktree functionality for fixture state management

Out of Scope

Mocked agent tests (this feature explicitly uses real agents)
CI/CD configuration (tests designed for local execution)
Performance benchmarking beyond basic timing
Agent-specific bug testing (focus is on orchestrator, not individual agents)
Cost/token tracking during tests
Test coverage reporting integration
Deliberate retry/fallback testing (retry/fallback paths tested only when real failures occur naturally)