Implementation Plan: Orchestrator End-to-End Testing Suite

Branch: 021-orchestrator-end-to-end-testing-suite | Date: 2026-01-19 | Spec: spec.md Input: Feature specification from /kitty-specs/021-orchestrator-end-to-end-testing-suite/spec.md

Summary

Build comprehensive end-to-end testing infrastructure for the orchestrator (feature 020) with:

  • Tiered agent coverage (5 core agents with full integration tests, 7 extended agents with smoke tests)
  • Modular test paths based on agent count (1-agent, 2-agent, 3+-agent)
  • Checkpoint-based fixtures for fast test execution
  • Real agent CLI invocation (no mocks)

Technical Context

Language/Version: Python 3.11+ (matches existing spec-kitty codebase) Primary Dependencies: pytest, pytest-asyncio (for async orchestrator tests), existing orchestrator module Storage: Git-tracked fixtures in tests/fixtures/orchestrator/ Testing: pytest with custom markers for test categories Target Platform: Local developer machines with agents installed Project Type: Single project (test infrastructure added to existing codebase) Performance Goals: Full integration suite <30 min, smoke suite <10 min Constraints: Requires real agent CLIs installed and authenticated Scale/Scope: 5 core agents, 7 extended agents, 6 test categories

Constitution Check

No constitution file found. Section skipped.

Project Structure

Documentation (this feature)

kitty-specs/021-orchestrator-end-to-end-testing-suite/
├── plan.md              # This file
├── data-model.md        # Test entities and fixture schema
├── quickstart.md        # How to run the test suite
└── tasks.md             # Phase 2 output (created by /spec-kitty.tasks)

Source Code (repository root)

tests/
├── fixtures/
│   └── orchestrator/                    # NEW: Fixture snapshots
│       ├── checkpoint_wp_created/
│       │   ├── state.json               # OrchestrationRun state
│       │   ├── feature/                 # Minimal feature structure
│       │   │   ├── spec.md
│       │   │   └── tasks/
│       │   │       ├── WP01.md
│       │   │       └── WP02.md
│       │   └── worktrees.json           # Worktree metadata
│       ├── checkpoint_wp_implemented/
│       ├── checkpoint_review_pending/
│       ├── checkpoint_review_rejected/
│       ├── checkpoint_review_approved/
│       └── checkpoint_wp_merged/
│
├── specify_cli/
│   └── orchestrator/
│       ├── test_integration.py          # EXISTING: Basic integration tests
│       ├── test_e2e_happy_path.py       # NEW: Happy path tests
│       ├── test_e2e_review_cycles.py    # NEW: Review cycle tests
│       ├── test_e2e_parallel.py         # NEW: Parallel execution tests
│       ├── test_e2e_smoke.py            # NEW: Extended agent smoke tests
│       └── conftest.py                  # NEW: Fixtures and markers
│
└── conftest.py                          # Update: Add orchestrator markers

src/specify_cli/
└── orchestrator/
    └── testing/                         # NEW: Test utilities module
        ├── __init__.py
        ├── availability.py              # Agent detection with auth probe
        ├── fixtures.py                  # Fixture loader/creator
        └── paths.py                     # Test path selection (1/2/3+ agents)

Structure Decision: Extend existing tests/specify_cli/orchestrator/ with new e2e test files. Add tests/fixtures/orchestrator/ for checkpoint snapshots. Add src/specify_cli/orchestrator/testing/ for reusable test utilities.

Key Engineering Decisions

1. Fixture Storage (from planning Q1)

Decision: Git-tracked directory snapshots in tests/fixtures/orchestrator/

Structure per checkpoint:

checkpoint_<name>/
├── state.json           # Serialized OrchestrationRun
├── feature/             # Minimal kitty-specs structure
│   ├── spec.md
│   ├── plan.md
│   ├── meta.json
│   └── tasks/
│       ├── WP01.md      # With lane frontmatter
│       └── WP02.md
└── worktrees.json       # Metadata: [{wp_id, branch, path}]

Loader behavior: 1. Copy fixture to temp directory 2. Initialize git repo in temp dir 3. Create worktrees from worktrees.json metadata 4. Load state.json into OrchestrationRun object 5. Return ready-to-use test context

2. Auth Verification (from planning Q2)

Decision: Lightweight API probe per agent

Implementation:

class AgentAvailability:
    agent_id: str
    is_installed: bool      # CLI binary exists
    is_authenticated: bool  # Probe call succeeded
    tier: Literal["core", "extended"]
    failure_reason: str | None

Probe behavior:

  • Each agent invoker gets a probe() method
  • Makes minimal API call (e.g., list models, whoami)
  • Timeout: 10 seconds
  • Caches result for session duration

3. Test Path Model (from planning Q3)

Decision: Parameterized by agent count, not identity

Paths:

PathAgentsTests
1-agentSingle agentSame-agent implementation and review
2-agentTwo agentsCross-agent review (different impl vs review agent)
3+-agentThree+ agentsFallback scenarios (third agent used on failure)

Runtime selection:

@pytest.fixture
def available_agents() -> list[str]:
    """Detect available agents at test start."""
    return [a.agent_id for a in detect_all_agents() if a.is_authenticated]

@pytest.fixture
def test_path(available_agents) -> Literal["1-agent", "2-agent", "3+-agent"]:
    """Select test path based on agent count."""
    count = len(available_agents)
    if count >= 3:
        return "3+-agent"
    elif count == 2:
        return "2-agent"
    return "1-agent"

4. Smoke Test Task (from planning Q4)

Decision: File touch - agent creates empty file at specified path

Implementation:

SMOKE_TASK = """
Create an empty file at: {target_path}
Do not add any content. Just create the file.
"""

def verify_smoke_result(target_path: Path) -> bool:
    return target_path.exists()

Test Categories & Markers

# In conftest.py
pytest.mark.orchestrator_availability  # Agent detection tests
pytest.mark.orchestrator_fixtures      # Fixture management tests
pytest.mark.orchestrator_happy_path    # Basic orchestration flow
pytest.mark.orchestrator_review_cycles # Review rejection/re-impl
pytest.mark.orchestrator_parallel      # Parallel execution & deps
pytest.mark.orchestrator_smoke         # Extended agent smoke tests

# Tier markers for skip behavior
pytest.mark.core_agent                 # Fail if agent unavailable
pytest.mark.extended_agent             # Skip if agent unavailable

Agent Tiers

Core Tier (must be available - tests fail if missing):

  • Claude Code (claude)
  • GitHub Codex (codex)
  • GitHub Copilot (copilot)
  • Google Gemini (gemini)
  • OpenCode (opencode)

Extended Tier (skip gracefully if missing):

  • Cursor (cursor)
  • Qwen Code (qwen)
  • Augment Code (augment)
  • Kilocode (kilocode)
  • Roo Cline (roo)
  • Windsurf (windsurf)
  • Amazon Q (amazonq)

Complexity Tracking

No constitution violations to track.