Archive notice: This page documents historical Spec Kitty behavior and is not the current 3.2 workflow. Start with Spec Kitty 3.2 for current docs.
2.x Model Discipline and Cost-Aware Routing (Draft)
Status: Draft
Date: 2026-02-28
Scope: User journey + implementation research for model-aware task assignment
Problem Statement
In current 2.x flow, task execution is agent-selected by the operator (--agent <name>), with optional static preferences in
.kittify/config.yaml. This is simple, but it does not optimize for:
- Task fit (which model is strongest for this task type)
- Quality/cost trade-offs
- Consistent governance when assignment decisions vary by operator
Current Baseline (What Exists Today)
spec-kitty nextrequires--agentand does not recommend a model/tool automatically (src/specify_cli/cli/commands/next_cmd.py).- Tool selection uses the
agents.availablelist order with first-available fallback (src/specify_cli/core/tool_config.py). - Agent profiles already support weighted matching by task context, but no model-cost dimension (
src/doctrine/agent_profiles/repository.py). - Doctrine artifacts already provide governance hooks (directives, toolguides) for 2.x (
docs/archive/2x/doctrine-and-charter.md).
Proposed User Journey
Scenario
A team enables a Model Discipline doctrine rule so Spec Kitty can recommend the best available model for each task, balancing quality, cost tier, and known weaknesses, while still allowing explicit operator override.
Actors
| # | Actor | Type | Role in Journey |
|---|---|---|---|
| 1 | Project Operator | human |
Chooses policy level and can override recommendations |
| 2 | Spec Kitty CLI | system |
Computes task type, recommends tool/model pair, enforces policy |
| 3 | Model Catalog Updater | system |
Refreshes capability/cost metadata from configured sources |
| 4 | Agent Runtime (Claude/Codex/etc.) | llm |
Executes the assigned work package |
Preconditions
- Feature and work packages exist (
kitty-specs/<mission>/tasks/WP*.md). .kittify/config.yamlhas available tools configured.- Doctrine bundle includes model-discipline directive/toolguide and task-type mapping file.
Journey Map
| Phase | Actor(s) | System | Key Events |
|---|---|---|---|
| 1. Configure Policy | Operator | Enables model-discipline mode in config/charter profile | ModelDisciplineConfigured |
| 2. Refresh Catalog | Model Catalog Updater | Pulls latest model ranking metadata and pricing snapshots | ModelCatalogRefreshed |
| 3. Classify Task | Spec Kitty CLI | Determines task type from mission step + WP metadata | TaskTypeClassified |
| 4. Recommend Assignment | Spec Kitty CLI | Scores candidates by quality/cost/risk and suggests tool+model | ModelAssignmentRecommended |
| 5. Execute/Override | Operator + Agent Runtime | Accepts recommendation or overrides with reason | ModelAssignmentAccepted, ModelAssignmentOverridden |
| 6. Capture Metrics | Spec Kitty CLI | Stores usage/cost/performance outcomes for future tuning | ModelExecutionMetricsCaptured |
Coordination Rules
Default posture: Advisory (Phase 1), then Gated (Phase 2+)
- If policy mode is
advisory, non-compliant selections warn but do not block. - If policy mode is
gated, non-compliant selections require explicit override reason. - If policy mode is
required, assignment blocks until a compliant model or override waiver is recorded.
Proposed 2.x Artifact Design
1. New Directive
Add a doctrine directive such as:
src/doctrine/directives/020-model-discipline-routing.directive.yaml
Purpose:
- Require model-to-task fit checks before assignment.
- Require cost-tier awareness and explicit override capture.
2. New Toolguide
Add a model discipline toolguide:
src/doctrine/toolguides/model-discipline.toolguide.yamlsrc/doctrine/toolguides/MODEL_DISCIPLINE.md
Purpose:
- Define task type taxonomy (
implementation,review,research,refactor,doc-authoring, etc.). - Explain scoring dimensions (quality, weakness risk, cost tier, latency tier).
- Define override rules and audit expectations.
3. Model-to-Task Mapping Data
Add a machine-readable mapping file:
src/doctrine/toolguides/model-to-task_type.yml
Suggested schema sections:
task_typesmodels(strengths, weaknesses, supported tools, cost tier, optional price per 1M tokens)routing_policy(weights + thresholds)sources(where each metric came from + timestamp)
Recommended follow-up:
- Add
src/doctrine/schemas/model-to-task_type.schema.yamlfor validation.
Proposed YAML Schema (v1.0)
Use this schema for validating model-to-task_type.yml:
Implementation Approach (Phased)
Phase 0: Advisory-only research slice
- Load mapping file and compute recommendations without changing assignment behavior.
- Expose recommendation in
spec-kitty next --jsonpayload as additional fields (recommended_agent,recommended_model,rationale). - Keep
--agentmandatory for backwards compatibility.
Phase 1: Optional auto-selection
- Add
--agent autosupport innextand workflow commands. - Resolve
tool+modelfrom mapping and configured availability. - Persist selected model in WP history metadata (or event payload) for traceability.
Phase 2: Directive enforcement
- Wire directive checks into execution path (pre-transition validation before moving WP to
doing). - Enforce advisory/gated/required behavior from config.
- Require override reason when non-compliant choices are used.
Phase 3: Feedback loop
- Record observed cost, latency, and acceptance outcomes.
- Use telemetry to tune weights and reduce static assumptions.
- Add periodic catalog sync command (e.g.,
spec-kitty model sync).
Integration Points in Current Code
src/specify_cli/core/tool_config.py
Extend selection config with model-discipline routing dimensions.src/specify_cli/cli/commands/next_cmd.py
Add optional auto-routing input/output surface.src/specify_cli/next/runtime_bridge.py
Compute routing recommendation before decision output.src/specify_cli/cli/commands/agent/workflow.py
Consume resolved agent/model for implement/review handoff.src/doctrine/agent_profiles/repository.py
Extend weighted scoring with cost/quality dimensions from model catalog.
Data Source Research Notes
Arena data
- Arena’s public docs describe leaderboard evaluation as human preference voting, and publish open datasets monthly.
- Arena Terms include restrictions on bots/scraping/harvesting without authorization.
Implication:
- Prefer documented/official data channels (published datasets/APIs) over raw HTML scraping of
arena.ai/leaderboard. - Treat Arena rankings as one quality signal, not a direct task-specialization oracle.
Cost data
- Arena leaderboard data does not provide comprehensive pricing metadata.
- Cost-per-1M token fields should be sourced from provider pricing pages/APIs and versioned with timestamps.
Risks and Mitigations
- Config schema drift: current save path rewrites
tools.selectionwith only two fields.
Mitigation: introduce explicit schema version + preserve/round-trip unknown fields during migration. - False confidence from leaderboard rank: global rank may not match local task quality.
Mitigation: include weakness flags and local telemetry feedback. - Operator trust: opaque routing decisions reduce adoption.
Mitigation: always show rationale and allow controlled override. - Rapid model churn: rankings and prices change frequently.
Mitigation: freshness timestamps + configurable staleness thresholds.
Suggested MVP Decision
Start with advisory mode in 2.x:
- Add directive + toolguide + mapping file.
- Return recommendation metadata in
next --json. - Do not block current workflows yet.
This gives immediate value (visibility + consistency) with low migration risk.
External References
- Arena home/terms: https://lmarena.ai/
- Arena how-it-works and dataset notes: https://lmarena.ai/howitworks/
- Arena leaderboard policy: https://lmarena.ai/leaderboard-policy/
- Arena-rank package (public ranking/data access workflow): https://pypi.org/project/arena-rank/
- OpenAI API pricing: https://openai.com/api/pricing/
- Anthropic API pricing: https://www.anthropic.com/pricing