Archive notice: This page documents historical Spec Kitty behavior and is not the current 3.2 workflow. Start with Spec Kitty 3.2 for current docs.

2.x Model Discipline and Cost-Aware Routing (Draft)

Status: Draft
Date: 2026-02-28
Scope: User journey + implementation research for model-aware task assignment

Problem Statement

In current 2.x flow, task execution is agent-selected by the operator (--agent <name>), with optional static preferences in .kittify/config.yaml. This is simple, but it does not optimize for:

Task fit (which model is strongest for this task type)
Quality/cost trade-offs
Consistent governance when assignment decisions vary by operator

Current Baseline (What Exists Today)

spec-kitty next requires --agent and does not recommend a model/tool automatically (src/specify_cli/cli/commands/next_cmd.py).
Tool selection uses the agents.available list order with first-available fallback ( src/specify_cli/core/tool_config.py).
Agent profiles already support weighted matching by task context, but no model-cost dimension (src/doctrine/agent_profiles/repository.py).
Doctrine artifacts already provide governance hooks (directives, toolguides) for 2.x (docs/archive/2x/doctrine-and-charter.md).

Proposed User Journey

Scenario

A team enables a Model Discipline doctrine rule so Spec Kitty can recommend the best available model for each task, balancing quality, cost tier, and known weaknesses, while still allowing explicit operator override.

Actors

#	Actor	Type	Role in Journey
1	Project Operator	`human`	Chooses policy level and can override recommendations
2	Spec Kitty CLI	`system`	Computes task type, recommends tool/model pair, enforces policy
3	Model Catalog Updater	`system`	Refreshes capability/cost metadata from configured sources
4	Agent Runtime (Claude/Codex/etc.)	`llm`	Executes the assigned work package

Preconditions

Feature and work packages exist (kitty-specs/<mission>/tasks/WP*.md).
.kittify/config.yaml has available tools configured.
Doctrine bundle includes model-discipline directive/toolguide and task-type mapping file.

Journey Map

Phase	Actor(s)	System	Key Events
1. Configure Policy	Operator	Enables model-discipline mode in config/charter profile	`ModelDisciplineConfigured`
2. Refresh Catalog	Model Catalog Updater	Pulls latest model ranking metadata and pricing snapshots	`ModelCatalogRefreshed`
3. Classify Task	Spec Kitty CLI	Determines task type from mission step + WP metadata	`TaskTypeClassified`
4. Recommend Assignment	Spec Kitty CLI	Scores candidates by quality/cost/risk and suggests tool+model	`ModelAssignmentRecommended`
5. Execute/Override	Operator + Agent Runtime	Accepts recommendation or overrides with reason	`ModelAssignmentAccepted`, `ModelAssignmentOverridden`
6. Capture Metrics	Spec Kitty CLI	Stores usage/cost/performance outcomes for future tuning	`ModelExecutionMetricsCaptured`

Coordination Rules

Default posture: Advisory (Phase 1), then Gated (Phase 2+)

If policy mode is advisory, non-compliant selections warn but do not block.
If policy mode is gated, non-compliant selections require explicit override reason.
If policy mode is required, assignment blocks until a compliant model or override waiver is recorded.

Proposed 2.x Artifact Design

1. New Directive

Add a doctrine directive such as:

src/doctrine/directives/020-model-discipline-routing.directive.yaml

Purpose:

Require model-to-task fit checks before assignment.
Require cost-tier awareness and explicit override capture.

2. New Toolguide

Add a model discipline toolguide:

src/doctrine/toolguides/model-discipline.toolguide.yaml
src/doctrine/toolguides/MODEL_DISCIPLINE.md

Purpose:

Define task type taxonomy (implementation, review, research, refactor, doc-authoring, etc.).
Explain scoring dimensions (quality, weakness risk, cost tier, latency tier).
Define override rules and audit expectations.

3. Model-to-Task Mapping Data

Add a machine-readable mapping file:

src/doctrine/toolguides/model-to-task_type.yml

Suggested schema sections:

task_types
models (strengths, weaknesses, supported tools, cost tier, optional price per 1M tokens)
routing_policy (weights + thresholds)
sources (where each metric came from + timestamp)

Recommended follow-up:

Add src/doctrine/schemas/model-to-task_type.schema.yaml for validation.

Proposed YAML Schema (v1.0)

Use this schema for validating model-to-task_type.yml:

Implementation Approach (Phased)

Phase 0: Advisory-only research slice

Load mapping file and compute recommendations without changing assignment behavior.
Expose recommendation in spec-kitty next --json payload as additional fields (recommended_agent, recommended_model, rationale).
Keep --agent mandatory for backwards compatibility.

Phase 1: Optional auto-selection

Add --agent auto support in next and workflow commands.
Resolve tool+model from mapping and configured availability.
Persist selected model in WP history metadata (or event payload) for traceability.

Phase 2: Directive enforcement

Wire directive checks into execution path (pre-transition validation before moving WP to doing).
Enforce advisory/gated/required behavior from config.
Require override reason when non-compliant choices are used.

Phase 3: Feedback loop

Record observed cost, latency, and acceptance outcomes.
Use telemetry to tune weights and reduce static assumptions.
Add periodic catalog sync command (e.g., spec-kitty model sync).

Integration Points in Current Code

src/specify_cli/core/tool_config.py
Extend selection config with model-discipline routing dimensions.
src/specify_cli/cli/commands/next_cmd.py
Add optional auto-routing input/output surface.
src/specify_cli/next/runtime_bridge.py
Compute routing recommendation before decision output.
src/specify_cli/cli/commands/agent/workflow.py
Consume resolved agent/model for implement/review handoff.
src/doctrine/agent_profiles/repository.py
Extend weighted scoring with cost/quality dimensions from model catalog.

Data Source Research Notes

Arena data

Arena’s public docs describe leaderboard evaluation as human preference voting, and publish open datasets monthly.
Arena Terms include restrictions on bots/scraping/harvesting without authorization.

Implication:

Prefer documented/official data channels (published datasets/APIs) over raw HTML scraping of arena.ai/leaderboard.
Treat Arena rankings as one quality signal, not a direct task-specialization oracle.

Cost data

Arena leaderboard data does not provide comprehensive pricing metadata.
Cost-per-1M token fields should be sourced from provider pricing pages/APIs and versioned with timestamps.

Risks and Mitigations

Config schema drift: current save path rewrites tools.selection with only two fields.
Mitigation: introduce explicit schema version + preserve/round-trip unknown fields during migration.
False confidence from leaderboard rank: global rank may not match local task quality.
Mitigation: include weakness flags and local telemetry feedback.
Operator trust: opaque routing decisions reduce adoption.
Mitigation: always show rationale and allow controlled override.
Rapid model churn: rankings and prices change frequently.
Mitigation: freshness timestamps + configurable staleness thresholds.

Suggested MVP Decision

Start with advisory mode in 2.x:

Add directive + toolguide + mapping file.
Return recommendation metadata in next --json.
Do not block current workflows yet.

This gives immediate value (visibility + consistency) with low migration risk.

External References

Arena home/terms: https://lmarena.ai/
Arena how-it-works and dataset notes: https://lmarena.ai/howitworks/
Arena leaderboard policy: https://lmarena.ai/leaderboard-policy/
Arena-rank package (public ranking/data access workflow): https://pypi.org/project/arena-rank/
OpenAI API pricing: https://openai.com/api/pricing/
Anthropic API pricing: https://www.anthropic.com/pricing

Table of Contents