Implementation Plan: Glossary Semantic Integrity Runtime for Mission Framework
Branch: 041-mission-glossary-semantic-integrity | Date: 2026-02-16 | Spec: spec.md Input: Feature specification from kitty-specs/041-mission-glossary-semantic-integrity/spec.md
Summary
Implement a glossary semantic integrity runtime system that enforces semantic consistency in mission execution by attaching term resolution and conflict detection to mission primitives. The system uses a middleware chain architecture to extract terms from step inputs/outputs, resolve against a 4-tier scope hierarchy, detect semantic conflicts, block LLM generation on high-severity issues, and prompt users for interactive clarification with checkpoint/resume capability. All state is stored in the event log using Feature 007 canonical contracts (no side-channel state).
Primary capabilities: Metadata-driven check attachment, automatic term extraction, scope-aware resolution (mission_local → team_domain → audience_domain → spec_kitty_core), 4 conflict types detection, 3 strictness modes (off/medium/max), hybrid interactive/async clarification flow, event-sourced checkpoint/resume.
Technical approach: Middleware pipeline with 5 layers (extraction → semantic check → generation gate → clarification → resume), deterministic heuristics + metadata hints for term extraction (no LLM in hot path), event sourcing for checkpoint (minimal payload at generation boundary), Typer prompts + Rich formatting for CLI interaction, Feature 007 canonical event contracts for observability/audit/replay.
Technical Context
Language/Version: Python 3.11+ (existing spec-kitty requirement from constitution) Primary Dependencies: typer (CLI), rich (console output), ruamel.yaml (seed file parsing), spec-kitty-events (Feature 007 canonical event contracts via Git dependency per ADR-11) Storage: Event log only (JSONL via spec-kitty-events), optional seed files (.kittify/glossaries/{scope}.yaml) Testing: pytest with 90%+ coverage (constitution requirement), mypy --strict (no type errors), integration tests for full middleware pipeline Target Platform: Cross-platform (Linux, macOS, Windows 10+ per constitution) Project Type: Single project (CLI library extension, not separate web/mobile app) Performance Goals:
Constraints:
Scale/Scope:
- Term extraction < 100ms per step (deterministic heuristics, no LLM)
- Clarification prompt display < 500ms (Rich rendering)
- Checkpoint/resume < 200ms (minimal payload)
- Event emission < 50ms (append-only JSONL)
- No side-channel state (event log is source of truth per Feature 007 invariant #6)
- No LLM in runtime hot path (async enrichment only)
- Default behavior "mostly invisible" (auto-capture, only block on high-severity)
- Strictness precedence: global → mission → step → runtime override
- Support 100+ mission primitives (metadata-driven attachment)
- Handle 1000+ terms per glossary scope (efficient lookup)
- 3 strictness modes, 4 scope levels, 4 conflict types
Integration Points:
- Mission primitive execution context (middleware injection point)
- Feature 007 event contracts (import from spec-kitty-events package)
- Existing typer CLI commands (extend with clarification prompts)
- Existing rich console output (extend with ranked candidate rendering)
Constitution Check
GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.
Dependencies
✅ Python 3.11+: Aligned with constitution requirement ✅ typer, rich: Existing spec-kitty dependencies (no new framework surface) ✅ ruamel.yaml: Already used for frontmatter parsing in spec-kitty ✅ spec-kitty-events: Git dependency per ADR-11 dual-repository pattern (commit-pinned for determinism)
New dependency: None (all dependencies already in spec-kitty or spec-kitty-events)
Testing Requirements
✅ pytest with 90%+ coverage: Will cover all middleware layers, term extraction, conflict detection, clarification flow ✅ mypy --strict: All new code will have type annotations ✅ Integration tests: Full pipeline tests (term extraction → gate → clarification → resume) ✅ Unit tests: Term extraction heuristics, conflict scoring, scope resolution
Performance Requirements
✅ CLI operations < 2 seconds: Term extraction and semantic check are deterministic (no LLM), designed for < 100ms per step ✅ No lag on 100+ work packages: Glossary checks are step-scoped, don't accumulate state across WPs
Architecture Alignment
✅ 2.x branch target: This is a 2.x feature (event sourcing, spec-kitty-events integration) ✅ Event sourcing with Lamport clocks: Uses Feature 007 canonical events from spec-kitty-events ✅ No 1.x compatibility constraints: Greenfield implementation for 2.x
Deployment Constraints
✅ Cross-platform: Pure Python, no platform-specific dependencies ✅ PyPI distribution: Will vendor spec-kitty-events per constitution Section 103 (until events goes public)
Code Quality
✅ Type annotations: All middleware classes, term extraction, conflict detection ✅ Docstrings: Public API (middleware interfaces, event emission helpers) ✅ No security issues: No credential handling, no secrets (glossary terms are user-visible domain language)
Breaking Changes
⚠️ Behavior change: Glossary checks enabled by default (opt-out model) - impacts all mission primitives unless strictness=off
Justification: Spec says "mostly invisible" and "automatic glossary capture" (FR-020). Default-enabled aligns with "automatic by default" design intent from Sprint S2 kickoff.
Migration: Provide global config option to disable: strictness: off in .kittify/config.yaml for users who want opt-in behavior.
Verdict
✅ PASS: All constitution requirements met. One behavior change (default-enabled checks) justified by spec requirements and mitigated with config option.
Project Structure
Documentation (this feature)
kitty-specs/041-mission-glossary-semantic-integrity/
├── spec.md # Feature specification (created by /spec-kitty.specify)
├── plan.md # This file (created by /spec-kitty.plan)
├── research.md # Phase 0 output (architectural research)
├── data-model.md # Phase 1 output (entities and event contracts)
├── quickstart.md # Phase 1 output (developer quickstart)
├── contracts/ # Phase 1 output (event schema examples)
│ ├── events.md # Canonical event reference (Feature 007)
│ └── middleware.md # Middleware interface contracts
└── tasks.md # Phase 2 output (/spec-kitty.tasks command - NOT created by /spec-kitty.plan)
Source Code (repository root)
src/specify_cli/
├── glossary/ # NEW: Glossary semantic integrity package
│ ├── __init__.py # Public API exports
│ ├── middleware.py # Middleware pipeline components
│ │ ├── GlossaryCandidateExtractionMiddleware
│ │ ├── SemanticCheckMiddleware
│ │ ├── GenerationGateMiddleware
│ │ ├── ClarificationMiddleware
│ │ └── ResumeMiddleware
│ ├── extraction.py # Term extraction (heuristics + metadata)
│ ├── resolution.py # Scope-aware term resolution
│ ├── conflict.py # Conflict detection and scoring
│ ├── clarification.py # Interactive prompt handling (Typer + Rich)
│ ├── checkpoint.py # Event-sourced checkpoint/resume
│ ├── scope.py # Scope hierarchy and seed file loading
│ ├── events.py # Event emission adapters (Feature 007 contracts)
│ └── strictness.py # Strictness policy (off/medium/max)
├── missions/ # EXISTING: Mission framework
│ └── [primitives integration point for middleware attachment]
├── cli/ # EXISTING: CLI commands
│ └── commands/
│ └── glossary.py # NEW: Glossary management commands (if needed)
└── [existing spec-kitty packages]
tests/specify_cli/glossary/ # NEW: Glossary tests
├── test_middleware.py # Middleware pipeline integration tests
├── test_extraction.py # Term extraction heuristics tests
├── test_resolution.py # Scope resolution tests
├── test_conflict.py # Conflict detection and scoring tests
├── test_clarification.py # Interactive prompt tests
├── test_checkpoint.py # Event-sourced checkpoint/resume tests
├── test_scope.py # Scope hierarchy and seed file tests
├── test_events.py # Event emission tests (Feature 007 contracts)
└── test_strictness.py # Strictness policy tests
.kittify/glossaries/ # NEW: Glossary seed files (optional)
├── spec_kitty_core.yaml # Spec Kitty canonical terms
├── team_domain.yaml # Team-specific terms (optional)
└── audience_domain.yaml # Audience-specific terms (optional)
Structure Decision: Single project (Option 1) selected. This is a CLI library extension, not a separate web/mobile app. All code lives in src/specify_cli/glossary/ package to keep glossary semantics modular and testable. Middleware integrates with existing mission primitive execution context. Seed files live in .kittify/glossaries/ for user-managed glossary bootstrapping.
Complexity Tracking
No violations requiring justification. All constitution requirements met.
Phase 0: Research
Prerequisites: Constitution Check passed ✅
Research Goals
1. Middleware Integration Points: Understand how mission primitives execute in spec-kitty 2.x to identify middleware injection points 2. Feature 007 Event Contracts: Extract canonical event schemas from spec-kitty-events package 3. Term Extraction Heuristics: Research deterministic patterns for domain term detection (quoted phrases, acronyms, casing patterns) 4. Scope Resolution Strategies: Best practices for hierarchical glossary lookup with fallback 5. Checkpoint State Minimization: Identify minimal state needed for deterministic resume
Research Outputs
See research.md for detailed findings.
Key Decisions:
1. Middleware Architecture (B + D Hybrid):
- Decision: Use middleware chain as primary control path, with event emission at boundaries
- Rationale: Primitives are config-defined and customizable (not stable class hierarchy). Middleware provides one consistent execution choke point. Events are for observability/audit/replay, not enforcement.
- Alternatives considered: (A) Decorator pattern (brittle for config-driven primitives), (C) Base class hooks (forces class model mission system doesn't require), (D) Pure event-driven (can't guarantee inline block/resume semantics)
2. Event Contracts (A + C: Reference Feature 007):
- Decision: CLI references/imports events package contracts from spec-kitty-events (not redefining)
- Rationale: Feature 007 is authoritative source for glossary event schemas. Prevents contract drift between CLI and SaaS.
- Alternatives considered: (B) Define schemas inline in CLI (causes contract drift), (D) Stub contracts temporarily (defers integration)
3. Term Extraction (D: Metadata hints + deterministic heuristics):
- Decision: Hybrid extraction with metadata hints (highest confidence) + deterministic heuristics (quoted phrases, acronyms, casing patterns) + scope-aware normalization
- Rationale: Keeps extraction automatic and mostly invisible, avoids LLM latency/cost in hot path, better precision than raw heuristics via mission metadata
- Alternatives considered: (A) NLP-based (spaCy/NLTK - adds dependency weight), (B) LLM-based (high quality but latency/cost), (C) Pattern-based only (low precision)
4. Checkpoint/Resume (A: Lightweight event sourcing):
- Decision: Emit
StepCheckpointedevent before generation gate with minimal payload (mission/run/step IDs, strictness, scope/version refs, input hash, cursor, retry token) - Rationale: Event log is source of truth (Feature 007 invariant #6), supports cross-session resume, deterministic replay
- Alternatives considered: (B) Filesystem state file (conflicts with "no side-channel state"), (C) In-memory cache (fails async defer + cross-session), (D) Re-run with inputs (fragile unless fully idempotent)
5. Interactive Prompts (B: Typer prompts + Rich formatting):
- Decision: Use typer.prompt() for input, rich for output formatting (no new dependencies)
- Rationale: Matches existing CLI patterns (typer.confirm/prompt in upgrade.py, orchestrate.py), enough capability for ranked options + custom input + defer flow
- Alternatives considered: (A) Rich + questionary (new dependency), (C) Custom Rich prompts (reinventing wheel), (D) AskUserQuestion tool (unclear if exists)
Phase 1: Design & Contracts
Prerequisites: Phase 0 research complete ✅
Data Model
See data-model.md for complete entity definitions and relationships.
Core Entities:
1. TermSurface: Raw string (e.g., "workspace") 2. TermSense: Meaning per scope (surface, scope, definition, provenance, confidence, status) 3. GlossaryScope: Enum (mission_local, team_domain, audience_domain, spec_kitty_core) 4. SemanticConflict: Conflict classification (term, type, severity, confidence, candidates, context) 5. StepCheckpoint: Minimal state for resume (mission/run/step IDs, strictness, scope refs, input hash, cursor, retry token)
Middleware Components:
1. GlossaryCandidateExtractionMiddleware: Extract terms from step I/O using metadata hints + heuristics 2. SemanticCheckMiddleware: Resolve terms against scope hierarchy, detect conflicts 3. GenerationGateMiddleware: Block generation on unresolved high-severity conflicts 4. ClarificationMiddleware: Render ranked candidates, prompt user (Typer + Rich) 5. ResumeMiddleware: Load checkpoint from events, restore step execution context
Event Contracts
See contracts/events.md for canonical schemas from Feature 007.
Canonical Events (from spec-kitty-events package):
1. GlossaryScopeActivated: (scope_id, glossary_version_id) 2. TermCandidateObserved: (term, source_step, actor_id, confidence) 3. SemanticCheckEvaluated: (step_id, mission_id, timestamp, findings, overall_severity, confidence, effective_strictness, recommended_action, blocked) 4. GlossaryClarificationRequested: (question, term, options, urgency) 5. GlossaryClarificationResolved: (conflict_id, term_surface, selected_sense, actor, timestamp, resolution_mode, provenance) 6. GlossarySenseUpdated: (term_surface, scope, new_sense, actor, timestamp, update_type, provenance) 7. GenerationBlockedBySemanticConflict: (step_id, mission_id, timestamp, conflicts, strictness_mode, effective_strictness)
CLI-specific events (may need to add if not in Feature 007):
8. StepCheckpointed: (mission_id, run_id, step_id, strictness, scope_refs, input_hash, cursor, retry_token, timestamp)
Middleware Interface Contracts
See contracts/middleware.md for interface definitions.
Middleware Base Interface:
class GlossaryMiddleware(Protocol):
def process(self, context: PrimitiveExecutionContext) -> PrimitiveExecutionContext:
"""Process primitive execution context, return modified context or raise BlockedByConflict."""
...
Pipeline Execution Order:
1. Extraction → 2. Semantic Check → 3. Generation Gate → 4. Clarification (if blocked) → 5. Resume (on retry)
Quickstart
See quickstart.md for developer setup and common workflows.
Developer Setup:
1. Install spec-kitty 2.x with glossary feature 2. (Optional) Create seed files in .kittify/glossaries/ 3. Configure strictness mode in .kittify/config.yaml (default: medium)
Common Workflows:
1. Mission author: Add glossary_check: enabled to step metadata in mission.yaml 2. Developer: Encounter conflict, resolve interactively via Typer prompt 3. Operator: Set strictness: off for local dev, strictness: max for production
Agent Context Update
This section will be auto-updated by the agent context script during /spec-kitty.plan execution.
Action: Run agent context update script to add new technology stack to appropriate agent-specific context file.
New technology to add:
- Python package:
src/specify_cli/glossary/(glossary semantic integrity runtime) - Dependencies: typer (CLI), rich (console), ruamel.yaml (seed files), spec-kitty-events (Feature 007 contracts)
- Testing: pytest (90%+ coverage), mypy --strict
- Event contracts: Feature 007 canonical glossary events (GlossaryScopeActivated, TermCandidateObserved, SemanticCheckEvaluated, etc.)
Agent-specific files:
- Claude Code:
.claude/context.md - GitHub Copilot:
.github/context.md - (Other agents as detected by script)
Phase 2: Task Generation
STOP: This is the end of /spec-kitty.plan workflow. Do NOT proceed to task generation.
The user must explicitly run /spec-kitty.tasks to generate work packages.
Implementation Notes
Risks:
- Over-questioning can reduce velocity if severity policy too aggressive (mitigate: cap at 3 questions per burst, severity-first prioritization)
- Weak confidence scoring can under-detect conflicts (mitigate: start conservative, tune based on user feedback)
- Feature 007 events package not yet published (mitigate: stub adapter boundaries, gate implementation on 007 availability)
Assumptions:
- Mission primitives in 2.x are configurable and can carry glossary-check metadata (validate during WP01 implementation)
- Participants will answer targeted clarifications when prompted (design escape hatch: defer to async)
- Term extraction heuristics have acceptable precision (validate with test corpus, iterate if needed)
Open Questions:
- Confidence scoring calibration strategy for v1 default policy (defer to WP03: conflict detection implementation)
- Exact dashboard ROI metrics beyond "prevented rework" (out of scope for CLI, defer to SaaS)
Next command: /spec-kitty.tasks (user must invoke explicitly)