Implementation Tasks: Glossary Semantic Integrity Runtime for Mission Framework
Feature: 041-mission-glossary-semantic-integrity Target Branch: 2.x Created: 2026-02-16 Status: Ready for implementation
Overview
This feature implements a glossary semantic integrity runtime system that enforces semantic consistency in mission execution. The system uses a middleware pipeline to extract terms, resolve against a 4-tier scope hierarchy, detect conflicts, block LLM generation on high-severity issues, and prompt for interactive clarification with checkpoint/resume capability.
Architecture: B + D Hybrid (middleware chain + event emission) Dependencies: typer, rich, ruamel.yaml, spec-kitty-events (Feature 007 contracts) Testing: pytest 90%+ coverage, mypy --strict
Work Package Summary
Total Work Packages: 11 Total Subtasks: 51 Estimated Timeline: 8-12 weeks (with parallel execution)
MVP Scope
MVP = WP01-WP05 (Foundation through Generation Gate)
Delivers core glossary enforcement:
- Term extraction (metadata hints + heuristics)
- Scope resolution (4-tier hierarchy)
- Conflict detection (4 types)
- Generation gate blocking (strictness modes)
NOT in MVP: Interactive clarification UI, checkpoint/resume, CLI commands (defer to WP06-WP11)
Phase 1: Foundation & Infrastructure
WP01: Foundation & Data Models
Goal: Establish package structure, core data models, and test infrastructure.
Priority: P1 (blocking for all other WPs)
Independent Test: Can create TermSurface, TermSense, SemanticConflict objects and verify serialization.
Included Subtasks:
- ✅ T001: Create glossary package structure (
src/specify_cli/glossary/) - ✅ T002: Define core data models (TermSurface, TermSense, GlossaryScope, SemanticConflict)
- ✅ T003: Define exception hierarchy (BlockedByConflict, DeferredToAsync, AbortResume)
- ✅ T004: Set up test infrastructure (fixtures, mocks for PrimitiveExecutionContext)
- ✅ T005: Implement GlossaryScope enum with resolution order
Implementation Notes:
1. Create src/specify_cli/glossary/__init__.py with public API exports 2. Define dataclasses in models.py using Python 3.11+ features 3. Create exception base class GlossaryError with specific subclasses 4. Set up pytest fixtures in tests/specify_cli/glossary/conftest.py
Dependencies: None (foundational WP)
Risks: None (data model definitions are straightforward)
Estimated Prompt Size: ~300 lines
WP02: Scope Management & Storage
Goal: Implement glossary scope loading, seed file parsing, and in-memory storage backed by event log.
Priority: P1 (blocking for WP03-WP05)
Independent Test: Can load seed files, activate scopes, query glossary store for terms.
Included Subtasks:
- ✅ T006: Implement seed file loader (YAML parsing for team_domain.yaml, audience_domain.yaml)
- ✅ T007: Implement scope activation (emit GlossaryScopeActivated events)
- ✅ T008: Implement glossary store (in-memory cache backed by event log)
- ✅ T009: Write scope resolution tests (hierarchical lookup)
- ✅ T048: Create spec_kitty_core.yaml seed file (canonical Spec Kitty terms)
Implementation Notes:
1. Seed files live in .kittify/glossaries/{scope}.yaml 2. Use ruamel.yaml for parsing (preserve comments, order) 3. Store uses LRU cache for performance (max 10,000 terms) 4. Event log is source of truth for glossary state
Dependencies: WP01 (data models)
Parallel Opportunity: Can run in parallel with WP03 (different modules)
Risks: Seed file schema validation (mitigate: strict YAML schema, fail-fast)
Estimated Prompt Size: ~350 lines
Phase 2: Term Extraction
WP03: Term Extraction Implementation
Goal: Implement term extraction using metadata hints + deterministic heuristics, with scope-aware normalization and confidence scoring.
Priority: P1 (blocking for WP04)
Independent Test: Can extract terms from sample step inputs, verify confidence scores, validate normalization.
Included Subtasks:
- ✅ T010: Implement metadata hints extraction (glossary_watch_terms, aliases, exclude, fields)
- ✅ T011: Implement deterministic heuristics (quoted phrases, acronyms, casing patterns, repeats)
- ✅ T012: Implement scope-aware normalization (lowercase, trim, stem-light)
- ✅ T013: Implement confidence scoring (metadata > pattern > weak heuristic)
- ✅ T014: Implement GlossaryCandidateExtractionMiddleware
- ✅ T015: Write extraction tests (unit + integration with mocked context)
Implementation Notes:
1. Extraction logic in extraction.py (pure functions, no side effects) 2. Heuristic patterns: r'"([^"]+)"' (quoted), r'\b[A-Z]{2,5}\b' (acronyms), r'\b[a-z]+_[a-z]+\b' (snake_case) 3. Stem-light: simple plural→singular (workspaces→workspace), no full stemming 4. Middleware emits TermCandidateObserved for each extracted term
Dependencies: WP01 (data models), WP02 (scope definitions)
Parallel Opportunity: Can run in parallel with WP02 after WP01 completes
Risks: False positives from heuristics (mitigate: confidence scoring, low-confidence terms auto-add as draft)
Estimated Prompt Size: ~400 lines
Phase 3: Semantic Check & Conflict Detection
WP04: Semantic Check & Conflict Detection
Goal: Implement term resolution against scope hierarchy, conflict classification, and severity scoring.
Priority: P1 (blocking for WP05)
Independent Test: Can resolve terms, detect all 4 conflict types, score severity correctly.
Included Subtasks:
- ✅ T016: Implement term resolution against scope hierarchy (mission_local → team_domain → audience_domain → spec_kitty_core)
- ✅ T017: Implement conflict classification (unknown, ambiguous, inconsistent, unresolved_critical)
- ✅ T018: Implement severity scoring (step criticality + confidence → low/medium/high)
- ✅ T019: Implement SemanticCheckMiddleware
- ✅ T020: Write semantic check tests (all conflict types, severity edge cases)
Implementation Notes:
1. Resolution logic in resolution.py (hierarchical lookup with fallback) 2. Conflict types: no match (unknown), 2+ matches (ambiguous), contradictory usage (inconsistent), critical + low confidence (unresolved_critical) 3. Severity: high (critical step + low confidence OR ambiguous), medium (non-critical + ambiguous), low (inconsistent OR unknown + high confidence) 4. Middleware emits SemanticCheckEvaluated with findings
Dependencies: WP02 (scope store), WP03 (extracted terms)
Parallel Opportunity: None (sequential after WP03)
Risks: Severity calibration (mitigate: start conservative, tune based on test corpus)
Estimated Prompt Size: ~350 lines
Phase 4: Generation Gate
WP05: Generation Gate & Strictness Policy
Goal: Implement generation gate that blocks LLM generation on unresolved high-severity conflicts, with configurable strictness policy.
Priority: P1 (MVP blocker)
Independent Test: Can block generation in medium/max modes, pass in off mode, respect precedence.
Included Subtasks:
- ✅ T021: Implement StrictnessPolicy (precedence resolution: global → mission → step → runtime)
- ✅ T022: Implement gate decision logic (off: pass, medium: block high-severity, max: block all)
- ✅ T023: Implement GenerationGateMiddleware
- ✅ T024: Write gate tests (strictness modes, blocking behavior, precedence)
Implementation Notes:
1. Strictness enum in strictness.py: off, medium, max 2. Precedence: runtime override > step metadata > mission config > global default 3. Gate raises BlockedByConflict exception if should block 4. Middleware emits GenerationBlockedBySemanticConflict when blocking
Dependencies: WP04 (semantic check)
Parallel Opportunity: None (sequential after WP04)
Risks: Precedence edge cases (mitigate: exhaustive test matrix)
Estimated Prompt Size: ~250 lines
Phase 5: Interactive Clarification
WP06: Interactive Clarification UI
Goal: Implement interactive clarification prompts using Typer + Rich, with ranked candidates and async defer option.
Priority: P2 (enhances UX, not blocking MVP)
Independent Test: Can render conflicts with Rich, prompt for user input, handle all choices (candidate, custom, defer).
Included Subtasks:
- ✅ T025: Implement conflict rendering with Rich (term, context, ranked candidates by confidence)
- ✅ T026: Implement Typer prompts (select candidate 1..N, C for custom, D for defer)
- ✅ T027: Implement non-interactive mode (auto-defer all conflicts)
- ✅ T028: Implement ClarificationMiddleware
- ✅ T029: Write clarification tests (interactive mocking, non-interactive mode)
Implementation Notes:
1. Rich tables for candidate display (term | scope | definition | confidence) 2. typer.prompt() for choice input with validation 3. Non-interactive detection: sys.stdin.isatty() or CI env var 4. Middleware emits: GlossaryClarificationRequested (defer), GlossaryClarificationResolved (candidate), GlossarySenseUpdated (custom)
Dependencies: WP05 (conflicts exist)
Parallel Opportunity: Can run in parallel with WP07 (different concerns)
Risks: Terminal rendering issues (mitigate: test in CI, provide plain-text fallback)
Estimated Prompt Size: ~350 lines
Phase 6: Checkpoint/Resume
WP07: Checkpoint/Resume Mechanism
Goal: Implement event-sourced checkpoint/resume with input hash verification for cross-session recovery.
Priority: P2 (enhances UX, enables async workflow)
Independent Test: Can checkpoint before gate, resume after resolution, detect context changes.
Included Subtasks:
- ✅ T030: Implement StepCheckpoint data model (mission/run/step IDs, strictness, scope refs, input hash, cursor, retry token)
- ✅ T031: Implement checkpoint emission (before generation gate, minimal payload)
- ✅ T032: Implement checkpoint loading from event log (latest for step_id)
- ✅ T033: Implement input hash verification (SHA256, detect context changes, prompt for confirmation)
- ✅ T034: Implement ResumeMiddleware
- ✅ T035: Write checkpoint/resume tests (happy path, context changed, cross-session)
Implementation Notes:
1. Checkpoint emitted as StepCheckpointed event (may need to add to Feature 007 contracts) 2. Input hash: SHA256 of sorted JSON dump of step inputs 3. Resume flow: load checkpoint → verify hash → restore context → resume from cursor 4. typer.confirm() for context change confirmation
Dependencies: WP06 (clarification resolution)
Parallel Opportunity: Can run in parallel with WP06
Risks: StepCheckpointed event not in Feature 007 yet (mitigate: stub adapter, gate on package update)
Estimated Prompt Size: ~400 lines
Phase 7: Event Integration
WP08: Event Integration
Goal: Implement event emission adapters that import Feature 007 canonical contracts and emit at middleware boundaries.
Priority: P2 (enables replay, audit, SaaS sync)
Independent Test: Can emit all 7 canonical events + StepCheckpointed, events serialize correctly, persist to JSONL.
Included Subtasks:
- ✅ T036: Create event emission adapters (import from spec_kitty_events.glossary.events)
- ✅ T037: Implement event emission at middleware boundaries (extraction → check → gate → clarification → resume)
- ✅ T038: Implement event log persistence (JSONL via spec-kitty-events)
- ✅ T039: Write event emission tests (verify payloads, ordering, persistence)
Implementation Notes:
1. Events module: src/specify_cli/glossary/events.py 2. Import canonical events: GlossaryScopeActivated, TermCandidateObserved, SemanticCheckEvaluated, etc. 3. If StepCheckpointed not in package: stub adapter, document as pending Feature 007 4. Event log path: .kittify/events/glossary/{mission_id}.events.jsonl
Dependencies: WP01-WP07 (all middleware components)
Parallel Opportunity: Can run in parallel with WP09
Risks: Feature 007 package not yet published (mitigate: stub adapters, gate implementation)
Estimated Prompt Size: ~300 lines
Phase 8: Middleware Pipeline Integration
WP09: Middleware Pipeline Integration
Goal: Integrate middleware pipeline into mission primitive execution, with metadata-driven attachment.
Priority: P2 (connects all pieces)
Independent Test: Can attach pipeline to primitive, execute full flow (extract → check → gate → clarify → resume), verify events.
Included Subtasks:
- ✅ T040: Implement PrimitiveExecutionContext extension (add glossary fields: extracted_terms, conflicts, strictness)
- ✅ T041: Implement middleware pipeline composition (GlossaryMiddlewarePipeline class)
- ✅ T042: Implement middleware attachment to primitives (read glossary_check metadata from mission.yaml)
- ✅ T043: Write full pipeline integration tests (end-to-end: spec-kitty specify with conflict)
Implementation Notes:
1. Context extension: add extracted_terms, conflicts, strictness, checkpoint fields 2. Pipeline: ordered list of middleware, execute sequentially, catch BlockedByConflict 3. Attachment: mission primitive base class hook or decorator (depends on 2.x architecture) 4. Metadata: glossary_check: enabled in mission.yaml step definitions
Dependencies: WP01-WP08 (all components)
Parallel Opportunity: None (integrates everything)
Risks: Primitive architecture changes in 2.x (mitigate: validate during WP implementation)
Estimated Prompt Size: ~300 lines
Phase 9: Glossary Management CLI (Optional)
WP10: Glossary Management CLI
Goal: Provide CLI commands for glossary inspection, conflict viewing, and async resolution.
Priority: P3 (nice-to-have, not blocking MVP)
Independent Test: Can list terms, view conflicts, resolve conflicts via CLI.
Included Subtasks:
- ✅ T044: Implement
spec-kitty glossary list --scope <scope>command (table output with Rich) - ✅ T045: Implement
spec-kitty glossary conflicts --mission <mission>command (conflict history) - ✅ T046: Implement
spec-kitty glossary resolve <conflict_id>command (async resolution) - ✅ T047: Write CLI command tests (mocked event log, Rich output verification)
Implementation Notes:
1. Commands in src/specify_cli/cli/commands/glossary.py 2. Use Typer @app decorators 3. Rich tables for output formatting 4. Read from event log (no separate state)
Dependencies: WP08 (events), WP09 (pipeline)
Parallel Opportunity: Can run in parallel with WP11
Risks: None (CLI commands are isolated)
Estimated Prompt Size: ~250 lines
Phase 10: Polish & Documentation
WP11: Type Safety & Integration Tests
Goal: Ensure mypy --strict compliance, write comprehensive integration tests, update user docs.
Priority: P3 (quality gate before release)
Independent Test: mypy passes with no errors, pytest coverage >90%, quickstart examples work.
Included Subtasks:
- ✅ T049: Update type annotations (mypy --strict compliance for all glossary modules)
- ✅ T050: Write integration tests (end-to-end workflows: specify with conflict, clarify, resume)
- ✅ T051: Update user documentation (quickstart examples, troubleshooting guide)
Implementation Notes:
1. Add type stubs for any untyped dependencies 2. Use pytest-cov for coverage reporting 3. Integration tests: simulate full mission runs with conflicts 4. Update quickstart.md with real-world examples
Dependencies: WP01-WP10 (all code complete)
Parallel Opportunity: None (final validation)
Risks: None (polish work)
Estimated Prompt Size: ~250 lines
Dependency Graph
WP01 (Foundation)
├─> WP02 (Scope Management)
│ ├─> WP03 (Term Extraction)
│ │ └─> WP04 (Semantic Check)
│ │ └─> WP05 (Generation Gate) [MVP END]
│ │ ├─> WP06 (Clarification) [P]
│ │ └─> WP07 (Checkpoint) [P]
│ │ └─> WP08 (Events)
│ │ └─> WP09 (Pipeline)
│ │ ├─> WP10 (CLI) [P]
│ │ └─> WP11 (Polish)
[P] = Parallel opportunities
Parallelization Strategy
Wave 1 (after WP01):
- WP02 (Scope Management)
- WP03 (Term Extraction) - can start after WP01
Wave 2 (after WP05):
- WP06 (Clarification)
- WP07 (Checkpoint) - can run in parallel
Wave 3 (after WP09):
- WP10 (CLI)
- WP11 (Polish) - must wait for all code
Maximum parallelization: 2-3 agents simultaneously (Waves 1-2)
Risk Matrix
| WP | Risk | Severity | Mitigation |
|---|---|---|---|
| WP01 | None | - | Data models are straightforward |
| WP02 | Seed file schema validation | Low | Strict YAML schema, fail-fast |
| WP03 | False positives from heuristics | Medium | Confidence scoring, draft terms |
| WP04 | Severity calibration | Medium | Start conservative, tune with corpus |
| WP05 | Precedence edge cases | Low | Exhaustive test matrix |
| WP06 | Terminal rendering issues | Low | Test in CI, plain-text fallback |
| WP07 | StepCheckpointed not in Feature 007 | Medium | Stub adapter, gate on package |
| WP08 | Feature 007 package not published | High | Stub adapters, defer integration |
| WP09 | Primitive architecture unknown | High | Validate during implementation |
| WP10 | None | - | CLI commands isolated |
| WP11 | Coverage gaps | Low | Run pytest-cov, fill gaps |
Acceptance Criteria (from spec.md)
AC-001: medium strictness warns broadly and blocks only unresolved high severity AC-002: off mode allows mission execution without glossary enforcement AC-003: Step metadata can enable glossary checks for any custom primitive AC-004: Replay reproduces glossary evolution and generation gate outcomes
All acceptance criteria are covered across WP01-WP11.
Next Steps
MVP Implementation:
spec-kitty implement WP01 # Foundation (no dependencies)
spec-kitty implement WP02 --base WP01 # Scope Management
spec-kitty implement WP03 --base WP01 # Term Extraction (parallel with WP02)
spec-kitty implement WP04 --base WP03 # Semantic Check
spec-kitty implement WP05 --base WP04 # Generation Gate [MVP COMPLETE]
Full Feature:
# After MVP:
spec-kitty implement WP06 --base WP05 # Clarification (parallel with WP07)
spec-kitty implement WP07 --base WP05 # Checkpoint (parallel with WP06)
spec-kitty implement WP08 --base WP07 # Events
spec-kitty implement WP09 --base WP08 # Pipeline
spec-kitty implement WP10 --base WP09 # CLI (parallel with WP11)
spec-kitty implement WP11 --base WP09 # Polish
Estimated Timeline
With 1 agent (sequential): 11-15 weeks With 2 agents (parallel): 8-10 weeks With 3 agents (max parallel): 6-8 weeks
MVP only (WP01-WP05): 4-6 weeks (1 agent), 3-4 weeks (2 agents)
<!-- status-model:start -->
Canonical Status (Generated)
<!-- status-model:end -->
- WP01: done
- WP02: done
- WP03: done
- WP04: done
- WP05: done
- WP06: done
- WP07: done
- WP08: done
- WP09: done
- WP10: done
- WP11: done