Implementation Tasks: Glossary Semantic Integrity Runtime for Mission Framework

Feature: 041-mission-glossary-semantic-integrity Target Branch: 2.x Created: 2026-02-16 Status: Ready for implementation

Overview

This feature implements a glossary semantic integrity runtime system that enforces semantic consistency in mission execution. The system uses a middleware pipeline to extract terms, resolve against a 4-tier scope hierarchy, detect conflicts, block LLM generation on high-severity issues, and prompt for interactive clarification with checkpoint/resume capability.

Architecture: B + D Hybrid (middleware chain + event emission) Dependencies: typer, rich, ruamel.yaml, spec-kitty-events (Feature 007 contracts) Testing: pytest 90%+ coverage, mypy --strict


Work Package Summary

Total Work Packages: 11 Total Subtasks: 51 Estimated Timeline: 8-12 weeks (with parallel execution)

MVP Scope

MVP = WP01-WP05 (Foundation through Generation Gate)

Delivers core glossary enforcement:

  • Term extraction (metadata hints + heuristics)
  • Scope resolution (4-tier hierarchy)
  • Conflict detection (4 types)
  • Generation gate blocking (strictness modes)

NOT in MVP: Interactive clarification UI, checkpoint/resume, CLI commands (defer to WP06-WP11)


Phase 1: Foundation & Infrastructure

WP01: Foundation & Data Models

Goal: Establish package structure, core data models, and test infrastructure.

Priority: P1 (blocking for all other WPs)

Independent Test: Can create TermSurface, TermSense, SemanticConflict objects and verify serialization.

Included Subtasks:

  • T001: Create glossary package structure (src/specify_cli/glossary/)
  • T002: Define core data models (TermSurface, TermSense, GlossaryScope, SemanticConflict)
  • T003: Define exception hierarchy (BlockedByConflict, DeferredToAsync, AbortResume)
  • T004: Set up test infrastructure (fixtures, mocks for PrimitiveExecutionContext)
  • T005: Implement GlossaryScope enum with resolution order

Implementation Notes:

1. Create src/specify_cli/glossary/__init__.py with public API exports 2. Define dataclasses in models.py using Python 3.11+ features 3. Create exception base class GlossaryError with specific subclasses 4. Set up pytest fixtures in tests/specify_cli/glossary/conftest.py

Dependencies: None (foundational WP)

Risks: None (data model definitions are straightforward)

Estimated Prompt Size: ~300 lines


WP02: Scope Management & Storage

Goal: Implement glossary scope loading, seed file parsing, and in-memory storage backed by event log.

Priority: P1 (blocking for WP03-WP05)

Independent Test: Can load seed files, activate scopes, query glossary store for terms.

Included Subtasks:

  • T006: Implement seed file loader (YAML parsing for team_domain.yaml, audience_domain.yaml)
  • T007: Implement scope activation (emit GlossaryScopeActivated events)
  • T008: Implement glossary store (in-memory cache backed by event log)
  • T009: Write scope resolution tests (hierarchical lookup)
  • T048: Create spec_kitty_core.yaml seed file (canonical Spec Kitty terms)

Implementation Notes:

1. Seed files live in .kittify/glossaries/{scope}.yaml 2. Use ruamel.yaml for parsing (preserve comments, order) 3. Store uses LRU cache for performance (max 10,000 terms) 4. Event log is source of truth for glossary state

Dependencies: WP01 (data models)

Parallel Opportunity: Can run in parallel with WP03 (different modules)

Risks: Seed file schema validation (mitigate: strict YAML schema, fail-fast)

Estimated Prompt Size: ~350 lines


Phase 2: Term Extraction

WP03: Term Extraction Implementation

Goal: Implement term extraction using metadata hints + deterministic heuristics, with scope-aware normalization and confidence scoring.

Priority: P1 (blocking for WP04)

Independent Test: Can extract terms from sample step inputs, verify confidence scores, validate normalization.

Included Subtasks:

  • T010: Implement metadata hints extraction (glossary_watch_terms, aliases, exclude, fields)
  • T011: Implement deterministic heuristics (quoted phrases, acronyms, casing patterns, repeats)
  • T012: Implement scope-aware normalization (lowercase, trim, stem-light)
  • T013: Implement confidence scoring (metadata > pattern > weak heuristic)
  • T014: Implement GlossaryCandidateExtractionMiddleware
  • T015: Write extraction tests (unit + integration with mocked context)

Implementation Notes:

1. Extraction logic in extraction.py (pure functions, no side effects) 2. Heuristic patterns: r'"([^"]+)"' (quoted), r'\b[A-Z]{2,5}\b' (acronyms), r'\b[a-z]+_[a-z]+\b' (snake_case) 3. Stem-light: simple plural→singular (workspaces→workspace), no full stemming 4. Middleware emits TermCandidateObserved for each extracted term

Dependencies: WP01 (data models), WP02 (scope definitions)

Parallel Opportunity: Can run in parallel with WP02 after WP01 completes

Risks: False positives from heuristics (mitigate: confidence scoring, low-confidence terms auto-add as draft)

Estimated Prompt Size: ~400 lines


Phase 3: Semantic Check & Conflict Detection

WP04: Semantic Check & Conflict Detection

Goal: Implement term resolution against scope hierarchy, conflict classification, and severity scoring.

Priority: P1 (blocking for WP05)

Independent Test: Can resolve terms, detect all 4 conflict types, score severity correctly.

Included Subtasks:

  • T016: Implement term resolution against scope hierarchy (mission_local → team_domain → audience_domain → spec_kitty_core)
  • T017: Implement conflict classification (unknown, ambiguous, inconsistent, unresolved_critical)
  • T018: Implement severity scoring (step criticality + confidence → low/medium/high)
  • T019: Implement SemanticCheckMiddleware
  • T020: Write semantic check tests (all conflict types, severity edge cases)

Implementation Notes:

1. Resolution logic in resolution.py (hierarchical lookup with fallback) 2. Conflict types: no match (unknown), 2+ matches (ambiguous), contradictory usage (inconsistent), critical + low confidence (unresolved_critical) 3. Severity: high (critical step + low confidence OR ambiguous), medium (non-critical + ambiguous), low (inconsistent OR unknown + high confidence) 4. Middleware emits SemanticCheckEvaluated with findings

Dependencies: WP02 (scope store), WP03 (extracted terms)

Parallel Opportunity: None (sequential after WP03)

Risks: Severity calibration (mitigate: start conservative, tune based on test corpus)

Estimated Prompt Size: ~350 lines


Phase 4: Generation Gate

WP05: Generation Gate & Strictness Policy

Goal: Implement generation gate that blocks LLM generation on unresolved high-severity conflicts, with configurable strictness policy.

Priority: P1 (MVP blocker)

Independent Test: Can block generation in medium/max modes, pass in off mode, respect precedence.

Included Subtasks:

  • T021: Implement StrictnessPolicy (precedence resolution: global → mission → step → runtime)
  • T022: Implement gate decision logic (off: pass, medium: block high-severity, max: block all)
  • T023: Implement GenerationGateMiddleware
  • T024: Write gate tests (strictness modes, blocking behavior, precedence)

Implementation Notes:

1. Strictness enum in strictness.py: off, medium, max 2. Precedence: runtime override > step metadata > mission config > global default 3. Gate raises BlockedByConflict exception if should block 4. Middleware emits GenerationBlockedBySemanticConflict when blocking

Dependencies: WP04 (semantic check)

Parallel Opportunity: None (sequential after WP04)

Risks: Precedence edge cases (mitigate: exhaustive test matrix)

Estimated Prompt Size: ~250 lines


Phase 5: Interactive Clarification

WP06: Interactive Clarification UI

Goal: Implement interactive clarification prompts using Typer + Rich, with ranked candidates and async defer option.

Priority: P2 (enhances UX, not blocking MVP)

Independent Test: Can render conflicts with Rich, prompt for user input, handle all choices (candidate, custom, defer).

Included Subtasks:

  • T025: Implement conflict rendering with Rich (term, context, ranked candidates by confidence)
  • T026: Implement Typer prompts (select candidate 1..N, C for custom, D for defer)
  • T027: Implement non-interactive mode (auto-defer all conflicts)
  • T028: Implement ClarificationMiddleware
  • T029: Write clarification tests (interactive mocking, non-interactive mode)

Implementation Notes:

1. Rich tables for candidate display (term | scope | definition | confidence) 2. typer.prompt() for choice input with validation 3. Non-interactive detection: sys.stdin.isatty() or CI env var 4. Middleware emits: GlossaryClarificationRequested (defer), GlossaryClarificationResolved (candidate), GlossarySenseUpdated (custom)

Dependencies: WP05 (conflicts exist)

Parallel Opportunity: Can run in parallel with WP07 (different concerns)

Risks: Terminal rendering issues (mitigate: test in CI, provide plain-text fallback)

Estimated Prompt Size: ~350 lines


Phase 6: Checkpoint/Resume

WP07: Checkpoint/Resume Mechanism

Goal: Implement event-sourced checkpoint/resume with input hash verification for cross-session recovery.

Priority: P2 (enhances UX, enables async workflow)

Independent Test: Can checkpoint before gate, resume after resolution, detect context changes.

Included Subtasks:

  • T030: Implement StepCheckpoint data model (mission/run/step IDs, strictness, scope refs, input hash, cursor, retry token)
  • T031: Implement checkpoint emission (before generation gate, minimal payload)
  • T032: Implement checkpoint loading from event log (latest for step_id)
  • T033: Implement input hash verification (SHA256, detect context changes, prompt for confirmation)
  • T034: Implement ResumeMiddleware
  • T035: Write checkpoint/resume tests (happy path, context changed, cross-session)

Implementation Notes:

1. Checkpoint emitted as StepCheckpointed event (may need to add to Feature 007 contracts) 2. Input hash: SHA256 of sorted JSON dump of step inputs 3. Resume flow: load checkpoint → verify hash → restore context → resume from cursor 4. typer.confirm() for context change confirmation

Dependencies: WP06 (clarification resolution)

Parallel Opportunity: Can run in parallel with WP06

Risks: StepCheckpointed event not in Feature 007 yet (mitigate: stub adapter, gate on package update)

Estimated Prompt Size: ~400 lines


Phase 7: Event Integration

WP08: Event Integration

Goal: Implement event emission adapters that import Feature 007 canonical contracts and emit at middleware boundaries.

Priority: P2 (enables replay, audit, SaaS sync)

Independent Test: Can emit all 7 canonical events + StepCheckpointed, events serialize correctly, persist to JSONL.

Included Subtasks:

  • T036: Create event emission adapters (import from spec_kitty_events.glossary.events)
  • T037: Implement event emission at middleware boundaries (extraction → check → gate → clarification → resume)
  • T038: Implement event log persistence (JSONL via spec-kitty-events)
  • T039: Write event emission tests (verify payloads, ordering, persistence)

Implementation Notes:

1. Events module: src/specify_cli/glossary/events.py 2. Import canonical events: GlossaryScopeActivated, TermCandidateObserved, SemanticCheckEvaluated, etc. 3. If StepCheckpointed not in package: stub adapter, document as pending Feature 007 4. Event log path: .kittify/events/glossary/{mission_id}.events.jsonl

Dependencies: WP01-WP07 (all middleware components)

Parallel Opportunity: Can run in parallel with WP09

Risks: Feature 007 package not yet published (mitigate: stub adapters, gate implementation)

Estimated Prompt Size: ~300 lines


Phase 8: Middleware Pipeline Integration

WP09: Middleware Pipeline Integration

Goal: Integrate middleware pipeline into mission primitive execution, with metadata-driven attachment.

Priority: P2 (connects all pieces)

Independent Test: Can attach pipeline to primitive, execute full flow (extract → check → gate → clarify → resume), verify events.

Included Subtasks:

  • T040: Implement PrimitiveExecutionContext extension (add glossary fields: extracted_terms, conflicts, strictness)
  • T041: Implement middleware pipeline composition (GlossaryMiddlewarePipeline class)
  • T042: Implement middleware attachment to primitives (read glossary_check metadata from mission.yaml)
  • T043: Write full pipeline integration tests (end-to-end: spec-kitty specify with conflict)

Implementation Notes:

1. Context extension: add extracted_terms, conflicts, strictness, checkpoint fields 2. Pipeline: ordered list of middleware, execute sequentially, catch BlockedByConflict 3. Attachment: mission primitive base class hook or decorator (depends on 2.x architecture) 4. Metadata: glossary_check: enabled in mission.yaml step definitions

Dependencies: WP01-WP08 (all components)

Parallel Opportunity: None (integrates everything)

Risks: Primitive architecture changes in 2.x (mitigate: validate during WP implementation)

Estimated Prompt Size: ~300 lines


Phase 9: Glossary Management CLI (Optional)

WP10: Glossary Management CLI

Goal: Provide CLI commands for glossary inspection, conflict viewing, and async resolution.

Priority: P3 (nice-to-have, not blocking MVP)

Independent Test: Can list terms, view conflicts, resolve conflicts via CLI.

Included Subtasks:

  • T044: Implement spec-kitty glossary list --scope <scope> command (table output with Rich)
  • T045: Implement spec-kitty glossary conflicts --mission <mission> command (conflict history)
  • T046: Implement spec-kitty glossary resolve <conflict_id> command (async resolution)
  • T047: Write CLI command tests (mocked event log, Rich output verification)

Implementation Notes:

1. Commands in src/specify_cli/cli/commands/glossary.py 2. Use Typer @app decorators 3. Rich tables for output formatting 4. Read from event log (no separate state)

Dependencies: WP08 (events), WP09 (pipeline)

Parallel Opportunity: Can run in parallel with WP11

Risks: None (CLI commands are isolated)

Estimated Prompt Size: ~250 lines


Phase 10: Polish & Documentation

WP11: Type Safety & Integration Tests

Goal: Ensure mypy --strict compliance, write comprehensive integration tests, update user docs.

Priority: P3 (quality gate before release)

Independent Test: mypy passes with no errors, pytest coverage >90%, quickstart examples work.

Included Subtasks:

  • T049: Update type annotations (mypy --strict compliance for all glossary modules)
  • T050: Write integration tests (end-to-end workflows: specify with conflict, clarify, resume)
  • T051: Update user documentation (quickstart examples, troubleshooting guide)

Implementation Notes:

1. Add type stubs for any untyped dependencies 2. Use pytest-cov for coverage reporting 3. Integration tests: simulate full mission runs with conflicts 4. Update quickstart.md with real-world examples

Dependencies: WP01-WP10 (all code complete)

Parallel Opportunity: None (final validation)

Risks: None (polish work)

Estimated Prompt Size: ~250 lines


Dependency Graph

WP01 (Foundation)
  ├─> WP02 (Scope Management)
  │     ├─> WP03 (Term Extraction)
  │     │     └─> WP04 (Semantic Check)
  │     │           └─> WP05 (Generation Gate) [MVP END]
  │     │                 ├─> WP06 (Clarification) [P]
  │     │                 └─> WP07 (Checkpoint) [P]
  │     │                       └─> WP08 (Events)
  │     │                             └─> WP09 (Pipeline)
  │     │                                   ├─> WP10 (CLI) [P]
  │     │                                   └─> WP11 (Polish)

[P] = Parallel opportunities


Parallelization Strategy

Wave 1 (after WP01):

  • WP02 (Scope Management)
  • WP03 (Term Extraction) - can start after WP01

Wave 2 (after WP05):

  • WP06 (Clarification)
  • WP07 (Checkpoint) - can run in parallel

Wave 3 (after WP09):

  • WP10 (CLI)
  • WP11 (Polish) - must wait for all code

Maximum parallelization: 2-3 agents simultaneously (Waves 1-2)


Risk Matrix

WPRiskSeverityMitigation
WP01None-Data models are straightforward
WP02Seed file schema validationLowStrict YAML schema, fail-fast
WP03False positives from heuristicsMediumConfidence scoring, draft terms
WP04Severity calibrationMediumStart conservative, tune with corpus
WP05Precedence edge casesLowExhaustive test matrix
WP06Terminal rendering issuesLowTest in CI, plain-text fallback
WP07StepCheckpointed not in Feature 007MediumStub adapter, gate on package
WP08Feature 007 package not publishedHighStub adapters, defer integration
WP09Primitive architecture unknownHighValidate during implementation
WP10None-CLI commands isolated
WP11Coverage gapsLowRun pytest-cov, fill gaps

Acceptance Criteria (from spec.md)

AC-001: medium strictness warns broadly and blocks only unresolved high severity AC-002: off mode allows mission execution without glossary enforcement AC-003: Step metadata can enable glossary checks for any custom primitive AC-004: Replay reproduces glossary evolution and generation gate outcomes

All acceptance criteria are covered across WP01-WP11.


Next Steps

MVP Implementation:

spec-kitty implement WP01  # Foundation (no dependencies)
spec-kitty implement WP02 --base WP01  # Scope Management
spec-kitty implement WP03 --base WP01  # Term Extraction (parallel with WP02)
spec-kitty implement WP04 --base WP03  # Semantic Check
spec-kitty implement WP05 --base WP04  # Generation Gate [MVP COMPLETE]

Full Feature:

# After MVP:
spec-kitty implement WP06 --base WP05  # Clarification (parallel with WP07)
spec-kitty implement WP07 --base WP05  # Checkpoint (parallel with WP06)
spec-kitty implement WP08 --base WP07  # Events
spec-kitty implement WP09 --base WP08  # Pipeline
spec-kitty implement WP10 --base WP09  # CLI (parallel with WP11)
spec-kitty implement WP11 --base WP09  # Polish

Estimated Timeline

With 1 agent (sequential): 11-15 weeks With 2 agents (parallel): 8-10 weeks With 3 agents (max parallel): 6-8 weeks

MVP only (WP01-WP05): 4-6 weeks (1 agent), 3-4 weeks (2 agents)

<!-- status-model:start -->

Canonical Status (Generated)

<!-- status-model:end -->

  • WP01: done
  • WP02: done
  • WP03: done
  • WP04: done
  • WP05: done
  • WP06: done
  • WP07: done
  • WP08: done
  • WP09: done
  • WP10: done
  • WP11: done