Spec Kitty

└─ kitty-specs
   └─ Glossary Semantic Integrity Runtime for Mission Framework

Mission Run:

📚 Docs ↗

Research: Glossary Semantic Integrity Runtime

Feature: 041-mission-glossary-semantic-integrity Date: 2026-02-16 Status: Complete

Research Goals

1. Understand middleware integration points in spec-kitty 2.x mission primitive execution 2. Extract canonical event schemas from Feature 007 (spec-kitty-events package) 3. Research deterministic term extraction patterns 4. Identify best practices for hierarchical glossary scope resolution 5. Define minimal checkpoint state for deterministic resume

Finding 1: Middleware Architecture (B + D Hybrid)

Question: How should glossary checks integrate with mission primitive execution?

Research:

Reviewed spec-kitty 2.x mission framework architecture
Primitives are config-defined (YAML/JSON), not stable Python class hierarchy
Need synchronous gate behavior (block generation, prompt, resume) - pure event-driven can't enforce this
Middleware provides one consistent execution choke point for all primitive types

Decision: B + D Hybrid - Middleware chain as primary control path + Event emission at boundaries

Rationale:

Middleware gives synchronous control (can block generation deterministically)
Config-driven primitives don't have stable decorators/base classes to attach to
Events provide observability/audit/replay (Feature 007 requirement)
Eventing is for telemetry, not enforcement mechanism

Middleware Pipeline Design:

PrimitiveExecutionContext (from mission config + step metadata)
    ↓
GlossaryCandidateExtractionMiddleware (pre-step)
    ↓ emits: TermCandidateObserved (for each extracted term)
SemanticCheckMiddleware (pre-generation)
    ↓ emits: SemanticCheckEvaluated (with findings)
GenerationGateMiddleware (block on unresolved high severity)
    ↓ emits: GenerationBlockedBySemanticConflict (if blocked)
ClarificationMiddleware (interactive or defer async)
    ↓ emits: GlossaryClarificationRequested, GlossaryClarificationResolved
ResumeMiddleware (checkpoint continue after resolution)
    ↓ emits: GlossarySenseUpdated (if custom sense provided)

Alternatives Rejected:

(A) Decorator pattern: Brittle for config-driven primitives, dynamic loading issues
(C) Base class hooks: Forces class model that mission system doesn't require
(D) Pure event-driven: Cannot guarantee inline block/resume semantics (async subscribers can't block)

Implementation Note: Middleware attaches to primitive execution via metadata flag: glossary_check: enabled in mission.yaml step definitions.

Finding 2: Event Contracts (A + C: Reference Feature 007)

Question: Where are canonical glossary event schemas defined?

Research:

Feature 007 spec reviewed: /Users/robert/ClaudeCowork/Spec-Kitty-Cowork/spec-kitty-planning/product-ideas/prd-mission-glossary-semantic-integrity-v1.md
Canonical events defined in Section 10 (Domain events)
Events live in spec-kitty-events package (private Git dependency per ADR-11)

Decision: A + C - CLI references/imports events from spec-kitty-events package (not redefining)

Canonical Event Schemas (from Feature 007):

1. GlossaryScopeActivated:

Trigger: Mission starts or scope selected
Required: scope_id, glossary_version_id

2. TermCandidateObserved:

Trigger: New/uncertain term appears in input
Required: term, source_step, actor_id, confidence

3. SemanticCheckEvaluated:

Trigger: Step-level pre-generation validation
Required: severity, confidence, conflict_list, recommended_action
Consumers: Execution gate, UX prompt

4. GlossaryClarificationRequested:

Trigger: Policy requires user clarification
Required: question, term, options, urgency

5. GlossaryClarificationResolved:

Trigger: User/participant answer accepted
Required: selected/entered meaning, actor_id

6. GlossarySenseUpdated:

Trigger: New sense or sense edit accepted
Required: before/after, reason, actor_id

7. GenerationBlockedBySemanticConflict:

Trigger: High-severity unresolved conflict at generation boundary
Required: step_id, conflicts, blocking_policy_mode

Rationale:

Feature 007 is authoritative source for glossary events
Prevents contract drift between CLI and SaaS (both use same events package)
Enables deterministic replay (events are append-only, immutable)

Alternatives Rejected:

(B) Define schemas inline in CLI: Causes contract drift, breaks SaaS integration
(D) Stub contracts temporarily: Defers integration, blocks replay/audit features

Implementation Note: Import events from spec_kitty_events.glossary.events module. If Feature 007 package not yet published, stub adapter boundaries and gate implementation on package availability.

Finding 3: Term Extraction (D: Metadata hints + deterministic heuristics)

Question: How do we extract candidate terms from step inputs/outputs?

Research:

NLP libraries (spaCy, NLTK): High accuracy but adds dependency weight (~100MB models)
LLM extraction: Best quality but adds latency (200-500ms) and cost
Pattern-based heuristics: Fast (< 10ms) but lower recall without context

Decision: D: Hybrid - Metadata hints (highest confidence) + Deterministic heuristics + Scope-aware normalization

S2 Extraction Stack:

1. Metadata hints (highest confidence):

glossary_watch_terms: Explicit terms to track (e.g., ["workspace", "mission", "primitive"])
glossary_aliases: Known synonyms (e.g., {"WP": "work package"})
glossary_exclude_terms: Common words to ignore (e.g., ["the", "and", "it"])
glossary_fields: Which input/output fields to scan (e.g., ["description", "requirements"])

2. Deterministic heuristics:

Quoted phrases: "workspace" → extract as term
Acronyms: WP, LLM, CLI → extract (uppercase 2-5 chars)
Casing patterns: snake_case, camelCase, kebab-case → extract
Repeated noun-like phrases: term appears 3+ times → extract
Existing glossary matches: term already in any scope → extract

3. Scope-aware normalization:

Lowercase + trim whitespace
Stem-light: workspace/workspaces → workspace (plural → singular)
Resolve against scope order: mission_local → team_domain → audience_domain → spec_kitty_core

4. Confidence scoring:

High: metadata hint, existing glossary match
Medium: quoted phrase, acronym, casing pattern
Low: repeated noun-like phrase (weak heuristic)

5. Escalation policy:

High-severity + low-confidence critical term → immediate clarification (blocks generation)
All other conflicts → auto-add as draft, continue execution (warn only)

Rationale:

Keeps extraction automatic and mostly invisible (no LLM latency)
Better precision than raw heuristics via mission metadata
No heavy NLP dependencies (stays lightweight)
Opt-in enrichment path: async LLM pass can enhance drafts later (not in hot path)

Alternatives Rejected:

(A) NLP-based (spaCy/NLTK): Adds 100MB+ dependency, slower (50-100ms), overkill for S2
(B) LLM-based extraction: High quality but 200-500ms latency + cost in hot path
(C) Pattern-based only: Low precision without metadata context, high false-positive rate

Implementation Note: Package in src/specify_cli/glossary/extraction.py. Heuristics are pluggable (can add NLP/LLM extractors later without breaking existing code).

Finding 4: Checkpoint/Resume (A: Lightweight event sourcing)

Question: How do we save/restore step execution state for resume after conflict resolution?

Research:

Feature 007 invariant #6: "Mission run state is reconstructable from event stream"
Event sourcing: State is derived from event log, not separate files
Checkpoint granularity: Generation boundary only (not every statement)

Decision: A: Lightweight event sourcing - Emit StepCheckpointed event before generation gate

Checkpoint Payload (minimal deterministic resume context):

StepCheckpointed(
    mission_id: str,           # Which mission
    run_id: str,               # Which run instance
    step_id: str,              # Which step
    strictness: Strictness,    # Resolved strictness mode (off/medium/max)
    scope_refs: List[ScopeRef],  # Active glossary scope versions
    input_hash: str,           # SHA256 of step inputs (detect context changes)
    cursor: str,               # Execution stage: "pre_generation_gate"
    retry_token: str,          # Unique token for this checkpoint (UUID)
    timestamp: datetime,       # When checkpoint created
)

Resume Flow:

1. User resolves conflict (selects sense or provides custom definition) 2. Emit GlossaryClarificationResolved or GlossarySenseUpdated event 3. Load StepCheckpointed event from log (latest for this step_id) 4. Verify input_hash matches current inputs (detect context changes)

5. Re-run generation gate with updated glossary state 6. If pass: proceed to generation; If fail: clarification loop continues

If changed: prompt user for confirmation before resuming
If unchanged: resume from cursor ("pre_generation_gate")

Cross-session resume:

State persists in event log (not in-memory)
User can close CLI, resolve conflict in SaaS, reopen CLI
Resume loads checkpoint from events, continues execution

Rationale:

Event log is source of truth (Feature 007 requirement)
Minimal payload (only IDs + refs, no full step state)
Supports async defer + cross-session resume
Deterministic replay (same events → same state)

Alternatives Rejected:

(B) Filesystem state file: Conflicts with "no side-channel state" invariant
(C) In-memory cache: Fails async defer + cross-session resume (state lost on CLI exit)
(D) Re-run with inputs: Fragile unless all steps fully idempotent and cheap (not safe default)

Implementation Note: Checkpoint only at generation boundary (not every middleware layer). Resume is opt-in (only if user resolves conflict, not automatic retry).

Finding 5: Interactive Prompts (B: Typer prompts + Rich formatting)

Question: How should CLI implement interactive clarification prompts?

Research:

Existing spec-kitty patterns: typer.confirm/prompt used in upgrade.py, orchestrate.py
Rich already used for console output (tables, progress bars, colors)
Questionary: Nice UX but adds new dependency

Decision: B: Typer prompts for input, Rich for formatting/output (no new dependencies)

Recommended Flow:

1. Sort conflicts by severity (high → medium → low), cap to 3 max

2. Render each conflict with Rich:

``` 🔴 High-severity conflict: "workspace"

Term: workspace Context: "The workspace contains the implementation files" Scope: mission_local (no match), team_domain (2 matches)

Candidate senses: 1. [team_domain] Git worktree directory for a work package (confidence: 0.9) 2. [team_domain] VS Code workspace configuration file (confidence: 0.7) ```

3. Prompt with typer.prompt():

``python choice = typer.prompt( "Select: 1-2 (candidate), C (custom sense), D (defer to async)", type=str ) ``

4. Handle choice:

1-N: Select candidate sense → emit GlossaryClarificationResolved
C: Prompt for custom sense text → emit GlossarySenseUpdated
D: Defer to async → emit GlossaryClarificationRequested + exit with blocked status

5. Resume confirmation (if context changed):

``python proceed = typer.confirm( "Context may have changed since conflict. Proceed with resolution?" ) ``

Non-interactive mode:

Auto-defer all conflicts
Emit GlossaryClarificationRequested for all high-severity conflicts
Keep generation blocked (exit with error code)

Rationale:

Matches existing CLI interaction patterns (consistency)
No new dependencies (typer, rich already in spec-kitty)
Enough capability for ranked options + custom input + defer flow
Simple to test (mock typer.prompt, assert Rich output)

Alternatives Rejected:

(A) Rich + questionary: Adds new dependency, overkill for 1-3 questions
(C) Custom Rich prompts: Reinventing wheel (typer.prompt works fine)
(D) AskUserQuestion tool: Unclear if exists in codebase, would need research

Implementation Note: Package in src/specify_cli/glossary/clarification.py. Use Rich tables for candidate rendering, typer.prompt for input, typer.confirm for resume confirmation.

Summary of Key Decisions

Decision Area	Choice	Rationale
Architecture	B + D Hybrid (middleware + events)	Synchronous gate control, config-driven primitives, event observability
Event Contracts	A + C (reference Feature 007)	Prevent contract drift, enable SaaS integration, deterministic replay
Term Extraction	D (metadata hints + heuristics)	Automatic, fast (< 100ms), no LLM latency, good precision with metadata
Checkpoint/Resume	A (lightweight event sourcing)	Event log is source of truth, supports cross-session, minimal payload
Interactive Prompts	B (Typer + Rich)	Matches existing patterns, no new dependencies, simple to test

All decisions align with Sprint S2 requirements: automatic glossary capture, mostly invisible process, low-friction clarifications, deterministic replay.

Next Steps

Proceed to Phase 1: Design & Contracts

Generate data-model.md (entity definitions)
Generate contracts/events.md (canonical event schemas)
Generate contracts/middleware.md (interface definitions)
Generate quickstart.md (developer setup)