Feature Specification: Glossary Semantic Integrity Runtime for Mission Framework
Feature Branch: 041-mission-glossary-semantic-integrity Created: 2026-02-16 Status: Draft Input: User description: Sprint S2 kickoff prompt for glossary semantic integrity runtime behavior in spec-kitty 2.x
User Scenarios & Testing
User Story 1 - Mission Author Enables Glossary Checks (Priority: P1)
A mission author wants to ensure semantic consistency in their custom mission by adding glossary checks to critical steps (e.g., specification generation, documentation writing). They add check metadata to step definitions in their mission.yaml, and the runtime automatically enforces glossary resolution before LLM generation.
Why this priority: This is the foundational capability - without metadata-driven check attachment, the entire glossary enforcement system cannot function. It's the entry point for all semantic integrity workflows.
Independent Test: Can be fully tested by creating a minimal mission with one step that has glossary_check metadata enabled, running the step, and verifying that semantic check events are emitted before generation.
Acceptance Scenarios:
1. Given a mission step has no glossary check metadata and strictness is medium or max, When the step executes, Then the runtime extracts candidate terms from step inputs and emits a SemanticCheckEvaluated event (checks enabled by default) 2. Given a mission step has glossary_check: disabled metadata, When the step executes, Then no semantic checks are performed and generation proceeds normally 3. Given a mission has default glossary_check: enabled in mission config, When a step inherits this default, Then glossary checks run for that step
User Story 2 - Developer Resolves High-Severity Semantic Conflict Interactively (Priority: P2)
A developer runs a mission step (e.g., /spec-kitty.plan) that triggers a high-severity semantic conflict (e.g., term "workspace" is ambiguous between "git worktree directory" and "VS Code workspace"). The CLI blocks generation, shows up to 3 ranked candidate senses, and prompts the developer to pick one or provide a custom definition. After resolution, the step resumes from the checkpoint.
Why this priority: This is the primary user-facing workflow for resolving conflicts. Without interactive resolution, users would be blocked with no clear path forward. It delivers immediate value once P1 is implemented.
Independent Test: Can be tested by creating a test scenario with a known ambiguous term, triggering a mission step that uses it, and verifying that (1) generation is blocked, (2) ranked candidates are shown, (3) user selection updates the glossary, and (4) step resumes successfully.
Acceptance Scenarios:
1. Given a mission step encounters an unresolved high-severity conflict, When the runtime reaches the generation gate, Then generation is blocked and an interactive prompt shows 1-3 ranked candidate senses 2. Given the user is prompted for clarification, When they select a candidate sense from the list, Then the glossary is updated with the selected sense and the step resumes from the checkpoint 3. Given the user is prompted for clarification, When they provide a custom sense definition, Then the new sense is recorded with provenance (actor, timestamp) and the step resumes 4. Given the user is prompted for clarification, When they choose to defer resolution, Then the conflict is logged to the event stream, generation remains blocked, and the step exits with a clear error message
User Story 3 - Team Bootstraps Domain Glossary with Seed File (Priority: P3)
A team working on a domain-specific project (e.g., healthcare) wants to establish shared terminology upfront. They create a team_domain.yaml seed file with key terms (e.g., "patient", "encounter", "claim"), place it in .kittify/glossaries/, and the runtime uses it during scope resolution while also auto-populating additional terms from mission I/O.
Why this priority: This improves the user experience by reducing initial conflict noise, but it's optional - the system works without seed files via auto-population. It's a nice-to-have for teams with established vocabularies.
Independent Test: Can be tested by creating a seed file with 3-5 terms, running a mission step that uses those terms, and verifying that (1) seed terms are loaded into team_domain scope, (2) scope resolution finds them, and (3) no conflicts are raised for those terms.
Acceptance Scenarios:
1. Given a team_domain.yaml seed file exists in .kittify/glossaries/, When a mission initializes, Then the runtime loads seed terms into the team_domain scope 2. Given a mission step uses a term defined in the team_domain seed file, When scope resolution runs, Then the term resolves to the team_domain sense without conflict 3. Given no seed file exists for team_domain, When scope resolution runs, Then the runtime cleanly skips team_domain and continues to audience_domain and spec_kitty_core
User Story 4 - Mission Replay Reproduces Glossary Evolution (Priority: P4)
A developer wants to reproduce a mission execution from a week ago to debug a regression. They run spec-kitty replay <feature>, and the event log deterministically recreates the glossary state at each step, including conflict resolutions and generation gate decisions.
Why this priority: This is critical for debugging and audit trails, but it depends on P1-P3 being complete. It's a quality-of-life improvement that ensures determinism, not a blocking requirement for initial rollout.
Independent Test: Can be tested by running a mission, recording glossary events (term extractions, conflict resolutions), then replaying the mission and verifying that the same glossary state and generation gate outcomes are reproduced.
Acceptance Scenarios:
1. Given a mission has been executed with glossary checks enabled, When the mission is replayed from the event log, Then the glossary state at each step matches the original execution 2. Given a conflict was resolved during the original execution, When the mission is replayed, Then the same conflict resolution is applied and the same generation gate outcome occurs 3. Given a term was added to the glossary via auto-population, When the mission is replayed, Then the term appears in the glossary at the same step
User Story 5 - Operator Adjusts Strictness Mode for Different Environments (Priority: P5)
An operator wants to run missions with different enforcement levels in different environments. They set strictness: off for local development (fast iteration, no blocking), strictness: medium for CI (warn broadly, block only high-severity), and strictness: max for production (block any unresolved conflict). They can override at runtime via --strictness flag.
Why this priority: This is a configurability feature that provides flexibility for different workflows. It's valuable for advanced users but not essential for initial adoption - the default medium mode works for most cases.
Independent Test: Can be tested by running the same mission step with --strictness off, --strictness medium, and --strictness max, and verifying that (1) off skips all checks, (2) medium warns but only blocks high-severity, and (3) max blocks any unresolved conflict.
Acceptance Scenarios:
1. Given global strictness is set to off, When a mission step runs, Then no glossary checks are performed, no SemanticCheckEvaluated events are emitted, and generation proceeds without blocking 2. Given mission strictness is set to medium, When a low-severity conflict occurs, Then a warning is logged but generation is not blocked 3. Given mission strictness is set to medium, When a high-severity conflict occurs, Then generation is blocked until the conflict is resolved 4. Given runtime strictness override is max, When any unresolved conflict occurs, Then generation is blocked regardless of severity 5. Given strictness precedence (global → mission → step → runtime), When multiple levels are set, Then the most specific level takes precedence
Edge Cases
- Scope resolution with missing scopes: What happens when
team_domain.yamldoes not exist? → Resolver skips team_domain cleanly and continues to audience_domain/spec_kitty_core. - Conflicting sense updates from multiple participants: What happens when two users propose different senses for the same term simultaneously? → System records both proposals with provenance; mission owner acts as tie-breaker if unresolved tie remains.
- Context changes during async resolution: What happens when a user defers conflict resolution, makes code changes, then resolves the conflict? → System requests confirmation before resuming ("Context may have changed. Proceed with resolution?").
- Replay with manual glossary edits: What happens when a user manually edits a glossary file between original execution and replay? → Replay uses event log as source of truth, not filesystem state; manual edits are ignored during replay.
- Nested mission step with inherited strictness: What happens when a mission calls another mission with different strictness settings? → Child mission inherits parent's strictness unless explicitly overridden.
- LLM hallucinated term not in any scope: What happens when LLM generates output using a term that doesn't exist in any scope and has low confidence? → Classified as "unknown critical term" (type D conflict), severity set to high, generation blocked in medium/max modes.
Requirements
Functional Requirements
- FR-001: System MUST resolve terms against scope hierarchy in order: mission_local → team_domain → audience_domain → spec_kitty_core
- FR-002: System MUST detect 4 conflict types: (A) unknown terms (no match in scope stack), (B) ambiguous terms (multiple active senses, unqualified usage), (C) inconsistent usage (LLM output contradicts active glossary), (D) unresolved critical terms (low confidence, no resolved sense before generation)
- FR-003: System MUST emit
SemanticCheckEvaluatedevents when glossary checks run (i.e., when strictness ismediumormax, or when a primitive explicitly enables checks despiteoffmode), including severity (low/medium/high), confidence (0.0-1.0), effective_strictness, recommended_action, and findings (list of conflicts per term) conforming to feature 007 event contracts - FR-004: System MUST block LLM generation on unresolved high-severity conflicts in
mediumandmaxstrictness modes - FR-005: System MUST support 3 strictness modes:
off(no enforcement),medium(warn broadly, block high-severity only),max(block any unresolved conflict) - FR-006: System MUST apply strictness precedence: global defaults → mission defaults → primitive/step metadata → runtime override
- FR-007: System MUST extract candidate terms from mission step inputs and outputs using metadata-driven extraction rules
- FR-008: System MUST show interactive clarification prompts with 1-3 questions maximum, prioritized by severity (high → medium → low)
- FR-009: System MUST allow users to defer conflict resolution to async mode, logging the conflict to the event stream while keeping generation blocked
- FR-010: System MUST resume mission step execution from checkpoint after conflict resolution, without requiring full re-run
- FR-011: System MUST store all glossary state (terms, senses, resolutions) in the event log using existing event architecture (no side-channel state files)
- FR-012: System MUST support optional seed files (
team_domain.yaml,audience_domain.yaml) placed in.kittify/glossaries/ - FR-013: System MUST skip unconfigured scopes cleanly during resolution (e.g., if
team_domain.yamldoes not exist, resolver continues to next scope without error) - FR-014: System MUST record custom sense definitions with provenance metadata (actor, timestamp, source: "user_clarification")
- FR-015: System MUST attach glossary checks to mission primitives via metadata in mission config files (e.g.,
glossary_check: enabledin step definition) - FR-016: System MUST emit
GenerationBlockedBySemanticConflictevent when generation gate blocks due to unresolved high-severity conflict - FR-017: System MUST present ranked candidate senses during clarification (ordered by scope precedence, then by confidence/frequency)
- FR-018: System MUST allow free-text custom sense input during clarification (not limited to pre-defined candidates)
- FR-019: System MUST request user confirmation before resuming if context has changed materially during async conflict resolution
- FR-020: System MUST enable glossary checks by default for all mission primitives unless (a) global/mission/step strictness is explicitly set to
off, or (b) primitive metadata explicitly disables checks viaglossary_check: disabled
Key Entities
- TermSurface: Raw string representing a term (e.g., "workspace", "mission", "step")
- Attributes: surface_text (string)
- TermSense: Meaning of a term within a specific scope
- Attributes: surface (TermSurface), scope (GlossaryScope), definition (string), provenance (actor, timestamp, source), confidence (float 0.0-1.0), status (active/deprecated)
- GlossaryScope: Enumeration of scope levels
- Values:
mission_local,team_domain,audience_domain,spec_kitty_core - Resolution order: mission_local → team_domain → audience_domain → spec_kitty_core
- SemanticConflict: Classification of a term conflict
- Attributes: term (TermSurface), conflict_type (unknown/ambiguous/inconsistent/unresolved_critical), severity (low/medium/high), confidence (float), candidate_senses (list of TermSense), context (usage location)
- SemanticCheckEvaluated: Event emitted after semantic check runs (canonical contract from feature 007)
- Attributes: step_id, mission_id, timestamp, findings (list of SemanticConflict with detailed classification payload), overall_severity (low/medium/high), confidence (float 0.0-1.0), effective_strictness (off/medium/max), recommended_action (proceed/warn/block), blocked (boolean)
- Reference: Feature 007 glossary event contracts
- GenerationBlockedBySemanticConflict: Event emitted when generation gate blocks
- Attributes: step_id, mission_id, timestamp, conflicts (list of SemanticConflict), strictness_mode (off/medium/max), effective_strictness (off/medium/max)
- GlossaryClarificationResolved: Canonical event recording interactive/async conflict resolution (feature 007 contract)
- Attributes: conflict_id, term_surface, selected_sense (TermSense reference), actor, timestamp, resolution_mode (interactive/async), provenance
- GlossarySenseUpdated: Canonical event recording custom sense definition (feature 007 contract)
- Attributes: term_surface, scope (GlossaryScope), new_sense (TermSense with definition), actor, timestamp, update_type (create/update), provenance
Success Criteria
Measurable Outcomes
- SC-001: Mission authors can enable glossary checks for any custom primitive by adding metadata to mission config (no code changes required)
- SC-002: Unresolved high-severity conflicts prevent LLM generation in
mediumandmaxstrictness modes (100% enforcement) - SC-003: Developers can resolve semantic conflicts interactively in under 2 minutes (measured from prompt display to resolution commit)
- SC-004: Mission replay reproduces identical glossary evolution and generation gate outcomes (deterministic event log replay)
- SC-005:
offstrictness mode allows mission execution without any glossary enforcement or blocking (0 generation blocks) - SC-006: Scope resolution gracefully skips unconfigured
team_domainandaudience_domainscopes (no errors, continues to next scope) - SC-007: 90% of semantic conflicts are auto-resolvable without user intervention (single candidate sense found in scope hierarchy)
- SC-008: Clarification prompt bursts are limited to 3 questions maximum (prevents user fatigue)
- SC-009: Async conflict resolution state persists across CLI sessions (users can defer and return later)
- SC-010: Custom sense definitions submitted by users are recorded with full provenance (actor, timestamp, source tracked in event log)