Glossary Seed File Schema Validation

Purpose

Glossary seed files (.kittify/glossaries/*.yaml) are the authoritative source of domain terminology for Spec Kitty projects. Today these files are loaded as loose YAML with minimal structural checks. A single malformed term — such as a non-normalized surface value — causes the entire scope to fail at load time. Dashboard handlers silently catch the error and report zero glossary terms, hiding the root cause from operators.

This mission adds schema-backed validation with DDD aggregate modeling so that invalid glossary state is rejected with actionable errors at every boundary where seed files are created, edited, loaded, or checked in CI.

Source

  • GitHub Issue: #1322
  • Observed failure: ValueError: TermSurface must be normalized: Sonar quality gate (PR #1321)

Actors

ActorRole
Mission authorCreates and edits glossary seed files during specify/plan phases
Spec Kitty CLILoads seed files at runtime for glossary resolution and dashboard rendering
CI pipelineValidates committed glossary seed files on every push
Dashboard consumerReads glossary data via /api/glossary-health and /api/glossary-terms endpoints

User Scenarios & Testing

Scenario 1: Author introduces invalid term

A mission author adds a term with a non-normalized surface (Sonar quality gate) to spec_kitty_core.yaml. When they run spec-kitty glossary validate, the CLI reports the exact file, term index, field, and the normalization rule violated. The author fixes the surface to sonar quality gate and re-validates successfully.

Scenario 2: CI catches invalid seed file

A contributor pushes a commit that modifies .kittify/glossaries/team_domain.yaml with a missing definition field. CI runs glossary validation and fails with an actionable error message naming the file, the offending term surface, and the missing field. The contributor fixes the file before merging.

Scenario 3: Dashboard surfaces validation failure

An operator starts the dashboard with an invalid seed file present. Instead of silently showing zero terms, the glossary health endpoint reports a validation error with the specific file and failure reason. The operator sees what needs fixing rather than a misleading empty state.

Scenario 4: Runtime load rejects invalid state

load_seed_file() validates the entire seed file against the schema before constructing domain objects. If any term violates an invariant, the entire scope is rejected with a structured validation error that names the file, term, field, and rule. No partial loads occur.

Scenario 5: Save path enforces invariants

When code writes a glossary seed file via save_seed_file(), the same validation runs before persisting. Invalid state cannot be written to disk through the programmatic API.

Functional Requirements

IDRequirementStatus
FR-001The system validates that the root document of a glossary seed file is a YAML mapping containing a terms keyProposed
FR-002The system validates that each term has a surface field whose value equals surface.lower().strip() (normalized)Proposed
FR-003The system validates that each term has a non-empty definition fieldProposed
FR-004The system validates that each term's confidence field, when present, is a number in the range 0.0 to 1.0 inclusiveProposed
FR-005The system validates that each term's status field, when present, is one of: active, draft, deprecatedProposed
FR-006The system validates that scope seed filenames map to known GlossaryScope values: mission_local, team_domain, audience_domain, spec_kitty_coreProposed
FR-007Validation errors include the file path, term index or surface, field name, and a human-readable message describing the violated ruleProposed
FR-008When any term in a seed file fails validation, the entire scope is rejected — no partial loadsProposed
FR-009A CLI command spec-kitty glossary validate <path> validates a single seed file when given a file pathProposed
FR-010A CLI command spec-kitty glossary validate <path> validates all *.yaml files under a directory when given a directory pathProposed
FR-011The CLI validate command exits with a non-zero status code when validation failsProposed
FR-012load_seed_file() validates the seed file before constructing TermSurface objects and raises structured validation errors instead of raw ValueErrorProposed
FR-013save_seed_file() validates the seed data before writing to diskProposed
FR-014Dashboard glossary endpoints surface validation failure details instead of silently returning empty termsProposed
FR-015The system enforces an explicit unknown-fields policy for glossary seed files (fail-closed on unrecognized fields unless an allowlist is defined)Proposed

Non-Functional Requirements

IDRequirementThresholdStatus
NFR-001Validation of a seed file with 100 terms completes in under 500ms< 500ms for 100 termsProposed
NFR-002Validation error messages are actionable without consulting documentation — they name the file, term, field, and rule100% of error messages include file + term + field + ruleProposed
NFR-003New validation code has 90%+ test coverage≥ 90% line coverageProposed

Constraints

IDConstraintStatus
C-001Validation models must follow the existing doctrine schema pattern (schemas under source, models as source of truth)Proposed
C-002The GlossarySeedFile aggregate root and GlossaryTerm child entity must be expressed as domain models with explicit invariants, not ad-hoc dictionary checksProposed
C-003CI validation must cover all files matching .kittify/glossaries/*/.yamlProposed
C-004Changes must not break existing valid seed files — the new validation must accept all currently valid glossary YAML structuresProposed

Domain Language

Canonical TermDefinitionAvoid
glossary seed fileA YAML file under .kittify/glossaries/ that defines canonical terms for a scope"glossary file", "term file", "seed data"
surfaceThe normalized lowercase string form of a term"term name", "label"
scopeOne of the four glossary resolution levels (mission_local, team_domain, audience_domain, spec_kitty_core)"context", "namespace"
aggregate rootGlossarySeedFile — the consistency boundary for a seed file"document", "container"

Assumptions

1. The existing validate_seed_file() function in scope.py can be replaced or extended rather than preserved as a separate code path. 2. The unknown-fields policy will be fail-closed (reject unknown fields) unless existing seed files in the wild contain metadata fields that must be allowed. 3. The confidence and status fields remain optional in seed files (defaulting to draft/0.5 when absent), but when present must pass validation. 4. The immediate bad data (surface: Sonar quality gate) will be fixed as part of this mission's first work package or as a prerequisite commit.

Success Criteria

1. surface: Sonar quality gate fails validation with an actionable error naming the file, term, field, and normalization rule. 2. surface: sonar quality gate passes validation for spec_kitty_core.yaml. 3. spec-kitty glossary validate .kittify/glossaries/spec_kitty_core.yaml exits non-zero on invalid YAML and zero on valid YAML. 4. spec-kitty glossary validate .kittify/glossaries/ validates all scope seed files in the directory. 5. Dashboard glossary endpoints no longer silently report zero terms when a seed file contains validation errors. 6. CI fails when committed glossary seed files contain invalid data. 7. All validation errors include file path, term identifier, field name, and human-readable rule description.

Dependencies

  • Existing glossary subsystem: src/specify_cli/glossary/ (scope.py, models.py)
  • Existing doctrine schema pattern: src/doctrine/schemas/ (for architectural alignment)
  • Dashboard handlers: src/specify_cli/dashboard/handlers/glossary.py
  • CLI glossary commands: src/specify_cli/cli/

Risks

RiskLikelihoodImpactMitigation
Existing valid seed files in user projects contain fields not covered by the schemaMediumHigh — false validation failures on upgradeAudit existing seed file structures before finalizing the unknown-fields policy; consider an allowlist for known metadata fields
Dashboard error surfacing changes the API contract for consumersLowMedium — downstream tools may parse the response differentlyEnsure error responses use a distinct field or status code so existing consumers degrade gracefully
Replacing validate_seed_file() breaks callers that catch the current ValueError shapeLowLow — internal API, few callersGrep all callers before changing the exception type; maintain backward-compatible error message format if needed