Research: Glossary Seed File Schema Validation
R1: Existing Validation Gap Analysis
Decision: Current validate_seed_file() is insufficient — checks only terms key presence and surface/definition field existence. Does not check normalization, confidence range, status enum, or unknown fields.
Rationale: The function at src/specify_cli/glossary/scope.py:64 performs structural checks only. Domain invariants (surface normalization, confidence bounds, status values) are enforced later during TermSurface/TermSense construction, causing fail-at-construction errors that are hard to diagnose.
Alternatives considered:
- Extend existing
validate_seed_file()with manual checks → rejected: would duplicate Pydantic's validation capabilities and not produce structured errors - Replace dataclass validation entirely with Pydantic → rejected: existing
TermSurface/TermSenseare the runtime domain model used throughout 20+ modules; replacing them is out of scope
R2: Doctrine Schema Pattern Analysis
Decision: Follow the doctrine Pydantic pattern — ConfigDict(frozen=True, extra="forbid"), field-level validators, model_validator for cross-field rules.
Rationale: src/doctrine/directives/models.py demonstrates the established pattern:
Directive(BaseModel)withConfigDict(frozen=True, extra="forbid", populate_by_name=True)- Required fields with
Field(pattern=...)constraints - Optional fields with
Field(default_factory=list) @model_validator(mode="after")for cross-field invariants- Separate
validation.pymodule for YAML-level schema validation via jsonschema
For glossary seed files, Pydantic validation alone is sufficient (no separate JSON Schema needed initially) because the invariants are simple field-level checks, not complex cross-document references.
Alternatives considered:
- Add JSON Schema + jsonschema validation (like doctrine
validate_directive()) → deferred: can be added later if CI or external tools need standalone schema files; Pydantic models are the source of truth and can generate JSON Schema on demand viamodel_json_schema()
R3: Unknown Fields Policy
Decision: Fail-closed (extra="forbid") — unknown fields in seed files are rejected.
Rationale: Glossary seed files have a well-defined structure (surface, definition, confidence, status). Unknown fields are likely typos or unsupported extensions. Fail-closed prevents silent data loss and aligns with the DDD aggregate boundary principle.
Alternatives considered:
extra="allow"with warning → rejected: silently tolerating unknown data undermines the aggregate integrity the spec requiresextra="ignore"→ rejected: same problem as allow, plus data loss on round-trip throughsave_seed_file()
Risk: If existing seed files in the wild contain metadata fields beyond the four known ones, this will be a breaking change. Mitigation: audit .kittify/glossaries/ files in this repo and document any additional fields found.
R4: Error Aggregation Strategy
Decision: Collect all validation errors per file before raising, rather than failing on the first error.
Rationale: When a seed file has multiple issues (e.g., three terms with non-normalized surfaces), the operator needs to see all problems at once to fix them in one pass. Pydantic v2 natively collects all ValidationError entries.
Alternatives considered:
- Fail-fast on first error → rejected: poor developer experience when multiple terms need fixing
- Warning-only mode → rejected: conflicts with the fail-fast aggregate boundary decision (FR-008)
R5: save_seed_file() Validation Scope
Decision: Validate the term data that will be written by constructing a GlossarySeedFile model from the serialized form before writing to disk.
Rationale: save_seed_file() receives list[TermSense] — already-validated domain objects. The risk is low (data is valid by construction), but validating the serialized output catches edge cases like rounding errors in confidence serialization or encoding issues.
Alternatives considered:
- Skip validation in
save_seed_file()since input is already validated domain objects → rejected: defense-in-depth; the spec requires validation at all write boundaries (FR-013) - Validate the
TermSenseobjects directly → insufficient: we need to validate the YAML-serialized form, not the in-memory objects
R6: Dashboard API Backward Compatibility
Decision: Add optional validation_errors field to GlossaryHealthResponse. For /api/glossary-terms, return the existing array format when valid; include errors in a wrapper object only when validation fails.
Rationale: Existing dashboard JS consumers expect GlossaryHealthResponse shape for health and a bare array for terms. Adding an optional field to health is backward-compatible. For terms, the error case currently returns [], so changing it to {"terms": [], "validation_errors": [...]} is a semantic improvement but a structural change. Consumers already handle the empty-array case; the wrapper object is only returned on error.
Alternatives considered:
- Always return wrapper object for terms → rejected: breaks existing consumers that expect bare array
- Return errors via HTTP status code (4xx) → rejected: validation errors are about the data state, not the request; 200 with error details is more appropriate for a health-check pattern