for feat/pre-doctrine-stabilization-remediation (GitHub Actions run 26558837157).
Root causes identified via full CI run retrospective.
Context and Problem Statement
During a full CI pipeline run (run 26558837157) on the
feat/pre-doctrine-stabilization-remediation branch, 27 tests failed across four
distinct failure modes. All failures were silent in local development but visible in CI.
The root cause in every case was the same structural gap: the test environment in CI was
not equivalent to the local development environment, and the test surface did not enforce
the production contracts it claimed to verify.
Discovery Process
The failures were traced through the following chain:
- Observation:
integration-tests-agentandfast-tests-agentjobs showed 23 tests failing with empty output — "No JSON envelope found in output." - Hypothesis: Reproduced locally using typer 0.26.2 (installed in an isolated venv).
The local
uv.lockpins typer to 0.24.2; CI installs viapip install -e .[test]against the loosepyproject.tomlboundtyper>=0.24.1, resolving to 0.26.2. - Root cause: Typer 0.26 vendors click internally as
typer._click. Exceptions raised by typer internals (UsageError,Abort,Exit) now come from that vendored module and are unrelated to the standaloneclickpackage's exception hierarchy._JSONErrorGroup'sexcept click.UsageErrorclause silently missed every exception raised by typer 0.26+. - Secondary failures:
pytest.raises(click.exceptions.Exit)in four test files matched the wrong class for the same reason.test_quickstart_end_to_end.pyused a hand-rolled SHA-256 hash instead of the canonicalcharter.hasher.hash_content(), diverging silently. E2e tests failed because.kittify/charter/metadata.yamlis gitignored and absent in CI clones, but the e2e conftest did not synthesize it.
Why These Were Invisible Locally
- Local development uses
uv sync --locked, which installs exactly whatuv.lockspecifies (typer 0.24.2). - CI used
pip install -e .[test], which resolvestyper>=0.24.1to the latest available version at install time — currently 0.26.2. - No CI matrix exercised more than one typer version.
- No lint rule prevented direct
click.exceptions.*imports in test files. - No shared fixture synthesized CI-absent runtime state files.
- No lint rule required tests to import production helpers rather than re-implement them.
Decision Drivers
- Local development and CI must run against the same resolved dependency graph.
- The version drift that exposed these failures was not user error — it was a structural gap in how the project manages its environment contract.
- Linting and fixture discipline should enforce correct patterns before review, not after a CI failure.
- Tests that re-implement production algorithms test their own copy, not the system under test. Any deviation between the two silently breaks the test's validity as a contract check.
- E2e test infrastructure must be self-contained; it must not inherit implicit preconditions from the developer's machine state or gitignored runtime files.
Considered Options
Gap 1: CI vs. local dependency resolver mismatch
- Option A (chosen): Run CI via
uv sync --frozen, making the lockfile the single contract for both local and CI environments. - Option B: Add upper-bound pins in
pyproject.tomlfor sensitive transitive deps. - Option C: Keep
pip install, add a scheduled "latest-deps" shadow CI job.
Gap 2: No version-compatibility test for the typer/click interface
- Option A (chosen): Add a targeted smoke test that exercises
_JSONErrorGroup's JSON envelope output end-to-end, parameterized by importingtyper.Exitthrough the stable public surface. - Option B: Add a full CI matrix axis
typer: [pinned, latest]. - Option C: No test; rely on lockfile pinning from Gap 1.
Gap 3: Test helpers re-implement production algorithms
- Option A (chosen): Ruff
noqaallowlist ban — add a custom or per-file linter rule that flags inline re-implementations of known production helpers (detected by pattern, e.g.,hashlib.sha256insidetests/). - Option B: Code review checklist only (no automated enforcement).
- Option C: Extract a shared test-utilities module and require all tests to import from it.
Gap 4: E2e fixtures depend on gitignored runtime state
- Option A (chosen): E2e conftest synthesizes all required runtime state files
explicitly after
copytree, using the production helpers to produce valid content. - Option B: Disable preflight globally for test contexts (ad-hoc, per-conftest).
- Option C: Commit a minimal
metadata.yamlfixture file for tests only.
Gap 5: Direct click.exceptions.* imports in test files
- Option A (chosen): Ruff banned-import rule covering
click.exceptionsin thetests/tree. Any test that needs to assert on CLI exit/error types must import throughtyper's public surface. - Option B: Code review checklist only.
- Option C: Test-only compatibility shim (mirrors the production shim).
Decision Outcome
All five options A are accepted. They are the minimal set of changes that directly address the structural root cause in each gap with automated enforcement rather than convention.
Decision
CI runs
uv sync --frozen. Every test job inci-quality.ymlreplaces itspip install -e .[test]block withastral-sh/setup-uv+uv sync --frozen. The lockfile is the environment contract for both local and CI.A typer-surface smoke test is added. A single parameterized test in the
fast-tests-agentgroup verifies that invoking the CLI without a subcommand (the_JSONErrorGrouptrigger path) returns a JSON envelope with"ok": false, usingtyper.Exit(notclick.exceptions.Exit) as the expected exit-code exception class. This test is the canary for the specific regression mode we observed.hashlib.sha256usage insidetests/is banned by ruff. Tests that need to compute a production-format hash must import and call the canonical helper. The rule prevents future test helpers from diverging silently from the algorithm they are supposed to exercise.E2e conftest is self-contained. The conftest synthesizes
metadata.yamlusing the productioncharter.hasher.hash_content()helper immediately aftercopytree, rather than disabling preflight. This closes the "fixture inherits machine state" gap; any future preflight precondition must also be synthesized in the fixture.click.exceptionsis a banned import intests/. Ruffbanned-apirule coversclick.exceptions.Exit,click.exceptions.UsageError, andclick.exceptions.Abort. The allowed public surface for assertion istyper.Exit,typer.BadParameter, etc.
Consequences
Positive
- Local and CI environments are now described by the same lockfile contract. Any
dependency upgrade is visible, deliberate, and reviewed in a PR via
uv.lockdiff. - The
_JSONErrorGroupregression mode now has a test that would have caught it at the time it was introduced (i.e., at the firstuv.lockupdate that bumped typer). - Lint rules enforce both bans (re-implemented algorithms, direct click imports) without requiring human attention in review.
- The e2e fixture is reproducible across any fresh clone, eliminating a class of "works on my machine" CI failures.
Negative
- CI install time increases slightly (uv resolves and installs from lockfile rather than just latest-compatible). Measured impact is expected to be under 5 seconds per job given uv's speed.
uv.lockmust be kept current; PRs that add or change dependencies must also update the lockfile.uv lock --checkalready enforces this (NFR-005 gate, jobuv-lock-check).- The
hashlib.sha256ban intests/may produce false positives for tests that legitimately compute a raw SHA-256 (e.g., testing a hashing utility's output, PKCE challenges, body/file-integrity checksums). Those tests must carry an inline# noqa: TID251 — <justification>suppression (the configured rule isTID251, flake8-tidy-imports banned-api — there is noSC001rule). The ban is enforced per-call across the wholetests/tree by the dedicated[ENFORCED] banned-API lint gate (TID251)step inci-quality.yml; there are no whole-directory exemptions that could silently disarm it.
Neutral
- The typer version-compatibility smoke test does not replace the full envelope contract test suite; it is additive. Existing tests continue to run unchanged.
- No production code changes are required by this ADR; all changes are in CI configuration, test infrastructure, and lint configuration.
Confirmation
This ADR is considered fully implemented when:
- All
pip install -e .[test]blocks inci-quality.ymlare replaced withuv sync --frozenand CI continues to pass. - A failing test exists in
fast-tests-agentthat would have caught the_JSONErrorGroup/ typer 0.26 regression. ruff check tests/flags any new introduction ofhashlib.sha256orclick.exceptions.*in the test tree.tests/e2e/conftest.pysynthesizesmetadata.yamlusing the production helper and does not rely onpreflight.enabled: falseas the solution.- The full CI pipeline passes on a clean clone with no gitignored files present.
Assessment: Does This Remediate the Problem?
The five changes directly address the five structural gaps identified in the retrospective.
Gap 1 (resolver mismatch): Switching to uv sync --frozen eliminates the
divergence entirely. There is no longer a path where CI resolves a different version of
any dependency than local development. Future version upgrades are explicit lockfile
changes that go through PR review.
Implementation note (second-pass fix, CI run 26565789745): The initial implementation of
uv sync --frozenomitted the--all-extrasflag. Since pytest, pytest-cov, ruff, and mypy live in[project.optional-dependencies.test]and[project.optional-dependencies.lint],uv sync --frozenwithout--all-extrasproduced a venv that was missing pytest entirely, causing every test job to fail withNo module named pytest. All 37 non-infrastructureuv sync --frozeninvocations inci-quality.ymlwere corrected touv sync --frozen --all-extras. The three exempt infrastructure jobs (uv-lock-check,build-wheel,clean-install-verification) had nouv synccalls and were unaffected.
Gap 2 (no version-compatibility test): The smoke test is narrow by design — it
covers the exact trigger path (_JSONErrorGroup with no subcommand) that silently broke
in production. A broader matrix test would provide more coverage but is out of scope for
this ADR; the single test is the canary that was missing.
Gap 3 (re-implemented algorithms): The ruff ban is permanent and automated.
Copy-pasted algorithm implementations will be caught on the next ruff check run before
they are committed. The rule does not fix existing divergence in other tests; that
requires a separate audit.
Gap 4 (gitignored state in e2e): The conftest change makes the fixture
self-describing. This is a point fix for metadata.yaml; it does not generalize to
every possible gitignored file that might be needed in future. The design principle
established here — conftest synthesizes all required state — is the convention for
future e2e fixtures.
Gap 5 (direct click imports): The ruff rule catches the symptom (direct use of
click.exceptions.* in tests) but does not prevent future production code from using
the wrong exception class. The production-side fix (the _CLICK_USAGE_ERRORS shim in
_JSONErrorGroup) is already in place. Together, both layers are required: the
production shim handles version drift in runtime behavior; the lint rule prevents tests
from silently depending on the wrong class for assertions.
More Information
- GitHub Actions run 26558837157 — full CI run that surfaced all four failure modes.
src/specify_cli/orchestrator_api/commands.py—_JSONErrorGroupand the_CLICK_USAGE_ERRORS/_CLICK_ABORTScompatibility shim.tests/e2e/conftest.py— e2e fixture and thepreflight.enabled: falseworkaround being replaced by this ADR.tests/integration/test_quickstart_end_to_end.py— charter hash fix.architecture/adrs/2026-02-21-1-consistent-code-style-enforcement-via-git-hooks.md— related ADR on automated style enforcement.pyproject.toml[tool.ruff]section — location of banned-api rules.