Glossary Chokepoint p95 Latency Measurement

Context and Problem Statement

GlossaryChokepoint.run() is invoked on every agent step to scan request text for semantic conflicts. It must complete well within the human-perceptible threshold so that the scan never becomes a bottleneck in the agent loop.

The design target was p95 ≤ 50 ms for all realistic input sizes. This ADR records the measured performance to confirm or revise that threshold.

Benchmark Setup

Benchmark script: tests/specify_cli/glossary/bench_chokepoint.py

Parameter	Value
Synthetic term index	500 active SPEC_KITTY_CORE terms
Iterations per input size	1 000 (after 10 warm-up)
Input sizes tested	500 words, 2 000 words, 5 000 words
Term hit rate in texts	~20 % (deliberate over-representation)
Platform	macOS Darwin 24.6.0, Python 3.14.0

Approximately 20 % of tokens in the generated texts were drawn from the synthetic index, which is a deliberate over-representation of realistic inputs. Real agent requests typically contain far fewer glossary terms.

Measured Results

Benchmark executed on 2026-04-22.

Input size (words)	mean	p50	p95	p99
500	2.28 ms	2.19 ms	2.44 ms	4.03 ms
2 000	8.88 ms	8.79 ms	9.16 ms	9.67 ms
5 000	22.19 ms	21.91 ms	22.85 ms	26.70 ms

All three input sizes pass p95 ≤ 50 ms comfortably.

Decision

Threshold confirmed: p95 ≤ 50 ms.

The measured p95 values fall well below the 50 ms target at every tested input size. The design constraint is satisfied without modification.

Key observations:

Latency scales sub-linearly with input size. A 10× input increase (500 → 5 000 words) yields only a ~9× p95 increase (2.44 ms → 22.85 ms), thanks to the per-token deduplication and the O(1) dictionary lookup in GlossaryTermIndex.surface_to_senses.
Index load is amortised. _load_index() is called at most once per GlossaryChokepoint instance. The benchmark pre-loads the index before timed iterations, matching production behaviour where the chokepoint is constructed once and reused across many calls.
Headroom is substantial. Even at 5 000 words the p95 is 22.85 ms — less than half the threshold. This leaves room for future index growth (e.g. 2 000+ terms) before the threshold becomes a concern.

Consequences

Positive

The 50 ms p95 threshold is confirmed as a realistic operating constraint.
No performance-driven architecture changes are needed at this time.
CI tests can run without special hardware; the chokepoint adds negligible latency to the existing test suite (< 1 ms per call in tests).

Negative

None identified.

Neutral

The threshold should be re-measured if the index grows beyond ~5 000 terms or if a new text pre-processing step (e.g. NLP tokenisation) is added.
The benchmark script (bench_chokepoint.py) is not part of the CI suite; it must be run manually when the chokepoint implementation changes.

Re-measurement Trigger Conditions

Re-run tests/specify_cli/glossary/bench_chokepoint.py and update this ADR if:

The GlossaryTermIndex grows to more than 5 000 active terms.
_run_inner() is modified to call additional I/O-bound operations.
A new text pre-processing step (e.g. sentence segmentation, NLP) is added before the tokenisation loop.
The runtime target platform changes significantly (e.g. from macOS to a resource-constrained container).

src/specify_cli/glossary/chokepoint.py — implementation
tests/specify_cli/glossary/bench_chokepoint.py — benchmark script
tests/specify_cli/glossary/test_chokepoint.py — unit tests (T014)
WP01 ADR: architecture/adrs/2026-04-09-1-mission-identity-uses-ulid-not-sequential-prefix.md