Code as a Crime Scene — High-Level Overview
Pedagogical overview of the Code-as-a-Crime-Scene (CaaCS) auditing technique, with empirical observations from the 2026-05 run on spec-kitty (extended scope).
Last updated: 2026-05-09 (after tests/ + kitty-specs/ scope-expansion pass).
Companion artefacts:
- Doctrine:
src/doctrine/tactics/built-in/analysis/forensic-repository-audit.tactic.yaml,src/doctrine/procedures/built-in/legacy-codebase-triage.procedure.yaml - Empirical run:
docs/architecture/audits/2026-05-spec-kitty-caacs.md - Cross-check vs issue tracker:
docs/architecture/audits/2026-05-822-crosscheck.md - Phase-3 synthesis & meta:
docs/architecture/audits/2026-05-phase3-*.md,docs/architecture/audits/2026-05-caacs-meta-assessment.md
1. What CaaCS is
Code as a Crime Scene is a forensic technique for understanding a codebase you didn't write. The premise is borrowed from criminology: investigators don't read every report on every person ever in a city — they look for patterns of behavior that point to where the action is. CaaCS does the same thing for software, treating the version-control history as the behavioral log.
The output is a prioritized triage, not a "rewrite plan." CaaCS doesn't tell you what to fix — it tells you where to look first and what questions to ask when you get there.
2. The core insight
Behavior over structure. A static look at a codebase tells you what it currently is; the git history tells you what people did with it. Defects cluster, knowledge concentrates, and architectural debt accumulates in patterns that the file tree by itself cannot show. Forensic recipes make those patterns visible cheaply, before you commit to reading any file in depth.
A static reading answers what is here? CaaCS answers what mattered to the people who built this?
3. Origin and pedigree
The technique was articulated by Adam Tornhill in the books Your Code as a Crime Scene and Software Design X-Rays. Tornhill's code-maat tool implements the deeper analyses (temporal coupling, knowledge maps, complexity trends).
Piechowski's two blog posts (linked in the audit's methodology section) are a lighter, recipe-oriented adaptation aimed at Rails legacy audits. They strip the technique to five git invocations and a triage doc, which is the form most teams actually use day-to-day.
The Microsoft Research paper "Use of Relative Code Churn Measures to Predict System Defect Density" (Nagappan & Ball, 2005) is the foundational empirical justification: churn is a good defect predictor.
4. The five core recipes
Each recipe is a single git invocation answering a single question. Run them in order; each one sharpens the next.
| # | Question | Recipe (1y window) | Signal |
|---|---|---|---|
| 1 | Where is the action? | git log --format=format: --name-only --since="1 year ago" \| sort \| uniq -c \| sort -nr \| head -20 |
Churn hotspots — files most touched |
| 2 | Who knows this code? | git shortlog -sn --no-merges --since="1 year ago" |
Bus factor — contributor concentration |
| 3 | Where are recurring fixes? | git log -i -E --grep="fix\|bug\|broken\|regress" --name-only --format='' \| sort \| uniq -c \| sort -nr \| head -20 |
Bug hotspots — files repeatedly being patched |
| 4 | Is the project alive? | git log --format='%ad' --date=format:'%Y-%m' \| sort \| uniq -c |
Velocity over time |
| 5 | Does the team trust its pipeline? | git log --oneline --since="1 year ago" \| grep -iE 'revert\|hotfix\|emergency\|rollback' |
Firefighting frequency |
The interesting move is intersecting recipes 1 and 3. Files that appear in both the churn list and the bug-hotspot list are the highest-risk targets — they're unstable and known-defective. Tornhill calls this the "principal hot spot" overlay.
5. The question framework
The recipes are not load-bearing on their own; the questions are. CaaCS practitioners reframe every recipe as a question because that's how the technique stays useful when you're in a codebase whose tools differ from the source post:
- Where is risk concentrated? → recipes 1 + 3, intersected
- Who knows this code? → recipe 2 + per-file authorship
- Is the project alive? → recipe 4
- Does the team trust its pipeline? → recipe 5
- Is anything abandoned? → last-touch dates + zero-coverage overlay
- Is complexity load-bearing or accidental? → recipe 1 ∩ complexity overlay (radon for Python, rubycritic for Ruby, etc.)
When the language changes, the recipes barely change but the tools do (radon ⇄ rubycritic ⇄ rustdoc-coverage…). The questions stay constant.
6. The output: a triage document
CaaCS doesn't end with tables. It ends with a four-bucket triage in the spirit of an Eisenhower matrix:
┌─────────────────────────┬─────────────────────────┐
│ Fix this week │ Fix this quarter │
│ (hot ∩ buggy ∩ │ (hot OR buggy, │
│ critical) │ not both) │
├─────────────────────────┼─────────────────────────┤
│ Parallelisable │ Don't worry │
│ (hot but stable — │ (cold, not buggy, │
│ refactor candidates) │ peripheral) │
└─────────────────────────┴─────────────────────────┘
This shape forces a real prioritization decision and separates urgency from strategic value.
7. Limits and biases — why CaaCS can lie to you
Forensic methods only see what's in the data. Six biases consistently bite:
| Bias | Effect | Mitigation |
|---|---|---|
| Squash-merge distortion | Compressed authorship → bus factor under-reports | Inspect upstream merge convention before trusting recipe 2 |
| Weak commit messages | Bug-grep underperforms | Sanity-check by spot-reading 10 commits; aim for ≥80% true-positive rate |
| Vanity files | Lockfiles, CHANGELOGs dominate raw counts | Hard exclusion list before running recipes |
| No rename-following | History truncates at file rename | Pass --follow for per-file work |
| No complexity capture | Raw churn ≠ complexity | Pair with radon / rubycritic / cloc |
| Bus factor ≠ knowledge | Low contributor count can mean stable, not abandoned | Require interpretation, not numerical reading |
These six are encoded as a failure_modes block on the spec-kitty forensic-repository-audit tactic so that any contributor running the recipe sees the caveats inline.
8. Where CaaCS sits among adjacent techniques
| Technique | What it sees | What it misses | Relationship to CaaCS |
|---|---|---|---|
| Static analysis (linters, complexity) | Current structure | History, intent | Complementary overlay |
| DDD strategic classification | Intentional importance | Actual usage patterns | Complementary — what should be core vs what is hot |
| Code review | Specific change quality | Pattern-level risk | Different time scale; CaaCS scopes review, doesn't replace it |
| Brown-field interview (#665/#666) | Tribal knowledge, design intent | Quantitative behavior | Strict complement — CaaCS surfaces the what; the interview answers the why |
| Coverage analysis | Test gaps | Why coverage is low | Pair with recipes 1+3 for "hot ∩ buggy ∩ untested" red-list |
A useful rule: CaaCS tells you which questions to ask; the other techniques answer them.
9. What spec-kitty's doctrine extension adds
When CaaCS was codified into spec-kitty's doctrine system, three deliberate extensions:
- Wrapped the recipes in an entry/exit-conditioned procedure (
legacy-codebase-triage.procedure.yaml) so a contributor running it knows when it's the right tool. The Piechowski posts assume the reader has already decided. - Made the DDD overlay an explicit optional step, reusing the existing
strategic-domain-classificationtactic. The source posts don't mention DDD at all; treating the overlay as separate keeps provenance honest and lets the technique be used without DDD on projects that don't have a context map. - Encoded the six biases as a mandatory
failure_modesblock in the tactic so anyone running the recipe sees the caveats inline rather than discovering them by being misled.
10. Empirical observations from the spec-kitty run
Two passes, growing scope. The growth is what made the technique credible.
First pass — src/ only (757 files, 1y window)
| Finding | Reading |
|---|---|
| F1 — 89.5% single-author concentration | Bus factor or maturity? Could not tell |
F2 — cli/commands/agent/{tasks,workflow,mission}.py top of every list |
Unambiguous refactor candidate |
| Pipeline trust | ~0.3% reverts/hotfixes — healthy |
| Velocity | accelerating |
Cross-cutting kitty-specs/ ↔ src/ couplings |
invisible by scope |
| Test-coverage overlay | invisible by scope |
The audit-vs-issue-tracker crosscheck produced zero STRONG matches between the 14 catalogued findings and the 16 currently-open sub-issues under #822. The forensic surface and the operational backlog see different worlds.
Second pass — full corpus (src/ + tests/ + kitty-specs/)
| Finding | Update |
|---|---|
| F1 bus factor | 89.5% → 95.2%. Worse, not better. Tests and mission specs don't dilute the signal — single author wrote those too. F1 reclassified from "bus-factor or maturity" to definitively bus factor. |
| F2 hotspot list | Robust under scope expansion. Top-19 of the full-corpus table is identical to the src/-only list. First non-src/ entry doesn't appear until rank 20. |
| F15 (new) — test-update lag | F2 hotspots change with their tests only ~30% of the time. ~70% of changes ship without test updates. Empirically grounds the "untested invariants" gap audit (KC-WP3 in the F1 plan). |
| F16 (new) — glossary middleware | 13.9% test-to-src churn ratio — worst hot-file ratio in the corpus. |
| F17 (new) — mission ↔ src/ coupling | Max 5 co-changes/year for any pair. The SDD pipeline separates spec and code commits structurally — invisible in commit history by design. |
| F18 (new) — agent_utils/status.py | 19.0% test ratio — under-tested hot file. |
What the second pass taught about the technique itself
- Scope expansion served as an accidental validity test. If the bus-factor finding had been an artefact of looking only at production code, broadening to tests and specs would have diluted it. Instead, it sharpened. That's evidence the signal is robust, not measurement noise.
- The hotspot list survived the scope change unchanged. Top-19 identical. F2 is not an artefact of where we looked.
- Test coverage proxy via churn ratio works as a low-cost smoke test. It found four under-tested hot files in one pass without running pytest-cov.
- Cross-cutting coupling between mission specs and source is structurally limited in spec-kitty (max 5 co-changes/year for any pair) — the SDD pipeline separates them by design. This is itself a finding about the workflow, not just the code.
11. Further reading
Books
- Adam Tornhill — Your Code as a Crime Scene (Pragmatic Bookshelf)
- Adam Tornhill — Software Design X-Rays (Pragmatic Bookshelf)
Tools
code-maat— Tornhill's CLI for the deeper analyses (temporal coupling, knowledge maps)cloc— file/SLOC inventoryradon— Python complexity overlayrubycritic,brakeman— Ruby/Rails complementary overlays
Papers
- Nagappan & Ball (2005), "Use of Relative Code Churn Measures to Predict System Defect Density" — foundational empirical justification
Adaptations
- Piechowski, "How I audit a legacy Rails codebase"
- Piechowski, "Git commands before reading code"
Internal artefacts (this repo)
- Doctrine:
src/doctrine/tactics/built-in/analysis/forensic-repository-audit.tactic.yaml - Doctrine:
src/doctrine/procedures/built-in/legacy-codebase-triage.procedure.yaml - Audit (2026-05, two-pass):
docs/architecture/audits/2026-05-spec-kitty-caacs.md - Issue-tracker crosscheck:
docs/architecture/audits/2026-05-822-crosscheck.md - Phase-3 synthesis (issue drafts, backlog triage, F1 knowledge-capture plan, meta-assessment):
docs/architecture/audits/2026-05-phase3-*.mdanddocs/architecture/audits/2026-05-caacs-meta-assessment.md