Code as a Crime Scene — High-Level Overview

Pedagogical overview of the Code-as-a-Crime-Scene (CaaCS) auditing technique, with empirical observations from the 2026-05 run on spec-kitty (extended scope).

Last updated: 2026-05-09 (after tests/ + kitty-specs/ scope-expansion pass). Companion artefacts:

Doctrine: src/doctrine/tactics/built-in/analysis/forensic-repository-audit.tactic.yaml, src/doctrine/procedures/built-in/legacy-codebase-triage.procedure.yaml
Empirical run: docs/architecture/audits/2026-05-spec-kitty-caacs.md
Cross-check vs issue tracker: docs/architecture/audits/2026-05-822-crosscheck.md
Phase-3 synthesis & meta: docs/architecture/audits/2026-05-phase3-*.md, docs/architecture/audits/2026-05-caacs-meta-assessment.md

1. What CaaCS is

Code as a Crime Scene is a forensic technique for understanding a codebase you didn't write. The premise is borrowed from criminology: investigators don't read every report on every person ever in a city — they look for patterns of behavior that point to where the action is. CaaCS does the same thing for software, treating the version-control history as the behavioral log.

The output is a prioritized triage, not a "rewrite plan." CaaCS doesn't tell you what to fix — it tells you where to look first and what questions to ask when you get there.

2. The core insight

Behavior over structure. A static look at a codebase tells you what it currently is; the git history tells you what people did with it. Defects cluster, knowledge concentrates, and architectural debt accumulates in patterns that the file tree by itself cannot show. Forensic recipes make those patterns visible cheaply, before you commit to reading any file in depth.

A static reading answers what is here? CaaCS answers what mattered to the people who built this?

3. Origin and pedigree

The technique was articulated by Adam Tornhill in the books Your Code as a Crime Scene and Software Design X-Rays. Tornhill's code-maat tool implements the deeper analyses (temporal coupling, knowledge maps, complexity trends).

Piechowski's two blog posts (linked in the audit's methodology section) are a lighter, recipe-oriented adaptation aimed at Rails legacy audits. They strip the technique to five git invocations and a triage doc, which is the form most teams actually use day-to-day.

The Microsoft Research paper "Use of Relative Code Churn Measures to Predict System Defect Density" (Nagappan & Ball, 2005) is the foundational empirical justification: churn is a good defect predictor.

4. The five core recipes

Each recipe is a single git invocation answering a single question. Run them in order; each one sharpens the next.

#	Question	Recipe (1y window)	Signal
1	Where is the action?	`git log --format=format: --name-only --since="1 year ago" \\| sort \\| uniq -c \\| sort -nr \\| head -20`	Churn hotspots — files most touched
2	Who knows this code?	`git shortlog -sn --no-merges --since="1 year ago"`	Bus factor — contributor concentration
3	Where are recurring fixes?	`git log -i -E --grep="fix\\|bug\\|broken\\|regress" --name-only --format='' \\| sort \\| uniq -c \\| sort -nr \\| head -20`	Bug hotspots — files repeatedly being patched
4	Is the project alive?	`git log --format='%ad' --date=format:'%Y-%m' \\| sort \\| uniq -c`	Velocity over time
5	Does the team trust its pipeline?	`git log --oneline --since="1 year ago" \\| grep -iE 'revert\\|hotfix\\|emergency\\|rollback'`	Firefighting frequency

The interesting move is intersecting recipes 1 and 3. Files that appear in both the churn list and the bug-hotspot list are the highest-risk targets — they're unstable and known-defective. Tornhill calls this the "principal hot spot" overlay.

5. The question framework

The recipes are not load-bearing on their own; the questions are. CaaCS practitioners reframe every recipe as a question because that's how the technique stays useful when you're in a codebase whose tools differ from the source post:

Where is risk concentrated? → recipes 1 + 3, intersected
Who knows this code? → recipe 2 + per-file authorship
Is the project alive? → recipe 4
Does the team trust its pipeline? → recipe 5
Is anything abandoned? → last-touch dates + zero-coverage overlay
Is complexity load-bearing or accidental? → recipe 1 ∩ complexity overlay (radon for Python, rubycritic for Ruby, etc.)

When the language changes, the recipes barely change but the tools do (radon ⇄ rubycritic ⇄ rustdoc-coverage…). The questions stay constant.

6. The output: a triage document

CaaCS doesn't end with tables. It ends with a four-bucket triage in the spirit of an Eisenhower matrix:

┌─────────────────────────┬─────────────────────────┐
│ Fix this week           │ Fix this quarter        │
│ (hot ∩ buggy ∩          │ (hot OR buggy,          │
│  critical)              │  not both)              │
├─────────────────────────┼─────────────────────────┤
│ Parallelisable          │ Don't worry             │
│ (hot but stable —       │ (cold, not buggy,       │
│  refactor candidates)   │  peripheral)            │
└─────────────────────────┴─────────────────────────┘

This shape forces a real prioritization decision and separates urgency from strategic value.

7. Limits and biases — why CaaCS can lie to you

Forensic methods only see what's in the data. Six biases consistently bite:

Bias	Effect	Mitigation
Squash-merge distortion	Compressed authorship → bus factor under-reports	Inspect upstream merge convention before trusting recipe 2
Weak commit messages	Bug-grep underperforms	Sanity-check by spot-reading 10 commits; aim for ≥80% true-positive rate
Vanity files	Lockfiles, CHANGELOGs dominate raw counts	Hard exclusion list before running recipes
No rename-following	History truncates at file rename	Pass `--follow` for per-file work
No complexity capture	Raw churn ≠ complexity	Pair with radon / rubycritic / cloc
Bus factor ≠ knowledge	Low contributor count can mean stable, not abandoned	Require interpretation, not numerical reading

These six are encoded as a failure_modes block on the spec-kitty forensic-repository-audit tactic so that any contributor running the recipe sees the caveats inline.

8. Where CaaCS sits among adjacent techniques

Technique	What it sees	What it misses	Relationship to CaaCS
Static analysis (linters, complexity)	Current structure	History, intent	Complementary overlay
DDD strategic classification	Intentional importance	Actual usage patterns	Complementary — what should be core vs what is hot
Code review	Specific change quality	Pattern-level risk	Different time scale; CaaCS scopes review, doesn't replace it
Brown-field interview (#665/#666)	Tribal knowledge, design intent	Quantitative behavior	Strict complement — CaaCS surfaces the what; the interview answers the why
Coverage analysis	Test gaps	Why coverage is low	Pair with recipes 1+3 for "hot ∩ buggy ∩ untested" red-list

A useful rule: CaaCS tells you which questions to ask; the other techniques answer them.

9. What spec-kitty's doctrine extension adds

When CaaCS was codified into spec-kitty's doctrine system, three deliberate extensions:

Wrapped the recipes in an entry/exit-conditioned procedure (legacy-codebase-triage.procedure.yaml) so a contributor running it knows when it's the right tool. The Piechowski posts assume the reader has already decided.
Made the DDD overlay an explicit optional step, reusing the existing strategic-domain-classification tactic. The source posts don't mention DDD at all; treating the overlay as separate keeps provenance honest and lets the technique be used without DDD on projects that don't have a context map.
Encoded the six biases as a mandatory failure_modes block in the tactic so anyone running the recipe sees the caveats inline rather than discovering them by being misled.

10. Empirical observations from the spec-kitty run

Two passes, growing scope. The growth is what made the technique credible.

First pass — `src/` only (757 files, 1y window)

Finding	Reading
F1 — 89.5% single-author concentration	Bus factor or maturity? Could not tell
F2 — `cli/commands/agent/{tasks,workflow,mission}.py` top of every list	Unambiguous refactor candidate
Pipeline trust	~0.3% reverts/hotfixes — healthy
Velocity	accelerating
Cross-cutting `kitty-specs/` ↔ `src/` couplings	invisible by scope
Test-coverage overlay	invisible by scope

The audit-vs-issue-tracker crosscheck produced zero STRONG matches between the 14 catalogued findings and the 16 currently-open sub-issues under #822. The forensic surface and the operational backlog see different worlds.

Second pass — full corpus (`src/` + `tests/` + `kitty-specs/`)

Finding	Update
F1 bus factor	89.5% → 95.2%. Worse, not better. Tests and mission specs don't dilute the signal — single author wrote those too. F1 reclassified from "bus-factor or maturity" to definitively bus factor.
F2 hotspot list	Robust under scope expansion. Top-19 of the full-corpus table is identical to the src/-only list. First non-`src/` entry doesn't appear until rank 20.
F15 (new) — test-update lag	F2 hotspots change with their tests only ~30% of the time. ~70% of changes ship without test updates. Empirically grounds the "untested invariants" gap audit (KC-WP3 in the F1 plan).
F16 (new) — glossary middleware	13.9% test-to-src churn ratio — worst hot-file ratio in the corpus.
F17 (new) — mission ↔ src/ coupling	Max 5 co-changes/year for any pair. The SDD pipeline separates spec and code commits structurally — invisible in commit history by design.
F18 (new) — agent_utils/status.py	19.0% test ratio — under-tested hot file.

What the second pass taught about the technique itself

Scope expansion served as an accidental validity test. If the bus-factor finding had been an artefact of looking only at production code, broadening to tests and specs would have diluted it. Instead, it sharpened. That's evidence the signal is robust, not measurement noise.
The hotspot list survived the scope change unchanged. Top-19 identical. F2 is not an artefact of where we looked.
Test coverage proxy via churn ratio works as a low-cost smoke test. It found four under-tested hot files in one pass without running pytest-cov.
Cross-cutting coupling between mission specs and source is structurally limited in spec-kitty (max 5 co-changes/year for any pair) — the SDD pipeline separates them by design. This is itself a finding about the workflow, not just the code.

11. Further reading

Books

Adam Tornhill — Your Code as a Crime Scene (Pragmatic Bookshelf)
Adam Tornhill — Software Design X-Rays (Pragmatic Bookshelf)

Tools

code-maat — Tornhill's CLI for the deeper analyses (temporal coupling, knowledge maps)
cloc — file/SLOC inventory
radon — Python complexity overlay
rubycritic, brakeman — Ruby/Rails complementary overlays

Papers

Nagappan & Ball (2005), "Use of Relative Code Churn Measures to Predict System Defect Density" — foundational empirical justification

Adaptations

Piechowski, "How I audit a legacy Rails codebase"
Piechowski, "Git commands before reading code"

Internal artefacts (this repo)

Doctrine: src/doctrine/tactics/built-in/analysis/forensic-repository-audit.tactic.yaml
Doctrine: src/doctrine/procedures/built-in/legacy-codebase-triage.procedure.yaml
Audit (2026-05, two-pass): docs/architecture/audits/2026-05-spec-kitty-caacs.md
Issue-tracker crosscheck: docs/architecture/audits/2026-05-822-crosscheck.md
Phase-3 synthesis (issue drafts, backlog triage, F1 knowledge-capture plan, meta-assessment): docs/architecture/audits/2026-05-phase3-*.md and docs/architecture/audits/2026-05-caacs-meta-assessment.md