Check Stale — catch cassettes recorded against a retired model¶

A replayed cassette is a deterministic test: it passes as long as the recording is unchanged. But that's exactly the trap — a green replay says nothing about whether the recording still mirrors reality. In 2026, models get hard retirement dates (and providers silently update weights). When the model a cassette was recorded against is gone, your test keeps "passing" against a world that no longer exists.

evalcraft check-stale fixes the blind spot by activating the provenance every cassette already records (model set, prompt hash, timestamp) and turning it into a CI gate.

evalcraft check-stale tests/cassettes/*.json --models "gpt-5.1,claude-sonnet-4-5"

  staleness check  3 cassette(s)

  refund_flow
  CRITICAL  [model_retired] Recorded model 'gpt-4o' is not in the current model set —
            it may have been retired or swapped. This deterministic test no longer
            mirrors production.
  fresh  weather_agent
  fresh  search_agent

  CRITICAL staleness found — re-record the affected cassettes
# exit code 1

What it checks¶

Finding	Severity	Meaning	Exits CI?
`model_retired`	CRITICAL	A recorded model is absent from the current `--models` set (retired or swapped) — the cassette may now exercise an API that errors live.	Yes (exit 1)
`prompt_drift`	WARNING	The current prompt hash (`--prompts`) differs from the recorded one — still replays, but no longer mirrors the live prompt.	No
`age`	INFO	The recording is older than `--max-age-days`.	No
`no_provenance`	INFO	A legacy / hand-built cassette with no provenance — re-record to enable checks.	No

Only a retired model blocks the build — it's the one signal that means "your deterministic test is lying." Prompt drift and age are visible but non-blocking.

Flags¶

Flag	Description
`--models "a,b,c"`	The model set you ship today. Any recorded model not in this exact set → CRITICAL. Omit to skip the model check.
`--prompts <file>`	A file of your current prompts; its hash is compared to the recorded `prompt_hash`. Omit to skip.
`--max-age-days N`	Recorded-at age over `N` days → INFO. Defaults to `30` if no other check is given.
`--json`	Emit `{"cassettes": [report, ...]}` (severity strings `CRITICAL`/`WARNING`/`INFO`). Still exits 1 on any CRITICAL.

Matching is exact and case-sensitive — a swap from gpt-5.1 to gpt-5.1-mini should fire. No fuzzy family matching.

`--prompts` file shape¶

The hash basis is identical to what was recorded at capture time, so a file that reproduces the prompts matches byte-for-byte. Accepted shapes:

// 1. JSON object with the run's input + per-LLM-call inputs
{ "input_text": "refund order 123", "llm_inputs": ["system + user prompt...", "..."] }

// 2. JSON list → treated as llm_inputs (input_text = "")
["system + user prompt..."]

// 3. anything else → treated as input_text

Wire it into CI¶

Add it as a fast, deterministic gate next to your other checks — no API key, no network:

- name: Fail if any cassette was recorded against a retired model
  run: evalcraft check-stale tests/cassettes/*.json --models "${{ vars.CURRENT_MODELS }}"

When a model is retired, the gate goes red — re-record the affected cassettes (which refreshes their provenance), review the new behavior, and commit.

Python API¶

from evalcraft import StalenessChecker
from evalcraft.core.models import Cassette

report = StalenessChecker(max_age_days=30).check(
    Cassette.load("tests/cassettes/refund_flow.json"),
    current_models=["gpt-5.1", "claude-sonnet-4-5"],
)
assert not report.has_critical, report.to_dict()