Skip to content

Live Eval — catching drift

Replay tests a recorded run: it never re-executes the live model, so it cannot catch the regressions that come from model / prompt / retrieval drift (a provider updates a model, you edit a prompt, your retrieval corpus changes). Live-eval is the complementary layer that does.

Replay (per-commit) Live-eval (nightly / gated)
Runs the real model? No — replays a cassette Yes
Cost / determinism $0, deterministic, offline Paid, non-deterministic, online
Catches glue / logic regressions
Catches model / prompt / retrieval drift
Where it runs every push nightly or release gate

Use both: replay for fast per-commit checks of your agent's deterministic glue, live-eval to track real quality over time.

How it works

You define a set of golden inputs (not frozen outputs), supply a runner that executes your real agent, and attach scorers. Live-eval runs each input, scores the live output, and aggregates per-case scores you can compare against a saved baseline.

from evalcraft.eval.live import LiveEvalCase, LiveEvalResult, run_live_eval, compare_to_baseline
from evalcraft import assert_output_contains

cases = [
    LiveEvalCase(
        name="paris_weather",
        input="What's the weather in Paris?",
        scorers=[lambda c: assert_output_contains(c, "Paris")],
    ),
]

def runner(case):
    # Run your REAL agent. Return a str, or a Cassette / AgentRun for trace scoring.
    return my_agent.run(case.input)

result = run_live_eval(cases, runner)        # live, paid, non-deterministic
print(result.pass_rate, result.mean_score)
result.save("live-baseline.json")            # save the first good run as a baseline

Gating CI against a baseline

Re-run live-eval (e.g. nightly), then fail the build if any case regresses:

current = run_live_eval(cases, runner)
comparison = compare_to_baseline(
    current,
    LiveEvalResult.load("live-baseline.json"),
    max_score_drop=0.1,
)
assert comparison.passed, comparison.summary()

Or save the current result and gate from the CLI:

evalcraft live-eval current.json --baseline live-baseline.json --max-drop 0.1
# exits non-zero if any case's score dropped by more than 0.1

A case missing from the baseline is reported as new (never a regression); a case missing from the current run is reported as removed.

Warning

Live-eval makes real model calls. Keep it out of the $0 per-commit path — run it on a schedule or as a release gate.