Scorers¶
Evalcraft provides a set of assertion functions ("scorers") for evaluating cassettes and agent runs. Each function returns an AssertionResult rather than raising immediately, so you can collect all failures before reporting.
All scorers accept either a Cassette or an AgentRun as their first argument.
from evalcraft import (
assert_tool_called,
assert_tool_order,
assert_no_tool_called,
assert_output_contains,
assert_output_matches,
assert_cost_under,
assert_latency_under,
assert_token_count_under,
)
from evalcraft.eval.scorers import Evaluator
Offline vs. live scorers¶
Evalcraft has two kinds of scorers, and the difference matters: only the offline family delivers the "deterministic, $0, runs-on-every-commit" promise.
Offline / deterministic (no network, $0)¶
Evaluated entirely against the recorded cassette — no model is called. These run
fine inside replay's NetworkGuard:
assert_tool_called,assert_tool_order,assert_no_tool_calledassert_output_contains,assert_output_matchesassert_cost_under,assert_latency_under,assert_token_count_under
Cost / latency / token assertions read the recorded numbers — they gate the captured run, not a fresh live run.
Live / paid / non-deterministic (calls a real model at test time)¶
These send the recorded output to a judge model when invoked, so they incur
cost + latency, require an API key, and are not deterministic. Run them with
eval_n(...) + confidence intervals, and keep them out of the $0 per-commit path
(use nightly / gated jobs):
- LLM-as-Judge:
assert_output_semantic,assert_factual_consistency,assert_tone,assert_custom_criteria - RAG metrics:
assert_faithfulness,assert_context_relevance,assert_answer_relevance,assert_context_recall - Pairwise:
pairwise_compare,pairwise_rank - Multi-judge:
JuryScorer - Hallucination:
assert_no_hallucination,detect_hallucinations
The judge model is configurable (
provider=,model=). A judge call cannot run inside replay'sNetworkGuard— it needs the network. For deterministic, $0 CI you can opt in to record/replay of judge responses withevalcraft.eval.judge_cache.use_judge_cache(...)(or theEVALCRAFT_JUDGE_CACHEenv var) — at the cost of a frozen judgment that only updates when you re-record.
AssertionResult¶
Every scorer returns an AssertionResult:
result = assert_tool_called(cassette, "web_search")
print(result.passed) # True or False
print(result.name) # "assert_tool_called(web_search)"
print(result.expected) # "web_search"
print(result.actual) # ["web_search", "summarize"]
print(result.message) # "" if passed, error description if failed
In pytest, use assert result.passed, result.message to get descriptive failures:
result = assert_tool_called(run, "web_search")
assert result.passed, result.message
# If failed: "Tool 'web_search' was never called. Called tools: ['search', 'analyze']"
Tool assertions¶
assert_tool_called(cassette, tool_name, ...)¶
Assert that a specific tool was called.
from evalcraft import assert_tool_called, replay
run = replay("tests/cassettes/agent.json")
# Basic: tool was called at least once
result = assert_tool_called(run, "web_search")
assert result.passed
# Called exactly N times
result = assert_tool_called(run, "web_search", times=2)
# Called with specific arguments
result = assert_tool_called(run, "web_search", with_args={"query": "Paris weather"})
# Called before another tool
result = assert_tool_called(run, "web_search", before="summarize")
# Called after another tool
result = assert_tool_called(run, "summarize", after="web_search")
| Parameter | Type | Description |
|---|---|---|
cassette |
Cassette \| AgentRun |
The run to check |
tool_name |
str |
Name of the tool |
times |
int \| None |
Exact number of calls expected |
with_args |
dict \| None |
Arguments that must have been passed |
before |
str \| None |
Tool that should come AFTER this one |
after |
str \| None |
Tool that should come BEFORE this one |
assert_tool_order(cassette, expected_order, strict=False)¶
Assert tools were called in a specific order.
# Non-strict: the tools must appear in order, but other tools can be in between
result = assert_tool_order(run, ["web_search", "summarize", "send_email"])
assert result.passed
# Strict: the sequence must match exactly
result = assert_tool_order(
run,
["web_search", "summarize", "send_email"],
strict=True,
)
| Parameter | Type | Description |
|---|---|---|
cassette |
Cassette \| AgentRun |
The run to check |
expected_order |
list[str] |
Expected tool names in order |
strict |
bool |
If True, exact match required. If False, subsequence match. |
assert_no_tool_called(cassette, tool_name)¶
Assert a specific tool was NOT called.
result = assert_no_tool_called(run, "send_email")
assert result.passed, result.message
# If failed: "Tool 'send_email' was called 1 times, expected 0"
Output assertions¶
assert_output_contains(cassette, substring, case_sensitive=True)¶
Assert the agent's output contains a substring.
result = assert_output_contains(run, "Paris")
assert result.passed
# Case-insensitive
result = assert_output_contains(run, "paris", case_sensitive=False)
assert result.passed
assert_output_matches(cassette, pattern)¶
Assert the agent's output matches a regex pattern.
result = assert_output_matches(run, r"\d+°C")
assert result.passed, result.message
# If failed: "Output does not match pattern '\\d+°C'"
result = assert_output_matches(run, r"(sunny|cloudy|rainy)")
assert result.passed
Cost and performance assertions¶
assert_cost_under(cassette, max_usd)¶
Assert the total estimated cost is under a threshold.
result = assert_cost_under(run, max_usd=0.05)
assert result.passed
# If failed: "Cost $0.0823 exceeds limit $0.0500"
Note
Cost is only available if recorded during capture (e.g., via record_llm_call(cost_usd=...) or an adapter that estimates cost).
assert_latency_under(cassette, max_ms)¶
Assert total wall-clock time is under a threshold (in milliseconds).
result = assert_latency_under(run, max_ms=5000)
assert result.passed
# If failed: "Latency 6234.0ms exceeds limit 5000.0ms"
assert_token_count_under(cassette, max_tokens)¶
Assert total token count is under a threshold.
result = assert_token_count_under(run, max_tokens=2000)
assert result.passed
# If failed: "Token count 2341 exceeds limit 2000"
Evaluator¶
The Evaluator class lets you compose multiple assertions into a single evaluation.
from evalcraft.eval.scorers import Evaluator
from evalcraft import (
assert_tool_called,
assert_tool_order,
assert_cost_under,
assert_token_count_under,
assert_output_contains,
replay,
)
run = replay("tests/cassettes/agent.json")
evaluator = Evaluator()
evaluator.add(assert_tool_called, run, "web_search")
evaluator.add(assert_tool_order, run, ["web_search", "summarize"])
evaluator.add(assert_cost_under, run, max_usd=0.05)
evaluator.add(assert_token_count_under, run, max_tokens=2000)
evaluator.add(assert_output_contains, run, "summary")
result = evaluator.run()
print(result.passed) # True if all pass
print(result.score) # e.g. 0.8 = 4/5 passed
print(len(result.assertions)) # 5
for assertion in result.assertions:
status = "PASS" if assertion.passed else "FAIL"
print(f"[{status}] {assertion.name}")
if not assertion.passed:
print(f" {assertion.message}")
Evaluator.add(assertion_fn, *args, **kwargs)¶
Add an assertion to run.
Returns self for chaining.
Evaluator.run()¶
Run all assertions and return an EvalResult.
EvalResult fields¶
| Field | Type | Description |
|---|---|---|
passed |
bool |
True if all assertions passed |
score |
float |
Fraction of assertions that passed (0.0–1.0) |
assertions |
list[AssertionResult] |
Individual assertion results |
failed_assertions |
list[AssertionResult] |
Only failed assertions |
result = evaluator.run()
if not result.passed:
for a in result.failed_assertions:
print(f"FAIL: {a.name} — {a.message}")
Complete example¶
import pytest
from evalcraft import replay
from evalcraft import (
assert_tool_called,
assert_tool_order,
assert_no_tool_called,
assert_output_contains,
assert_cost_under,
assert_token_count_under,
assert_latency_under,
)
from evalcraft.eval.scorers import Evaluator
@pytest.fixture
def run():
return replay("tests/cassettes/research_agent.json")
def test_searches_before_summarizing(run):
result = assert_tool_called(run, "web_search", before="summarize")
assert result.passed, result.message
def test_no_email_sent(run):
result = assert_no_tool_called(run, "send_email")
assert result.passed, result.message
def test_output_is_informative(run):
result = assert_output_contains(run, "summary", case_sensitive=False)
assert result.passed, result.message
def test_budget(run):
evaluator = Evaluator()
evaluator.add(assert_cost_under, run, max_usd=0.10)
evaluator.add(assert_token_count_under, run, max_tokens=3000)
evaluator.add(assert_latency_under, run, max_ms=10_000)
result = evaluator.run()
assert result.passed, "\n".join(
f"{a.name}: {a.message}" for a in result.failed_assertions
)