Scorers¶

Evalcraft provides a set of assertion functions ("scorers") for evaluating cassettes and agent runs. Each function returns an AssertionResult rather than raising immediately, so you can collect all failures before reporting.

All scorers accept either a Cassette or an AgentRun as their first argument.

from evalcraft import (
    assert_tool_called,
    assert_tool_order,
    assert_no_tool_called,
    assert_output_contains,
    assert_output_matches,
    assert_cost_under,
    assert_latency_under,
    assert_token_count_under,
)
from evalcraft.eval.scorers import Evaluator

Offline vs. live scorers¶

Evalcraft has two kinds of scorers, and the difference matters: only the offline family delivers the "deterministic, $0, runs-on-every-commit" promise.

Offline / deterministic (no network, $0)¶

Evaluated entirely against the recorded cassette — no model is called. These run fine inside replay's NetworkGuard:

assert_tool_called, assert_tool_order, assert_no_tool_called
assert_output_contains, assert_output_matches
assert_cost_under, assert_latency_under, assert_token_count_under

Cost / latency / token assertions read the recorded numbers — they gate the captured run, not a fresh live run.

Live / paid / non-deterministic (calls a real model at test time)¶

These send the recorded output to a judge model when invoked, so they incur cost + latency, require an API key, and are not deterministic. Run them with eval_n(...) + confidence intervals, and keep them out of the $0 per-commit path (use nightly / gated jobs):

LLM-as-Judge: assert_output_semantic, assert_factual_consistency, assert_tone, assert_custom_criteria
RAG metrics: assert_faithfulness, assert_context_relevance, assert_answer_relevance, assert_context_recall
Pairwise: pairwise_compare, pairwise_rank
Multi-judge: JuryScorer
Hallucination: assert_no_hallucination, detect_hallucinations

The judge model is configurable (provider=, model=). A judge call cannot run inside replay's NetworkGuard — it needs the network. For deterministic, $0 CI you can opt in to record/replay of judge responses with evalcraft.eval.judge_cache.use_judge_cache(...) (or the EVALCRAFT_JUDGE_CACHE env var) — at the cost of a frozen judgment that only updates when you re-record.

AssertionResult¶

Every scorer returns an AssertionResult:

result = assert_tool_called(cassette, "web_search")

print(result.passed)    # True or False
print(result.name)      # "assert_tool_called(web_search)"
print(result.expected)  # "web_search"
print(result.actual)    # ["web_search", "summarize"]
print(result.message)   # "" if passed, error description if failed

In pytest, use assert result.passed, result.message to get descriptive failures:

result = assert_tool_called(run, "web_search")
assert result.passed, result.message
# If failed: "Tool 'web_search' was never called. Called tools: ['search', 'analyze']"

Tool assertions¶

`assert_tool_called(cassette, tool_name, ...)`¶

Assert that a specific tool was called.

from evalcraft import assert_tool_called, replay

run = replay("tests/cassettes/agent.json")

# Basic: tool was called at least once
result = assert_tool_called(run, "web_search")
assert result.passed

# Called exactly N times
result = assert_tool_called(run, "web_search", times=2)

# Called with specific arguments
result = assert_tool_called(run, "web_search", with_args={"query": "Paris weather"})

# Called before another tool
result = assert_tool_called(run, "web_search", before="summarize")

# Called after another tool
result = assert_tool_called(run, "summarize", after="web_search")

Parameter	Type	Description
`cassette`	`Cassette \\| AgentRun`	The run to check
`tool_name`	`str`	Name of the tool
`times`	`int \\| None`	Exact number of calls expected
`with_args`	`dict \\| None`	Arguments that must have been passed
`before`	`str \\| None`	Tool that should come AFTER this one
`after`	`str \\| None`	Tool that should come BEFORE this one

`assert_tool_order(cassette, expected_order, strict=False)`¶

Assert tools were called in a specific order.

# Non-strict: the tools must appear in order, but other tools can be in between
result = assert_tool_order(run, ["web_search", "summarize", "send_email"])
assert result.passed

# Strict: the sequence must match exactly
result = assert_tool_order(
    run,
    ["web_search", "summarize", "send_email"],
    strict=True,
)

Parameter	Type	Description
`cassette`	`Cassette \\| AgentRun`	The run to check
`expected_order`	`list[str]`	Expected tool names in order
`strict`	`bool`	If `True`, exact match required. If `False`, subsequence match.

`assert_no_tool_called(cassette, tool_name)`¶

Assert a specific tool was NOT called.

result = assert_no_tool_called(run, "send_email")
assert result.passed, result.message
# If failed: "Tool 'send_email' was called 1 times, expected 0"

Output assertions¶

`assert_output_contains(cassette, substring, case_sensitive=True)`¶

Assert the agent's output contains a substring.

result = assert_output_contains(run, "Paris")
assert result.passed

# Case-insensitive
result = assert_output_contains(run, "paris", case_sensitive=False)
assert result.passed

`assert_output_matches(cassette, pattern)`¶

Assert the agent's output matches a regex pattern.

result = assert_output_matches(run, r"\d+°C")
assert result.passed, result.message
# If failed: "Output does not match pattern '\\d+°C'"

result = assert_output_matches(run, r"(sunny|cloudy|rainy)")
assert result.passed

Cost and performance assertions¶

`assert_cost_under(cassette, max_usd)`¶

Assert the total estimated cost is under a threshold.

result = assert_cost_under(run, max_usd=0.05)
assert result.passed
# If failed: "Cost $0.0823 exceeds limit $0.0500"

Note

Cost is only available if recorded during capture (e.g., via record_llm_call(cost_usd=...) or an adapter that estimates cost).

`assert_latency_under(cassette, max_ms)`¶

Assert total wall-clock time is under a threshold (in milliseconds).

result = assert_latency_under(run, max_ms=5000)
assert result.passed
# If failed: "Latency 6234.0ms exceeds limit 5000.0ms"

`assert_token_count_under(cassette, max_tokens)`¶

Assert total token count is under a threshold.

result = assert_token_count_under(run, max_tokens=2000)
assert result.passed
# If failed: "Token count 2341 exceeds limit 2000"

Evaluator¶

The Evaluator class lets you compose multiple assertions into a single evaluation.

from evalcraft.eval.scorers import Evaluator
from evalcraft import (
    assert_tool_called,
    assert_tool_order,
    assert_cost_under,
    assert_token_count_under,
    assert_output_contains,
    replay,
)

run = replay("tests/cassettes/agent.json")

evaluator = Evaluator()
evaluator.add(assert_tool_called, run, "web_search")
evaluator.add(assert_tool_order, run, ["web_search", "summarize"])
evaluator.add(assert_cost_under, run, max_usd=0.05)
evaluator.add(assert_token_count_under, run, max_tokens=2000)
evaluator.add(assert_output_contains, run, "summary")

result = evaluator.run()

print(result.passed)          # True if all pass
print(result.score)           # e.g. 0.8 = 4/5 passed
print(len(result.assertions)) # 5

for assertion in result.assertions:
    status = "PASS" if assertion.passed else "FAIL"
    print(f"[{status}] {assertion.name}")
    if not assertion.passed:
        print(f"       {assertion.message}")

`Evaluator.add(assertion_fn, *args, **kwargs)`¶

Add an assertion to run.

evaluator.add(assert_cost_under, cassette, max_usd=0.10)

Returns self for chaining.

`Evaluator.run()`¶

Run all assertions and return an EvalResult.

`EvalResult` fields¶

Field	Type	Description
`passed`	`bool`	True if all assertions passed
`score`	`float`	Fraction of assertions that passed (0.0–1.0)
`assertions`	`list[AssertionResult]`	Individual assertion results
`failed_assertions`	`list[AssertionResult]`	Only failed assertions

result = evaluator.run()
if not result.passed:
    for a in result.failed_assertions:
        print(f"FAIL: {a.name} — {a.message}")

Complete example¶

import pytest
from evalcraft import replay
from evalcraft import (
    assert_tool_called,
    assert_tool_order,
    assert_no_tool_called,
    assert_output_contains,
    assert_cost_under,
    assert_token_count_under,
    assert_latency_under,
)
from evalcraft.eval.scorers import Evaluator


@pytest.fixture
def run():
    return replay("tests/cassettes/research_agent.json")


def test_searches_before_summarizing(run):
    result = assert_tool_called(run, "web_search", before="summarize")
    assert result.passed, result.message


def test_no_email_sent(run):
    result = assert_no_tool_called(run, "send_email")
    assert result.passed, result.message


def test_output_is_informative(run):
    result = assert_output_contains(run, "summary", case_sensitive=False)
    assert result.passed, result.message


def test_budget(run):
    evaluator = Evaluator()
    evaluator.add(assert_cost_under, run, max_usd=0.10)
    evaluator.add(assert_token_count_under, run, max_tokens=3000)
    evaluator.add(assert_latency_under, run, max_ms=10_000)

    result = evaluator.run()
    assert result.passed, "\n".join(
        f"{a.name}: {a.message}" for a in result.failed_assertions
    )

Scorers¶

Offline vs. live scorers¶

Offline / deterministic (no network, $0)¶

Live / paid / non-deterministic (calls a real model at test time)¶

AssertionResult¶

Tool assertions¶

assert_tool_called(cassette, tool_name, ...)¶

assert_tool_order(cassette, expected_order, strict=False)¶

assert_no_tool_called(cassette, tool_name)¶

Output assertions¶

assert_output_contains(cassette, substring, case_sensitive=True)¶

assert_output_matches(cassette, pattern)¶