Concepts¶
Understanding Evalcraft's core abstractions.
Cassette¶
A cassette is a complete recording of one agent run. It is serialized as plain JSON and is the fundamental unit in Evalcraft.
Named after VCR cassettes — you record something once, then play it back as many times as you want.
{
"evalcraft_version": "0.1.0",
"cassette": {
"id": "a1b2c3d4-...",
"name": "weather_agent_test",
"agent_name": "weather_agent",
"framework": "openai",
"input_text": "What's the weather in Paris?",
"output_text": "It's 18°C and cloudy in Paris.",
"total_tokens": 135,
"total_cost_usd": 0.0008,
"total_duration_ms": 450.0,
"llm_call_count": 1,
"tool_call_count": 1,
"fingerprint": "a3f1c2d4e5b6a7c8",
"metadata": {}
},
"spans": [...]
}
Key properties¶
| Property | Type | Description |
|---|---|---|
id |
str |
UUID, unique per run |
name |
str |
Human-readable test name |
agent_name |
str |
Name of the agent under test |
framework |
str |
e.g. "openai", "langgraph" |
input_text |
str |
User's input to the agent |
output_text |
str |
Agent's final output |
total_tokens |
int |
Sum of all token usage |
total_cost_usd |
float |
Estimated dollar cost |
total_duration_ms |
float |
Wall-clock time in milliseconds |
llm_call_count |
int |
Number of LLM calls made |
tool_call_count |
int |
Number of tool calls made |
fingerprint |
str |
SHA-256 of span content (first 16 hex chars) |
spans |
list[Span] |
Ordered list of all recorded events |
Cassettes are git-friendly¶
Cassettes are stored as plain JSON. This means:
- Diff them in PRs — review exactly which tools were called, what tokens were used
- Version them — each cassette represents a specific agent behavior
- Detect regressions — the
fingerprintfield changes if any span changes
Span¶
A span is a single recorded event in an agent run. Each LLM call, tool invocation, or agent step is a span.
Span kinds¶
| SpanKind | Value | Description |
|---|---|---|
LLM_REQUEST |
"llm_request" |
Before an LLM call |
LLM_RESPONSE |
"llm_response" |
After an LLM call (with output and tokens) |
TOOL_CALL |
"tool_call" |
A tool was invoked |
TOOL_RESULT |
"tool_result" |
Result of a tool call |
AGENT_STEP |
"agent_step" |
A node or chain step (LangGraph) |
USER_INPUT |
"user_input" |
The user's input message |
AGENT_OUTPUT |
"agent_output" |
The agent's final answer |
Span fields¶
| Field | Type | Description |
|---|---|---|
id |
str |
UUID |
kind |
SpanKind |
Type of event |
name |
str |
Human-readable label (e.g. "tool:get_weather") |
timestamp |
float |
Unix timestamp when the span started |
duration_ms |
float |
Duration in milliseconds |
input |
Any |
Input data (prompt, tool args) |
output |
Any |
Output data (response text, tool result) |
error |
str \| None |
Error message if the call failed |
model |
str \| None |
LLM model name (LLM spans only) |
token_usage |
TokenUsage \| None |
Token counts (LLM spans only) |
cost_usd |
float \| None |
Estimated cost (LLM spans only) |
tool_name |
str \| None |
Tool name (tool spans only) |
tool_args |
dict \| None |
Arguments passed to the tool |
tool_result |
Any |
Return value of the tool |
metadata |
dict |
Arbitrary extra data |
Capture¶
Capturing means running your agent with a CaptureContext active. All LLM calls and tool calls recorded during the context are collected into a cassette.
from evalcraft import CaptureContext
# Context manager — sync
with CaptureContext(name="my_test", save_path="cassettes/my_test.json") as ctx:
# ... run your agent ...
pass
# Context manager — async
async with CaptureContext(name="async_test") as ctx:
# ... run your async agent ...
pass
There are three ways to record events into a cassette:
- Manual recording — call
ctx.record_llm_call(...),ctx.record_tool_call(...)directly - MockLLM / MockTool — auto-record when
MockLLM.complete()orMockTool.call()is invoked inside an active context - Framework adapters —
OpenAIAdapter,AnthropicAdapter,LangGraphAdapter,CrewAIAdaptermonkey-patch the SDK to record automatically
Replay¶
Replaying means loading a cassette and running the recorded spans without making any real API calls.
from evalcraft import replay, ReplayEngine
# Simple replay
run = replay("cassettes/my_test.json")
assert run.replayed is True
# Replay with modifications
engine = ReplayEngine("cassettes/my_test.json")
engine.override_tool_result("get_weather", {"temp": 5, "condition": "snow"})
run = engine.run()
During replay: - LLM responses are returned from the cassette (no API calls) - Tool results are returned from the cassette (no real tool execution) - Overrides can substitute new values for specific tools or LLM calls
Fingerprint¶
The fingerprint is a 16-character hex digest (SHA-256) of the cassette's span content. It changes if any span's input, output, tool name, or model changes.
Use fingerprints to:
- Detect regressions — if your agent's behavior changes between runs, the fingerprint changes
- CI gates — fail a build if the fingerprint changes unexpectedly
- Diff — compare two cassettes with evalcraft diff
from evalcraft import replay
run = replay("cassettes/v1.json")
print(run.cassette.fingerprint) # e.g. "a3f1c2d4e5b6a7c8"
AgentRun¶
An AgentRun is the result object returned by replay() and ReplayEngine.run(). It wraps a Cassette with metadata about whether the run was live or replayed.
from evalcraft import replay
run = replay("cassettes/my_test.json")
print(run.cassette.output_text) # the agent's answer
print(run.replayed) # True
print(run.success) # True
print(run.error) # None
All scorer functions (assert_tool_called, etc.) accept either a Cassette or an AgentRun.