Skip to content

User Guide Overview

Welcome to the Evalcraft user guide. Use the links below to navigate to specific topics.

Core workflow

The typical Evalcraft workflow has three phases:

1. Capture (once)

Run your agent with a CaptureContext active. Every LLM call, tool invocation, and agent decision is recorded into a cassette (a plain JSON file).

from evalcraft import CaptureContext

with CaptureContext(name="my_test", save_path="tests/cassettes/my_test.json") as ctx:
    ctx.record_input("user prompt")
    result = my_agent.run("user prompt")
    ctx.record_output(result)

Commit the cassette to git. This is your ground truth.

2. Replay (every test run)

Load the cassette and replay it. No API calls. No cost. 200ms.

from evalcraft import replay

run = replay("tests/cassettes/my_test.json")
assert run.replayed is True

3. Assert (in CI)

Use the built-in scorers to assert behavior.

from evalcraft import assert_tool_called, assert_cost_under

assert assert_tool_called(run, "web_search").passed
assert assert_cost_under(run, max_usd=0.05).passed

Guide sections

Section What you'll learn
Quickstart Full working example in 5 minutes
Concepts What cassettes, spans, and fingerprints are
Capture API CaptureContext, record_llm_call, record_tool_call
Replay Engine ReplayEngine, overrides, diffs
Mock LLM & Tools MockLLM, MockTool
Scorers All assert_* functions and Evaluator
pytest Plugin Fixtures and markers for pytest integration
CLI Reference capture, replay, diff, eval, info, mock commands
Adapters Auto-capture for OpenAI, Anthropic, LangGraph, CrewAI
CI/CD GitHub Actions workflows
Changelog What's new in each release