User Guide Overview¶
Welcome to the Evalcraft user guide. Use the links below to navigate to specific topics.
Core workflow¶
The typical Evalcraft workflow has three phases:
1. Capture (once)¶
Run your agent with a CaptureContext active. Every LLM call, tool invocation, and agent decision is recorded into a cassette (a plain JSON file).
from evalcraft import CaptureContext
with CaptureContext(name="my_test", save_path="tests/cassettes/my_test.json") as ctx:
ctx.record_input("user prompt")
result = my_agent.run("user prompt")
ctx.record_output(result)
Commit the cassette to git. This is your ground truth.
2. Replay (every test run)¶
Load the cassette and replay it. No API calls. No cost. 200ms.
from evalcraft import replay
run = replay("tests/cassettes/my_test.json")
assert run.replayed is True
3. Assert (in CI)¶
Use the built-in scorers to assert behavior.
from evalcraft import assert_tool_called, assert_cost_under
assert assert_tool_called(run, "web_search").passed
assert assert_cost_under(run, max_usd=0.05).passed
Guide sections¶
| Section | What you'll learn |
|---|---|
| Quickstart | Full working example in 5 minutes |
| Concepts | What cassettes, spans, and fingerprints are |
| Capture API | CaptureContext, record_llm_call, record_tool_call |
| Replay Engine | ReplayEngine, overrides, diffs |
| Mock LLM & Tools | MockLLM, MockTool |
| Scorers | All assert_* functions and Evaluator |
| pytest Plugin | Fixtures and markers for pytest integration |
| CLI Reference | capture, replay, diff, eval, info, mock commands |
| Adapters | Auto-capture for OpenAI, Anthropic, LangGraph, CrewAI |
| CI/CD | GitHub Actions workflows |
| Changelog | What's new in each release |