Evalcraft¶

Deterministic tests for AI agents — generated from one real run.

Capture an agent run and evalcraft writes a pytest that locks its tool calls, output shape, and cost — then replays it in CI for $0. Like VCR for HTTP, but it writes the agent tests for you.

The problem¶

Agent testing is broken:

Expensive. Running 200 tests against GPT-4 costs real money. Every commit.
Non-deterministic. Tests fail randomly because LLMs aren't functions.
No CI/CD story. You can't gate deploys on eval results if evals take 10 minutes and cost $5.

Evalcraft records agent runs as cassettes (like VCR for HTTP) and replays them deterministically — so the tests that exercise your agent's plumbing (tool wiring, control flow, output shape, cost/latency budgets) drop from 10 minutes + $5 to 200ms + $0. For the questions that genuinely need a live model — quality, drift, LLM-judge, RAG — run live-eval on a schedule.

How it works¶

  Your Agent
      │
      ▼
┌─────────────┐    record     ┌──────────────┐
│  CaptureCtx │ ────────────► │   Cassette   │  (plain JSON, git-friendly)
│             │               │  (spans[])   │
└─────────────┘               └──────┬───────┘
                                     │
                    ┌────────────────┼────────────────┐
                    ▼                ▼                ▼
              replay()          MockLLM /        assert_*()
           (zero API calls)    MockTool()       (scorers)
                    │                                 │
                    └──────────────┬──────────────────┘
                                   ▼
                            pytest / CI gate
                           (200ms, $0.00)

Install¶

pip install evalcraft

# With pytest plugin
pip install "evalcraft[pytest]"

# With framework adapters
pip install "evalcraft[openai]"      # OpenAI SDK adapter
pip install "evalcraft[anthropic]"   # Anthropic SDK adapter
pip install "evalcraft[langchain]"   # LangGraph adapter

# Everything
pip install "evalcraft[all]"

Quick example¶

from evalcraft import CaptureContext, MockLLM, MockTool
from evalcraft import assert_tool_called, assert_cost_under

# 1. Record a run with mocks
llm = MockLLM()
llm.add_response("*", "It's 22°C and sunny in Paris.")

search = MockTool("get_weather")
search.returns({"temp": 22, "condition": "sunny"})

with CaptureContext(name="weather_test", save_path="tests/cassettes/weather.json") as ctx:
    ctx.record_input("What's the weather in Paris?")
    result = search.call(city="Paris")
    response = llm.complete(f"Weather data: {result}")
    ctx.record_output(response.content)

# 2. Replay from cassette — zero API calls
from evalcraft import replay
run = replay("tests/cassettes/weather.json")

# 3. Assert behavior
assert assert_tool_called(run, "get_weather").passed
assert assert_cost_under(run, max_usd=0.01).passed

Documentation¶

Section	Description
Quickstart	Get running in 5 minutes
Case Study	How a team caught a $50/day regression
Concepts	Cassettes, spans, capture, replay explained
Capture API	Full capture API reference
Replay Engine	Replay and diff cassettes
Mock LLM & Tools	Deterministic mocks for testing
Scorers	Built-in assertion functions
pytest Plugin	Fixtures, markers, and CLI flags
CLI Reference	All 6 CLI commands
Adapters	OpenAI, Anthropic, LangGraph, CrewAI
CI/CD	GitHub Actions integration