Skip to content

Evalcraft

Deterministic tests for AI agents — generated from one real run.

Capture an agent run and evalcraft writes a pytest that locks its tool calls, output shape, and cost — then replays it in CI for $0. Like VCR for HTTP, but it writes the agent tests for you.

CI PyPI Python License


The problem

Agent testing is broken:

  • Expensive. Running 200 tests against GPT-4 costs real money. Every commit.
  • Non-deterministic. Tests fail randomly because LLMs aren't functions.
  • No CI/CD story. You can't gate deploys on eval results if evals take 10 minutes and cost $5.

Evalcraft records agent runs as cassettes (like VCR for HTTP) and replays them deterministically — so the tests that exercise your agent's plumbing (tool wiring, control flow, output shape, cost/latency budgets) drop from 10 minutes + $5 to 200ms + $0. For the questions that genuinely need a live model — quality, drift, LLM-judge, RAG — run live-eval on a schedule.


How it works

  Your Agent
┌─────────────┐    record     ┌──────────────┐
│  CaptureCtx │ ────────────► │   Cassette   │  (plain JSON, git-friendly)
│             │               │  (spans[])   │
└─────────────┘               └──────┬───────┘
                    ┌────────────────┼────────────────┐
                    ▼                ▼                ▼
              replay()          MockLLM /        assert_*()
           (zero API calls)    MockTool()       (scorers)
                    │                                 │
                    └──────────────┬──────────────────┘
                            pytest / CI gate
                           (200ms, $0.00)

Install

pip install evalcraft

# With pytest plugin
pip install "evalcraft[pytest]"

# With framework adapters
pip install "evalcraft[openai]"      # OpenAI SDK adapter
pip install "evalcraft[anthropic]"   # Anthropic SDK adapter
pip install "evalcraft[langchain]"   # LangGraph adapter

# Everything
pip install "evalcraft[all]"

Quick example

from evalcraft import CaptureContext, MockLLM, MockTool
from evalcraft import assert_tool_called, assert_cost_under

# 1. Record a run with mocks
llm = MockLLM()
llm.add_response("*", "It's 22°C and sunny in Paris.")

search = MockTool("get_weather")
search.returns({"temp": 22, "condition": "sunny"})

with CaptureContext(name="weather_test", save_path="tests/cassettes/weather.json") as ctx:
    ctx.record_input("What's the weather in Paris?")
    result = search.call(city="Paris")
    response = llm.complete(f"Weather data: {result}")
    ctx.record_output(response.content)

# 2. Replay from cassette — zero API calls
from evalcraft import replay
run = replay("tests/cassettes/weather.json")

# 3. Assert behavior
assert assert_tool_called(run, "get_weather").passed
assert assert_cost_under(run, max_usd=0.01).passed

Documentation

Section Description
Quickstart Get running in 5 minutes
Case Study How a team caught a $50/day regression
Concepts Cassettes, spans, capture, replay explained
Capture API Full capture API reference
Replay Engine Replay and diff cassettes
Mock LLM & Tools Deterministic mocks for testing
Scorers Built-in assertion functions
pytest Plugin Fixtures, markers, and CLI flags
CLI Reference All 6 CLI commands
Adapters OpenAI, Anthropic, LangGraph, CrewAI
CI/CD GitHub Actions integration