Skip to content

Evalcraft

The pytest for AI agents. Capture, replay, mock, and evaluate agent behavior — without burning API credits on every test run.

CI PyPI Python License


The problem

Agent testing is broken:

  • Expensive. Running 200 tests against GPT-4 costs real money. Every commit.
  • Non-deterministic. Tests fail randomly because LLMs aren't functions.
  • No CI/CD story. You can't gate deploys on eval results if evals take 10 minutes and cost $5.

Evalcraft fixes this by recording agent runs as cassettes (like VCR for HTTP), then replaying them deterministically. Your test suite goes from 10 minutes + $5 to 200ms + $0.


How it works

  Your Agent
┌─────────────┐    record     ┌──────────────┐
│  CaptureCtx │ ────────────► │   Cassette   │  (plain JSON, git-friendly)
│             │               │  (spans[])   │
└─────────────┘               └──────┬───────┘
                    ┌────────────────┼────────────────┐
                    ▼                ▼                ▼
              replay()          MockLLM /        assert_*()
           (zero API calls)    MockTool()       (scorers)
                    │                                 │
                    └──────────────┬──────────────────┘
                            pytest / CI gate
                           (200ms, $0.00)

Install

pip install evalcraft

# With pytest plugin
pip install "evalcraft[pytest]"

# With framework adapters
pip install "evalcraft[openai]"      # OpenAI SDK adapter
pip install "evalcraft[anthropic]"   # Anthropic SDK adapter
pip install "evalcraft[langchain]"   # LangGraph adapter

# Everything
pip install "evalcraft[all]"

Quick example

from evalcraft import CaptureContext, MockLLM, MockTool
from evalcraft import assert_tool_called, assert_cost_under

# 1. Record a run with mocks
llm = MockLLM()
llm.add_response("*", "It's 22°C and sunny in Paris.")

search = MockTool("get_weather")
search.returns({"temp": 22, "condition": "sunny"})

with CaptureContext(name="weather_test", save_path="tests/cassettes/weather.json") as ctx:
    ctx.record_input("What's the weather in Paris?")
    result = search.call(city="Paris")
    response = llm.complete(f"Weather data: {result}")
    ctx.record_output(response.content)

# 2. Replay from cassette — zero API calls
from evalcraft import replay
run = replay("tests/cassettes/weather.json")

# 3. Assert behavior
assert assert_tool_called(run, "get_weather").passed
assert assert_cost_under(run, max_usd=0.01).passed

Documentation

Section Description
Quickstart Get running in 5 minutes
Case Study How a team caught a $50/day regression
Concepts Cassettes, spans, capture, replay explained
Capture API Full capture API reference
Replay Engine Replay and diff cassettes
Mock LLM & Tools Deterministic mocks for testing
Scorers Built-in assertion functions
pytest Plugin Fixtures, markers, and CLI flags
CLI Reference All 6 CLI commands
Adapters OpenAI, Anthropic, LangGraph, CrewAI
CI/CD GitHub Actions integration