Evalcraft — CI/CD for AI agents

The problem

Agent testing is broken.

Running your agent test suite against live LLMs is slow, expensive, and non-deterministic. Evalcraft fixes all three.

Expensive

$0.01/eval × 200 tests = $2/commit

Running 200 tests against GPT-4 costs real money. Every commit, every PR, every CI run — it adds up fast.

Non-deterministic

Same prompt, different output every time

Tests fail randomly because LLMs aren't pure functions. Flaky tests erode trust and slow down your team.

No CI/CD

Agents ship on vibes

You can't gate deploys on eval results if evals take 10 minutes and cost $5. So teams skip testing entirely.

How it works

Record once. Replay forever.

Evalcraft records your agent runs as cassettes — plain JSON files you check into git. Then replays them deterministically at zero cost.

Capture

Wrap your agent in CaptureContext. Every LLM call, tool use, and output is recorded.

Cassette

Agent runs are saved as .json cassettes. Plain text, git-friendly, human-readable.

Replay & Assert

Replay cassettes with zero API calls. Assert on tool calls, cost, latency, output content.

CI Gate

Run pytest in CI. 200 tests in 200ms, $0.00. Gate deploys on eval results.

Code examples

Feels like writing pytest.

If you know Python testing, you already know Evalcraft.

from evalcraft import CaptureContext

with CaptureContext(
    name="weather_agent_test",
    agent_name="weather_agent",
    save_path="tests/cassettes/weather.json",
) as ctx:
    ctx.record_input("What's the weather in Paris?")

    ctx.record_tool_call(
        "get_weather",
        args={"city": "Paris"},
        result={"temp": 18, "condition": "cloudy"},
    )
    ctx.record_llm_call(
        model="gpt-4o",
        input="User asked about weather. Tool returned: cloudy 18°C",
        output="It's 18°C and cloudy in Paris right now.",
        prompt_tokens=120,
        completion_tokens=15,
        cost_usd=0.0008,
    )

    ctx.record_output("It's 18°C and cloudy in Paris right now.")

cassette = ctx.cassette
print(f"Captured {cassette.tool_call_count} tool calls, ${cassette.total_cost_usd:.4f}")
# Captured 1 tool calls, $0.0008

from evalcraft import replay

# Loads the cassette and replays all spans — zero LLM calls
run = replay("tests/cassettes/weather.json")

assert run.replayed is True
assert run.cassette.output_text == "It's 18°C and cloudy in Paris right now."

# Advanced: override tool results to test edge cases
from evalcraft import ReplayEngine

engine = ReplayEngine("tests/cassettes/weather.json")
engine.override_tool_result("get_weather", {"temp": -5, "condition": "blizzard"})
run = engine.run()

from evalcraft import MockLLM, MockTool, CaptureContext

llm = MockLLM()
llm.add_response("*", "It's sunny and 22°C.")  # wildcard match

search = MockTool("web_search")
search.returns({"results": [{"title": "Weather Paris", "snippet": "Sunny, 22°C"}]})

with CaptureContext(name="mocked_run", save_path="tests/cassettes/mocked.json") as ctx:
    ctx.record_input("Weather in Paris?")

    search_result = search.call(query="Paris weather today")
    response = llm.complete(f"Search result: {search_result}")

    ctx.record_output(response.content)

search.assert_called(times=1)
search.assert_called_with(query="Paris weather today")
llm.assert_called(times=1)

from evalcraft import replay, assert_tool_called, assert_tool_order, assert_cost_under

run = replay("tests/cassettes/weather.json")

# Tool assertions
assert assert_tool_called(run, "get_weather").passed
assert assert_tool_called(run, "get_weather", with_args={"city": "Paris"}).passed
assert assert_tool_order(run, ["get_weather"]).passed

# Cost & performance
assert assert_cost_under(run, max_usd=0.05).passed

# Output assertions
from evalcraft import assert_output_contains
assert assert_output_contains(run, "Paris").passed

# tests/test_weather_agent.py
from evalcraft import replay, assert_tool_called, assert_tool_order, assert_cost_under

def test_agent_calls_weather_tool():
    run = replay("tests/cassettes/weather.json")
    result = assert_tool_called(run, "get_weather")
    assert result.passed, result.message

def test_agent_tool_sequence():
    run = replay("tests/cassettes/weather.json")
    result = assert_tool_order(run, ["get_weather"])
    assert result.passed, result.message

def test_agent_cost_budget():
    run = replay("tests/cassettes/weather.json")
    result = assert_cost_under(run, max_usd=0.01)
    assert result.passed, result.message

def test_agent_output():
    run = replay("tests/cassettes/weather.json")
    assert "Paris" in run.cassette.output_text

Integrations

Works with your stack.

Drop-in adapters that auto-record LLM calls. Zero code changes to your agent logic.

LLM Providers

OpenAI

GPT-5.4 / o3

GPT-5.4, GPT-5-mini, o3, o4-mini, GPT-4.1

evalcraft[openai]

Anthropic

Opus 4.6 / Sonnet 4.6

claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5

evalcraft[anthropic]

Google Gemini

Gemini 3.1 / 2.5

gemini-3.1-pro, gemini-2.5-pro, gemini-2.5-flash

evalcraft[google]

Meta Llama

Llama 4

Llama 4 Scout, Llama 4 Maverick (MoE, 10M ctx)

evalcraft[openai]

Agent Frameworks

LangGraph

v1.0 GA

Graphs, chains, ReAct agents, streaming

evalcraft[langchain]

CrewAI

v1.10

Multi-agent crews, delegation, task callbacks

evalcraft[crewai]

OpenAI Agents SDK

v0.10

Responses API, tool use, realtime agents

evalcraft[openai]

Any LLM

Mistral, Cohere, +

Mistral Large 3, Command A, or any OpenAI-compatible API

evalcraft[openai]

	Evalcraft	Braintrust	LangSmith	Promptfoo
Cassette-based replay	✓	—	—	—
Zero-cost CI testing	✓	—	—	Partial
pytest-native	✓	—	—	—
Mock LLM / Tools	✓	—	—	—
Framework agnostic	✓	✓	✓	✓
Self-hostable	✓	—	Partial	✓
Observability dashboard	—	✓	✓	—
Pricing	Free / OSS	Paid SaaS	Paid SaaS	Free / OSS

Pricing

Start free. Scale when ready.

The core library is open source and always will be. Paid plans add team features and managed infrastructure.

Free

Open source forever

$0/mo

Unlimited cassettes
All assertions & scorers
MockLLM & MockTool
pytest plugin
CLI tools

Get started

Starter

For solo developers

$49/mo

Everything in Free
Cloud cassette storage
Dashboard & analytics
Regression alerts
Email support

Team

For growing teams

$199/mo

Everything in Starter
Up to 10 seats
Shared cassette library
CI/CD integrations
Priority support

Enterprise

For large organizations

Custom

Everything in Team
Unlimited seats
SSO & RBAC
Self-hosted option
Dedicated support

CI/CD for AI agents.

Catch agent regressions before production.
Open source.

Agent testing is broken.

Expensive

Non-deterministic

No CI/CD

Record once. Replay forever.

Capture

Cassette

Replay & Assert

CI Gate

Feels like writing pytest.

Works with your stack.

OpenAI

Anthropic

Google Gemini

Meta Llama

LangGraph

CrewAI

OpenAI Agents SDK

Any LLM

How Evalcraft stacks up.

Start free. Scale when ready.

Free

Starter

Team

Enterprise

CI/CD for AI agents.

Catch agent regressions before production.Open source.

Agent testing is broken.

Expensive

Non-deterministic

No CI/CD

Record once. Replay forever.

Capture

Cassette

Replay & Assert

CI Gate

Feels like writing pytest.

Works with your stack.

OpenAI

Anthropic

Google Gemini

Meta Llama

LangGraph

CrewAI

OpenAI Agents SDK

Any LLM

How Evalcraft stacks up.

Start free. Scale when ready.

Free

Starter

Team

Enterprise

Catch agent regressions before production.
Open source.