Now in public beta · v0.1.0

CI/CD for AI agents.

Catch agent regressions before production.
Open source.

test_weather_agent.py
from evalcraft import replay, assert_tool_called, assert_cost_under

def test_agent_calls_weather_tool():
    run = replay("tests/cassettes/weather.json")

    assert run.replayed is True
    assert assert_tool_called(run, "get_weather").passed
    assert assert_cost_under(run, max_usd=0.05).passed
    # 200ms, $0.00 — zero API calls
View on GitHub

Agent testing is broken.

Running your agent test suite against live LLMs is slow, expensive, and non-deterministic. Evalcraft fixes all three.

Expensive

$0.01/eval × 200 tests = $2/commit

Running 200 tests against GPT-4 costs real money. Every commit, every PR, every CI run — it adds up fast.

Non-deterministic

Same prompt, different output every time

Tests fail randomly because LLMs aren't pure functions. Flaky tests erode trust and slow down your team.

No CI/CD

Agents ship on vibes

You can't gate deploys on eval results if evals take 10 minutes and cost $5. So teams skip testing entirely.

Record once. Replay forever.

Evalcraft records your agent runs as cassettes — plain JSON files you check into git. Then replays them deterministically at zero cost.

1

Capture

Wrap your agent in CaptureContext. Every LLM call, tool use, and output is recorded.

2

Cassette

Agent runs are saved as .json cassettes. Plain text, git-friendly, human-readable.

3

Replay & Assert

Replay cassettes with zero API calls. Assert on tool calls, cost, latency, output content.

4

CI Gate

Run pytest in CI. 200 tests in 200ms, $0.00. Gate deploys on eval results.

Feels like writing pytest.

If you know Python testing, you already know Evalcraft.

from evalcraft import CaptureContext

with CaptureContext(
    name="weather_agent_test",
    agent_name="weather_agent",
    save_path="tests/cassettes/weather.json",
) as ctx:
    ctx.record_input("What's the weather in Paris?")

    ctx.record_tool_call(
        "get_weather",
        args={"city": "Paris"},
        result={"temp": 18, "condition": "cloudy"},
    )
    ctx.record_llm_call(
        model="gpt-4o",
        input="User asked about weather. Tool returned: cloudy 18°C",
        output="It's 18°C and cloudy in Paris right now.",
        prompt_tokens=120,
        completion_tokens=15,
        cost_usd=0.0008,
    )

    ctx.record_output("It's 18°C and cloudy in Paris right now.")

cassette = ctx.cassette
print(f"Captured {cassette.tool_call_count} tool calls, ${cassette.total_cost_usd:.4f}")
# Captured 1 tool calls, $0.0008
from evalcraft import replay

# Loads the cassette and replays all spans — zero LLM calls
run = replay("tests/cassettes/weather.json")

assert run.replayed is True
assert run.cassette.output_text == "It's 18°C and cloudy in Paris right now."

# Advanced: override tool results to test edge cases
from evalcraft import ReplayEngine

engine = ReplayEngine("tests/cassettes/weather.json")
engine.override_tool_result("get_weather", {"temp": -5, "condition": "blizzard"})
run = engine.run()
from evalcraft import MockLLM, MockTool, CaptureContext

llm = MockLLM()
llm.add_response("*", "It's sunny and 22°C.")  # wildcard match

search = MockTool("web_search")
search.returns({"results": [{"title": "Weather Paris", "snippet": "Sunny, 22°C"}]})

with CaptureContext(name="mocked_run", save_path="tests/cassettes/mocked.json") as ctx:
    ctx.record_input("Weather in Paris?")

    search_result = search.call(query="Paris weather today")
    response = llm.complete(f"Search result: {search_result}")

    ctx.record_output(response.content)

search.assert_called(times=1)
search.assert_called_with(query="Paris weather today")
llm.assert_called(times=1)
from evalcraft import replay, assert_tool_called, assert_tool_order, assert_cost_under

run = replay("tests/cassettes/weather.json")

# Tool assertions
assert assert_tool_called(run, "get_weather").passed
assert assert_tool_called(run, "get_weather", with_args={"city": "Paris"}).passed
assert assert_tool_order(run, ["get_weather"]).passed

# Cost & performance
assert assert_cost_under(run, max_usd=0.05).passed

# Output assertions
from evalcraft import assert_output_contains
assert assert_output_contains(run, "Paris").passed
# tests/test_weather_agent.py
from evalcraft import replay, assert_tool_called, assert_tool_order, assert_cost_under

def test_agent_calls_weather_tool():
    run = replay("tests/cassettes/weather.json")
    result = assert_tool_called(run, "get_weather")
    assert result.passed, result.message

def test_agent_tool_sequence():
    run = replay("tests/cassettes/weather.json")
    result = assert_tool_order(run, ["get_weather"])
    assert result.passed, result.message

def test_agent_cost_budget():
    run = replay("tests/cassettes/weather.json")
    result = assert_cost_under(run, max_usd=0.01)
    assert result.passed, result.message

def test_agent_output():
    run = replay("tests/cassettes/weather.json")
    assert "Paris" in run.cassette.output_text

Works with your stack.

Drop-in adapters that auto-record LLM calls. Zero code changes to your agent logic.

LLM Providers

OpenAI

GPT-5.4 / o3

GPT-5.4, GPT-5-mini, o3, o4-mini, GPT-4.1

evalcraft[openai]

Anthropic

Opus 4.6 / Sonnet 4.6

claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5

evalcraft[anthropic]

Google Gemini

Gemini 3.1 / 2.5

gemini-3.1-pro, gemini-2.5-pro, gemini-2.5-flash

evalcraft[google]

Meta Llama

Llama 4

Llama 4 Scout, Llama 4 Maverick (MoE, 10M ctx)

evalcraft[openai]
Agent Frameworks

LangGraph

v1.0 GA

Graphs, chains, ReAct agents, streaming

evalcraft[langchain]

CrewAI

v1.10

Multi-agent crews, delegation, task callbacks

evalcraft[crewai]

OpenAI Agents SDK

v0.10

Responses API, tool use, realtime agents

evalcraft[openai]

Any LLM

Mistral, Cohere, +

Mistral Large 3, Command A, or any OpenAI-compatible API

evalcraft[openai]

How Evalcraft stacks up.

Evalcraft Braintrust LangSmith Promptfoo
Cassette-based replay
Zero-cost CI testing Partial
pytest-native
Mock LLM / Tools
Framework agnostic
Self-hostable Partial
Observability dashboard
Pricing Free / OSS Paid SaaS Paid SaaS Free / OSS

Start free. Scale when ready.

The core library is open source and always will be. Paid plans add team features and managed infrastructure.

Free

Open source forever

$0/mo
  • Unlimited cassettes
  • All assertions & scorers
  • MockLLM & MockTool
  • pytest plugin
  • CLI tools
Get started

Team

For growing teams

$199/mo
  • Everything in Starter
  • Up to 10 seats
  • Shared cassette library
  • CI/CD integrations
  • Priority support

Enterprise

For large organizations

Custom
  • Everything in Team
  • Unlimited seats
  • SSO & RBAC
  • Self-hosted option
  • Dedicated support
Contact us
Star us on GitHub