from evalcraft import replay, assert_tool_called, assert_cost_under
def test_agent_calls_weather_tool():
run = replay("tests/cassettes/weather.json")
assert run.replayed is True
assert assert_tool_called(run, "get_weather").passed
assert assert_cost_under(run, max_usd=0.05).passed
# 200ms, $0.00 — zero API calls
The problem
Running your agent test suite against live LLMs is slow, expensive, and non-deterministic. Evalcraft fixes all three.
$0.01/eval × 200 tests = $2/commit
Running 200 tests against GPT-4 costs real money. Every commit, every PR, every CI run — it adds up fast.
Same prompt, different output every time
Tests fail randomly because LLMs aren't pure functions. Flaky tests erode trust and slow down your team.
Agents ship on vibes
You can't gate deploys on eval results if evals take 10 minutes and cost $5. So teams skip testing entirely.
How it works
Evalcraft records your agent runs as cassettes — plain JSON files you check into git. Then replays them deterministically at zero cost.
Wrap your agent in CaptureContext. Every LLM call, tool use, and output is recorded.
Agent runs are saved as .json cassettes. Plain text, git-friendly, human-readable.
Replay cassettes with zero API calls. Assert on tool calls, cost, latency, output content.
Run pytest in CI. 200 tests in 200ms, $0.00. Gate deploys on eval results.
Code examples
If you know Python testing, you already know Evalcraft.
from evalcraft import CaptureContext
with CaptureContext(
name="weather_agent_test",
agent_name="weather_agent",
save_path="tests/cassettes/weather.json",
) as ctx:
ctx.record_input("What's the weather in Paris?")
ctx.record_tool_call(
"get_weather",
args={"city": "Paris"},
result={"temp": 18, "condition": "cloudy"},
)
ctx.record_llm_call(
model="gpt-4o",
input="User asked about weather. Tool returned: cloudy 18°C",
output="It's 18°C and cloudy in Paris right now.",
prompt_tokens=120,
completion_tokens=15,
cost_usd=0.0008,
)
ctx.record_output("It's 18°C and cloudy in Paris right now.")
cassette = ctx.cassette
print(f"Captured {cassette.tool_call_count} tool calls, ${cassette.total_cost_usd:.4f}")
# Captured 1 tool calls, $0.0008
from evalcraft import replay
# Loads the cassette and replays all spans — zero LLM calls
run = replay("tests/cassettes/weather.json")
assert run.replayed is True
assert run.cassette.output_text == "It's 18°C and cloudy in Paris right now."
# Advanced: override tool results to test edge cases
from evalcraft import ReplayEngine
engine = ReplayEngine("tests/cassettes/weather.json")
engine.override_tool_result("get_weather", {"temp": -5, "condition": "blizzard"})
run = engine.run()
from evalcraft import MockLLM, MockTool, CaptureContext
llm = MockLLM()
llm.add_response("*", "It's sunny and 22°C.") # wildcard match
search = MockTool("web_search")
search.returns({"results": [{"title": "Weather Paris", "snippet": "Sunny, 22°C"}]})
with CaptureContext(name="mocked_run", save_path="tests/cassettes/mocked.json") as ctx:
ctx.record_input("Weather in Paris?")
search_result = search.call(query="Paris weather today")
response = llm.complete(f"Search result: {search_result}")
ctx.record_output(response.content)
search.assert_called(times=1)
search.assert_called_with(query="Paris weather today")
llm.assert_called(times=1)
from evalcraft import replay, assert_tool_called, assert_tool_order, assert_cost_under
run = replay("tests/cassettes/weather.json")
# Tool assertions
assert assert_tool_called(run, "get_weather").passed
assert assert_tool_called(run, "get_weather", with_args={"city": "Paris"}).passed
assert assert_tool_order(run, ["get_weather"]).passed
# Cost & performance
assert assert_cost_under(run, max_usd=0.05).passed
# Output assertions
from evalcraft import assert_output_contains
assert assert_output_contains(run, "Paris").passed
# tests/test_weather_agent.py
from evalcraft import replay, assert_tool_called, assert_tool_order, assert_cost_under
def test_agent_calls_weather_tool():
run = replay("tests/cassettes/weather.json")
result = assert_tool_called(run, "get_weather")
assert result.passed, result.message
def test_agent_tool_sequence():
run = replay("tests/cassettes/weather.json")
result = assert_tool_order(run, ["get_weather"])
assert result.passed, result.message
def test_agent_cost_budget():
run = replay("tests/cassettes/weather.json")
result = assert_cost_under(run, max_usd=0.01)
assert result.passed, result.message
def test_agent_output():
run = replay("tests/cassettes/weather.json")
assert "Paris" in run.cassette.output_text
Integrations
Drop-in adapters that auto-record LLM calls. Zero code changes to your agent logic.
GPT-5.4, GPT-5-mini, o3, o4-mini, GPT-4.1
evalcraft[openai]
claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5
evalcraft[anthropic]
gemini-3.1-pro, gemini-2.5-pro, gemini-2.5-flash
evalcraft[google]
Llama 4 Scout, Llama 4 Maverick (MoE, 10M ctx)
evalcraft[openai]
Graphs, chains, ReAct agents, streaming
evalcraft[langchain]
Multi-agent crews, delegation, task callbacks
evalcraft[crewai]
Responses API, tool use, realtime agents
evalcraft[openai]
Mistral Large 3, Command A, or any OpenAI-compatible API
evalcraft[openai]
Comparison
| Evalcraft | Braintrust | LangSmith | Promptfoo | |
|---|---|---|---|---|
| Cassette-based replay | ✓ | — | — | — |
| Zero-cost CI testing | ✓ | — | — | Partial |
| pytest-native | ✓ | — | — | — |
| Mock LLM / Tools | ✓ | — | — | — |
| Framework agnostic | ✓ | ✓ | ✓ | ✓ |
| Self-hostable | ✓ | — | Partial | ✓ |
| Observability dashboard | — | ✓ | ✓ | — |
| Pricing | Free / OSS | Paid SaaS | Paid SaaS | Free / OSS |
Pricing
The core library is open source and always will be. Paid plans add team features and managed infrastructure.
Open source forever
For solo developers
For growing teams
For large organizations