Changelog¶
All notable changes to Evalcraft are documented here.
The format follows Keep a Changelog and Semantic Versioning.
0.1.0 — 2026-03-05¶
Initial public release of Evalcraft — the pytest for AI agents.
Added¶
Core data model¶
Span— atomic unit of capture, recording every LLM call, tool invocation, agent step, user input, and output with timing, token usage, and cost metadataCassette— the fundamental recording unit that stores all spans from a single agent execution; supports fingerprinting for change detection, aggregate metrics, and JSON serialization/deserializationAgentRun— wrapper for live or replayed agent resultsEvalResult/AssertionResult— structured pass/fail results for assertions with score trackingSpanKindenum:llm_request,llm_response,tool_call,tool_result,agent_step,user_input,agent_outputTokenUsagedataclass tracking prompt, completion, and total tokens
Capture¶
capture()context manager — instrument any code block to record spans into a cassetteCaptureContext— configurable capture session with name, agent name, framework tag, and optional auto-save path
Replay¶
ReplayEngine— feeds recorded LLM responses back without making real API calls- Tool result overriding for isolated replay testing
ReplayDiff— compare two cassettes and detect changes in tool sequence, output text, token count, cost, and span count
Mock¶
MockLLM— deterministic LLM fake with pattern-based response matching ("*"wildcard), token usage simulation, cost tracking, and automatic span recordingMockTool— configurable tool fake with.returns()/.raises()/.side_effect()control
Eval scorers — 8 built-in assertions¶
| Assertion | Description |
|---|---|
assert_tool_called |
Verify a tool was invoked; supports times, with_args, before, after |
assert_tool_order |
Verify tool call sequence (strict or subsequence mode) |
assert_no_tool_called |
Verify a tool was never invoked |
assert_output_contains |
Verify agent output contains a substring |
assert_output_matches |
Verify agent output matches a regex pattern |
assert_cost_under |
Enforce a cost budget in USD |
assert_latency_under |
Enforce a latency budget in milliseconds |
assert_token_count_under |
Enforce a token budget |
Evaluator — compose multiple assertions into a single evaluation with aggregate scoring.
Framework adapters — 4 adapters¶
| Adapter | Frameworks |
|---|---|
OpenAIAdapter |
OpenAI Python SDK (chat.completions.create, sync + async) |
AnthropicAdapter |
Anthropic Python SDK (messages.create, sync + async); built-in Claude pricing table |
LangGraphAdapter |
LangGraph compiled graphs — node executions, LLM calls, tool calls |
CrewAIAdapter |
CrewAI Crew — kickoff timing, per-agent tool calls, task completions, delegations |
pytest plugin (pytest-evalcraft)¶
Auto-registered via entry_points — zero-config activation when evalcraft is installed.
Fixtures: capture_context, mock_llm, mock_tool, cassette, replay_engine, evalcraft_cassette_dir
Markers:
- @pytest.mark.evalcraft_cassette(path) — load a cassette for replay-based assertions
- @pytest.mark.evalcraft_capture(name, save) — auto-capture the test's agent run
- @pytest.mark.evalcraft_agent — tag tests as agent evaluation tests for filtering
CLI options: --cassette-dir DIR, --evalcraft-record {none,new,all}
Terminal summary: per-test agent run metrics table (tokens, cost, tools, latency, fingerprint) appended to pytest output.
CLI (evalcraft)¶
| Command | Description |
|---|---|
evalcraft capture <script> |
Run a Python script under capture and save the cassette |
evalcraft replay <cassette> |
Replay a cassette and display metrics (--verbose shows all spans) |
evalcraft diff <old> <new> |
Compare two cassettes side-by-side (--json for CI) |
evalcraft eval <cassette> |
Run assertions with cost/token/latency/tool thresholds; exits 1 on failure |
evalcraft info <cassette> |
Inspect cassette metadata, metrics, tool sequence, and spans |
evalcraft mock <cassette> |
Generate ready-to-use MockLLM and MockTool Python fixtures |
Project infrastructure¶
- MIT license, Python 3.9–3.13 support
- Optional dependency groups:
[pytest],[openai],[anthropic],[langchain],[crewai],[all] - Hatchling build system, Ruff linting, mypy strict type checking
- GitHub Actions CI and PyPI publish workflows
- 260 tests at release