Changelog¶

All notable changes to Evalcraft are documented here.

The format follows Keep a Changelog and Semantic Versioning.

[0.4.0] — 2026-06-16¶

Added¶

Deterministic structured-output & tool-call-argument scorers ($0, offline, no model call): assert_output_json, assert_output_json_schema (dict / .json path / inline JSON / pydantic model; pure-stdlib subset validator that upgrades to full Draft 2020-12 when jsonschema is installed), assert_output_has_keys, assert_output_field, assert_output_value_in, assert_output_value_in_range, assert_match_groups (regex capture groups), and assert_tool_args_match_schema (validate recorded tool-call arguments against a schema). See Structured Output.
generate-tests auto-emits assert_output_json + assert_output_has_keys tests when a recorded output is JSON.

[0.3.1] — 2026-06-16¶

Fixed¶

README logo now renders on the PyPI project page (switched from a repo-relative image path to an absolute URL). No SDK or docs behavior changes.

[0.3.0] — 2026-06-01¶

Added¶

evalcraft check-stale — detect cassettes recorded against a retired/swapped model (CRITICAL, non-zero exit for CI) or a drifted prompt (WARNING), by activating the provenance each cassette records. See Check Stale.

0.1.0 — 2026-03-05¶

Initial public release of Evalcraft — the pytest for AI agents.

Added¶

Core data model¶

Span — atomic unit of capture, recording every LLM call, tool invocation, agent step, user input, and output with timing, token usage, and cost metadata
Cassette — the fundamental recording unit that stores all spans from a single agent execution; supports fingerprinting for change detection, aggregate metrics, and JSON serialization/deserialization
AgentRun — wrapper for live or replayed agent results
EvalResult / AssertionResult — structured pass/fail results for assertions with score tracking
SpanKind enum: llm_request, llm_response, tool_call, tool_result, agent_step, user_input, agent_output
TokenUsage dataclass tracking prompt, completion, and total tokens

Capture¶

capture() context manager — instrument any code block to record spans into a cassette
CaptureContext — configurable capture session with name, agent name, framework tag, and optional auto-save path

Replay¶

ReplayEngine — feeds recorded LLM responses back without making real API calls
Tool result overriding for isolated replay testing
ReplayDiff — compare two cassettes and detect changes in tool sequence, output text, token count, cost, and span count

Mock¶

MockLLM — deterministic LLM fake with pattern-based response matching ("*" wildcard), token usage simulation, cost tracking, and automatic span recording
MockTool — configurable tool fake with .returns() / .raises() / .side_effect() control

Eval scorers — 8 built-in assertions¶

Assertion	Description
`assert_tool_called`	Verify a tool was invoked; supports `times`, `with_args`, `before`, `after`
`assert_tool_order`	Verify tool call sequence (strict or subsequence mode)
`assert_no_tool_called`	Verify a tool was never invoked
`assert_output_contains`	Verify agent output contains a substring
`assert_output_matches`	Verify agent output matches a regex pattern
`assert_cost_under`	Enforce a cost budget in USD
`assert_latency_under`	Enforce a latency budget in milliseconds
`assert_token_count_under`	Enforce a token budget

Evaluator — compose multiple assertions into a single evaluation with aggregate scoring.

Framework adapters — 4 adapters¶

Adapter	Frameworks
`OpenAIAdapter`	OpenAI Python SDK (`chat.completions.create`, sync + async)
`AnthropicAdapter`	Anthropic Python SDK (`messages.create`, sync + async); built-in Claude pricing table
`LangGraphAdapter`	LangGraph compiled graphs — node executions, LLM calls, tool calls
`CrewAIAdapter`	CrewAI `Crew` — kickoff timing, per-agent tool calls, task completions, delegations

pytest plugin (`pytest-evalcraft`)¶

Auto-registered via entry_points — zero-config activation when evalcraft is installed.

Fixtures: capture_context, mock_llm, mock_tool, cassette, replay_engine, evalcraft_cassette_dir

Markers: - @pytest.mark.evalcraft_cassette(path) — load a cassette for replay-based assertions - @pytest.mark.evalcraft_capture(name, save) — auto-capture the test's agent run - @pytest.mark.evalcraft_agent — tag tests as agent evaluation tests for filtering

CLI options: --cassette-dir DIR, --evalcraft-record {none,new,all}

Terminal summary: per-test agent run metrics table (tokens, cost, tools, latency, fingerprint) appended to pytest output.

CLI (`evalcraft`)¶

Command	Description
`evalcraft capture <script>`	Run a Python script under capture and save the cassette
`evalcraft replay <cassette>`	Replay a cassette and display metrics (`--verbose` shows all spans)
`evalcraft diff <old> <new>`	Compare two cassettes side-by-side (`--json` for CI)
`evalcraft eval <cassette>`	Run assertions with cost/token/latency/tool thresholds; exits 1 on failure
`evalcraft info <cassette>`	Inspect cassette metadata, metrics, tool sequence, and spans
`evalcraft mock <cassette>`	Generate ready-to-use `MockLLM` and `MockTool` Python fixtures

Project infrastructure¶

MIT license, Python 3.9–3.13 support
Optional dependency groups: [pytest], [openai], [anthropic], [langchain], [crewai], [all]
Hatchling build system, Ruff linting, mypy strict type checking
GitHub Actions CI and PyPI publish workflows
260 tests at release

Changelog¶

[0.4.0] — 2026-06-16¶

Added¶

[0.3.1] — 2026-06-16¶

Fixed¶

[0.3.0] — 2026-06-01¶

Added¶

0.1.0 — 2026-03-05¶

Added¶

Core data model¶

Capture¶

Replay¶

Mock¶

Eval scorers — 8 built-in assertions¶

Framework adapters — 4 adapters¶

pytest plugin (pytest-evalcraft)¶

CLI (evalcraft)¶

Project infrastructure¶

pytest plugin (`pytest-evalcraft`)¶

CLI (`evalcraft`)¶