Structured output ($0 shape checks)¶
Modern agents increasingly return structured JSON — function calling,
response_format, structured outputs. Evalcraft's structured-output scorers let
you lock the shape of that output, and of every tool call, with assertions
that are offline, deterministic, and $0: they read only what the cassette
already recorded and never call a model or the network.
This is the cheap half of agent testing made concrete. Instead of paying an
LLM judge to answer "did the agent return the right fields / call the tool with
the right arguments?", you answer it with a byte-stable pytest assertion that
runs in milliseconds on every commit.
When to reach for these vs. an LLM judge
Use these for shape — is it valid JSON, does it have the required keys, are the values the right type / in the allowed set / in range, did the tool get called with conforming arguments. Use an LLM-as-Judge scorer only for quality — is the prose helpful, faithful, on-tone — which genuinely needs a live model.
The output is valid JSON¶
from evalcraft import replay, assert_output_json
run = replay("tests/cassettes/extractor.json")
assert assert_output_json(run).passed
# Agents often wrap JSON in prose or a ```json fence — accept that too:
assert assert_output_json(run, embedded=True).passed
The output conforms to a schema¶
assert_output_json_schema accepts a schema as a dict, a path to a
committed .json file, an inline JSON string, or a pydantic model
class (pydantic is already a core dependency):
from evalcraft import replay, assert_output_json_schema
run = replay("tests/cassettes/weather.json")
assert assert_output_json_schema(run, {
"type": "object",
"required": ["city", "temp_c", "status"],
"properties": {
"city": {"type": "string", "minLength": 1},
"temp_c": {"type": "number", "minimum": -90, "maximum": 60},
"status": {"enum": ["ok", "error"]},
},
}).passed
…or straight from a pydantic model you already have:
from pydantic import BaseModel
class Weather(BaseModel):
city: str
temp_c: float
status: str
assert assert_output_json_schema(run, Weather).passed
Which schema engine runs¶
By default the validator uses a small, pure-stdlib JSON-Schema subset
(type, required, properties, enum, const, minimum/maximum/
exclusive bounds, minLength/maxLength, pattern, items, minItems/
maxItems, uniqueItems, anyOf/allOf/oneOf, and local $ref + $defs).
That covers the vast majority of agent-output schemas with zero new
dependencies.
If you install the optional jsonschema package it is used transparently for
full Draft 2020-12 coverage:
The built-in validator is deliberately strict: if a schema uses a keyword it
does not implement, it raises a clear error pointing you at
evalcraft[schema] — so a test can never get a false PASS from a construct
that was silently ignored. Force a specific engine with engine="builtin" or
engine="jsonschema" (default "auto").
Field-level assertions¶
For quick checks you don't need a whole schema for. Paths are dotted with
bracket/index support ("user.id", "items.0.name", "items[0].name"):
from evalcraft import (
assert_output_has_keys, assert_output_field,
assert_output_value_in, assert_output_value_in_range,
)
assert assert_output_has_keys(run, ["city", "temp_c"]).passed
assert assert_output_field(run, "city", equals="Paris").passed
assert assert_output_value_in(run, "status", ["ok", "error"]).passed
assert assert_output_value_in_range(run, "temp_c", minimum=-90, maximum=60).passed
Regex with capture groups¶
assert_output_matches tells you whether the output matches a pattern;
assert_match_groups checks what it captured:
from evalcraft import assert_match_groups
# order id "#4521" -> group "4521"
assert assert_match_groups(run, r"#(\d+)", expected_groups=("4521",)).passed
# named groups
assert assert_match_groups(run, r"status=(?P<s>\w+)", expected_named={"s": "done"}).passed
Lock the shape of tool-call arguments¶
The agent-native one. Other tools spend a live LLM to judge whether a tool was
called with sensible arguments; evalcraft validates the recorded tool_args
against a schema deterministically, for $0:
from evalcraft import replay, assert_tool_args_match_schema
run = replay("tests/cassettes/booking.json")
assert assert_tool_args_match_schema(run, "book_flight", {
"type": "object",
"required": ["origin", "destination", "date"],
"properties": {
"origin": {"type": "string", "pattern": "^[A-Z]{3}$"},
"destination": {"type": "string", "pattern": "^[A-Z]{3}$"},
"date": {"type": "string"},
"cabin": {"enum": ["economy", "business", "first"]},
},
}).passed
which="all" (default) requires every recorded call to book_flight to
conform; which="any" passes if at least one does.
It's automatic in generate-tests¶
When you scaffold tests from a cassette whose output is JSON,
evalcraft generate-tests now emits assert_output_json and
assert_output_has_keys(...) tests for you, so the output's shape is locked
from the first run:
Why this is the right layer to own¶
- Deterministic &
$0. No model, no network, no flakiness — runs on every commit in milliseconds and never bills you. - Git-diffable. The cassette is committed; the schema is committed. A shape regression shows up as a failing test in the PR, not a surprise in production.
- Agent-shaped. Tool-call-argument validation tests the agent's plumbing (did it wire the right arguments into the right tool), which is exactly the layer replay + structural scorers are built to keep fast and committed.