CI/CD Integration¶
Evalcraft ships a reusable GitHub Action that gates pull requests on agent test results. Add it to your workflow in under five minutes.
Quick Start¶
# .github/workflows/agent-tests.yml
name: Agent Tests
on:
pull_request:
branches: [main]
jobs:
agent-tests:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: beyhangl/evalcraft@v0.1.0
with:
test-path: tests/agent_tests/
cassette-dir: tests/cassettes/
max-cost: '0.10'
max-regression: '5%'
This will: 1. Install evalcraft into the runner 2. Run your pytest agent tests in replay mode (no real LLM calls) 3. Post a results table as a PR comment 4. Fail the workflow if any test fails or a threshold is exceeded
Inputs¶
| Input | Default | Description |
|---|---|---|
test-path |
tests/ |
Path to agent test files or directory |
cassette-dir |
tests/cassettes |
Where cassettes are stored/loaded |
record-mode |
none |
none · new · all — see Record Modes |
max-cost |
(none) | Maximum total cost in USD. Fails if exceeded. |
max-regression |
(none) | Max % increase in any cassette metric vs. baseline. |
evalcraft-version |
latest |
Pin to a specific release (e.g. 0.1.0) |
python-version |
3.11 |
Python version to use |
extra-pytest-args |
(none) | Extra flags forwarded to pytest verbatim |
post-comment |
true |
Post PR comment with results table |
github-token |
${{ github.token }} |
Token for posting PR comments |
Outputs¶
| Output | Description |
|---|---|
passed |
"true" if all tests and thresholds passed |
total-cost |
Total USD cost across all cassettes |
total-tokens |
Total token count across all cassettes |
total-tool-calls |
Total tool call count |
cassette-count |
Number of cassette files evaluated |
Use outputs in downstream steps:
- uses: beyhangl/evalcraft@v0.1.0
id: evalcraft
with:
test-path: tests/agent_tests/
- name: Notify on regression
if: steps.evalcraft.outputs.passed == 'false'
run: echo "Agent regressions detected — total cost ${{ steps.evalcraft.outputs.total-cost }}"
Record Modes¶
| Mode | Description | Requires LLM keys |
|---|---|---|
none |
Replay only. Skips tests whose cassette is missing. | No |
new |
Record cassettes that don't exist yet; replay the rest. | Yes |
all |
Always re-record every cassette (live LLM calls). | Yes |
The default none keeps CI fast and deterministic — no real API calls are made.
Threshold Checks¶
max-cost¶
Fails the action if the total cost across all cassettes exceeds the limit.
- uses: beyhangl/evalcraft@v0.1.0
with:
max-cost: '0.10' # fail if agents collectively spend > $0.10
max-regression¶
Compares cassette metrics (tokens, cost) against the baseline versions committed in git. Fails if any metric increases by more than the specified percentage.
- uses: beyhangl/evalcraft@v0.1.0
with:
max-regression: '5%' # fail if any metric regresses > 5%
record-mode: all # re-record so fresh metrics are available
Note:
max-regressionis only meaningful whenrecord-modeisneworall— it compares the freshly-recorded cassettes against the baseline versions already committed to git.
PR Comment¶
When post-comment: true (default) and the action runs on a pull request, evalcraft posts a formatted results table:
## 🧪 Evalcraft Agent Test Results — ✅ Passed
| Metric | Value |
|------------------|-----------|
| Cassettes | 4 |
| Total tokens | 2,840 |
| Total cost | $0.0423 |
| Total tool calls | 12 |
### Per-Cassette Metrics
| Cassette | LLM Calls | Tool Calls | Tokens | Cost | Duration | Fingerprint |
|--------------------|:---------:|:----------:|-------:|--------:|:--------:|:-----------:|
| `math_agent` | 1 | 0 | 312 | $0.0041 | 210ms | `a1b2c3d4` |
| `search_agent` | 3 | 4 | 1,240 | $0.0180 | 1.24s | `e5f6g7h8` |
The action updates the same comment on each push (no comment spam).
Pinning Versions¶
Always pin evalcraft to a specific version in production workflows for reproducibility:
Full Example¶
See .github/workflows/example-ci-gate.yml for a complete workflow you can copy to your project.
name: Agent Tests (Evalcraft CI Gate)
on:
pull_request:
branches: [main]
concurrency:
group: evalcraft-${{ github.ref }}
cancel-in-progress: true
jobs:
agent-tests:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: beyhangl/evalcraft@v0.1.0
with:
test-path: tests/agent_tests/
cassette-dir: tests/cassettes/
max-cost: '0.10'
max-regression: '5%'
record-mode: none
post-comment: 'true'
Nightly Cassette Refresh¶
To keep cassettes up to date with model changes, run a nightly re-record job and open a PR if anything changed:
# .github/workflows/cassette-refresh.yml
name: Nightly cassette refresh
on:
schedule:
- cron: '0 3 * * *' # 3 AM UTC daily
jobs:
refresh:
runs-on: ubuntu-latest
permissions:
contents: write
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: beyhangl/evalcraft@v0.1.0
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
with:
test-path: tests/agent_tests/
cassette-dir: tests/cassettes/
record-mode: all # re-record every cassette
max-regression: '10%' # allow up to 10% model drift
post-comment: 'false'
- uses: peter-evans/create-pull-request@v6
with:
commit-message: 'chore: refresh evalcraft cassettes'
title: '🤖 Cassette refresh'
branch: cassette-refresh/nightly
labels: cassettes, automated
Required Permissions¶
The action only needs elevated permissions for PR comment posting:
permissions:
contents: read # read cassette files from the repo
pull-requests: write # post / update results comment
If you set post-comment: 'false' you can drop pull-requests: write.
Troubleshooting¶
Tests are skipped in CI
If tests are skipped with "Cassette not found", your cassette files aren't committed to the repo. Either commit the cassettes or switch to record-mode: new (requires LLM API keys as secrets).
PR comment not posting
Ensure permissions: pull-requests: write is set on the job and that the workflow trigger is pull_request (not push).
max-regression not triggering
Regression checks compare freshly-recorded cassettes to the git-committed baseline. Make sure record-mode is new or all so new cassettes are written before the comparison runs.
Pinning evalcraft version in eval step
If you use evalcraft-version: latest but need reproducible results, pin to a release: evalcraft-version: '0.1.0'.