Skip to content

CI/CD Integration

Evalcraft ships a reusable GitHub Action that gates pull requests on agent test results. Add it to your workflow in under five minutes.

Quick Start

# .github/workflows/agent-tests.yml
name: Agent Tests

on:
  pull_request:
    branches: [main]

jobs:
  agent-tests:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write

    steps:
      - uses: actions/checkout@v4

      - uses: beyhangl/evalcraft@v0.1.0
        with:
          test-path: tests/agent_tests/
          cassette-dir: tests/cassettes/
          max-cost: '0.10'
          max-regression: '5%'

This will: 1. Install evalcraft into the runner 2. Run your pytest agent tests in replay mode (no real LLM calls) 3. Post a results table as a PR comment 4. Fail the workflow if any test fails or a threshold is exceeded


Inputs

Input Default Description
test-path tests/ Path to agent test files or directory
cassette-dir tests/cassettes Where cassettes are stored/loaded
record-mode none none · new · all — see Record Modes
max-cost (none) Maximum total cost in USD. Fails if exceeded.
max-regression (none) Max % increase in any cassette metric vs. baseline.
evalcraft-version latest Pin to a specific release (e.g. 0.1.0)
python-version 3.11 Python version to use
extra-pytest-args (none) Extra flags forwarded to pytest verbatim
post-comment true Post PR comment with results table
github-token ${{ github.token }} Token for posting PR comments

Outputs

Output Description
passed "true" if all tests and thresholds passed
total-cost Total USD cost across all cassettes
total-tokens Total token count across all cassettes
total-tool-calls Total tool call count
cassette-count Number of cassette files evaluated

Use outputs in downstream steps:

- uses: beyhangl/evalcraft@v0.1.0
  id: evalcraft
  with:
    test-path: tests/agent_tests/

- name: Notify on regression
  if: steps.evalcraft.outputs.passed == 'false'
  run: echo "Agent regressions detected — total cost ${{ steps.evalcraft.outputs.total-cost }}"

Record Modes

Mode Description Requires LLM keys
none Replay only. Skips tests whose cassette is missing. No
new Record cassettes that don't exist yet; replay the rest. Yes
all Always re-record every cassette (live LLM calls). Yes

The default none keeps CI fast and deterministic — no real API calls are made.


Threshold Checks

max-cost

Fails the action if the total cost across all cassettes exceeds the limit.

- uses: beyhangl/evalcraft@v0.1.0
  with:
    max-cost: '0.10'   # fail if agents collectively spend > $0.10

max-regression

Compares cassette metrics (tokens, cost) against the baseline versions committed in git. Fails if any metric increases by more than the specified percentage.

- uses: beyhangl/evalcraft@v0.1.0
  with:
    max-regression: '5%'    # fail if any metric regresses > 5%
    record-mode: all        # re-record so fresh metrics are available

Note: max-regression is only meaningful when record-mode is new or all — it compares the freshly-recorded cassettes against the baseline versions already committed to git.


PR Comment

When post-comment: true (default) and the action runs on a pull request, evalcraft posts a formatted results table:

## 🧪 Evalcraft Agent Test Results — ✅ Passed

| Metric           | Value     |
|------------------|-----------|
| Cassettes        | 4         |
| Total tokens     | 2,840     |
| Total cost       | $0.0423   |
| Total tool calls | 12        |

### Per-Cassette Metrics

| Cassette           | LLM Calls | Tool Calls | Tokens | Cost    | Duration | Fingerprint |
|--------------------|:---------:|:----------:|-------:|--------:|:--------:|:-----------:|
| `math_agent`       | 1         | 0          | 312    | $0.0041 | 210ms    | `a1b2c3d4`  |
| `search_agent`     | 3         | 4          | 1,240  | $0.0180 | 1.24s    | `e5f6g7h8`  |

The action updates the same comment on each push (no comment spam).


Pinning Versions

Always pin evalcraft to a specific version in production workflows for reproducibility:

- uses: beyhangl/evalcraft@v0.1.0
  with:
    evalcraft-version: '0.1.0'

Full Example

See .github/workflows/example-ci-gate.yml for a complete workflow you can copy to your project.

name: Agent Tests (Evalcraft CI Gate)

on:
  pull_request:
    branches: [main]

concurrency:
  group: evalcraft-${{ github.ref }}
  cancel-in-progress: true

jobs:
  agent-tests:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write

    steps:
      - uses: actions/checkout@v4

      - uses: beyhangl/evalcraft@v0.1.0
        with:
          test-path: tests/agent_tests/
          cassette-dir: tests/cassettes/
          max-cost: '0.10'
          max-regression: '5%'
          record-mode: none
          post-comment: 'true'

Nightly Cassette Refresh

To keep cassettes up to date with model changes, run a nightly re-record job and open a PR if anything changed:

# .github/workflows/cassette-refresh.yml
name: Nightly cassette refresh

on:
  schedule:
    - cron: '0 3 * * *'   # 3 AM UTC daily

jobs:
  refresh:
    runs-on: ubuntu-latest
    permissions:
      contents: write
      pull-requests: write

    steps:
      - uses: actions/checkout@v4

      - uses: beyhangl/evalcraft@v0.1.0
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        with:
          test-path: tests/agent_tests/
          cassette-dir: tests/cassettes/
          record-mode: all          # re-record every cassette
          max-regression: '10%'    # allow up to 10% model drift
          post-comment: 'false'

      - uses: peter-evans/create-pull-request@v6
        with:
          commit-message: 'chore: refresh evalcraft cassettes'
          title: '🤖 Cassette refresh'
          branch: cassette-refresh/nightly
          labels: cassettes, automated

Required Permissions

The action only needs elevated permissions for PR comment posting:

permissions:
  contents: read        # read cassette files from the repo
  pull-requests: write  # post / update results comment

If you set post-comment: 'false' you can drop pull-requests: write.


Troubleshooting

Tests are skipped in CI

If tests are skipped with "Cassette not found", your cassette files aren't committed to the repo. Either commit the cassettes or switch to record-mode: new (requires LLM API keys as secrets).

PR comment not posting

Ensure permissions: pull-requests: write is set on the job and that the workflow trigger is pull_request (not push).

max-regression not triggering

Regression checks compare freshly-recorded cassettes to the git-committed baseline. Make sure record-mode is new or all so new cassettes are written before the comparison runs.

Pinning evalcraft version in eval step

If you use evalcraft-version: latest but need reproducible results, pin to a release: evalcraft-version: '0.1.0'.