Skip to main content
Wire a published Verial benchmark into your CI so every pull request runs the same rollout and a regression fails the build. This guide shows a complete GitHub Actions workflow that starts a run against the v1 API, waits for it to finish, and gates the PR on the resulting score.

Prerequisites

  • A Verial Solver key stored as a repo secret named VERIAL_SOLVER_KEY. Create the Solver in your organization’s dashboard under Solvers, mint a key, and paste it into Settings > Secrets and variables > Actions in GitHub. See Solver Keys.
  • A published benchmark reference (slug@version), for example fax-referral@1. Browse public benchmarks in the dashboard or via the benchmarks API.
  • A way to start your agent in CI. The example below assumes an agent server you can launch with npm run start:agent. Replace that step with whatever is right for your agent.

Complete Workflow

# .github/workflows/verial-benchmark.yml
name: Verial benchmark

on:
  pull_request:
  workflow_dispatch:

jobs:
  benchmark:
    runs-on: ubuntu-latest
    env:
      VERIAL_SOLVER_KEY: ${{ secrets.VERIAL_SOLVER_KEY }}
      BENCHMARK_REF: fax-referral@1
      SCORE_THRESHOLD: "0.85"
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Install agent dependencies
        run: npm ci

      - name: Start the agent in the background
        run: npm run start:agent &
        # Replace with your agent boot command.

      - name: Run Verial benchmark
        id: run
        run: |
          RESPONSE=$(curl -sS -X POST https://api.verial.ai/v1/benchmark-runs \
            -H "Authorization: Bearer $VERIAL_SOLVER_KEY" \
            -H "Content-Type: application/json" \
            -d "{\"benchmark\": \"$BENCHMARK_REF\", \"scored\": true}")

          echo "$RESPONSE" > run.json
          RUN_ID=$(jq -r '.benchmark_run_id' run.json)
          RUN_TOKEN=$(jq -r '.bearer_token' run.json)

          echo "run_id=$RUN_ID" >> "$GITHUB_OUTPUT"
          echo "::add-mask::$RUN_TOKEN"
          echo "RUN_TOKEN=$RUN_TOKEN" >> "$GITHUB_ENV"

      - name: Drive the rollout
        run: node scripts/drive-verial-run.js "${{ steps.run.outputs.run_id }}"
        # Your agent reads from /v1/benchmark-runs/{run_id}/... and calls
        # POST /v1/task-runs/{id}/complete when done. See the Quickstart.

      - name: Fetch final score
        id: score
        run: |
          RESULT=$(curl -sS \
            "https://api.verial.ai/v1/benchmark-runs/${{ steps.run.outputs.run_id }}" \
            -H "Authorization: Bearer $RUN_TOKEN")
          echo "$RESULT" > result.json
          SCORE=$(jq -r '.score // 0' result.json)
          VERDICT=$(jq -r '.verdict // "fail"' result.json)
          echo "Run verdict=$VERDICT score=$SCORE"
          echo "score=$SCORE" >> "$GITHUB_OUTPUT"
          echo "verdict=$VERDICT" >> "$GITHUB_OUTPUT"

      - name: Fail if score dropped
        run: |
          awk -v s="${{ steps.score.outputs.score }}" -v t="$SCORE_THRESHOLD" \
            'BEGIN { if (s+0 < t+0) { exit 1 } }'
The run is created with scored: true, which withholds evidence (details and field_results) from the agent at completion time so your agent cannot learn the rubric between CI runs. You can still fetch the full evidence after the fact from GET /criterion-runs/{id} using an organization API key. See Benchmark Runs.

Using the SDK / CLI

The same flow works with the TypeScript SDK if your CI already installs Node dependencies:
- name: Run benchmark via SDK
  env:
    VERIAL_SOLVER_KEY: ${{ secrets.VERIAL_SOLVER_KEY }}
  run: |
    npm install @verial-ai/sdk
    node -e "
      const { Verial } = require('@verial-ai/sdk');
      const v = new Verial({ apiKey: process.env.VERIAL_SOLVER_KEY });
      // ... see /sdk/usage for the full run-driving pattern ...
    "
The Verial CLI ships with the SDK (npx @verial-ai/sdk <command>). See the CLI reference for available commands. The curl-based workflow above is the lowest-dependency path and works without any Node setup in the CI job where your agent runs.

Scheduled Regression Runs

GitHub Actions’ schedule trigger lets you re-run the same benchmark on a cadence, independent of pull requests:
on:
  schedule:
    - cron: "0 8 * * 1" # every Monday at 08:00 UTC
See Scheduled Runs for more on this pattern.

Next Steps

Webhooks

Get notified asynchronously when a long run finishes rather than polling.

Running a Benchmark

The deeper guide: browsing benchmarks, versions, comparing runs.

Solver Keys

Create, rotate, and scope the key your CI uses.

Run Results

Read back a failing run top-down with evidence.