A benchmark run is a single execution of a benchmark. The agent drives a rollout for each task against a provisioned playground. When a task run completes, the verification engine runs each criterion against the final sandbox state and produces Criterion Runs, a task score, and a verdict.Documentation Index
Fetch the complete documentation index at: https://docs.verial.ai/llms.txt
Use this file to discover all available pages before exploring further.
Lifecycle
- Created. Benchmark run is created, a playground is provisioned, TaskRun rows are created in phase
created. - Started. The agent calls
POST /v1/task-runs/{id}/starton the first task. - Completed. All task runs are complete; the benchmark run score is the mean of task scores.
Two Entry Points
| Flow | Auth | Typical caller |
|---|---|---|
Internal POST /benchmark-runs | Organization API key (Bearer) | Verial tooling |
Public POST /v1/benchmark-runs | Solver key (Bearer) | External developers running a published benchmark from their own organization’s Solver |
The v1 Flow
Scoring
Criterion Score
Every criterion produces ascore in [0, 1].
Task Score
Weighted mean across criteria:Per-Axis Scores
Criteria sharing anaxis are aggregated into a per-axis weighted mean, returned in axes on the completion response.
Benchmark Run Score
Mean of task scores.Verdict
passwhen the aggregate is>= 0.9partialwhen it is above zerofailotherwise
Scored vs Unscored Runs
Benchmark runs can be created withscored: true. When the run is scored, completion responses omit details and per-field evidence to avoid leaking the rubric to the agent. You can always retrieve the full evidence later via GET /criterion-runs/{id}.
Cancellation
A run can be cancelled viaPOST /benchmark-runs/{id}/cancel. Completed task results are retained.
Comparing Runs
Run the same benchmark multiple times to measure improvements or regressions. Each run is independent.Next Steps
Criteria
Understand how typed assertions score task runs.
Quick Start
Drive a v1 benchmark run end to end.