Benchmark Runs

A benchmark run is a single execution of a benchmark. The agent drives a rollout for each task against a provisioned playground. When a task run completes, the verification engine runs each criterion against the final sandbox state and produces Criterion Runs, a task score, and a verdict.

Lifecycle

Created. Benchmark run is created, a playground is provisioned, TaskRun rows are created in phase created.
Started. The agent calls POST /v1/task-runs/{id}/start on the first task.
Completed. All task runs are complete; the benchmark run score is the mean of task scores.

Two Entry Points

Flow	Auth	Typical caller
Internal `POST /benchmark-runs`	Organization API key (Bearer)	Verial tooling
Public `POST /v1/benchmark-runs`	Solver key (Bearer)	External developers running a published benchmark from their own organization’s Solver

External integrations should use the v1 flow. See the Quick Start for an end-to-end walkthrough.

The v1 Flow

Scoring

Criterion Score

Every criterion produces a score in [0, 1].

Task Score

Weighted mean across criteria:

task_score = sum(score_i * weight_i) / sum(weight_i)

Per-Axis Scores

Criteria sharing an axis are aggregated into a per-axis weighted mean, returned in axes on the completion response.

Benchmark Run Score

Mean of task scores.

Verdict

pass when the aggregate is >= 0.9
partial when it is above zero
fail otherwise

Scored vs Unscored Runs

Benchmark runs can be created with scored: true. When the run is scored, completion responses omit details and per-field evidence to avoid leaking the rubric to the agent. You can always retrieve the full evidence later via GET /criterion-runs/{id}.

Cancellation

A run can be cancelled via POST /benchmark-runs/{id}/cancel. Completed task results are retained.

Comparing Runs

Run the same benchmark multiple times to measure improvements or regressions. Each run is independent.

​Lifecycle

​Two Entry Points

​The v1 Flow

​Scoring

​Criterion Score

​Task Score

​Per-Axis Scores

​Benchmark Run Score

​Verdict

​Scored vs Unscored Runs

​Cancellation

​Comparing Runs

​Next Steps

Criteria

Quick Start