Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.verial.ai/llms.txt

Use this file to discover all available pages before exploring further.

A benchmark run is a single execution of a benchmark. The agent drives a rollout for each task against a provisioned playground. When a task run completes, the verification engine runs each criterion against the final sandbox state and produces Criterion Runs, a task score, and a verdict.

Lifecycle

  1. Created. Benchmark run is created, a playground is provisioned, TaskRun rows are created in phase created.
  2. Started. The agent calls POST /v1/task-runs/{id}/start on the first task.
  3. Completed. All task runs are complete; the benchmark run score is the mean of task scores.

Two Entry Points

FlowAuthTypical caller
Internal POST /benchmark-runsOrganization API key (Bearer)Verial tooling
Public POST /v1/benchmark-runsSolver key (Bearer)External developers running a published benchmark from their own organization’s Solver
External integrations should use the v1 flow. See the Quick Start for an end-to-end walkthrough.

The v1 Flow

Scoring

Criterion Score

Every criterion produces a score in [0, 1].

Task Score

Weighted mean across criteria:
task_score = sum(score_i * weight_i) / sum(weight_i)

Per-Axis Scores

Criteria sharing an axis are aggregated into a per-axis weighted mean, returned in axes on the completion response.

Benchmark Run Score

Mean of task scores.

Verdict

  • pass when the aggregate is >= 0.9
  • partial when it is above zero
  • fail otherwise

Scored vs Unscored Runs

Benchmark runs can be created with scored: true. When the run is scored, completion responses omit details and per-field evidence to avoid leaking the rubric to the agent. You can always retrieve the full evidence later via GET /criterion-runs/{id}.

Cancellation

A run can be cancelled via POST /benchmark-runs/{id}/cancel. Completed task results are retained.

Comparing Runs

Run the same benchmark multiple times to measure improvements or regressions. Each run is independent.

Next Steps

Criteria

Understand how typed assertions score task runs.

Quick Start

Drive a v1 benchmark run end to end.