The Three Levels
1. Benchmark Run
GET /v1/benchmark-runs/{id} returns the aggregate score, verdict, and a summary of every task run in the rollout. Authenticate with the run’s bearer token (returned when you created the run) or a Solver key.
2. Task Run
GET /task-runs/{id} returns a single task run’s verdict, score, axes, the frozen task snapshot, and an array of criterion_runs. This is the internal endpoint; callers use an organization API key.
id, passed, and score. To pull the full evidence payload for a specific criterion, follow the drill-down in step 3. See Task Runs for the full shape.
3. Criterion Run
GET /criterion-runs/{id} is the authoritative record of how one criterion scored. Its evidence field holds the structured payload that justifies the passed and score values.
evidence shape depends on the criterion’s check type. The Verification concept page lists every shape.
Scoring
Every level applies the same arithmetic.- Criterion score. A number in
[0, 1]returned by the check. - Task score. Weighted mean of the task’s criterion scores:
sum(score_i * weight_i) / sum(weight_i). - Axes. Criteria that declare an
axis(for examplecorrectness,safety,efficiency) are grouped into per-axis weighted means and returned in theaxesobject. - Benchmark run score. Mean of task scores.
passwhen the score is>= 0.9partialwhen it is above zero and below that thresholdfailwhen the score is zero or there are no criteria
Scored vs Unscored Runs
When a run is created withscored: true, the POST /v1/task-runs/{id}/complete response omits details and per-field evidence to avoid leaking the rubric to the agent.
GET /criterion-runs/{id} (as shown above) with an organization API key.
Drill-Down Example
A typical triage flow: start at the benchmark run, pick a failing task run, list its criterion runs, open the one you care about.Next Steps
Interactions
Read the raw evidence captured per protocol during a rollout.
Verification
How the engine dispatches checks and builds the evidence payloads.