Run Results

After a benchmark rollout completes, results are stored as a three-level hierarchy: Benchmark Run → Task Run → Criterion Run. Each level has its own endpoint, and you can drill straight from the aggregate verdict down to the field-level evidence that produced it.

The Three Levels

1. Benchmark Run

GET /v1/benchmark-runs/{id} returns the aggregate score, verdict, and a summary of every task run in the rollout. Authenticate with the run’s bearer token (returned when you created the run) or a Solver key.

curl "https://api.verial.ai/v1/benchmark-runs/$BENCHMARK_RUN_ID" \
  -H "Authorization: Bearer $RUN_TOKEN"

{
  "benchmark_run_id": "br_01H...",
  "benchmark": { "slug": "fax-referral", "version": "1" },
  "scored": false,
  "phase": "completed",
  "status": "completed",
  "verdict": "partial",
  "score": 0.72,
  "task_runs": [
    { "id": "tr_01H...", "phase": "completed", "status": "completed", "verdict": "pass", "score": 0.9 },
    { "id": "tr_02H...", "phase": "completed", "status": "completed", "verdict": "fail", "score": 0.0 }
  ]
}

See the full field list on the Benchmark Runs API reference.

2. Task Run

GET /task-runs/{id} returns a single task run’s verdict, score, axes, the frozen task snapshot, and an array of criterion_runs. This is the internal endpoint; callers use an organization API key.

curl "https://api.verial.ai/task-runs/$TASK_RUN_ID" \
  -H "Authorization: Bearer $VERIAL_API_KEY"

The response includes each criterion’s id, passed, and score. To pull the full evidence payload for a specific criterion, follow the drill-down in step 3. See Task Runs for the full shape.

3. Criterion Run

GET /criterion-runs/{id} is the authoritative record of how one criterion scored. Its evidence field holds the structured payload that justifies the passed and score values.

curl "https://api.verial.ai/criterion-runs/$CRITERION_RUN_ID" \
  -H "Authorization: Bearer $VERIAL_API_KEY"

{
  "id": "crun_01H...",
  "task_run_id": "tr_01H...",
  "criterion_id": "crit_01H...",
  "passed": true,
  "score": 1.0,
  "details": "Appointment resource matched; all 2 field assertions passed.",
  "evidence": {
    "field_results": [
      { "path": "participant.0.actor.display", "expected": "Dr. Rivera", "actual": "Dr. Rivera", "passed": true }
    ]
  },
  "started_at": "2026-04-20T17:10:02.000Z",
  "completed_at": "2026-04-20T17:10:04.000Z"
}

The evidence shape depends on the criterion’s check type. The Verification concept page lists every shape.

Scoring

Every level applies the same arithmetic.

Criterion score. A number in [0, 1] returned by the check.
Task score. Weighted mean of the task’s criterion scores: sum(score_i * weight_i) / sum(weight_i).
Axes. Criteria that declare an axis (for example correctness, safety, efficiency) are grouped into per-axis weighted means and returned in the axes object.
Benchmark run score. Mean of task scores.

Verdict is assigned at each level from the score:

pass when the score is >= 0.9
partial when it is above zero and below that threshold
fail when the score is zero or there are no criteria

Scored vs Unscored Runs

When a run is created with scored: true, the POST /v1/task-runs/{id}/complete response omits details and per-field evidence to avoid leaking the rubric to the agent.

{
  "task_run_id": "tr_01H...",
  "phase": "completed",
  "verdict": "partial",
  "score": 0.66,
  "axes": { "correctness": { "score": 1.0, "weight": 2 } },
  "checks": [
    { "criterion_id": "crit_01H...", "label": "Appointment booked", "result": "pass", "score": 1.0, "axis": "correctness" }
  ]
}

Full evidence is still written to the Criterion Run row. Fetch it later via GET /criterion-runs/{id} (as shown above) with an organization API key.

Drill-Down Example

A typical triage flow: start at the benchmark run, pick a failing task run, list its criterion runs, open the one you care about.

# 1. Read the benchmark run summary
curl "https://api.verial.ai/v1/benchmark-runs/$BENCHMARK_RUN_ID" \
  -H "Authorization: Bearer $RUN_TOKEN"

# 2. List task runs for the benchmark run (internal)
curl "https://api.verial.ai/task-runs?benchmark_run_id=$BENCHMARK_RUN_ID" \
  -H "Authorization: Bearer $VERIAL_API_KEY"

# 3. List criterion runs for a failing task run
curl "https://api.verial.ai/criterion-runs?task_run_id=$TASK_RUN_ID" \
  -H "Authorization: Bearer $VERIAL_API_KEY"

# 4. Read a single criterion run with full evidence
curl "https://api.verial.ai/criterion-runs/$CRITERION_RUN_ID" \
  -H "Authorization: Bearer $VERIAL_API_KEY"

The Three Levels

1. Benchmark Run

2. Task Run

3. Criterion Run

Scoring

Scored vs Unscored Runs

Drill-Down Example

Next Steps

Interactions

Verification

​The Three Levels

​1. Benchmark Run

​2. Task Run

​3. Criterion Run

​Scoring

​Scored vs Unscored Runs

​Drill-Down Example

​Next Steps

Interactions

Verification

The Three Levels

1. Benchmark Run

2. Task Run

3. Criterion Run

Scoring

Scored vs Unscored Runs

Drill-Down Example

Next Steps