Skip to main content
A run is a single execution of a benchmark. When you start a run, Verial provisions simulators from the benchmark’s linked environment, executes tasks, evaluates results, and produces a score.

Lifecycle

  1. Created. The run is initialized. Verial begins provisioning simulator instances from the benchmark’s environment.
  2. Running. Simulators are live and tasks are being executed. Your agent is interacting with simulator endpoints.
  3. Completed. All tasks have finished (or timed out). Evaluations have been processed and scores are available.
  4. Canceled. The run was manually stopped before completion.
  5. Timed Out. The run exceeded its maximum duration.

Starting a Run

A run only requires a benchmark ID. The environment is resolved from the benchmark’s configuration:
const run = await verial.runs.create({
  benchmarkId: 'bench_clxyz456',
})
Verial provisions all simulators defined in the benchmark’s environment, loads data, and begins executing tasks.

Polling for Completion

Runs are asynchronous. Poll the run status until it reaches a terminal state:
let run = await verial.runs.get({ id: runId })

while (run.status === 'running') {
  await new Promise(resolve => setTimeout(resolve, 5000))
  run = await verial.runs.get({ id: runId })
}

console.log(run.status)  // "completed"
console.log(run.score)   // 0.85
console.log(run.verdict) // "pass"

Results

A completed run includes:
FieldTypeDescription
scorenumberAggregate score from 0.0 to 1.0, weighted across all tasks
verdictstring"pass" or "fail" based on benchmark thresholds
completed_atstringISO 8601 timestamp of completion

Task Runs

Each task within a run produces a task run with its own score and verdict. Retrieve task-level results using the taskRuns resource:
const taskRuns = await verial.taskRuns.list({ runId })

for (const taskRun of taskRuns.data) {
  console.log(`Task ${taskRun.task_id}: ${taskRun.score} (${taskRun.verdict})`)

  // Get eval-level results
  const evalRuns = await verial.evalRuns.list({ taskRunId: taskRun.id })
  for (const evalRun of evalRuns.data) {
    console.log(`  ${evalRun.result}: ${evalRun.details}`)
  }
}

Eval Runs

Each eval within a task run produces an eval run with the LLM judge’s assessment:
FieldTypeDescription
resultstring"pass" or "fail"
scorenumberScore for this eval (0.0 or 1.0)
detailsstring | nullThe judge’s explanation for the result

Scoring

The run score is a weighted average of all task scores. Each task score is computed from its evaluation results. See Evals for details on how individual evaluations are scored and weighted. The verdict is determined by comparing the aggregate score against the benchmark’s threshold:
  • If score >= threshold, the verdict is "pass"
  • If score < threshold, the verdict is "fail"
The default threshold is 0.7 (70%).

Completing or Canceling a Run

Use separate methods to complete or cancel a running benchmark:
// Cancel a run
await verial.runs.cancel({ id: runId })

// Mark a run as completed
await verial.runs.complete({ id: runId })
Canceled runs retain any task results that were completed before cancellation.

Comparing Runs

Run the same benchmark multiple times to track agent performance over time. Each run produces an independent score, making it easy to measure improvements or detect regressions.
const runs = await verial.runs.list()

for (const run of runs.data) {
  console.log(`${run.created_at}: ${run.score} (${run.verdict})`)
}

Next Steps

Evals

Understand how evaluations produce scores.

API Reference

Create and manage runs via the API.