Lifecycle
- Created. The run is initialized. Verial begins provisioning simulator instances from the benchmark’s environment.
- Running. Simulators are live and tasks are being executed. Your agent is interacting with simulator endpoints.
- Completed. All tasks have finished (or timed out). Evaluations have been processed and scores are available.
- Canceled. The run was manually stopped before completion.
- Timed Out. The run exceeded its maximum duration.
Starting a Run
A run only requires a benchmark ID. The environment is resolved from the benchmark’s configuration:Polling for Completion
Runs are asynchronous. Poll the run status until it reaches a terminal state:Results
A completed run includes:| Field | Type | Description |
|---|---|---|
score | number | Aggregate score from 0.0 to 1.0, weighted across all tasks |
verdict | string | "pass" or "fail" based on benchmark thresholds |
completed_at | string | ISO 8601 timestamp of completion |
Task Runs
Each task within a run produces a task run with its own score and verdict. Retrieve task-level results using thetaskRuns resource:
Eval Runs
Each eval within a task run produces an eval run with the LLM judge’s assessment:| Field | Type | Description |
|---|---|---|
result | string | "pass" or "fail" |
score | number | Score for this eval (0.0 or 1.0) |
details | string | null | The judge’s explanation for the result |
Scoring
The run score is a weighted average of all task scores. Each task score is computed from its evaluation results. See Evals for details on how individual evaluations are scored and weighted. The verdict is determined by comparing the aggregate score against the benchmark’s threshold:- If
score >= threshold, the verdict is"pass" - If
score < threshold, the verdict is"fail"
0.7 (70%).
Completing or Canceling a Run
Use separate methods to complete or cancel a running benchmark:Comparing Runs
Run the same benchmark multiple times to track agent performance over time. Each run produces an independent score, making it easy to measure improvements or detect regressions.Next Steps
Evals
Understand how evaluations produce scores.
API Reference
Create and manage runs via the API.