Evals define what success looks like for a Task. Each eval has a label, a natural language assertion, and a weight that determines its contribution to the overall score. During a Run, an LLM judge evaluates each assertion against the evidence collected from sandbox interactions.
Endpoints not yet in OpenAPI spec.
Endpoints
| Method | Endpoint | Description |
|---|
GET | /evals?task_id={taskId} | List evals for a task |
POST | /evals | Create an eval |
GET | /evals/{id} | Get eval details |
PATCH | /evals/{id} | Update an eval |
DELETE | /evals/{id} | Delete an eval |
Eval Object
| Field | Type | Description |
|---|
id | string | Unique identifier |
task_id | string | Parent Task |
label | string | Short label describing the assertion |
assert | string | Natural language assertion the LLM judge evaluates |
weight | number | Weight for scoring (higher = more important) |
organization_id | string | Parent organization |
created_at | datetime | Creation timestamp |
updated_at | datetime | Last modification timestamp |
SDK Example
// Create an eval
const eval_ = await verial.evals.create({
taskId: 'task_abc123',
label: 'Prior auth submitted',
assert: 'The agent submitted a prior authorization request to the payer',
weight: 1.0,
})
// List evals for a task
const evals = await verial.evals.list({ taskId: 'task_abc123' })
// Get a specific eval
const details = await verial.evals.get({ id: eval_.id })
// Update
await verial.evals.update({
id: eval_.id,
weight: 2.0,
})
// Delete
await verial.evals.delete({ id: eval_.id })