The Feedback Loop
You can traverse that loop top-down from any completed run: start at the aggregate verdict, drill into task verdicts and per-axis scores, open a failing criterion run to see its evidence payload, and drop down to raw sandbox events when you need the exact request or transcript turn.What You Can Inspect Today
- Aggregate scores and verdicts on a Benchmark Run (
score,verdict, list of task runs). - Per-task scoring on a Task Run:
verdict,score,axes(correctness, safety, efficiency, or whatever axes the benchmark declares), and acriterion_runsarray. - Per-criterion results and evidence on a Criterion Run:
passed,score,details, and a structuredevidencepayload whose shape depends on the check type. - Raw sandbox events via
GET /sandboxes/{id}/events: the lowest-level record of FHIR calls, HL7 messages, portal actions, SFTP uploads, voice turns, fax documents, and X12 responses captured during the rollout.
Dashboards, real-time monitoring, trace search, and anomaly detection are
planned but not yet available. Today, observability in Verial means inspecting
completed runs and their recorded evidence through the API.
Where Evidence Lives
| Layer | Where it lives | Read with |
|---|---|---|
| Run-level score and verdict | Benchmark Run row | GET /v1/benchmark-runs/{id} |
| Task-level score, verdict, axes | Task Run row | GET /task-runs/{id} |
| Per-criterion pass/score + structured evidence | Criterion Run row | GET /criterion-runs/{id} |
| Raw protocol interactions | Sandbox event log | GET /sandboxes/{id}/events |
GET /criterion-runs/{id}, including for runs created with scored: true (where the task-run completion response elides details and field-level evidence to avoid leaking the rubric to the agent).
Next Steps
Run Results
Read a completed benchmark run top-down: verdict, axes, criteria, evidence.
Interactions
Inspect raw recorded evidence per protocol (FHIR, HL7, voice, fax, portal, SFTP, X12).