Observability Overview

Every benchmark rollout produces a stack of structured evidence. The agent drives sandbox endpoints (FHIR, HL7, voice, fax, portal, SFTP, X12); Verial records every observable interaction; the verification engine scores that recorded state into per-criterion results. Observability in Verial means reading that stack back after the run.

The Feedback Loop

You can traverse that loop top-down from any completed run: start at the aggregate verdict, drill into task verdicts and per-axis scores, open a failing criterion run to see its evidence payload, and drop down to raw sandbox events when you need the exact request or transcript turn.

What You Can Inspect Today

Aggregate scores and verdicts on a Benchmark Run (score, verdict, list of task runs).
Per-task scoring on a Task Run: verdict, score, axes (correctness, safety, efficiency, or whatever axes the benchmark declares), and a criterion_runs array.
Per-criterion results and evidence on a Criterion Run: passed, score, details, and a structured evidence payload whose shape depends on the check type.
Raw sandbox events via GET /sandboxes/{id}/events: the lowest-level record of FHIR calls, HL7 messages, portal actions, SFTP uploads, voice turns, fax documents, and X12 responses captured during the rollout.

The Run Results page walks through reading a completed run top-down. The Interactions page covers the raw evidence per protocol.

Dashboards, real-time monitoring, trace search, and anomaly detection are planned but not yet available. Today, observability in Verial means inspecting completed runs and their recorded evidence through the API.

Where Evidence Lives

Layer	Where it lives	Read with
Run-level score and verdict	Benchmark Run row	`GET /v1/benchmark-runs/{id}`
Task-level score, verdict, axes	Task Run row	`GET /task-runs/{id}`
Per-criterion pass/score + structured evidence	Criterion Run row	`GET /criterion-runs/{id}`
Raw protocol interactions	Sandbox event log	`GET /sandboxes/{id}/events`

Criterion Run evidence is always available via GET /criterion-runs/{id}, including for runs created with scored: true (where the task-run completion response elides details and field-level evidence to avoid leaking the rubric to the agent).

Next Steps

Run Results

Read a completed benchmark run top-down: verdict, axes, criteria, evidence.

Interactions

Inspect raw recorded evidence per protocol (FHIR, HL7, voice, fax, portal, SFTP, X12).

​The Feedback Loop

​What You Can Inspect Today

​Where Evidence Lives

​Next Steps

Run Results

Interactions

The Feedback Loop

What You Can Inspect Today

Where Evidence Lives

Next Steps