Skip to main content
Every benchmark rollout produces a stack of structured evidence. The agent drives sandbox endpoints (FHIR, HL7, voice, fax, portal, SFTP, X12); Verial records every observable interaction; the verification engine scores that recorded state into per-criterion results. Observability in Verial means reading that stack back after the run.

The Feedback Loop

You can traverse that loop top-down from any completed run: start at the aggregate verdict, drill into task verdicts and per-axis scores, open a failing criterion run to see its evidence payload, and drop down to raw sandbox events when you need the exact request or transcript turn.

What You Can Inspect Today

  • Aggregate scores and verdicts on a Benchmark Run (score, verdict, list of task runs).
  • Per-task scoring on a Task Run: verdict, score, axes (correctness, safety, efficiency, or whatever axes the benchmark declares), and a criterion_runs array.
  • Per-criterion results and evidence on a Criterion Run: passed, score, details, and a structured evidence payload whose shape depends on the check type.
  • Raw sandbox events via GET /sandboxes/{id}/events: the lowest-level record of FHIR calls, HL7 messages, portal actions, SFTP uploads, voice turns, fax documents, and X12 responses captured during the rollout.
The Run Results page walks through reading a completed run top-down. The Interactions page covers the raw evidence per protocol.
Dashboards, real-time monitoring, trace search, and anomaly detection are planned but not yet available. Today, observability in Verial means inspecting completed runs and their recorded evidence through the API.

Where Evidence Lives

LayerWhere it livesRead with
Run-level score and verdictBenchmark Run rowGET /v1/benchmark-runs/{id}
Task-level score, verdict, axesTask Run rowGET /task-runs/{id}
Per-criterion pass/score + structured evidenceCriterion Run rowGET /criterion-runs/{id}
Raw protocol interactionsSandbox event logGET /sandboxes/{id}/events
Criterion Run evidence is always available via GET /criterion-runs/{id}, including for runs created with scored: true (where the task-run completion response elides details and field-level evidence to avoid leaking the rubric to the agent).

Next Steps

Run Results

Read a completed benchmark run top-down: verdict, axes, criteria, evidence.

Interactions

Inspect raw recorded evidence per protocol (FHIR, HL7, voice, fax, portal, SFTP, X12).