Verification

The verification engine is the component that scores a Task Run. After an agent finishes a rollout and calls POST /v1/task-runs/{id}/complete, the engine walks the task’s criteria, runs the matching check against the sandbox state, and writes a Criterion Run for each.

The Flow

Check Dispatch

The assertion.assert field is a discriminator. The engine maps each value to a check implementation:

`assert` value	Data source
`fhir-resource-state`	FHIR store attached to the playground (Google Cloud Healthcare API)
`hl7-structural`	`SandboxEvent` rows of type `hl7_outbound`
`portal-state-match`	Sandbox state rows keyed by `correlate_by.resource`
`sftp-file-present`	GCS objects under the sandbox’s SFTP prefix
`voice-transcript`	Recorded voice turns for the playground
`x12-response`	X12 response records for the playground

If assertion.assert does not parse against a known spec, the criterion run is marked failed with details indicating an unsupported assertion.

Scoring

Each criterion produces a score in [0, 1] and a passed boolean. The task score is a weighted mean:

task_score = sum(score_i * weight_i) / sum(weight_i)

The engine assigns a verdict:

pass when score >= 0.9
partial when 0 < score < 0.9
fail when score == 0 or there are no criteria

Per-Axis Scores

Criteria can declare an axis (for example correctness, safety, efficiency). The engine groups criteria by axis and produces a per-axis score using the same weighted mean. Criteria with no axis are collapsed into a __default__ bucket. The POST /v1/task-runs/{id}/complete response includes an axes object keyed by axis:

{
  "verdict": "partial",
  "score": 0.66,
  "axes": {
    "correctness": { "score": 1.0, "weight": 2 },
    "safety": { "score": 0.0, "weight": 1 }
  }
}

Evidence Payloads

Every check produces structured evidence stored on the Criterion Run. Shape varies by check type:

Check	Evidence fields
`fhir-resource-state`	`fieldResults: [{ path, expected, actual, passed }]`
`hl7-structural`	`field_results: [{ path, expected, actual, passed }]`
`portal-state-match`	`assertionResults: [{ path, expected, actual, passed }]`
`sftp-file-present`	`fieldResults` plus matched path(s)
`voice-transcript`	`phraseResults: [{ phrase, found, turn }]`
`x12-response`	`fieldResults: [{ path, expected, actual, passed }]`

When a task run is marked scored: true the API elides details and field-level evidence in the completion response (to avoid leaking the scoring rubric to the agent). You can still read the full evidence later via GET /criterion-runs/{id}.

Benchmark Run Aggregation

Once every task run in a benchmark run is completed, the engine takes the mean of task scores to produce the benchmark run score and verdict (pass at >= 0.9, partial above zero, fail otherwise).

Getting Started

Core Concepts

Simulators

Observability

Guides

The Flow

Check Dispatch

Scoring

Per-Axis Scores

Evidence Payloads

Benchmark Run Aggregation

Next Steps

Criteria

Criterion Runs API

​The Flow

​Check Dispatch

​Scoring

​Per-Axis Scores

​Evidence Payloads

​Benchmark Run Aggregation

​Next Steps

Criteria

Criterion Runs API

The Flow

Check Dispatch

Scoring

Per-Axis Scores

Evidence Payloads

Benchmark Run Aggregation

Next Steps