Skip to main content
The verification engine is the component that scores a Task Run. After an agent finishes a rollout and calls POST /v1/task-runs/{id}/complete, the engine walks the task’s criteria, runs the matching check against the sandbox state, and writes a Criterion Run for each.

The Flow

Check Dispatch

The assertion.assert field is a discriminator. The engine maps each value to a check implementation:
assert valueData source
fhir-resource-stateFHIR store attached to the playground (Google Cloud Healthcare API)
hl7-structuralSandboxEvent rows of type hl7_outbound
portal-state-matchSandbox state rows keyed by correlate_by.resource
sftp-file-presentGCS objects under the sandbox’s SFTP prefix
voice-transcriptRecorded voice turns for the playground
x12-responseX12 response records for the playground
If assertion.assert does not parse against a known spec, the criterion run is marked failed with details indicating an unsupported assertion.

Scoring

Each criterion produces a score in [0, 1] and a passed boolean. The task score is a weighted mean:
task_score = sum(score_i * weight_i) / sum(weight_i)
The engine assigns a verdict:
  • pass when score >= 0.9
  • partial when 0 < score < 0.9
  • fail when score == 0 or there are no criteria

Per-Axis Scores

Criteria can declare an axis (for example correctness, safety, efficiency). The engine groups criteria by axis and produces a per-axis score using the same weighted mean. Criteria with no axis are collapsed into a __default__ bucket. The POST /v1/task-runs/{id}/complete response includes an axes object keyed by axis:
{
  "verdict": "partial",
  "score": 0.66,
  "axes": {
    "correctness": { "score": 1.0, "weight": 2 },
    "safety": { "score": 0.0, "weight": 1 }
  }
}

Evidence Payloads

Every check produces structured evidence stored on the Criterion Run. Shape varies by check type:
CheckEvidence fields
fhir-resource-statefieldResults: [{ path, expected, actual, passed }]
hl7-structuralfield_results: [{ path, expected, actual, passed }]
portal-state-matchassertionResults: [{ path, expected, actual, passed }]
sftp-file-presentfieldResults plus matched path(s)
voice-transcriptphraseResults: [{ phrase, found, turn }]
x12-responsefieldResults: [{ path, expected, actual, passed }]
When a task run is marked scored: true the API elides details and field-level evidence in the completion response (to avoid leaking the scoring rubric to the agent). You can still read the full evidence later via GET /criterion-runs/{id}.

Benchmark Run Aggregation

Once every task run in a benchmark run is completed, the engine takes the mean of task scores to produce the benchmark run score and verdict (pass at >= 0.9, partial above zero, fail otherwise).

Next Steps

Criteria

Write typed assertions for each check type.

Criterion Runs API

Read per-criterion results and evidence.