Skip to main content
Evals (evaluations) are the assertions that determine whether your agent completed a task correctly. Unlike traditional test assertions that check exact values, Verial evals use natural language assertions evaluated by an LLM judge with access to all recorded interactions.

How Evals Work

Each eval has a label (short identifier) and an assert (natural language assertion). When a task completes, Verial passes the assertion along with all recorded interactions (FHIR requests, HL7 messages, call transcripts, fax documents) to an LLM judge. The judge determines whether the evidence supports the assertion.

Writing Evals

Good evals are specific, observable, and grounded in evidence that the simulators capture.

Good Evals

[
  {
    "label": "appointment-created",
    "assert": "A FHIR Appointment resource was created for patient John Smith with a date within the next 7 days",
    "weight": 1.0
  },
  {
    "label": "correct-cpt-code",
    "assert": "The prior authorization request includes CPT code 72148 (MRI lumbar spine without contrast)",
    "weight": 0.5
  },
  {
    "label": "ivr-navigation",
    "assert": "The agent called the payer IVR and navigated to the prior auth status menu",
    "weight": 0.5
  }
]
These are good because they describe observable outcomes that leave evidence in the simulator logs.

Avoid

[
  {
    "label": "good-job",
    "assert": "The agent did a good job",
    "weight": 1.0
  },
  {
    "label": "careful-thinking",
    "assert": "The agent thought about the problem carefully",
    "weight": 1.0
  }
]
These are too vague for the judge to evaluate. There is no observable evidence to check against.

Weights

Each eval has a weight that determines its contribution to the task score. Weights are relative within a task. A weight of 1.0 means the eval contributes twice as much as an eval with weight 0.5.
{
  "evals": [
    { "label": "pa-submitted", "assert": "Prior auth was submitted", "weight": 1.0 },
    { "label": "cpt-code", "assert": "Correct CPT code used", "weight": 0.5 },
    { "label": "documentation", "assert": "Supporting documentation attached", "weight": 0.5 }
  ]
}
In this example, submitting the prior auth counts for 50% of the task score, while the CPT code and documentation each count for 25%.

Scoring

Task Score

The task score is computed as the weighted sum of passed evals divided by the total weight:
task_score = sum(weight for passed evals) / sum(weight for all evals)
For the example above:
  • All pass: (1.0 + 0.5 + 0.5) / 2.0 = 1.0
  • Only prior auth submitted: 1.0 / 2.0 = 0.5
  • Prior auth + CPT code: (1.0 + 0.5) / 2.0 = 0.75

Run Score

The run score is the average of all task scores (equally weighted across tasks).

Verdict

The run verdict compares the run score against the benchmark’s threshold (default 0.7):
ScoreThresholdVerdict
0.850.7Pass
0.650.7Fail
1.00.9Pass
0.890.9Fail

Eval Run Results

The LLM judge produces an eval run for each eval, including a result, score, and details. This is useful for understanding why an eval passed or failed:
{
  "eval_id": "eval_abc123",
  "result": "pass",
  "score": 1.0,
  "details": "The agent submitted a prior authorization via POST /prior-auths with body containing cpt_code: '72148'. This matches the expected CPT code for MRI lumbar spine without contrast."
}
When an eval fails, the details explain what evidence was missing or contradicted:
{
  "eval_id": "eval_def456",
  "result": "fail",
  "score": 0.0,
  "details": "The prior authorization request was submitted successfully, but no supporting documentation (clinical notes, imaging reports) was attached or faxed to the payer. The fax simulator shows no outbound fax activity."
}

Multi-Interface Evals

Evals can reference evidence across multiple simulators. The judge has access to all interaction logs from all active simulators in the run.
{
  "label": "chart-to-prior-auth",
  "assert": "The agent read the patient's chart in the EHR, then submitted a prior auth to the payer with the correct diagnosis from the chart",
  "weight": 1.0
}
The judge will check FHIR request logs for the chart read and payer interaction logs for the prior auth submission, verifying that the diagnosis matches.

Tips

  • Be specific about expected values. “CPT code 72148” is better than “correct CPT code.”
  • Reference observable actions. Describe what should appear in logs, not what the agent should think.
  • Use multiple evals per task. Break complex tasks into individual checkable outcomes.
  • Weight critical outcomes higher. The primary action (submitting the prior auth) should have higher weight than secondary details (formatting).
  • Test failure cases too. Include tasks where the correct behavior is to not take an action (e.g., “The agent should not submit a prior auth for an excluded service”).

Next Steps

Benchmarks

Build benchmarks with tasks and evals.

Runs

Execute benchmarks and review scored results.