Metrics & Evaluation

Metrics are how RubricHQ turns a conversation into a score. After each call, the transcript and audio are evaluated against the metrics attached to the scenario, plus the standard metrics that run on every voice call.

Two kinds of metric

  • Custom metrics — judged by an LLM against the transcript. You define what to check in plain language (e.g. “Did the agent verify the caller’s identity before sharing account details?”). These can be boolean, scored, or categorical.
  • Standard metrics — computed from the call audio by RubricHQ’s analysis pipeline. They apply automatically to every voice run unless disabled on the agent.

Standard metrics

Standard metrics measure objective qualities of the conversation, including:

  • Latency — how long the agent takes to respond.
  • User interruptions / AI interruptions — who interrupted whom, and how often.
  • False interruptions — the agent stopping when it shouldn’t have.
  • Interruption non-adherence — the agent talking over the caller after being interrupted.
  • Silence detection — notable gaps of dead air.
  • Stop time after interruption — how quickly the agent yields when interrupted.
  • Words per minute — speaking pace.
  • Transcription accuracy — how cleanly the agent’s speech transcribes.
  • Voice change / tone clarity — voice quality signals.

Critical metrics & call success

Each metric can be marked critical. A run’s call success is determined by its critical metrics:

  • If all critical metrics pass (or there are none), the call is a success.
  • If any critical metric fails, the run fails.

Call success is what the Batch Run verdict aggregates into a pass rate. Non-critical metrics are still scored and reported — they just don’t fail the run on their own.

How evaluation runs

Evaluation happens after the call, asynchronously:

  1. The call completes and the transcript and recording are saved.
  2. Custom metrics are scored by an LLM against the transcript.
  3. Standard metrics are computed from the audio.
  4. Once every metric has a result, the run finalizes and call success is set.

Because evaluation runs after the call, a run can be “completed” while its metrics are still “pending.” The run only reaches a final verdict once all metric results are in.

Managing metrics

  • Attach the custom metrics you care about to each Scenario.
  • Mark the ones that should fail a run as critical.
  • Disable any standard metrics that don’t apply, per agent.

Keep critical metrics tight — reserve “critical” for the checks that genuinely mean the agent failed, and leave informational checks non-critical so they surface without blocking.