Batch Runs | RubricHQ Docs

A Batch Run executes a set of scenarios against an agent and collects the results in one place. Each scenario in the batch becomes one or more runs — a single simulated conversation with its own transcript, recording, and scores.

Starting a batch

Under Batch Runs → New Batch:

Select the agent to test.
Select the scenarios (directly, or by tag).
Choose a channel — phone, web, or text (defaults to phone if the agent has a number, otherwise web).
Set a frequency — how many times to run each scenario. Higher frequency smooths out flakiness from real-world voice variability.
Set a success threshold — the minimum pass rate required for the batch to be considered passing.

Click Run. RubricHQ launches the runs in parallel.

What happens during a run

For each run, RubricHQ:

Connects to the agent over the chosen channel (dials the number, joins the room, or opens a text session).
Drives the conversation as the scenario’s caller until the goal is met, the call ends, or a time limit is reached.
Captures the transcript and, for voice channels, a stereo recording (agent on one channel, simulated caller on the other).
Evaluates the transcript against the scenario’s metrics, and runs standard metrics on the audio.

Verdict & pass rate

Once every run finishes evaluating, the batch settles to a verdict:

A run passes when the call completes and all of its critical metrics pass. Runs with no critical metrics pass on that dimension; calls that error out count as failures.
The pass rate is passed runs ÷ total runs × 100.
The verdict is passed when the pass rate meets or exceeds your success threshold, and failed otherwise.

A batch stays pending until every call has completed and all metric evaluations have finished — metric scoring runs asynchronously after each call ends. Wait for the verdict rather than relying on call completion alone.

Reviewing results

Open a completed batch to see, per run:

Pass / fail and the reason (e.g. which critical metric failed).
The full conversation transcript.
The call recording, with the agent and simulated caller separated.
Every metric score, custom and standard.

Batches can also be triggered from CI to gate deploys on agent quality — see the GitHub Actions CI/CD integration.