Batch Runs

A Batch Run executes a set of scenarios against an agent and collects the results in one place. Each scenario in the batch becomes one or more runs — a single simulated conversation with its own transcript, recording, and scores.

Starting a batch

Under Batch Runs → New Batch:

  1. Select the agent to test.
  2. Select the scenarios (directly, or by tag).
  3. Choose a channel — phone, web, or text (defaults to phone if the agent has a number, otherwise web).
  4. Set a frequency — how many times to run each scenario. Higher frequency smooths out flakiness from real-world voice variability.
  5. Set a success threshold — the minimum pass rate required for the batch to be considered passing.

Click Run. RubricHQ launches the runs in parallel.

What happens during a run

For each run, RubricHQ:

  1. Connects to the agent over the chosen channel (dials the number, joins the room, or opens a text session).
  2. Drives the conversation as the scenario’s caller until the goal is met, the call ends, or a time limit is reached.
  3. Captures the transcript and, for voice channels, a stereo recording (agent on one channel, simulated caller on the other).
  4. Evaluates the transcript against the scenario’s metrics, and runs standard metrics on the audio.

Verdict & pass rate

Once every run finishes evaluating, the batch settles to a verdict:

  • A run passes when the call completes and all of its critical metrics pass. Runs with no critical metrics pass on that dimension; calls that error out count as failures.
  • The pass rate is passed runs ÷ total runs × 100.
  • The verdict is passed when the pass rate meets or exceeds your success threshold, and failed otherwise.

A batch stays pending until every call has completed and all metric evaluations have finished — metric scoring runs asynchronously after each call ends. Wait for the verdict rather than relying on call completion alone.

Reviewing results

Open a completed batch to see, per run:

  • Pass / fail and the reason (e.g. which critical metric failed).
  • The full conversation transcript.
  • The call recording, with the agent and simulated caller separated.
  • Every metric score, custom and standard.

Batches can also be triggered from CI to gate deploys on agent quality — see the GitHub Actions CI/CD integration.