Batch Runs
A Batch Run executes a set of scenarios against an agent and collects the results in one place. Each scenario in the batch becomes one or more runs — a single simulated conversation with its own transcript, recording, and scores.
Starting a batch
Under Batch Runs → New Batch:
- Select the agent to test.
- Select the scenarios (directly, or by tag).
- Choose a channel — phone, web, or text (defaults to phone if the agent has a number, otherwise web).
- Set a frequency — how many times to run each scenario. Higher frequency smooths out flakiness from real-world voice variability.
- Set a success threshold — the minimum pass rate required for the batch to be considered passing.
Click Run. RubricHQ launches the runs in parallel.
What happens during a run
For each run, RubricHQ:
- Connects to the agent over the chosen channel (dials the number, joins the room, or opens a text session).
- Drives the conversation as the scenario’s caller until the goal is met, the call ends, or a time limit is reached.
- Captures the transcript and, for voice channels, a stereo recording (agent on one channel, simulated caller on the other).
- Evaluates the transcript against the scenario’s metrics, and runs standard metrics on the audio.
Verdict & pass rate
Once every run finishes evaluating, the batch settles to a verdict:
- A run passes when the call completes and all of its critical metrics pass. Runs with no critical metrics pass on that dimension; calls that error out count as failures.
- The pass rate is
passed runs ÷ total runs × 100. - The verdict is passed when the pass rate meets or exceeds your success threshold, and failed otherwise.
A batch stays pending until every call has completed and all metric evaluations have finished — metric scoring runs asynchronously after each call ends. Wait for the verdict rather than relying on call completion alone.
Reviewing results
Open a completed batch to see, per run:
- Pass / fail and the reason (e.g. which critical metric failed).
- The full conversation transcript.
- The call recording, with the agent and simulated caller separated.
- Every metric score, custom and standard.
Batches can also be triggered from CI to gate deploys on agent quality — see the GitHub Actions CI/CD integration.