Prompt Optimizer
The Prompt Optimizer is in Beta. The workflow and results may change as we refine it.
The Prompt Optimizer helps you improve an agent’s instructions and prove the improvement with data. Instead of editing a prompt and hoping it’s better, you compare versions against the same scenarios and watch the metrics move.
The optimization loop
- Start from a baseline — run a batch against your current agent instructions and note the pass rate and metric scores.
- Propose a new prompt — edit the instructions, or have RubricHQ suggest a revision based on where the agent failed.
- Compare — view a side-by-side diff of the old and new instructions so every change is explicit.
- Re-run — test the new version against the same scenarios for an apples-to-apples comparison.
- Apply or reset — apply the new prompt to the agent if it wins, or reset to the previous version if it doesn’t.
Comparing versions
The optimizer shows the two prompt versions and their results next to each other:
- A diff of the instructions, highlighting exactly what changed.
- The metric scores for each version, so regressions are as visible as improvements.
- The pass rate across the scenario set for each version.
Compare versions on the same scenarios at the same frequency. Changing the test set at the same time as the prompt makes it impossible to attribute a difference to the prompt change.
Getting reliable comparisons
- Hold the test set fixed — same scenarios, same channel, same frequency across versions.
- Raise the frequency — running each scenario several times averages out voice flakiness so a small real improvement isn’t lost in noise.
- Change one thing at a time — isolate a single prompt change per comparison to know what caused the result.
- Watch the metrics, not just pass/fail — a version can keep the same pass rate while improving latency or reducing interruptions.
Treat prompt changes like code changes: baseline, change one thing, measure, and keep the version that the metrics — not intuition — say is better.