GitHub Actions CI/CD

Run your RubricHQ agent test suites automatically on every push or pull request and block deploys when agents regress. The integration works via a published GitHub Action (RubricHQ/agent-test-action) or through the raw REST API — use the API path for GitLab CI, Jenkins, CircleCI, or any other CI platform.


Prerequisites

  • An agent with at least one scenario in RubricHQ.
  • A RubricHQ API key — Settings → API Keys → Create key.
  • In your GitHub repo:
    • Secret RUBRICHQ_API_KEY — your API key (Settings → Secrets and variables → Actions → New repository secret).
    • Variable RUBRICHQ_AGENT_ID — the numeric ID of the agent to test (Settings → Secrets and variables → Actions → Variables tab → New repository variable).

Using the GitHub Action

Quick-start workflow

Add this file to .github/workflows/agent-tests.yml in your repository:

1name: Agent Tests
2on:
3 push:
4 branches: [main]
5 workflow_dispatch:
6
7jobs:
8 agent-tests:
9 runs-on: ubuntu-latest
10 steps:
11 - name: Run RubricHQ agent tests
12 uses: RubricHQ/agent-test-action@v1
13 with:
14 api_key: ${{ secrets.RUBRICHQ_API_KEY }}
15 agent_id: ${{ vars.RUBRICHQ_AGENT_ID }}
16 scenario_ids: "12,15,22"

The step exits 0 when the verdict is passed and 1 when it is failed or the timeout is reached — so the job fails exactly when your agents regress.

Inputs

InputRequiredDefaultDescription
api_keyyesRubricHQ API key. Use a secret: ${{ secrets.RUBRICHQ_API_KEY }}.
agent_idyesNumeric ID of the agent to test.
scenario_idsyesComma-separated list of scenario IDs to run (e.g. 12,15,22).
tagsnoOptional. Tags to also include, on top of scenario_ids (union).
frequencyno1How many times to run each scenario. Accepts 15.
success_thresholdno100Minimum pass-rate (0–100) required for the verdict to be passed.
timeoutno3600Seconds to wait for the run to complete before failing the step.
poll_intervalno15Seconds between status-poll requests.
api_urlnohttps://api.rubrichq.ioOverride for self-hosted or staging deployments.

scenario_ids is required. tags is optional — when provided, matching scenarios are added on top (union).

Outputs

OutputDescription
test_run_idNumeric ID of the test run created.
verdictpassed or failed.
pass_ratePercentage of runs that passed (e.g. 80.0).
report_urlLink to the full run report in the RubricHQ dashboard.

Testing over web, phone, or text

The channel input controls how each scenario is run:

  • web — a browser/WebSocket voice call.
  • phone — a real phone call placed over Twilio (the agent must have a phone number configured).
  • text — a text-only conversation.

When channel is omitted, it defaults to phone if the agent has a phone number, otherwise web.

Run a single channel by setting channel on the step:

1 - uses: RubricHQ/agent-test-action@v1
2 with:
3 api_key: ${{ secrets.RUBRICHQ_API_KEY }}
4 agent_id: ${{ vars.RUBRICHQ_AGENT_ID }}
5 scenario_ids: "12,15,22"
6 channel: web # or: phone, text

To exercise both web and phone on every push, use two jobs — each reports its own verdict, and either failing blocks the deploy:

1jobs:
2 web-tests:
3 runs-on: ubuntu-latest
4 steps:
5 - uses: RubricHQ/agent-test-action@v1
6 with:
7 api_key: ${{ secrets.RUBRICHQ_API_KEY }}
8 agent_id: ${{ vars.RUBRICHQ_AGENT_ID }}
9 scenario_ids: "12,15,22"
10 channel: web
11
12 phone-tests:
13 runs-on: ubuntu-latest
14 steps:
15 - uses: RubricHQ/agent-test-action@v1
16 with:
17 api_key: ${{ secrets.RUBRICHQ_API_KEY }}
18 agent_id: ${{ vars.RUBRICHQ_AGENT_ID }}
19 scenario_ids: "12,15,22"
20 channel: phone

The two jobs run in parallel. To stop a flaky phone job from blocking the deploy while you stabilize it, add continue-on-error: true to the phone job’s step — it still reports a verdict but won’t fail the workflow.

Gating deploys

Add a deploy job that only runs after agent-tests passes:

1jobs:
2 agent-tests:
3 runs-on: ubuntu-latest
4 steps:
5 - name: Run RubricHQ agent tests
6 uses: RubricHQ/agent-test-action@v1
7 with:
8 api_key: ${{ secrets.RUBRICHQ_API_KEY }}
9 agent_id: ${{ vars.RUBRICHQ_AGENT_ID }}
10 scenario_ids: "12,15,22"
11
12 deploy:
13 needs: agent-tests
14 runs-on: ubuntu-latest
15 steps:
16 - name: Deploy
17 run: ./scripts/deploy.sh

If agent-tests fails, deploy is skipped automatically.

Staging → production pipeline

Use two sequential jobs with different secrets to promote only builds that pass in staging first:

1jobs:
2 test-staging:
3 runs-on: ubuntu-latest
4 steps:
5 - name: Run agent tests against staging
6 uses: RubricHQ/agent-test-action@v1
7 with:
8 api_key: ${{ secrets.RUBRICHQ_API_KEY_STAGING }}
9 agent_id: ${{ vars.RUBRICHQ_AGENT_ID_STAGING }}
10 scenario_ids: "12,15,22"
11 success_threshold: 90
12
13 test-production:
14 needs: test-staging
15 runs-on: ubuntu-latest
16 steps:
17 - name: Run agent tests against production
18 uses: RubricHQ/agent-test-action@v1
19 with:
20 api_key: ${{ secrets.RUBRICHQ_API_KEY_PROD }}
21 agent_id: ${{ vars.RUBRICHQ_AGENT_ID_PROD }}
22 scenario_ids: "12,15,22"
23 success_threshold: 100

Using the API directly

For GitLab CI, Jenkins, CircleCI, or any CI system that can run shell commands, call the REST API directly.

Trigger a test run

$curl -sS -X POST https://api.rubrichq.io/api/public/v1/test_runs \
> -H "Authorization: Bearer $RUBRICHQ_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{"agent_id": 1, "scenario_ids": [12, 15, 22], "success_threshold": 90}'

The API returns 202 Accepted immediately — the run is queued, not yet complete:

1{
2 "test_run_id": 42,
3 "status": "pending",
4 "run_count": 5,
5 "scenario_count": 5,
6 "frequency": 1,
7 "success_threshold": 90,
8 "status_url": "https://api.rubrichq.io/api/public/v1/test_runs/42",
9 "report_url": "https://app.rubrichq.io/batch-run/42"
10}

Request fields

FieldTypeRequiredDescription
agent_idintegeryesThe agent to test.
scenario_idsint array or comma-separated stringyesScenarios to run.
tagsstring array or comma-separated stringnoOptional. Also include scenarios matching these tags (union).
frequencyintegernoRuns per scenario (15, default 1).
success_thresholdintegernoMinimum pass-rate to pass (0100, default 100).
testing_modestringnovoice or text (default voice).
channelstringnophone, web, or text. Defaults to phone when the agent has a phone number, otherwise web.
namestringnoHuman-readable label for the run (shows in the dashboard).
ci_metadataobjectnoArbitrary JSON stored with the run and echoed back in status responses. Put commit SHA, branch name, and CI run URL here for traceability.

scenario_ids is required; tags is an optional additional filter.

Poll for the verdict

Voice scenarios take several minutes each. Poll the status_url until verdict is no longer pending:

$TEST_RUN_ID=42
$
$while true; do
$ verdict=$(curl -sS "https://api.rubrichq.io/api/public/v1/test_runs/$TEST_RUN_ID" \
> -H "Authorization: Bearer $RUBRICHQ_API_KEY" | python3 -c 'import json,sys; print(json.load(sys.stdin)["verdict"])')
$ [ "$verdict" != "pending" ] && break
$ sleep 15
$done
$
$[ "$verdict" = "passed" ] || exit 1

Status response

1{
2 "id": 42,
3 "status": "completed",
4 "verdict": "failed",
5 "pass_rate": 80.0,
6 "success_threshold": 90,
7 "runs": {
8 "total": 5,
9 "completed": 5,
10 "running": 0,
11 "pending": 0,
12 "failed": 0,
13 "passed": 4
14 },
15 "failed_runs": [
16 {
17 "run_id": 207,
18 "scenario_name": "Angry refund caller",
19 "status": "completed",
20 "reason": "critical metric failed: Greeting Check"
21 }
22 ],
23 "ci_metadata": { "sha": "abc123", "branch": "main" },
24 "report_url": "https://app.rubrichq.io/batch-run/42",
25 "created_at": "2026-06-11T10:00:00Z",
26 "updated_at": "2026-06-11T10:12:31Z"
27}

How the verdict works

A run passes when all of its critical metrics pass (runs with no critical metrics always pass on that dimension). Calls that error out count as failed runs.

The pass rate is passed_runs / total_runs × 100. The verdict is passed when pass_rate >= success_threshold, and failed otherwise.

The verdict stays "pending" until every call in the batch has completed and all metric evaluations for those calls have finished. Poll until the verdict is no longer "pending" — don’t rely on status == "completed" alone, because metric evaluation runs asynchronously after call completion.


Troubleshooting

SymptomLikely causeFix
Step times out before run completesLarge suite with long voice scenarios (each can take several minutes).Raise the timeout input (Action) or extend your CI job’s timeout. A 20-scenario suite at frequency: 2 can easily run for 40+ minutes.
"No scenarios matched tags" errorTags are exact-match strings — case and whitespace matter.Check the scenario tags in the RubricHQ app under Scenarios and make sure they match exactly what you’re passing.
HTTP 402 Payment RequiredThe workspace has run out of credits.Top up credits in Settings → Billing, then re-run.
HTTP 422 Unprocessable Entity listing specific scenario IDsThose scenario IDs are not attached to the agent, or the scenarios are archived.Verify the scenario IDs under Agent → Scenarios; archived scenarios must be restored before they can be run.

Use ci_metadata to attach {"sha": "$COMMIT_SHA", "branch": "$BRANCH", "ci_run": "$CI_RUN_URL"} to every test run. It’s stored on the run and echoed back in status responses, so you can trace any dashboard report back to the exact commit that triggered it.