Metrics & Evaluation

RubricHQ evaluates every call using three types of metrics, each with different cost and capability trade-offs.

Type	How it works	Cost	Best for
Standard (Audio)	Signal processing on the audio recording	Free	Latency, silence, interruptions, speech speed
LLM-as-Judge	An LLM evaluates the transcript against your criteria	~$0.003/metric	Compliance, tone, process adherence
Code-as-Judge	Your Python code runs against call data	Free	Rule-based checks, threshold comparisons, custom logic

Evaluation Pipeline

When a call is analyzed, metrics execute in this order:

1. Audio Analysis     (parallel: latency, silence, interruptions, WPM, voice tone)
2. LLM Evaluation     (batched: custom LLM metrics + system LLM metrics)
3. Code-as-Judge      (runs after audio + LLM are done, has access to all results)

Code-as-Judge metrics execute last so they can reference results from audio and LLM metrics.

Code-as-Judge

Code-as-Judge lets you write Python code that evaluates calls programmatically. Your code runs in a secure sandbox with no imports, filesystem, or network access — just pure Python logic against the call data.

How It Works

You write Python code in the metric editor
Your code receives a context dict with all call data
You set metric["result"] and metric["explanation"]
Optionally set structured_output for classification
The code runs in under 5 seconds per metric

Output Variables

Your code must set these:

1 metric["result"]       # The metric value — must match your Result Type
2                        #   boolean: True / False
3                        #   rating:  1-5 (integer)
4                        #   enum:    one of your defined values
5                        #   numeric: any number
6 
7 metric["explanation"]  # A short string explaining the result

Optionally, set structured output for classification:

1 structured_output["name"]            # Field name (e.g. "silence_compliance")
2 structured_output["value"]           # Field value (e.g. "pass", "fail")
3 structured_output["classification"]  # One of:
4                                      #   meets_expectations
5                                      #   exceeds_expectations
6                                      #   requires_attention
7                                      #   goal_achieved
8                                      #   goal_missed

Available Context

Your code accesses call data via the context dict. Press / in the code editor to browse all available attributes.

Attribute	Type	Description
`context["transcript"]`	`str`	Full transcript as text (`[Role] message` per line)
`context["transcript_json"]`	`list[dict]`	Structured transcript: `[{"role": "...", "content": "..."}]`
`context["call_duration"]`	`float`	Call duration in seconds
`context["call_end_reason"]`	`str`	Why the call ended (e.g. “hangup”, “completed”)
`context["metadata"]`	`dict`	Custom metadata key-value pairs
`context["latency"]`	`dict`	`avg_ms`, `p95_ms`, `count`, `turns[]`
`context["silence"]`	`dict`	`count`, `total_silence_ms`, `silences[]` (each has `duration_ms`, `start_ms`, `end_ms`)
`context["interruptions"]`	`dict`	`user: {count, events[]}`, `ai: {count, events[]}`
`context["wpm"]`	`dict`	`avg_wpm`, `total_words`, `assessment`
`context["voice_tone"]`	`dict`	`overall_score`, `clarity_score`, `tone_score` (1-5)
`context["voice_change"]`	`dict`	`count`, `avg_similarity`, `events[]`
`context["transcription_accuracy"]`	`dict`	`wer`, `mer` (0-1 word error rate)
`context["metrics_results"]`	`dict`	All computed metric results: `{"Metric Name": {value, explanation}}`
`context["recording_metadata"]`	`dict`	Recording info (duration, channels, sample_rate)
`context["agent_name"]`	`str`	Name of the AI agent
`context["agent_instructions"]`	`str`	Agent’s system prompt

Transcript roles vary by source. Common values: "assistant", "AI Assistant", "user", "User", "bot". Use case-insensitive matching: t.get("role", "").lower() in ["assistant", "ai assistant", "bot", "agent"]

Examples

Silence Detection (Boolean)

Check if any silence period exceeds 5 seconds:

1 silence = context["silence"]
2 silences = silence.get("silences", [])
3 threshold_ms = 5000  # 5 seconds
4 long_silences = [s for s in silences if s.get("duration_ms", 0) > threshold_ms]
5 
6 if len(long_silences) == 0:
7     metric["result"] = True
8     metric["explanation"] = "No silence periods exceeded 5 seconds"
9     structured_output["name"] = "silence_compliance"
10     structured_output["value"] = "pass"
11     structured_output["classification"] = "meets_expectations"
12 else:
13     worst_ms = max(s["duration_ms"] for s in long_silences)
14     metric["result"] = False
15     metric["explanation"] = str(len(long_silences)) + " silence(s) exceeded 5s (worst: " + str(worst_ms/1000) + "s)"
16     structured_output["name"] = "silence_compliance"
17     structured_output["value"] = "fail"
18     structured_output["classification"] = "requires_attention"

Average Latency Check (Boolean)

Fail if average response latency exceeds thresholds:

1 lat = context["latency"]
2 avg = lat.get("avg_ms", 0)
3 
4 if avg < 1000:
5     metric["result"] = True
6     metric["explanation"] = "Average latency " + str(avg) + "ms is under 1 second"
7     structured_output["name"] = "latency_check"
8     structured_output["value"] = "pass"
9     structured_output["classification"] = "meets_expectations"
10 elif avg < 2000:
11     metric["result"] = False
12     metric["explanation"] = "Average latency " + str(avg) + "ms is between 1-2 seconds"
13     structured_output["name"] = "latency_check"
14     structured_output["value"] = "warning"
15     structured_output["classification"] = "requires_attention"
16 else:
17     metric["result"] = False
18     metric["explanation"] = "Average latency " + str(avg) + "ms exceeds 2 seconds"
19     structured_output["name"] = "latency_check"
20     structured_output["value"] = "fail"
21     structured_output["classification"] = "goal_missed"

Agent Greeting Check (Boolean)

Verify the agent greets the caller:

1 transcript = context["transcript_json"]
2 agent_roles = ["assistant", "ai assistant", "bot", "agent"]
3 first_agent = ""
4 for t in transcript:
5     if t.get("role", "").lower() in agent_roles:
6         first_agent = t.get("content", "")
7         break
8 
9 greetings = ["hello", "hi", "good morning", "good afternoon", "welcome", "thank you for calling"]
10 greeted = any(g in first_agent.lower() for g in greetings)
11 
12 metric["result"] = greeted
13 metric["explanation"] = "Agent opened with: " + first_agent[:80] if first_agent else "No agent message found"
14 structured_output["name"] = "greeting_check"
15 structured_output["value"] = "pass" if greeted else "fail"
16 structured_output["classification"] = "meets_expectations" if greeted else "requires_attention"

Speech Speed Assessment (Rating 1-5)

Rate the agent’s speech speed:

1 w = context["wpm"]
2 avg_wpm = w.get("avg_wpm", 0)
3 
4 if 120 <= avg_wpm <= 180:
5     metric["result"] = 5
6     metric["explanation"] = "Speech speed " + str(avg_wpm) + " WPM is in ideal range (120-180)"
7     structured_output["name"] = "speech_speed"
8     structured_output["value"] = "ideal"
9     structured_output["classification"] = "meets_expectations"
10 elif 100 <= avg_wpm < 120 or 180 < avg_wpm <= 200:
11     metric["result"] = 3
12     metric["explanation"] = "Speech speed " + str(avg_wpm) + " WPM is slightly outside ideal range"
13     structured_output["name"] = "speech_speed"
14     structured_output["value"] = "acceptable"
15     structured_output["classification"] = "requires_attention"
16 else:
17     metric["result"] = 1
18     metric["explanation"] = "Speech speed " + str(avg_wpm) + " WPM is too slow or too fast"
19     structured_output["name"] = "speech_speed"
20     structured_output["value"] = "poor"
21     structured_output["classification"] = "goal_missed"

Interruption Count (Numeric)

Count total interruptions and classify severity:

1 ints = context["interruptions"]
2 user_count = ints.get("user", {}).get("count", 0)
3 ai_count = ints.get("ai", {}).get("count", 0)
4 total = user_count + ai_count
5 
6 metric["result"] = total
7 metric["explanation"] = "Total " + str(total) + " interruptions (user: " + str(user_count) + ", AI: " + str(ai_count) + ")"
8 structured_output["name"] = "interruption_level"
9 if total <= 3:
10     structured_output["value"] = "low"
11     structured_output["classification"] = "meets_expectations"
12 elif total <= 8:
13     structured_output["value"] = "moderate"
14     structured_output["classification"] = "requires_attention"
15 else:
16     structured_output["value"] = "high"
17     structured_output["classification"] = "goal_missed"

Voice Tone Quality (Rating 1-5)

Check voice clarity and tone scores:

1 vt = context["voice_tone"]
2 score = vt.get("overall_score", 0)
3 clarity = vt.get("clarity_score", 0)
4 tone = vt.get("tone_score", 0)
5 
6 metric["result"] = round(score)
7 metric["explanation"] = "Voice tone score: " + str(score) + "/5 (clarity: " + str(clarity) + ", tone: " + str(tone) + ")"
8 structured_output["name"] = "voice_quality"
9 if score >= 4:
10     structured_output["value"] = "excellent"
11     structured_output["classification"] = "exceeds_expectations"
12 elif score >= 3:
13     structured_output["value"] = "acceptable"
14     structured_output["classification"] = "meets_expectations"
15 else:
16     structured_output["value"] = "poor"
17     structured_output["classification"] = "requires_attention"

Call Duration Check (Boolean)

Ensure call duration is within acceptable range:

1 dur = context["call_duration"] or 0
2 
3 if dur < 30:
4     metric["result"] = False
5     metric["explanation"] = "Call too short at " + str(dur) + "s"
6     structured_output["name"] = "duration_check"
7     structured_output["value"] = "too_short"
8     structured_output["classification"] = "requires_attention"
9 elif dur > 600:
10     metric["result"] = False
11     metric["explanation"] = "Call too long at " + str(dur) + "s"
12     structured_output["name"] = "duration_check"
13     structured_output["value"] = "too_long"
14     structured_output["classification"] = "requires_attention"
15 else:
16     metric["result"] = True
17     metric["explanation"] = "Call duration " + str(dur) + "s is within acceptable range"
18     structured_output["name"] = "duration_check"
19     structured_output["value"] = "acceptable"
20     structured_output["classification"] = "meets_expectations"

Conversation Turn Count (Numeric)

Count conversation turns and flag abnormal lengths:

1 transcript = context["transcript_json"]
2 agent_roles = ["assistant", "ai assistant", "bot", "agent"]
3 user_roles = ["user", "human", "customer", "caller"]
4 total = len(transcript)
5 agent_turns = len([t for t in transcript if t.get("role", "").lower() in agent_roles])
6 user_turns = len([t for t in transcript if t.get("role", "").lower() in user_roles])
7 
8 metric["result"] = total
9 metric["explanation"] = str(total) + " total turns (agent: " + str(agent_turns) + ", user: " + str(user_turns) + ")"
10 structured_output["name"] = "turn_count"
11 if total < 4:
12     structured_output["value"] = "too_short"
13     structured_output["classification"] = "requires_attention"
14 elif total > 50:
15     structured_output["value"] = "too_long"
16     structured_output["classification"] = "requires_attention"
17 else:
18     structured_output["value"] = "normal"
19     structured_output["classification"] = "meets_expectations"

Call End Reason Check (Boolean)

Verify the call ended normally:

1 reason = context.get("call_end_reason", "")
2 normal_reasons = ["user disconnected", "hangup", "completed", "end_turn", "natural_conclusion", "objective_met"]
3 is_normal = any(r in str(reason).lower() for r in normal_reasons) if reason else False
4 
5 metric["result"] = is_normal
6 metric["explanation"] = "Call ended with: " + str(reason)
7 structured_output["name"] = "end_reason"
8 structured_output["value"] = "normal" if is_normal else "abnormal"
9 structured_output["classification"] = "meets_expectations" if is_normal else "requires_attention"

Agent Identifies Themselves (Boolean)

Check if the agent introduces themselves in the first few turns:

1 transcript = context["transcript_json"]
2 agent_roles = ["assistant", "ai assistant", "bot", "agent"]
3 first_3 = [t["content"].lower() for t in transcript if t.get("role", "").lower() in agent_roles][:3]
4 combined = " ".join(first_3)
5 
6 identified = any(p in combined for p in ["my name is", "this is", "i'm ", "i am "])
7 
8 metric["result"] = identified
9 metric["explanation"] = "Agent introduced themselves" if identified else "Agent did not introduce themselves in first 3 turns"
10 structured_output["name"] = "agent_identification"
11 structured_output["value"] = "pass" if identified else "fail"
12 structured_output["classification"] = "meets_expectations" if identified else "requires_attention"

Metadata Presence Check (Boolean)

Verify that call metadata is attached:

1 meta = context.get("metadata", {})
2 has_meta = len(meta) > 0
3 
4 metric["result"] = has_meta
5 keys = list(meta.keys())
6 metric["explanation"] = "Metadata present: " + ", ".join(keys[:5]) if has_meta else "No metadata attached to this call"
7 structured_output["name"] = "metadata_check"
8 structured_output["value"] = "present" if has_meta else "missing"
9 structured_output["classification"] = "meets_expectations" if has_meta else "requires_attention"

Cross-Metric Check (Using Other Metric Results)

Check if multiple metrics passed together:

1 results = context["metrics_results"]
2 latency_result = results.get("Response Latency", {})
3 silence_result = results.get("Silence Detection", {})
4 
5 avg_latency = latency_result.get("avg_ms", 0) if isinstance(latency_result, dict) else 0
6 silence_count = silence_result.get("count", 0) if isinstance(silence_result, dict) else 0
7 
8 both_good = avg_latency < 2000 and silence_count < 10
9 metric["result"] = both_good
10 metric["explanation"] = "Latency: " + str(avg_latency) + "ms, Silences: " + str(silence_count)
11 structured_output["name"] = "overall_quality"
12 structured_output["value"] = "pass" if both_good else "fail"
13 structured_output["classification"] = "meets_expectations" if both_good else "requires_attention"

Testing Your Code

Use the Test with Call button in the code editor to run your code against a real call before saving. The test:

Loads a call from your selected agent (with all computed metrics)
Executes your code in the same sandbox used in production
Shows the result, explanation, structured output, and execution time
Highlights errors if your code has bugs

Tips

Use context.get("key", default) for safe access when data may not be available
Transcript roles vary by source — always use case-insensitive matching against multiple role names
Silence uses duration_ms (milliseconds), not seconds
Latency uses avg_ms for the average
Avoid f-strings with quotes inside — use string concatenation instead (sandbox limitation)
next(), iter(), reversed() are available alongside standard builtins
Keep code simple and fast — 5 second timeout per metric
Code metrics have access to all other computed metric results via context["metrics_results"]

When to Use Code-as-Judge vs LLM-as-Judge

Use Code-as-Judge when…	Use LLM-as-Judge when…
Checking numeric thresholds (latency < 2s)	Evaluating tone, empathy, professionalism
Verifying specific phrases exist	Assessing complex compliance requirements
Counting occurrences	Interpreting nuanced conversation flow
Cross-referencing other metric results	Determining if the agent followed a process
You need zero cost per evaluation	Subjective quality assessment
Rule is deterministic (same input = same output)	Evaluation requires judgment