Metrics & Evaluation

Metrics & Evaluation

RubricHQ evaluates every call using three types of metrics, each with different cost and capability trade-offs.

TypeHow it worksCostBest for
Standard (Audio)Signal processing on the audio recordingFreeLatency, silence, interruptions, speech speed
LLM-as-JudgeAn LLM evaluates the transcript against your criteria~$0.003/metricCompliance, tone, process adherence
Code-as-JudgeYour Python code runs against call dataFreeRule-based checks, threshold comparisons, custom logic

Evaluation Pipeline

When a call is analyzed, metrics execute in this order:

1. Audio Analysis (parallel: latency, silence, interruptions, WPM, voice tone)
2. LLM Evaluation (batched: custom LLM metrics + system LLM metrics)
3. Code-as-Judge (runs after audio + LLM are done, has access to all results)

Code-as-Judge metrics execute last so they can reference results from audio and LLM metrics.


Code-as-Judge

Code-as-Judge lets you write Python code that evaluates calls programmatically. Your code runs in a secure sandbox with no imports, filesystem, or network access — just pure Python logic against the call data.

How It Works

  1. You write Python code in the metric editor
  2. Your code receives a context dict with all call data
  3. You set metric["result"] and metric["explanation"]
  4. Optionally set structured_output for classification
  5. The code runs in under 5 seconds per metric

Output Variables

Your code must set these:

1metric["result"] # The metric value — must match your Result Type
2 # boolean: True / False
3 # rating: 1-5 (integer)
4 # enum: one of your defined values
5 # numeric: any number
6
7metric["explanation"] # A short string explaining the result

Optionally, set structured output for classification:

1structured_output["name"] # Field name (e.g. "silence_compliance")
2structured_output["value"] # Field value (e.g. "pass", "fail")
3structured_output["classification"] # One of:
4 # meets_expectations
5 # exceeds_expectations
6 # requires_attention
7 # goal_achieved
8 # goal_missed

Available Context

Your code accesses call data via the context dict. Press / in the code editor to browse all available attributes.

AttributeTypeDescription
context["transcript"]strFull transcript as text ([Role] message per line)
context["transcript_json"]list[dict]Structured transcript: [{"role": "...", "content": "..."}]
context["call_duration"]floatCall duration in seconds
context["call_end_reason"]strWhy the call ended (e.g. “hangup”, “completed”)
context["metadata"]dictCustom metadata key-value pairs
context["latency"]dictavg_ms, p95_ms, count, turns[]
context["silence"]dictcount, total_silence_ms, silences[] (each has duration_ms, start_ms, end_ms)
context["interruptions"]dictuser: {count, events[]}, ai: {count, events[]}
context["wpm"]dictavg_wpm, total_words, assessment
context["voice_tone"]dictoverall_score, clarity_score, tone_score (1-5)
context["voice_change"]dictcount, avg_similarity, events[]
context["transcription_accuracy"]dictwer, mer (0-1 word error rate)
context["metrics_results"]dictAll computed metric results: {"Metric Name": {value, explanation}}
context["recording_metadata"]dictRecording info (duration, channels, sample_rate)
context["agent_name"]strName of the AI agent
context["agent_instructions"]strAgent’s system prompt

Transcript roles vary by source. Common values: "assistant", "AI Assistant", "user", "User", "bot". Use case-insensitive matching: t.get("role", "").lower() in ["assistant", "ai assistant", "bot", "agent"]

Examples

Silence Detection (Boolean)

Check if any silence period exceeds 5 seconds:

1silence = context["silence"]
2silences = silence.get("silences", [])
3threshold_ms = 5000 # 5 seconds
4long_silences = [s for s in silences if s.get("duration_ms", 0) > threshold_ms]
5
6if len(long_silences) == 0:
7 metric["result"] = True
8 metric["explanation"] = "No silence periods exceeded 5 seconds"
9 structured_output["name"] = "silence_compliance"
10 structured_output["value"] = "pass"
11 structured_output["classification"] = "meets_expectations"
12else:
13 worst_ms = max(s["duration_ms"] for s in long_silences)
14 metric["result"] = False
15 metric["explanation"] = str(len(long_silences)) + " silence(s) exceeded 5s (worst: " + str(worst_ms/1000) + "s)"
16 structured_output["name"] = "silence_compliance"
17 structured_output["value"] = "fail"
18 structured_output["classification"] = "requires_attention"

Average Latency Check (Boolean)

Fail if average response latency exceeds thresholds:

1lat = context["latency"]
2avg = lat.get("avg_ms", 0)
3
4if avg < 1000:
5 metric["result"] = True
6 metric["explanation"] = "Average latency " + str(avg) + "ms is under 1 second"
7 structured_output["name"] = "latency_check"
8 structured_output["value"] = "pass"
9 structured_output["classification"] = "meets_expectations"
10elif avg < 2000:
11 metric["result"] = False
12 metric["explanation"] = "Average latency " + str(avg) + "ms is between 1-2 seconds"
13 structured_output["name"] = "latency_check"
14 structured_output["value"] = "warning"
15 structured_output["classification"] = "requires_attention"
16else:
17 metric["result"] = False
18 metric["explanation"] = "Average latency " + str(avg) + "ms exceeds 2 seconds"
19 structured_output["name"] = "latency_check"
20 structured_output["value"] = "fail"
21 structured_output["classification"] = "goal_missed"

Agent Greeting Check (Boolean)

Verify the agent greets the caller:

1transcript = context["transcript_json"]
2agent_roles = ["assistant", "ai assistant", "bot", "agent"]
3first_agent = ""
4for t in transcript:
5 if t.get("role", "").lower() in agent_roles:
6 first_agent = t.get("content", "")
7 break
8
9greetings = ["hello", "hi", "good morning", "good afternoon", "welcome", "thank you for calling"]
10greeted = any(g in first_agent.lower() for g in greetings)
11
12metric["result"] = greeted
13metric["explanation"] = "Agent opened with: " + first_agent[:80] if first_agent else "No agent message found"
14structured_output["name"] = "greeting_check"
15structured_output["value"] = "pass" if greeted else "fail"
16structured_output["classification"] = "meets_expectations" if greeted else "requires_attention"

Speech Speed Assessment (Rating 1-5)

Rate the agent’s speech speed:

1w = context["wpm"]
2avg_wpm = w.get("avg_wpm", 0)
3
4if 120 <= avg_wpm <= 180:
5 metric["result"] = 5
6 metric["explanation"] = "Speech speed " + str(avg_wpm) + " WPM is in ideal range (120-180)"
7 structured_output["name"] = "speech_speed"
8 structured_output["value"] = "ideal"
9 structured_output["classification"] = "meets_expectations"
10elif 100 <= avg_wpm < 120 or 180 < avg_wpm <= 200:
11 metric["result"] = 3
12 metric["explanation"] = "Speech speed " + str(avg_wpm) + " WPM is slightly outside ideal range"
13 structured_output["name"] = "speech_speed"
14 structured_output["value"] = "acceptable"
15 structured_output["classification"] = "requires_attention"
16else:
17 metric["result"] = 1
18 metric["explanation"] = "Speech speed " + str(avg_wpm) + " WPM is too slow or too fast"
19 structured_output["name"] = "speech_speed"
20 structured_output["value"] = "poor"
21 structured_output["classification"] = "goal_missed"

Interruption Count (Numeric)

Count total interruptions and classify severity:

1ints = context["interruptions"]
2user_count = ints.get("user", {}).get("count", 0)
3ai_count = ints.get("ai", {}).get("count", 0)
4total = user_count + ai_count
5
6metric["result"] = total
7metric["explanation"] = "Total " + str(total) + " interruptions (user: " + str(user_count) + ", AI: " + str(ai_count) + ")"
8structured_output["name"] = "interruption_level"
9if total <= 3:
10 structured_output["value"] = "low"
11 structured_output["classification"] = "meets_expectations"
12elif total <= 8:
13 structured_output["value"] = "moderate"
14 structured_output["classification"] = "requires_attention"
15else:
16 structured_output["value"] = "high"
17 structured_output["classification"] = "goal_missed"

Voice Tone Quality (Rating 1-5)

Check voice clarity and tone scores:

1vt = context["voice_tone"]
2score = vt.get("overall_score", 0)
3clarity = vt.get("clarity_score", 0)
4tone = vt.get("tone_score", 0)
5
6metric["result"] = round(score)
7metric["explanation"] = "Voice tone score: " + str(score) + "/5 (clarity: " + str(clarity) + ", tone: " + str(tone) + ")"
8structured_output["name"] = "voice_quality"
9if score >= 4:
10 structured_output["value"] = "excellent"
11 structured_output["classification"] = "exceeds_expectations"
12elif score >= 3:
13 structured_output["value"] = "acceptable"
14 structured_output["classification"] = "meets_expectations"
15else:
16 structured_output["value"] = "poor"
17 structured_output["classification"] = "requires_attention"

Call Duration Check (Boolean)

Ensure call duration is within acceptable range:

1dur = context["call_duration"] or 0
2
3if dur < 30:
4 metric["result"] = False
5 metric["explanation"] = "Call too short at " + str(dur) + "s"
6 structured_output["name"] = "duration_check"
7 structured_output["value"] = "too_short"
8 structured_output["classification"] = "requires_attention"
9elif dur > 600:
10 metric["result"] = False
11 metric["explanation"] = "Call too long at " + str(dur) + "s"
12 structured_output["name"] = "duration_check"
13 structured_output["value"] = "too_long"
14 structured_output["classification"] = "requires_attention"
15else:
16 metric["result"] = True
17 metric["explanation"] = "Call duration " + str(dur) + "s is within acceptable range"
18 structured_output["name"] = "duration_check"
19 structured_output["value"] = "acceptable"
20 structured_output["classification"] = "meets_expectations"

Conversation Turn Count (Numeric)

Count conversation turns and flag abnormal lengths:

1transcript = context["transcript_json"]
2agent_roles = ["assistant", "ai assistant", "bot", "agent"]
3user_roles = ["user", "human", "customer", "caller"]
4total = len(transcript)
5agent_turns = len([t for t in transcript if t.get("role", "").lower() in agent_roles])
6user_turns = len([t for t in transcript if t.get("role", "").lower() in user_roles])
7
8metric["result"] = total
9metric["explanation"] = str(total) + " total turns (agent: " + str(agent_turns) + ", user: " + str(user_turns) + ")"
10structured_output["name"] = "turn_count"
11if total < 4:
12 structured_output["value"] = "too_short"
13 structured_output["classification"] = "requires_attention"
14elif total > 50:
15 structured_output["value"] = "too_long"
16 structured_output["classification"] = "requires_attention"
17else:
18 structured_output["value"] = "normal"
19 structured_output["classification"] = "meets_expectations"

Call End Reason Check (Boolean)

Verify the call ended normally:

1reason = context.get("call_end_reason", "")
2normal_reasons = ["user disconnected", "hangup", "completed", "end_turn", "natural_conclusion", "objective_met"]
3is_normal = any(r in str(reason).lower() for r in normal_reasons) if reason else False
4
5metric["result"] = is_normal
6metric["explanation"] = "Call ended with: " + str(reason)
7structured_output["name"] = "end_reason"
8structured_output["value"] = "normal" if is_normal else "abnormal"
9structured_output["classification"] = "meets_expectations" if is_normal else "requires_attention"

Agent Identifies Themselves (Boolean)

Check if the agent introduces themselves in the first few turns:

1transcript = context["transcript_json"]
2agent_roles = ["assistant", "ai assistant", "bot", "agent"]
3first_3 = [t["content"].lower() for t in transcript if t.get("role", "").lower() in agent_roles][:3]
4combined = " ".join(first_3)
5
6identified = any(p in combined for p in ["my name is", "this is", "i'm ", "i am "])
7
8metric["result"] = identified
9metric["explanation"] = "Agent introduced themselves" if identified else "Agent did not introduce themselves in first 3 turns"
10structured_output["name"] = "agent_identification"
11structured_output["value"] = "pass" if identified else "fail"
12structured_output["classification"] = "meets_expectations" if identified else "requires_attention"

Metadata Presence Check (Boolean)

Verify that call metadata is attached:

1meta = context.get("metadata", {})
2has_meta = len(meta) > 0
3
4metric["result"] = has_meta
5keys = list(meta.keys())
6metric["explanation"] = "Metadata present: " + ", ".join(keys[:5]) if has_meta else "No metadata attached to this call"
7structured_output["name"] = "metadata_check"
8structured_output["value"] = "present" if has_meta else "missing"
9structured_output["classification"] = "meets_expectations" if has_meta else "requires_attention"

Cross-Metric Check (Using Other Metric Results)

Check if multiple metrics passed together:

1results = context["metrics_results"]
2latency_result = results.get("Response Latency", {})
3silence_result = results.get("Silence Detection", {})
4
5avg_latency = latency_result.get("avg_ms", 0) if isinstance(latency_result, dict) else 0
6silence_count = silence_result.get("count", 0) if isinstance(silence_result, dict) else 0
7
8both_good = avg_latency < 2000 and silence_count < 10
9metric["result"] = both_good
10metric["explanation"] = "Latency: " + str(avg_latency) + "ms, Silences: " + str(silence_count)
11structured_output["name"] = "overall_quality"
12structured_output["value"] = "pass" if both_good else "fail"
13structured_output["classification"] = "meets_expectations" if both_good else "requires_attention"

Testing Your Code

Use the Test with Call button in the code editor to run your code against a real call before saving. The test:

  1. Loads a call from your selected agent (with all computed metrics)
  2. Executes your code in the same sandbox used in production
  3. Shows the result, explanation, structured output, and execution time
  4. Highlights errors if your code has bugs

Tips

  • Use context.get("key", default) for safe access when data may not be available
  • Transcript roles vary by source — always use case-insensitive matching against multiple role names
  • Silence uses duration_ms (milliseconds), not seconds
  • Latency uses avg_ms for the average
  • Avoid f-strings with quotes inside — use string concatenation instead (sandbox limitation)
  • next(), iter(), reversed() are available alongside standard builtins
  • Keep code simple and fast — 5 second timeout per metric
  • Code metrics have access to all other computed metric results via context["metrics_results"]

When to Use Code-as-Judge vs LLM-as-Judge

Use Code-as-Judge when…Use LLM-as-Judge when…
Checking numeric thresholds (latency < 2s)Evaluating tone, empathy, professionalism
Verifying specific phrases existAssessing complex compliance requirements
Counting occurrencesInterpreting nuanced conversation flow
Cross-referencing other metric resultsDetermining if the agent followed a process
You need zero cost per evaluationSubjective quality assessment
Rule is deterministic (same input = same output)Evaluation requires judgment