LLM Observability and Evaluation: Traces, Quality Metrics, and Experiments
Build observability and evaluation for LLM systems, including prompt traces, cost tracking, model versions, RAG metrics, LLM-as-judge, A/B tests, and regression datasets.
Start With a Quality Regression Nobody Noticed
Your RAG support assistant has been running smoothly for three weeks. Latency is stable at 2 seconds. Error rate is under 0.5%. The team is focused on building new features.
On day 21, a user posts on social media:
"Your AI assistant told me I can cancel my annual subscription for a full refund. That is wrong. Your policy clearly says annual subscriptions are non-refundable."
You check the monitoring dashboard. Latency: green. Error rate: green. Everything looks fine. But the answer was wrong.
What happened? Three days ago, a teammate updated the system prompt from "Answer concisely" to "Be thorough and helpful." A new model version was rolled out by the provider. And the retrieval top-k was changed from 5 to 10 in a config deploy.
Each change seemed harmless. Each was logged somewhere. But no system connected the dots: prompt change + model update + retrieval change = worse answers. The observability system tracked uptime, not quality.
An LLM call can return HTTP 200 and still be a bad answer. Traditional observability — latency, errors, and saturation — is necessary but not sufficient for LLM systems. You also need to track what the model was asked, what it was given, what it produced, how much it cost, and whether the answer was any good.
Mental model: Traditional observability tells you if the system is up. LLM observability tells you if the system is working. An HTTP 200 with a hallucinated answer is a successful call and a failed interaction.
Why LLM Observability Is Different
Traditional web service observability tracks three pillars: logs, metrics, and traces. For an API endpoint, you measure request rate, latency percentiles, and error codes. If the endpoint returns 200 in under 200ms, everything is fine.
LLM observability needs all of that plus:
| Additional Dimension | Why It Matters | Traditional Analogy |
|---|---|---|
| Prompt content | The input determines the output | Request body (usually opaque) |
| Retrieved context | RAG quality depends on evidence | Database query plan |
| Model version | Different models produce different answers | API version (but changes more often) |
| Token count | Drives cost and latency | Response size (but billed per unit) |
| Answer quality | Is the response correct and safe? | Not tracked traditionally |
| User feedback | Did the user accept the answer? | Conversion rate |
| Cost per request | Unit economics matter at scale | Compute cost (but more variable) |
| Guardrail outcomes | Was the response validated? | Input validation result |
Important distinction: Traditional observability detects when the system breaks. LLM observability detects when the system degrades. Degradation is harder to detect because the system still works — it just works worse. You need quality signals, not just availability signals.
Why Not Just Use Application Logs?
Beginners often ask: "Can I just log the prompt and response to my existing logging system?"
You can, but LLM observability has requirements that general-purpose logging does not satisfy.
| Requirement | General Logging | LLM Observability |
|---|---|---|
| Trace across spans | Correlate by request ID | Retrieval → prompt → model → validation must be linked |
| High-volume prompt storage | Expensive at LLM scale | Sampling and redaction strategies needed |
| Cost attribution | Not a logging feature | Token counts must be aggregated per tenant/feature |
| Quality scoring | Not available | Separate evaluation pipeline needed |
| Version tracking | Manual tagging | Automatic capture of model, prompt, config versions |
| Experiment comparison | Not supported | A/B test framework with statistical comparison |
Step 1: Tracing the Full Pipeline
Tracing in LLM systems means capturing every step from user request to final response as connected spans. Each span captures timing, inputs, outputs, and metadata.
Span Schema
Every span should capture:
{
"spanId": "span_abc123",
"traceId": "trace_def456",
"parentSpanId": "span_parent789",
"service": "rag-pipeline",
"spanName": "retrieval",
"startTime": "2026-06-01T10:00:00.000Z",
"duration": 320,
"attributes": {
"query": "Can I pause my annual subscription?",
"topK": 10,
"filter": "tenant=acme",
"documentIds": ["doc_001", "doc_042", "doc_103"],
"scores": [0.92, 0.87, 0.64],
"indexVersion": "idx_2026_05_28"
}
}
Which Spans to Capture
| Span | Captures | Why It Matters for Debugging |
|---|---|---|
| Request | Tenant, feature, user type, request ID | Who asked, what feature, was it authorized |
| Auth | Auth result, user permissions | Was the user allowed to ask this? |
| Retrieval | Query, filters, top-k, document IDs, scores | Were the right documents retrieved? |
| Reranking | Candidate count, selected chunks, scores | Did reranking help or hurt? |
| Prompt assembly | Prompt version, token count, context IDs | Was the prompt constructed correctly? |
| LLM call | Model name, provider, latency, tokens, cost | Which model, how fast, how expensive |
| Validation | Schema result, safety result, retries | Did guardrails pass or fail? |
| Response | Citations, confidence score, user feedback | Was the answer accepted? |
Sampling Strategy
Storing every trace for every request is expensive. Use a sampling strategy:
| Sampling Tier | Criteria | Sample Rate | Storage |
|---|---|---|---|
| All requests | Rollup metrics only | 0% payloads | Aggregated time-series |
| Quality sample | Random sample across all traffic | 1-5% | Full spans, redacted PII |
| Error sample | Any validation failure, error, or guardrail block | 100% | Full spans with payloads |
| Degradation sample | Requests with high latency or cost | 100% | Full spans |
| Feedback sample | Requests with explicit user feedback | 100% | Full spans |
Practical advice: Always store 100% of traces for requests that failed validation, had high latency, or received negative user feedback. These are your most valuable debugging signals. Sample the happy path at 1-5% to control cost while maintaining statistical visibility.
Step 2: Metrics — Reliability, Cost, and Quality
Metrics are aggregated measurements over time. LLM systems need three categories of metrics.
Reliability and Performance
| Metric | What It Measures | Why It Matters | Alert Threshold |
|---|---|---|---|
| Request rate | Requests per minute per feature | Capacity planning and adoption | Sudden drop or spike |
| Error rate | % of requests with provider errors or exceptions | Provider and application reliability | > 1% |
| TTFT (P50/P95) | Time to first token | Interactive user experience | P95 > 3s |
| Total latency (P50/P95) | End-to-end response time | User patience and SLO compliance | P95 > 10s |
| Token generation speed | Tokens per second | Model serving efficiency | < 10 tok/s |
| Retry rate | % of requests that needed a retry | Provider instability or validation issues | > 5% |
| Timeout rate | % of requests that timed out | SLO risk | > 0.5% |
| Cache hit rate | % of requests served from prefix or exact cache | Cost efficiency | < 20% indicates problem |
Cost Metrics
| Metric | What It Measures | Why It Matters | Action if High |
|---|---|---|---|
| Input tokens per day | Total prompt and context tokens | Baseline cost driver | Optimize context size, implement caching |
| Output tokens per day | Total generation tokens | Generation cost | Shorter answers, smaller model |
| Cost per request | Average cost per LLM call | Unit economics | Compare to value per request |
| Cost per successful task | Cost per completed user goal | Business value of LLM spend | Flag if > value of task |
| Spend by tenant | Cost attribution per customer | Budget ownership and billing | Large tenants may need caps |
| Spend by feature | Cost attribution per product feature | Investment decisions | High-cost, low-value features need optimization |
| Daily/weekly cost trend | Cost trajectory | Budget forecasting | Unusual growth needs investigation |
Cost per request calculation:
Cost = (input_tokens × input_price_per_token)
+ (output_tokens × output_price_per_token)
+ retrieval_cost
+ reranking_cost
Example for a RAG request:
Input: 3500 tokens × $3/M tokens = $0.0105
Output: 450 tokens × $15/M tokens = $0.0068
Retrieval: 1 vector search = $0.0002
Reranking: 20 candidates reranked = $0.0005
Total: $0.018 per request
At 100K requests/day: $1,800/day = $54,000/month
A 10% context reduction saves $5,400/month
Quality Metrics
| Metric | Meaning | How to Measure | Target |
|---|---|---|---|
| Answer relevance | Does the answer address the question? | LLM-as-judge or human rating | > 4/5 |
| Faithfulness | Is the answer supported by the retrieved context? | Claim extraction + citation check | > 95% |
| Context precision | Were the retrieved chunks actually useful? | Rank of relevant chunks in results | Mean rank < 3 |
| Context recall | Did retrieval include the needed evidence? | Was the expected source in top-k? | > 90% |
| Citation accuracy | Do the citations actually support the claims? | Semantic similarity check | > 95% |
| Abstention rate | How often does the model say "I do not know" vs hallucinate? | Tracker of abstention vs guess | Depends on domain |
| Task success rate | Did the user complete their goal? | User feedback, follow-up rate | > 85% |
| User satisfaction | Explicit user rating | Thumbs up/down, survey | > 90% positive |
Step 3: RAG Evaluation
RAG has two subsystems that fail differently. Retrieval can miss relevant documents. Generation can misinterpret or ignore the retrieved evidence. They must be evaluated separately.
Retrieval Evaluation Metrics
| Metric | Definition | How to Measure |
|---|---|---|
| Recall@K | Was the expected document in the top-K results? | Check if expected doc ID is in result set |
| MRR (Mean Reciprocal Rank) | How high was the first relevant result ranked? | 1 / rank of first relevant doc, averaged |
| Precision@K | How many of top-K were relevant? | Relevant docs in top-K / K |
| NDCG (Normalized Discounted Cumulative Gain) | Quality of ranking weighted by position | Standard information retrieval metric |
Generation Evaluation Metrics
| Metric | Definition | How to Measure |
|---|---|---|
| Faithfulness | Does the answer stay within the retrieved evidence? | Check each claim against source documents |
| Answer relevance | Does the answer directly address the question? | LLM-as-judge scoring |
| Completeness | Does the answer cover all aspects of the question? | Multi-facet question decomposition |
| Conciseness | Is the answer appropriately brief? | Token count relative to question complexity |
Building a Golden Dataset
Question: "Can I pause my annual subscription?"
Expected relevant docs: ["billing-faq.md#annual-pause", "subscription-policy.pdf#section-4.2"]
Expected answer: "Annual subscriptions cannot be paused. Customers on annual plans can contact
support for a credit review if there was a billing mistake or exceptional case."
Expected citations: [billing-faq.md#annual-pause]
Expected confidence: high
Acceptable variants: Mentions credit review, directs to support, does not promise pause
Unacceptable: Says annual subscriptions can be paused, does not mention exception path
Question: "What is the refund policy for enterprise customers?"
Expected relevant docs: ["enterprise-agreement-template.pdf#refund-clause"]
Expected answer: "Enterprise refunds are governed by the terms in your specific agreement.
Please contact your account manager for details."
Expected citations: [enterprise-agreement-template.pdf#refund-clause]
Expected behavior: Should NOT guess specific refund terms, should direct to account manager
Dataset Requirements
| Requirement | Why | Count |
|---|---|---|
| Real user questions | Representative of production traffic | 200+ |
| Edge cases | Unusual or tricky questions | 50+ |
| Unanswerable questions | Questions outside the knowledge base | 30+ |
| Unsafe or out-of-scope | Test guardrails and moderation | 20+ |
| Multi-tenant cases | Permission boundary testing | 20+ |
| Regression cases | Incidents that occurred in production | All past incidents |
A golden dataset is a living artifact: Add every production incident as a new test case. If a bad answer reached a user, your golden dataset should include that exact query and the expected correct answer. Run the full evaluation suite after every prompt change, model update, or retrieval config change.
Step 4: LLM-as-Judge
LLM-as-judge uses a separate model to evaluate the quality of your primary model's outputs. It scales evaluation to thousands of examples but introduces its own biases.
What to Judge
| Criterion | Judge Prompt Instruction | Scale |
|---|---|---|
| Relevance | "Does the answer address the user's question directly?" | 1-5 |
| Faithfulness | "Is every claim in the answer supported by the provided context?" | 1-5 |
| Completeness | "Does the answer cover all aspects of the question?" | 1-5 |
| Safety | "Does the answer contain unsafe, harmful, or policy-violating content?" | Pass/Fail |
| Conciseness | "Is the answer appropriately brief without omitting important information?" | 1-5 |
Judge Prompt Template
You are evaluating a support assistant's response.
User question: {question}
Retrieved context: {context}
Assistant response: {answer}
Rate the following dimensions on a scale of 1-5:
1. Relevance: Does the answer address the user's question?
2. Faithfulness: Are all claims supported by the context?
3. Completeness: Does the answer cover all aspects of the question?
Provide a score and a brief rationale for each dimension.
If any claim in the answer is NOT supported by the context, mark faithfulness as 1.
Risks and Calibration
| Risk | How It Manifests | Mitigation |
|---|---|---|
| Judge bias | Judge prefers longer or more confident answers | Include length-normalized scoring |
| Inconsistent scoring | Same answer gets different scores on different runs | Average across 3 judge calls |
| Self-enhancement bias | Judge rates the same model family higher | Use a different model as judge |
| Position bias | Judge favors answers shown first | Randomize answer order |
| Domain blindness | Judge misses domain-specific errors | Include domain rules in judge prompt |
Calibration with Human Labels
Human-labeled example 1:
Question: "Can I pause my annual subscription?"
Context: [billing-faq: annual subscriptions cannot be paused]
Answer: "No, annual subscriptions cannot be paused. You can contact support for a credit review."
Human score: relevance=5, faithfulness=5
Judge score: relevance=5, faithfulness=5 ✓
Human-labeled example 2:
Question: "What is my account number?"
Context: [no context contains the user's account number]
Answer: "Your account number is ACC-784512."
Human score: faithfulness=1 (hallucination, not in context)
Judge score: faithfulness=5 ✗ (judge failed to detect hallucination)
If the judge consistently disagrees with human labels on certain criteria, adjust the judge prompt or switch to a different judge model.
Practical advice: Run a calibration set of 50-100 human-labeled examples before deploying a judge. If the judge's accuracy on the calibration set is below 90% for any criterion, fix the prompt or model before using it for evaluations. Recalibrate monthly or after any judge model update.
Step 5: A/B Testing and Experiments
LLM systems change constantly: prompts, models, retrieval parameters, rerankers, tools, and guardrails. Every change is a potential regression. A/B testing measures whether the change actually improves quality, cost, or latency.
What to Test
| Experiment | Variant A (Control) | Variant B (Treatment) | What to Measure |
|---|---|---|---|
| Prompt version | "Answer concisely" | "Be thorough and include all relevant details" | Quality, latency, cost, token count |
| Model | GPT-4o-mini | GPT-4o | Quality, latency, cost |
| Retrieval top-k | top-5 | top-10 | Context recall, precision, latency |
| Reranking | No reranker | With reranker (top-5 of 20) | Answer quality, latency |
| Guardrail strictness | Loose citation requirement | Strict citation (every claim needs source) | Quality, refusal rate |
| Context window size | 2000 tokens | 4000 tokens | Quality, cost, latency |
| Temperature | 0.0 | 0.3 | Quality, variability |
Experiment Design
Experiment Requirements
| Requirement | Why | How |
|---|---|---|
| Consistent splitting | Same user should see the same variant | Hash user ID or tenant ID |
| Sufficient sample size | Statistical significance | Minimum 1000 requests per variant |
| Measure all dimensions | Quality, cost, latency, safety | Collect all metrics for both variants |
| Duration | Account for time-of-day effects | Run for at least 24-48 hours |
| Rollback plan | Revert immediately if metrics degrade | Feature flag control |
| Guardrail monitoring | Experiment should not reduce safety | Independent monitoring of guardrails |
Statistical Significance
Before declaring a winner, verify:
- At least 1000 samples per variant
- The difference in key metrics is statistically significant (p < 0.05)
- No metric degraded significantly (even if primary metric improved)
- The effect is consistent across tenants and user segments
- The experiment ran through at least one full business cycle
Example result:
Variant A (control): 85% quality score, $0.018/request, 2.1s latency
Variant B (treatment): 91% quality score, $0.022/request, 2.4s latency
Decision: Quality improved by 6 points (p < 0.01). Cost increased 22%.
Latency increased 14%. Deploy if the quality improvement justifies
the cost increase. Revert if not.
Step 6: Prompt and Model Versioning
Every production response should be traceable to the exact versions of every artifact that produced it. Without versioning, you cannot debug regressions.
| Artifact | Version It Because | Version Strategy |
|---|---|---|
| System prompt | Behavior changes | Git-tracked, semantic version, hash |
| Tool schema | Output and action changes | Schema file with version tag |
| Retrieval config | Evidence changes | Config file hash |
| Model | Quality and cost change | Provider model name + version |
| Embedding model | Index compatibility changes | Model name + training date |
| Guardrail policy | Safety behavior changes | Policy file hash |
| Reranker model | Ranking quality changes | Model name + version |
Traceable Response Metadata
Every response should carry version metadata that links back to the artifacts that produced it:
{
"response": "Annual subscriptions cannot be paused...",
"metadata": {
"promptVersion": "v2.4.1",
"promptHash": "a1b2c3d4",
"modelName": "gpt-4o",
"modelVersion": "2026-05-15",
"embeddingModel": "text-embedding-3-small@002",
"retrievalConfig": "retrieval_config_v3.json",
"indexVersion": "idx_2026_05_28",
"guardrailPolicy": "guardrails_v2.json",
"requestTimestamp": "2026-06-01T10:00:00Z"
}
}
Versioning Workflow
Step 7: Alerting and Dashboards
Alerts should catch regressions before users notice them.
Alert Rules
| Alert | Trigger | Possible Cause | Severity |
|---|---|---|---|
| Cost spike | Daily cost > 2x normal | Prompt grew, abuse, routing bug, context bloat | P1 |
| Latency spike | P95 TTFT > 3x normal | Provider issue, retrieval slowdown, model change | P2 |
| Validation failure rate | > 5% of requests fail validation | Prompt or schema regression | P1 |
| No-citation rate | > 10% of answers lack citations | RAG failure, retrieval config change | P2 |
| Quality drop | LLM-as-judge score drops > 10% | Prompt, model, or retrieval regression | P1 |
| Increased refusals | Refusal rate > 2x normal | Policy or guardrail config issue | P2 |
| Error rate spike | Provider error rate > 2% | Provider outage or throttling | P1 |
| Zero traffic | Request rate drops > 90% | Routing or deployment failure | P1 |
| Token consumption anomaly | Tokens per request > 3x normal | Context bloat, prompt template bug | P2 |
Dashboard Design
Organize dashboards by audience:
Operations Dashboard (on-call engineers):
- Request rate, error rate, latency (P50/P95) over time
- Provider status and error breakdown
- Cost per hour
- Active experiments
Quality Dashboard (ML engineers):
- LLM-as-judge quality scores over time by dimension
- Retrieval recall and precision trends
- Golden dataset pass/fail rate per version
- User feedback (thumbs up/down) over time
Business Dashboard (product managers):
- Cost per feature and per tenant
- Task success rate
- User satisfaction score
- Monthly cost trend vs budget
- Experiment results and recommendations
Common Failure Stories
The Prompt Changed But Nobody Logged It
A developer edits the system prompt directly in the production configuration file. No version bump. No pull request. No golden dataset run. Three days later, quality metrics drop. The team spends a week debugging retrieval, model, and infrastructure before someone notices the prompt changed.
The fix: prompts must be versioned and deployed through the same CI/CD pipeline as code. Every prompt change should trigger the golden dataset evaluation suite. If quality drops, the deployment should block.
The Model Update Broke Citations
The provider releases a new model version that changes the output format. The structured output schema still validates, but the model stops including citations. The validation layer passes because the schema allows empty citation arrays. For three days, the system serves uncited answers.
The fix: set a minimum citation count in the validation schema. If the model produces fewer than the required number of citations, the validation should fail and trigger an alert. Never rely on the model alone to follow formatting instructions.
The A/B Test Was Inconclusive Because of Traffic Skew
An experiment splits traffic 50/50 by user ID hash. But Variant B gets 80% of the traffic from enterprise tenants with complex questions. Variant B looks worse because it handles harder questions, not because the change is bad.
The fix: stratify experiment splits by tenant type, question complexity, and feature. Measure metrics per segment, not just aggregated. Or use a true randomized assignment within each segment.
The LLM-as-Judge Missed a Domain-Specific Error
A medical question about drug interactions gets a plausible-sounding but wrong answer. The judge model, also a general-purpose LLM, does not have enough medical knowledge to detect the error. It scores faithfulness as 5 because the answer sounds reasonable.
The fix: for high-stakes domains, use a domain-specific judge model or include domain rules in the judge prompt. Better yet, do not rely solely on LLM-as-judge for domains where errors have serious consequences. Use human review for a sample.
The Golden Dataset Did Not Catch the Regression
The team updates the retrieval config from top-5 to top-10. The golden dataset passes because all expected documents appear in the top-10. But the extra 5 documents add noise, and the model starts hallucinating from irrelevant context. The quality score drops in production but the golden dataset still passes.
The fix: the golden dataset should measure not just retrieval recall but also generation quality. Add test cases where irrelevant context is present, and measure whether the model correctly ignores it. The evaluation should mirror the full production pipeline, not just individual components.
Evaluating the Observability System
The observability system itself needs evaluation. Can it actually detect regressions?
| Question | How to Answer | Target |
|---|---|---|
| Can you reproduce any past incident from traces? | Pick a past incident, check if all relevant spans exist | 100% of incidents traceable |
| How long does it take to detect a regression? | Inject a known bad change, measure detection time | < 5 minutes |
| Are golden dataset results correlated with production quality? | Compare golden dataset pass rate vs user satisfaction | Correlation > 0.8 |
| Is the alert false-positive rate acceptable? | Track alerts that did not correspond to real issues | < 10% |
| Are all spans within budget? | Total observability cost as % of LLM infrastructure cost | < 5% |
| Can you explain any cost change from last week? | Query cost breakdown per dimension | Always explainable |
Debugging rule: When investigating a quality regression, start with the timeline. Plot quality scores, latency, cost, and error rate on the same timeline. Then overlay version changes: prompt deploys, model updates, retrieval config changes, guardrail updates. The regression cause is almost always visible as a change in one of these dimensions at the time quality dropped.
A Complete Observability Pipeline, End to End
Here is how observability data flows from a single request to actionable insight:
This flow connects every component: tracing captures what happened, metrics quantify it, evaluation scores quality, versioning provides context, and alerts surface regressions. Without any one of these pieces, you are flying blind.
What to Remember for Interviews
When explaining LLM observability, tell the story in order:
- Traditional observability is not enough: HTTP 200 with a bad answer is a successful call and a failed interaction. LLM systems need quality signals, not just uptime signals.
- Trace the full pipeline: Every span — retrieval, reranking, prompt assembly, LLM call, validation — must be connected by a trace ID. Sample happy paths, store 100% of failures.
- Track three categories of metrics: Reliability (latency, error rate), cost (tokens, spend per feature), and quality (relevance, faithfulness, citation accuracy).
- Evaluate retrieval and generation separately: They fail differently. Use a golden dataset with real user questions, expected sources, and acceptable answer criteria.
- Use LLM-as-judge carefully: It scales evaluation but has biases (self-enhancement, position, length, domain blindness). Calibrate against human labels. Recalibrate regularly.
- Version everything: Prompts, models, embedding models, retrieval configs, guardrail policies. Every response should carry the versions that produced it.
- A/B test every change: Prompt edits, model upgrades, retrieval changes — measure quality, cost, and latency simultaneously. A quality improvement that doubles cost may not be worth it.
- Alert on quality, not just errors: Cost spikes, quality drops, no-citation rates, and validation failures are the signals that matter for LLM systems.
Practice: Design an observability and evaluation system for a RAG support assistant. Include tracing spans for the full pipeline, golden datasets with retrieval and generation test cases, LLM-as-judge with calibration against human labels, A/B test design for a prompt change, version tracking for every artifact, and alerts for cost spikes and quality drops.