LLM Observability and Evaluation: Traces, Quality Metrics, and Experiments

Build observability and evaluation for LLM systems, including prompt traces, cost tracking, model versions, RAG metrics, LLM-as-judge, A/B tests, and regression datasets.

LLM observabilityevaluationLLM-as-judgetracingA/B testing

Start With a Quality Regression Nobody Noticed

Your RAG support assistant has been running smoothly for three weeks. Latency is stable at 2 seconds. Error rate is under 0.5%. The team is focused on building new features.

On day 21, a user posts on social media:

"Your AI assistant told me I can cancel my annual subscription for a full refund. That is wrong. Your policy clearly says annual subscriptions are non-refundable."

You check the monitoring dashboard. Latency: green. Error rate: green. Everything looks fine. But the answer was wrong.

What happened? Three days ago, a teammate updated the system prompt from "Answer concisely" to "Be thorough and helpful." A new model version was rolled out by the provider. And the retrieval top-k was changed from 5 to 10 in a config deploy.

Each change seemed harmless. Each was logged somewhere. But no system connected the dots: prompt change + model update + retrieval change = worse answers. The observability system tracked uptime, not quality.

An LLM call can return HTTP 200 and still be a bad answer. Traditional observability — latency, errors, and saturation — is necessary but not sufficient for LLM systems. You also need to track what the model was asked, what it was given, what it produced, how much it cost, and whether the answer was any good.

✅

Mental model: Traditional observability tells you if the system is up. LLM observability tells you if the system is working. An HTTP 200 with a hallucinated answer is a successful call and a failed interaction.

Why LLM Observability Is Different

Traditional web service observability tracks three pillars: logs, metrics, and traces. For an API endpoint, you measure request rate, latency percentiles, and error codes. If the endpoint returns 200 in under 200ms, everything is fine.

LLM observability needs all of that plus:

Additional Dimension	Why It Matters	Traditional Analogy
Prompt content	The input determines the output	Request body (usually opaque)
Retrieved context	RAG quality depends on evidence	Database query plan
Model version	Different models produce different answers	API version (but changes more often)
Token count	Drives cost and latency	Response size (but billed per unit)
Answer quality	Is the response correct and safe?	Not tracked traditionally
User feedback	Did the user accept the answer?	Conversion rate
Cost per request	Unit economics matter at scale	Compute cost (but more variable)
Guardrail outcomes	Was the response validated?	Input validation result

⚠️

Important distinction: Traditional observability detects when the system breaks. LLM observability detects when the system degrades. Degradation is harder to detect because the system still works — it just works worse. You need quality signals, not just availability signals.

Why Not Just Use Application Logs?

Beginners often ask: "Can I just log the prompt and response to my existing logging system?"

You can, but LLM observability has requirements that general-purpose logging does not satisfy.

Requirement	General Logging	LLM Observability
Trace across spans	Correlate by request ID	Retrieval → prompt → model → validation must be linked
High-volume prompt storage	Expensive at LLM scale	Sampling and redaction strategies needed
Cost attribution	Not a logging feature	Token counts must be aggregated per tenant/feature
Quality scoring	Not available	Separate evaluation pipeline needed
Version tracking	Manual tagging	Automatic capture of model, prompt, config versions
Experiment comparison	Not supported	A/B test framework with statistical comparison

Step 1: Tracing the Full Pipeline

Tracing in LLM systems means capturing every step from user request to final response as connected spans. Each span captures timing, inputs, outputs, and metadata.

Span Schema

Every span should capture:

json

{
  "spanId": "span_abc123",
  "traceId": "trace_def456",
  "parentSpanId": "span_parent789",
  "service": "rag-pipeline",
  "spanName": "retrieval",
  "startTime": "2026-06-01T10:00:00.000Z",
  "duration": 320,
  "attributes": {
    "query": "Can I pause my annual subscription?",
    "topK": 10,
    "filter": "tenant=acme",
    "documentIds": ["doc_001", "doc_042", "doc_103"],
    "scores": [0.92, 0.87, 0.64],
    "indexVersion": "idx_2026_05_28"
  }
}

Which Spans to Capture

Span	Captures	Why It Matters for Debugging
Request	Tenant, feature, user type, request ID	Who asked, what feature, was it authorized
Auth	Auth result, user permissions	Was the user allowed to ask this?
Retrieval	Query, filters, top-k, document IDs, scores	Were the right documents retrieved?
Reranking	Candidate count, selected chunks, scores	Did reranking help or hurt?
Prompt assembly	Prompt version, token count, context IDs	Was the prompt constructed correctly?
LLM call	Model name, provider, latency, tokens, cost	Which model, how fast, how expensive
Validation	Schema result, safety result, retries	Did guardrails pass or fail?
Response	Citations, confidence score, user feedback	Was the answer accepted?

Sampling Strategy

Storing every trace for every request is expensive. Use a sampling strategy:

Sampling Tier	Criteria	Sample Rate	Storage
All requests	Rollup metrics only	0% payloads	Aggregated time-series
Quality sample	Random sample across all traffic	1-5%	Full spans, redacted PII
Error sample	Any validation failure, error, or guardrail block	100%	Full spans with payloads
Degradation sample	Requests with high latency or cost	100%	Full spans
Feedback sample	Requests with explicit user feedback	100%	Full spans

✅

Practical advice: Always store 100% of traces for requests that failed validation, had high latency, or received negative user feedback. These are your most valuable debugging signals. Sample the happy path at 1-5% to control cost while maintaining statistical visibility.

Step 2: Metrics — Reliability, Cost, and Quality

Metrics are aggregated measurements over time. LLM systems need three categories of metrics.

Reliability and Performance

Metric	What It Measures	Why It Matters	Alert Threshold
Request rate	Requests per minute per feature	Capacity planning and adoption	Sudden drop or spike
Error rate	% of requests with provider errors or exceptions	Provider and application reliability	> 1%
TTFT (P50/P95)	Time to first token	Interactive user experience	P95 > 3s
Total latency (P50/P95)	End-to-end response time	User patience and SLO compliance	P95 > 10s
Token generation speed	Tokens per second	Model serving efficiency	< 10 tok/s
Retry rate	% of requests that needed a retry	Provider instability or validation issues	> 5%
Timeout rate	% of requests that timed out	SLO risk	> 0.5%
Cache hit rate	% of requests served from prefix or exact cache	Cost efficiency	< 20% indicates problem

Cost Metrics

Metric	What It Measures	Why It Matters	Action if High
Input tokens per day	Total prompt and context tokens	Baseline cost driver	Optimize context size, implement caching
Output tokens per day	Total generation tokens	Generation cost	Shorter answers, smaller model
Cost per request	Average cost per LLM call	Unit economics	Compare to value per request
Cost per successful task	Cost per completed user goal	Business value of LLM spend	Flag if > value of task
Spend by tenant	Cost attribution per customer	Budget ownership and billing	Large tenants may need caps
Spend by feature	Cost attribution per product feature	Investment decisions	High-cost, low-value features need optimization
Daily/weekly cost trend	Cost trajectory	Budget forecasting	Unusual growth needs investigation

txt

Cost per request calculation:
  Cost = (input_tokens × input_price_per_token)
       + (output_tokens × output_price_per_token)
       + retrieval_cost
       + reranking_cost

Example for a RAG request:
  Input: 3500 tokens × $3/M tokens = $0.0105
  Output: 450 tokens × $15/M tokens = $0.0068
  Retrieval: 1 vector search = $0.0002
  Reranking: 20 candidates reranked = $0.0005
  Total: $0.018 per request

  At 100K requests/day: $1,800/day = $54,000/month
  A 10% context reduction saves $5,400/month

Quality Metrics

Metric	Meaning	How to Measure	Target
Answer relevance	Does the answer address the question?	LLM-as-judge or human rating	> 4/5
Faithfulness	Is the answer supported by the retrieved context?	Claim extraction + citation check	> 95%
Context precision	Were the retrieved chunks actually useful?	Rank of relevant chunks in results	Mean rank < 3
Context recall	Did retrieval include the needed evidence?	Was the expected source in top-k?	> 90%
Citation accuracy	Do the citations actually support the claims?	Semantic similarity check	> 95%
Abstention rate	How often does the model say "I do not know" vs hallucinate?	Tracker of abstention vs guess	Depends on domain
Task success rate	Did the user complete their goal?	User feedback, follow-up rate	> 85%
User satisfaction	Explicit user rating	Thumbs up/down, survey	> 90% positive

Step 3: RAG Evaluation

RAG has two subsystems that fail differently. Retrieval can miss relevant documents. Generation can misinterpret or ignore the retrieved evidence. They must be evaluated separately.

Retrieval Evaluation Metrics

Metric	Definition	How to Measure
Recall@K	Was the expected document in the top-K results?	Check if expected doc ID is in result set
MRR (Mean Reciprocal Rank)	How high was the first relevant result ranked?	1 / rank of first relevant doc, averaged
Precision@K	How many of top-K were relevant?	Relevant docs in top-K / K
NDCG (Normalized Discounted Cumulative Gain)	Quality of ranking weighted by position	Standard information retrieval metric

Generation Evaluation Metrics

Metric	Definition	How to Measure
Faithfulness	Does the answer stay within the retrieved evidence?	Check each claim against source documents
Answer relevance	Does the answer directly address the question?	LLM-as-judge scoring
Completeness	Does the answer cover all aspects of the question?	Multi-facet question decomposition
Conciseness	Is the answer appropriately brief?	Token count relative to question complexity

Building a Golden Dataset

txt

Question: "Can I pause my annual subscription?"
Expected relevant docs: ["billing-faq.md#annual-pause", "subscription-policy.pdf#section-4.2"]
Expected answer: "Annual subscriptions cannot be paused. Customers on annual plans can contact
  support for a credit review if there was a billing mistake or exceptional case."
Expected citations: [billing-faq.md#annual-pause]
Expected confidence: high
Acceptable variants: Mentions credit review, directs to support, does not promise pause
Unacceptable: Says annual subscriptions can be paused, does not mention exception path

Question: "What is the refund policy for enterprise customers?"
Expected relevant docs: ["enterprise-agreement-template.pdf#refund-clause"]
Expected answer: "Enterprise refunds are governed by the terms in your specific agreement.
  Please contact your account manager for details."
Expected citations: [enterprise-agreement-template.pdf#refund-clause]
Expected behavior: Should NOT guess specific refund terms, should direct to account manager

Dataset Requirements

Requirement	Why	Count
Real user questions	Representative of production traffic	200+
Edge cases	Unusual or tricky questions	50+
Unanswerable questions	Questions outside the knowledge base	30+
Unsafe or out-of-scope	Test guardrails and moderation	20+
Multi-tenant cases	Permission boundary testing	20+
Regression cases	Incidents that occurred in production	All past incidents

⚠️

A golden dataset is a living artifact: Add every production incident as a new test case. If a bad answer reached a user, your golden dataset should include that exact query and the expected correct answer. Run the full evaluation suite after every prompt change, model update, or retrieval config change.

Step 4: LLM-as-Judge

LLM-as-judge uses a separate model to evaluate the quality of your primary model's outputs. It scales evaluation to thousands of examples but introduces its own biases.

What to Judge

Criterion	Judge Prompt Instruction	Scale
Relevance	"Does the answer address the user's question directly?"	1-5
Faithfulness	"Is every claim in the answer supported by the provided context?"	1-5
Completeness	"Does the answer cover all aspects of the question?"	1-5
Safety	"Does the answer contain unsafe, harmful, or policy-violating content?"	Pass/Fail
Conciseness	"Is the answer appropriately brief without omitting important information?"	1-5

Judge Prompt Template

txt

You are evaluating a support assistant's response.

User question: {question}
Retrieved context: {context}
Assistant response: {answer}

Rate the following dimensions on a scale of 1-5:

1. Relevance: Does the answer address the user's question?
2. Faithfulness: Are all claims supported by the context?
3. Completeness: Does the answer cover all aspects of the question?

Provide a score and a brief rationale for each dimension.
If any claim in the answer is NOT supported by the context, mark faithfulness as 1.

Risks and Calibration

Risk	How It Manifests	Mitigation
Judge bias	Judge prefers longer or more confident answers	Include length-normalized scoring
Inconsistent scoring	Same answer gets different scores on different runs	Average across 3 judge calls
Self-enhancement bias	Judge rates the same model family higher	Use a different model as judge
Position bias	Judge favors answers shown first	Randomize answer order
Domain blindness	Judge misses domain-specific errors	Include domain rules in judge prompt

Calibration with Human Labels

txt

Human-labeled example 1:
  Question: "Can I pause my annual subscription?"
  Context: [billing-faq: annual subscriptions cannot be paused]
  Answer: "No, annual subscriptions cannot be paused. You can contact support for a credit review."
  Human score: relevance=5, faithfulness=5
  Judge score: relevance=5, faithfulness=5 ✓

Human-labeled example 2:
  Question: "What is my account number?"
  Context: [no context contains the user's account number]
  Answer: "Your account number is ACC-784512."
  Human score: faithfulness=1 (hallucination, not in context)
  Judge score: faithfulness=5 ✗ (judge failed to detect hallucination)

If the judge consistently disagrees with human labels on certain criteria, adjust the judge prompt or switch to a different judge model.

✅

Practical advice: Run a calibration set of 50-100 human-labeled examples before deploying a judge. If the judge's accuracy on the calibration set is below 90% for any criterion, fix the prompt or model before using it for evaluations. Recalibrate monthly or after any judge model update.

Step 5: A/B Testing and Experiments

LLM systems change constantly: prompts, models, retrieval parameters, rerankers, tools, and guardrails. Every change is a potential regression. A/B testing measures whether the change actually improves quality, cost, or latency.

What to Test

Experiment	Variant A (Control)	Variant B (Treatment)	What to Measure
Prompt version	"Answer concisely"	"Be thorough and include all relevant details"	Quality, latency, cost, token count
Model	GPT-4o-mini	GPT-4o	Quality, latency, cost
Retrieval top-k	top-5	top-10	Context recall, precision, latency
Reranking	No reranker	With reranker (top-5 of 20)	Answer quality, latency
Guardrail strictness	Loose citation requirement	Strict citation (every claim needs source)	Quality, refusal rate
Context window size	2000 tokens	4000 tokens	Quality, cost, latency
Temperature	0.0	0.3	Quality, variability

Experiment Design

Experiment Requirements

Requirement	Why	How
Consistent splitting	Same user should see the same variant	Hash user ID or tenant ID
Sufficient sample size	Statistical significance	Minimum 1000 requests per variant
Measure all dimensions	Quality, cost, latency, safety	Collect all metrics for both variants
Duration	Account for time-of-day effects	Run for at least 24-48 hours
Rollback plan	Revert immediately if metrics degrade	Feature flag control
Guardrail monitoring	Experiment should not reduce safety	Independent monitoring of guardrails

Statistical Significance

txt

Before declaring a winner, verify:
- At least 1000 samples per variant
- The difference in key metrics is statistically significant (p < 0.05)
- No metric degraded significantly (even if primary metric improved)
- The effect is consistent across tenants and user segments
- The experiment ran through at least one full business cycle

Example result:
  Variant A (control):  85% quality score, $0.018/request, 2.1s latency
  Variant B (treatment): 91% quality score, $0.022/request, 2.4s latency

  Decision: Quality improved by 6 points (p < 0.01). Cost increased 22%.
  Latency increased 14%. Deploy if the quality improvement justifies
  the cost increase. Revert if not.

Step 6: Prompt and Model Versioning

Every production response should be traceable to the exact versions of every artifact that produced it. Without versioning, you cannot debug regressions.

Artifact	Version It Because	Version Strategy
System prompt	Behavior changes	Git-tracked, semantic version, hash
Tool schema	Output and action changes	Schema file with version tag
Retrieval config	Evidence changes	Config file hash
Model	Quality and cost change	Provider model name + version
Embedding model	Index compatibility changes	Model name + training date
Guardrail policy	Safety behavior changes	Policy file hash
Reranker model	Ranking quality changes	Model name + version

Traceable Response Metadata

Every response should carry version metadata that links back to the artifacts that produced it:

json

{
  "response": "Annual subscriptions cannot be paused...",
  "metadata": {
    "promptVersion": "v2.4.1",
    "promptHash": "a1b2c3d4",
    "modelName": "gpt-4o",
    "modelVersion": "2026-05-15",
    "embeddingModel": "text-embedding-3-small@002",
    "retrievalConfig": "retrieval_config_v3.json",
    "indexVersion": "idx_2026_05_28",
    "guardrailPolicy": "guardrails_v2.json",
    "requestTimestamp": "2026-06-01T10:00:00Z"
  }
}

Versioning Workflow

Step 7: Alerting and Dashboards

Alerts should catch regressions before users notice them.

Alert Rules

Alert	Trigger	Possible Cause	Severity
Cost spike	Daily cost > 2x normal	Prompt grew, abuse, routing bug, context bloat	P1
Latency spike	P95 TTFT > 3x normal	Provider issue, retrieval slowdown, model change	P2
Validation failure rate	> 5% of requests fail validation	Prompt or schema regression	P1
No-citation rate	> 10% of answers lack citations	RAG failure, retrieval config change	P2
Quality drop	LLM-as-judge score drops > 10%	Prompt, model, or retrieval regression	P1
Increased refusals	Refusal rate > 2x normal	Policy or guardrail config issue	P2
Error rate spike	Provider error rate > 2%	Provider outage or throttling	P1
Zero traffic	Request rate drops > 90%	Routing or deployment failure	P1
Token consumption anomaly	Tokens per request > 3x normal	Context bloat, prompt template bug	P2

Dashboard Design

Organize dashboards by audience:

Operations Dashboard (on-call engineers):

Request rate, error rate, latency (P50/P95) over time
Provider status and error breakdown
Cost per hour
Active experiments

Quality Dashboard (ML engineers):

LLM-as-judge quality scores over time by dimension
Retrieval recall and precision trends
Golden dataset pass/fail rate per version
User feedback (thumbs up/down) over time

Business Dashboard (product managers):

Cost per feature and per tenant
Task success rate
User satisfaction score
Monthly cost trend vs budget
Experiment results and recommendations

Common Failure Stories

The Prompt Changed But Nobody Logged It

A developer edits the system prompt directly in the production configuration file. No version bump. No pull request. No golden dataset run. Three days later, quality metrics drop. The team spends a week debugging retrieval, model, and infrastructure before someone notices the prompt changed.

The fix: prompts must be versioned and deployed through the same CI/CD pipeline as code. Every prompt change should trigger the golden dataset evaluation suite. If quality drops, the deployment should block.

The Model Update Broke Citations

The provider releases a new model version that changes the output format. The structured output schema still validates, but the model stops including citations. The validation layer passes because the schema allows empty citation arrays. For three days, the system serves uncited answers.

The fix: set a minimum citation count in the validation schema. If the model produces fewer than the required number of citations, the validation should fail and trigger an alert. Never rely on the model alone to follow formatting instructions.

The A/B Test Was Inconclusive Because of Traffic Skew

An experiment splits traffic 50/50 by user ID hash. But Variant B gets 80% of the traffic from enterprise tenants with complex questions. Variant B looks worse because it handles harder questions, not because the change is bad.

The fix: stratify experiment splits by tenant type, question complexity, and feature. Measure metrics per segment, not just aggregated. Or use a true randomized assignment within each segment.

The LLM-as-Judge Missed a Domain-Specific Error

A medical question about drug interactions gets a plausible-sounding but wrong answer. The judge model, also a general-purpose LLM, does not have enough medical knowledge to detect the error. It scores faithfulness as 5 because the answer sounds reasonable.

The fix: for high-stakes domains, use a domain-specific judge model or include domain rules in the judge prompt. Better yet, do not rely solely on LLM-as-judge for domains where errors have serious consequences. Use human review for a sample.

The Golden Dataset Did Not Catch the Regression

The team updates the retrieval config from top-5 to top-10. The golden dataset passes because all expected documents appear in the top-10. But the extra 5 documents add noise, and the model starts hallucinating from irrelevant context. The quality score drops in production but the golden dataset still passes.

The fix: the golden dataset should measure not just retrieval recall but also generation quality. Add test cases where irrelevant context is present, and measure whether the model correctly ignores it. The evaluation should mirror the full production pipeline, not just individual components.

Evaluating the Observability System

The observability system itself needs evaluation. Can it actually detect regressions?

Question	How to Answer	Target
Can you reproduce any past incident from traces?	Pick a past incident, check if all relevant spans exist	100% of incidents traceable
How long does it take to detect a regression?	Inject a known bad change, measure detection time	< 5 minutes
Are golden dataset results correlated with production quality?	Compare golden dataset pass rate vs user satisfaction	Correlation > 0.8
Is the alert false-positive rate acceptable?	Track alerts that did not correspond to real issues	< 10%
Are all spans within budget?	Total observability cost as % of LLM infrastructure cost	< 5%
Can you explain any cost change from last week?	Query cost breakdown per dimension	Always explainable

✅

Debugging rule: When investigating a quality regression, start with the timeline. Plot quality scores, latency, cost, and error rate on the same timeline. Then overlay version changes: prompt deploys, model updates, retrieval config changes, guardrail updates. The regression cause is almost always visible as a change in one of these dimensions at the time quality dropped.

A Complete Observability Pipeline, End to End

Here is how observability data flows from a single request to actionable insight:

This flow connects every component: tracing captures what happened, metrics quantify it, evaluation scores quality, versioning provides context, and alerts surface regressions. Without any one of these pieces, you are flying blind.

What to Remember for Interviews

When explaining LLM observability, tell the story in order:

Traditional observability is not enough: HTTP 200 with a bad answer is a successful call and a failed interaction. LLM systems need quality signals, not just uptime signals.
Trace the full pipeline: Every span — retrieval, reranking, prompt assembly, LLM call, validation — must be connected by a trace ID. Sample happy paths, store 100% of failures.
Track three categories of metrics: Reliability (latency, error rate), cost (tokens, spend per feature), and quality (relevance, faithfulness, citation accuracy).
Evaluate retrieval and generation separately: They fail differently. Use a golden dataset with real user questions, expected sources, and acceptable answer criteria.
Use LLM-as-judge carefully: It scales evaluation but has biases (self-enhancement, position, length, domain blindness). Calibrate against human labels. Recalibrate regularly.
Version everything: Prompts, models, embedding models, retrieval configs, guardrail policies. Every response should carry the versions that produced it.
A/B test every change: Prompt edits, model upgrades, retrieval changes — measure quality, cost, and latency simultaneously. A quality improvement that doubles cost may not be worth it.
Alert on quality, not just errors: Cost spikes, quality drops, no-citation rates, and validation failures are the signals that matter for LLM systems.

✅

Practice: Design an observability and evaluation system for a RAG support assistant. Include tracing spans for the full pipeline, golden datasets with retrieval and generation test cases, LLM-as-judge with calibration against human labels, A/B test design for a prompt change, version tracking for every artifact, and alerts for cost spikes and quality drops.

Guardrails and Output Validation: Safer LLM Responses

Performance Optimization: Profiling, Caching, and Latency Reduction