Gen AI Systems

LLM Observability and Evaluation: Traces, Quality Metrics, and Experiments

Build observability and evaluation for LLM systems, including prompt traces, cost tracking, model versions, RAG metrics, LLM-as-judge, A/B tests, and regression datasets.

LLM observabilityevaluationLLM-as-judgetracingA/B testing

Start With a Quality Regression Nobody Noticed

Your RAG support assistant has been running smoothly for three weeks. Latency is stable at 2 seconds. Error rate is under 0.5%. The team is focused on building new features.

On day 21, a user posts on social media:

"Your AI assistant told me I can cancel my annual subscription for a full refund. That is wrong. Your policy clearly says annual subscriptions are non-refundable."

You check the monitoring dashboard. Latency: green. Error rate: green. Everything looks fine. But the answer was wrong.

What happened? Three days ago, a teammate updated the system prompt from "Answer concisely" to "Be thorough and helpful." A new model version was rolled out by the provider. And the retrieval top-k was changed from 5 to 10 in a config deploy.

Each change seemed harmless. Each was logged somewhere. But no system connected the dots: prompt change + model update + retrieval change = worse answers. The observability system tracked uptime, not quality.

An LLM call can return HTTP 200 and still be a bad answer. Traditional observability — latency, errors, and saturation — is necessary but not sufficient for LLM systems. You also need to track what the model was asked, what it was given, what it produced, how much it cost, and whether the answer was any good.

Mental model: Traditional observability tells you if the system is up. LLM observability tells you if the system is working. An HTTP 200 with a hallucinated answer is a successful call and a failed interaction.


Why LLM Observability Is Different

Traditional web service observability tracks three pillars: logs, metrics, and traces. For an API endpoint, you measure request rate, latency percentiles, and error codes. If the endpoint returns 200 in under 200ms, everything is fine.

LLM observability needs all of that plus:

Additional DimensionWhy It MattersTraditional Analogy
Prompt contentThe input determines the outputRequest body (usually opaque)
Retrieved contextRAG quality depends on evidenceDatabase query plan
Model versionDifferent models produce different answersAPI version (but changes more often)
Token countDrives cost and latencyResponse size (but billed per unit)
Answer qualityIs the response correct and safe?Not tracked traditionally
User feedbackDid the user accept the answer?Conversion rate
Cost per requestUnit economics matter at scaleCompute cost (but more variable)
Guardrail outcomesWas the response validated?Input validation result
⚠️

Important distinction: Traditional observability detects when the system breaks. LLM observability detects when the system degrades. Degradation is harder to detect because the system still works — it just works worse. You need quality signals, not just availability signals.


Why Not Just Use Application Logs?

Beginners often ask: "Can I just log the prompt and response to my existing logging system?"

You can, but LLM observability has requirements that general-purpose logging does not satisfy.

RequirementGeneral LoggingLLM Observability
Trace across spansCorrelate by request IDRetrieval → prompt → model → validation must be linked
High-volume prompt storageExpensive at LLM scaleSampling and redaction strategies needed
Cost attributionNot a logging featureToken counts must be aggregated per tenant/feature
Quality scoringNot availableSeparate evaluation pipeline needed
Version trackingManual taggingAutomatic capture of model, prompt, config versions
Experiment comparisonNot supportedA/B test framework with statistical comparison

Step 1: Tracing the Full Pipeline

Tracing in LLM systems means capturing every step from user request to final response as connected spans. Each span captures timing, inputs, outputs, and metadata.

Span Schema

Every span should capture:

json
{
  "spanId": "span_abc123",
  "traceId": "trace_def456",
  "parentSpanId": "span_parent789",
  "service": "rag-pipeline",
  "spanName": "retrieval",
  "startTime": "2026-06-01T10:00:00.000Z",
  "duration": 320,
  "attributes": {
    "query": "Can I pause my annual subscription?",
    "topK": 10,
    "filter": "tenant=acme",
    "documentIds": ["doc_001", "doc_042", "doc_103"],
    "scores": [0.92, 0.87, 0.64],
    "indexVersion": "idx_2026_05_28"
  }
}

Which Spans to Capture

SpanCapturesWhy It Matters for Debugging
RequestTenant, feature, user type, request IDWho asked, what feature, was it authorized
AuthAuth result, user permissionsWas the user allowed to ask this?
RetrievalQuery, filters, top-k, document IDs, scoresWere the right documents retrieved?
RerankingCandidate count, selected chunks, scoresDid reranking help or hurt?
Prompt assemblyPrompt version, token count, context IDsWas the prompt constructed correctly?
LLM callModel name, provider, latency, tokens, costWhich model, how fast, how expensive
ValidationSchema result, safety result, retriesDid guardrails pass or fail?
ResponseCitations, confidence score, user feedbackWas the answer accepted?

Sampling Strategy

Storing every trace for every request is expensive. Use a sampling strategy:

Sampling TierCriteriaSample RateStorage
All requestsRollup metrics only0% payloadsAggregated time-series
Quality sampleRandom sample across all traffic1-5%Full spans, redacted PII
Error sampleAny validation failure, error, or guardrail block100%Full spans with payloads
Degradation sampleRequests with high latency or cost100%Full spans
Feedback sampleRequests with explicit user feedback100%Full spans

Practical advice: Always store 100% of traces for requests that failed validation, had high latency, or received negative user feedback. These are your most valuable debugging signals. Sample the happy path at 1-5% to control cost while maintaining statistical visibility.


Step 2: Metrics — Reliability, Cost, and Quality

Metrics are aggregated measurements over time. LLM systems need three categories of metrics.

Reliability and Performance

MetricWhat It MeasuresWhy It MattersAlert Threshold
Request rateRequests per minute per featureCapacity planning and adoptionSudden drop or spike
Error rate% of requests with provider errors or exceptionsProvider and application reliability> 1%
TTFT (P50/P95)Time to first tokenInteractive user experienceP95 > 3s
Total latency (P50/P95)End-to-end response timeUser patience and SLO complianceP95 > 10s
Token generation speedTokens per secondModel serving efficiency< 10 tok/s
Retry rate% of requests that needed a retryProvider instability or validation issues> 5%
Timeout rate% of requests that timed outSLO risk> 0.5%
Cache hit rate% of requests served from prefix or exact cacheCost efficiency< 20% indicates problem

Cost Metrics

MetricWhat It MeasuresWhy It MattersAction if High
Input tokens per dayTotal prompt and context tokensBaseline cost driverOptimize context size, implement caching
Output tokens per dayTotal generation tokensGeneration costShorter answers, smaller model
Cost per requestAverage cost per LLM callUnit economicsCompare to value per request
Cost per successful taskCost per completed user goalBusiness value of LLM spendFlag if > value of task
Spend by tenantCost attribution per customerBudget ownership and billingLarge tenants may need caps
Spend by featureCost attribution per product featureInvestment decisionsHigh-cost, low-value features need optimization
Daily/weekly cost trendCost trajectoryBudget forecastingUnusual growth needs investigation
txt
Cost per request calculation:
  Cost = (input_tokens × input_price_per_token)
       + (output_tokens × output_price_per_token)
       + retrieval_cost
       + reranking_cost

Example for a RAG request:
  Input: 3500 tokens × $3/M tokens = $0.0105
  Output: 450 tokens × $15/M tokens = $0.0068
  Retrieval: 1 vector search = $0.0002
  Reranking: 20 candidates reranked = $0.0005
  Total: $0.018 per request

  At 100K requests/day: $1,800/day = $54,000/month
  A 10% context reduction saves $5,400/month

Quality Metrics

MetricMeaningHow to MeasureTarget
Answer relevanceDoes the answer address the question?LLM-as-judge or human rating> 4/5
FaithfulnessIs the answer supported by the retrieved context?Claim extraction + citation check> 95%
Context precisionWere the retrieved chunks actually useful?Rank of relevant chunks in resultsMean rank < 3
Context recallDid retrieval include the needed evidence?Was the expected source in top-k?> 90%
Citation accuracyDo the citations actually support the claims?Semantic similarity check> 95%
Abstention rateHow often does the model say "I do not know" vs hallucinate?Tracker of abstention vs guessDepends on domain
Task success rateDid the user complete their goal?User feedback, follow-up rate> 85%
User satisfactionExplicit user ratingThumbs up/down, survey> 90% positive

Step 3: RAG Evaluation

RAG has two subsystems that fail differently. Retrieval can miss relevant documents. Generation can misinterpret or ignore the retrieved evidence. They must be evaluated separately.

Retrieval Evaluation Metrics

MetricDefinitionHow to Measure
Recall@KWas the expected document in the top-K results?Check if expected doc ID is in result set
MRR (Mean Reciprocal Rank)How high was the first relevant result ranked?1 / rank of first relevant doc, averaged
Precision@KHow many of top-K were relevant?Relevant docs in top-K / K
NDCG (Normalized Discounted Cumulative Gain)Quality of ranking weighted by positionStandard information retrieval metric

Generation Evaluation Metrics

MetricDefinitionHow to Measure
FaithfulnessDoes the answer stay within the retrieved evidence?Check each claim against source documents
Answer relevanceDoes the answer directly address the question?LLM-as-judge scoring
CompletenessDoes the answer cover all aspects of the question?Multi-facet question decomposition
ConcisenessIs the answer appropriately brief?Token count relative to question complexity

Building a Golden Dataset

txt
Question: "Can I pause my annual subscription?"
Expected relevant docs: ["billing-faq.md#annual-pause", "subscription-policy.pdf#section-4.2"]
Expected answer: "Annual subscriptions cannot be paused. Customers on annual plans can contact
  support for a credit review if there was a billing mistake or exceptional case."
Expected citations: [billing-faq.md#annual-pause]
Expected confidence: high
Acceptable variants: Mentions credit review, directs to support, does not promise pause
Unacceptable: Says annual subscriptions can be paused, does not mention exception path

Question: "What is the refund policy for enterprise customers?"
Expected relevant docs: ["enterprise-agreement-template.pdf#refund-clause"]
Expected answer: "Enterprise refunds are governed by the terms in your specific agreement.
  Please contact your account manager for details."
Expected citations: [enterprise-agreement-template.pdf#refund-clause]
Expected behavior: Should NOT guess specific refund terms, should direct to account manager

Dataset Requirements

RequirementWhyCount
Real user questionsRepresentative of production traffic200+
Edge casesUnusual or tricky questions50+
Unanswerable questionsQuestions outside the knowledge base30+
Unsafe or out-of-scopeTest guardrails and moderation20+
Multi-tenant casesPermission boundary testing20+
Regression casesIncidents that occurred in productionAll past incidents
⚠️

A golden dataset is a living artifact: Add every production incident as a new test case. If a bad answer reached a user, your golden dataset should include that exact query and the expected correct answer. Run the full evaluation suite after every prompt change, model update, or retrieval config change.


Step 4: LLM-as-Judge

LLM-as-judge uses a separate model to evaluate the quality of your primary model's outputs. It scales evaluation to thousands of examples but introduces its own biases.

What to Judge

CriterionJudge Prompt InstructionScale
Relevance"Does the answer address the user's question directly?"1-5
Faithfulness"Is every claim in the answer supported by the provided context?"1-5
Completeness"Does the answer cover all aspects of the question?"1-5
Safety"Does the answer contain unsafe, harmful, or policy-violating content?"Pass/Fail
Conciseness"Is the answer appropriately brief without omitting important information?"1-5

Judge Prompt Template

txt
You are evaluating a support assistant's response.

User question: {question}
Retrieved context: {context}
Assistant response: {answer}

Rate the following dimensions on a scale of 1-5:

1. Relevance: Does the answer address the user's question?
2. Faithfulness: Are all claims supported by the context?
3. Completeness: Does the answer cover all aspects of the question?

Provide a score and a brief rationale for each dimension.
If any claim in the answer is NOT supported by the context, mark faithfulness as 1.

Risks and Calibration

RiskHow It ManifestsMitigation
Judge biasJudge prefers longer or more confident answersInclude length-normalized scoring
Inconsistent scoringSame answer gets different scores on different runsAverage across 3 judge calls
Self-enhancement biasJudge rates the same model family higherUse a different model as judge
Position biasJudge favors answers shown firstRandomize answer order
Domain blindnessJudge misses domain-specific errorsInclude domain rules in judge prompt

Calibration with Human Labels

txt
Human-labeled example 1:
  Question: "Can I pause my annual subscription?"
  Context: [billing-faq: annual subscriptions cannot be paused]
  Answer: "No, annual subscriptions cannot be paused. You can contact support for a credit review."
  Human score: relevance=5, faithfulness=5
  Judge score: relevance=5, faithfulness=5 ✓

Human-labeled example 2:
  Question: "What is my account number?"
  Context: [no context contains the user's account number]
  Answer: "Your account number is ACC-784512."
  Human score: faithfulness=1 (hallucination, not in context)
  Judge score: faithfulness=5 ✗ (judge failed to detect hallucination)

If the judge consistently disagrees with human labels on certain criteria, adjust the judge prompt or switch to a different judge model.

Practical advice: Run a calibration set of 50-100 human-labeled examples before deploying a judge. If the judge's accuracy on the calibration set is below 90% for any criterion, fix the prompt or model before using it for evaluations. Recalibrate monthly or after any judge model update.


Step 5: A/B Testing and Experiments

LLM systems change constantly: prompts, models, retrieval parameters, rerankers, tools, and guardrails. Every change is a potential regression. A/B testing measures whether the change actually improves quality, cost, or latency.

What to Test

ExperimentVariant A (Control)Variant B (Treatment)What to Measure
Prompt version"Answer concisely""Be thorough and include all relevant details"Quality, latency, cost, token count
ModelGPT-4o-miniGPT-4oQuality, latency, cost
Retrieval top-ktop-5top-10Context recall, precision, latency
RerankingNo rerankerWith reranker (top-5 of 20)Answer quality, latency
Guardrail strictnessLoose citation requirementStrict citation (every claim needs source)Quality, refusal rate
Context window size2000 tokens4000 tokensQuality, cost, latency
Temperature0.00.3Quality, variability

Experiment Design

Experiment Requirements

RequirementWhyHow
Consistent splittingSame user should see the same variantHash user ID or tenant ID
Sufficient sample sizeStatistical significanceMinimum 1000 requests per variant
Measure all dimensionsQuality, cost, latency, safetyCollect all metrics for both variants
DurationAccount for time-of-day effectsRun for at least 24-48 hours
Rollback planRevert immediately if metrics degradeFeature flag control
Guardrail monitoringExperiment should not reduce safetyIndependent monitoring of guardrails

Statistical Significance

txt
Before declaring a winner, verify:
- At least 1000 samples per variant
- The difference in key metrics is statistically significant (p < 0.05)
- No metric degraded significantly (even if primary metric improved)
- The effect is consistent across tenants and user segments
- The experiment ran through at least one full business cycle

Example result:
  Variant A (control):  85% quality score, $0.018/request, 2.1s latency
  Variant B (treatment): 91% quality score, $0.022/request, 2.4s latency

  Decision: Quality improved by 6 points (p < 0.01). Cost increased 22%.
  Latency increased 14%. Deploy if the quality improvement justifies
  the cost increase. Revert if not.

Step 6: Prompt and Model Versioning

Every production response should be traceable to the exact versions of every artifact that produced it. Without versioning, you cannot debug regressions.

ArtifactVersion It BecauseVersion Strategy
System promptBehavior changesGit-tracked, semantic version, hash
Tool schemaOutput and action changesSchema file with version tag
Retrieval configEvidence changesConfig file hash
ModelQuality and cost changeProvider model name + version
Embedding modelIndex compatibility changesModel name + training date
Guardrail policySafety behavior changesPolicy file hash
Reranker modelRanking quality changesModel name + version

Traceable Response Metadata

Every response should carry version metadata that links back to the artifacts that produced it:

json
{
  "response": "Annual subscriptions cannot be paused...",
  "metadata": {
    "promptVersion": "v2.4.1",
    "promptHash": "a1b2c3d4",
    "modelName": "gpt-4o",
    "modelVersion": "2026-05-15",
    "embeddingModel": "text-embedding-3-small@002",
    "retrievalConfig": "retrieval_config_v3.json",
    "indexVersion": "idx_2026_05_28",
    "guardrailPolicy": "guardrails_v2.json",
    "requestTimestamp": "2026-06-01T10:00:00Z"
  }
}

Versioning Workflow


Step 7: Alerting and Dashboards

Alerts should catch regressions before users notice them.

Alert Rules

AlertTriggerPossible CauseSeverity
Cost spikeDaily cost > 2x normalPrompt grew, abuse, routing bug, context bloatP1
Latency spikeP95 TTFT > 3x normalProvider issue, retrieval slowdown, model changeP2
Validation failure rate> 5% of requests fail validationPrompt or schema regressionP1
No-citation rate> 10% of answers lack citationsRAG failure, retrieval config changeP2
Quality dropLLM-as-judge score drops > 10%Prompt, model, or retrieval regressionP1
Increased refusalsRefusal rate > 2x normalPolicy or guardrail config issueP2
Error rate spikeProvider error rate > 2%Provider outage or throttlingP1
Zero trafficRequest rate drops > 90%Routing or deployment failureP1
Token consumption anomalyTokens per request > 3x normalContext bloat, prompt template bugP2

Dashboard Design

Organize dashboards by audience:

Operations Dashboard (on-call engineers):

  • Request rate, error rate, latency (P50/P95) over time
  • Provider status and error breakdown
  • Cost per hour
  • Active experiments

Quality Dashboard (ML engineers):

  • LLM-as-judge quality scores over time by dimension
  • Retrieval recall and precision trends
  • Golden dataset pass/fail rate per version
  • User feedback (thumbs up/down) over time

Business Dashboard (product managers):

  • Cost per feature and per tenant
  • Task success rate
  • User satisfaction score
  • Monthly cost trend vs budget
  • Experiment results and recommendations

Common Failure Stories

The Prompt Changed But Nobody Logged It

A developer edits the system prompt directly in the production configuration file. No version bump. No pull request. No golden dataset run. Three days later, quality metrics drop. The team spends a week debugging retrieval, model, and infrastructure before someone notices the prompt changed.

The fix: prompts must be versioned and deployed through the same CI/CD pipeline as code. Every prompt change should trigger the golden dataset evaluation suite. If quality drops, the deployment should block.

The Model Update Broke Citations

The provider releases a new model version that changes the output format. The structured output schema still validates, but the model stops including citations. The validation layer passes because the schema allows empty citation arrays. For three days, the system serves uncited answers.

The fix: set a minimum citation count in the validation schema. If the model produces fewer than the required number of citations, the validation should fail and trigger an alert. Never rely on the model alone to follow formatting instructions.

The A/B Test Was Inconclusive Because of Traffic Skew

An experiment splits traffic 50/50 by user ID hash. But Variant B gets 80% of the traffic from enterprise tenants with complex questions. Variant B looks worse because it handles harder questions, not because the change is bad.

The fix: stratify experiment splits by tenant type, question complexity, and feature. Measure metrics per segment, not just aggregated. Or use a true randomized assignment within each segment.

The LLM-as-Judge Missed a Domain-Specific Error

A medical question about drug interactions gets a plausible-sounding but wrong answer. The judge model, also a general-purpose LLM, does not have enough medical knowledge to detect the error. It scores faithfulness as 5 because the answer sounds reasonable.

The fix: for high-stakes domains, use a domain-specific judge model or include domain rules in the judge prompt. Better yet, do not rely solely on LLM-as-judge for domains where errors have serious consequences. Use human review for a sample.

The Golden Dataset Did Not Catch the Regression

The team updates the retrieval config from top-5 to top-10. The golden dataset passes because all expected documents appear in the top-10. But the extra 5 documents add noise, and the model starts hallucinating from irrelevant context. The quality score drops in production but the golden dataset still passes.

The fix: the golden dataset should measure not just retrieval recall but also generation quality. Add test cases where irrelevant context is present, and measure whether the model correctly ignores it. The evaluation should mirror the full production pipeline, not just individual components.


Evaluating the Observability System

The observability system itself needs evaluation. Can it actually detect regressions?

QuestionHow to AnswerTarget
Can you reproduce any past incident from traces?Pick a past incident, check if all relevant spans exist100% of incidents traceable
How long does it take to detect a regression?Inject a known bad change, measure detection time< 5 minutes
Are golden dataset results correlated with production quality?Compare golden dataset pass rate vs user satisfactionCorrelation > 0.8
Is the alert false-positive rate acceptable?Track alerts that did not correspond to real issues< 10%
Are all spans within budget?Total observability cost as % of LLM infrastructure cost< 5%
Can you explain any cost change from last week?Query cost breakdown per dimensionAlways explainable

Debugging rule: When investigating a quality regression, start with the timeline. Plot quality scores, latency, cost, and error rate on the same timeline. Then overlay version changes: prompt deploys, model updates, retrieval config changes, guardrail updates. The regression cause is almost always visible as a change in one of these dimensions at the time quality dropped.


A Complete Observability Pipeline, End to End

Here is how observability data flows from a single request to actionable insight:

This flow connects every component: tracing captures what happened, metrics quantify it, evaluation scores quality, versioning provides context, and alerts surface regressions. Without any one of these pieces, you are flying blind.


What to Remember for Interviews

When explaining LLM observability, tell the story in order:

  1. Traditional observability is not enough: HTTP 200 with a bad answer is a successful call and a failed interaction. LLM systems need quality signals, not just uptime signals.
  2. Trace the full pipeline: Every span — retrieval, reranking, prompt assembly, LLM call, validation — must be connected by a trace ID. Sample happy paths, store 100% of failures.
  3. Track three categories of metrics: Reliability (latency, error rate), cost (tokens, spend per feature), and quality (relevance, faithfulness, citation accuracy).
  4. Evaluate retrieval and generation separately: They fail differently. Use a golden dataset with real user questions, expected sources, and acceptable answer criteria.
  5. Use LLM-as-judge carefully: It scales evaluation but has biases (self-enhancement, position, length, domain blindness). Calibrate against human labels. Recalibrate regularly.
  6. Version everything: Prompts, models, embedding models, retrieval configs, guardrail policies. Every response should carry the versions that produced it.
  7. A/B test every change: Prompt edits, model upgrades, retrieval changes — measure quality, cost, and latency simultaneously. A quality improvement that doubles cost may not be worth it.
  8. Alert on quality, not just errors: Cost spikes, quality drops, no-citation rates, and validation failures are the signals that matter for LLM systems.

Practice: Design an observability and evaluation system for a RAG support assistant. Include tracing spans for the full pipeline, golden datasets with retrieval and generation test cases, LLM-as-judge with calibration against human labels, A/B test design for a prompt change, version tracking for every artifact, and alerts for cost spikes and quality drops.