Generative AI Systems: Architecture, LLMs, RAG, and Production Considerations
Learn the architecture of generative AI systems, including LLM fundamentals, retrieval-augmented generation (RAG), prompt engineering, and deploying AI at scale.
The Rise of Generative AI in System Design
Generative AI has transformed how we build software. Understanding the architecture behind these systems is essential for any system designer working with AI-powered features.
Key insight: LLMs are powerful but have limitations (hallucinations, knowledge cutoff, latency). Production AI systems combine LLMs with retrieval, grounding, and guardrails to build reliable products.
How LLMs Work
Transformer Architecture
Large Language Models are based on the Transformer architecture, introduced in the "Attention Is All You Need" paper.
Key Concepts
| Concept | Description |
|---|---|
| Token | Text broken into pieces (~4 chars each) |
| Embedding | Numeric vector representing token meaning |
| Attention | Mechanism to relate different positions |
| Context Window | Max tokens the model can "see" |
| Temperature | Controls randomness (0 = deterministic, 1 = creative) |
Token math: A typical sentence is 20-50 tokens. GPT-4 has a 128K context window. One page of text ≈ 300 tokens.
LLM Deployment Options
Cloud APIs (OpenAI, Anthropic, Google)
Pros: Easy, powerful, always up-to-date models
Cons: Latency, cost, data privacy concerns
Self-Hosted (Llama, Mistral, etc.)
Pros: Full control, privacy, cost-effective at scale
Cons: Infrastructure complexity, GPU costs
Deployment Frameworks
| Framework | Use Case |
|---|---|
| vLLM | High-throughput serving, PagedAttention |
| TensorRT-LLM | NVIDIA-optimized, highest performance |
| Ollama | Local development, easy setup |
| LM Studio | Desktop inference, experimentation |
Retrieval-Augmented Generation (RAG)
RAG combines LLMs with your data. The model can "look up" information rather than relying solely on training data.
How RAG Works
RAG Architecture
Chunking Strategies
| Strategy | Description | Best For |
|---|---|---|
| Fixed Size | Split by character/word count | General purpose |
| Semantic | Split by paragraphs/sections | Long documents |
| Recursive | Split recursively until semantic units | Complex docs |
| Agentic | Use LLM to determine chunks | High-quality needs |
Chunk size matters: Too small = missing context. Too large = diluted relevance. 512 tokens is a common starting point; tune based on your use case.
Vector Databases
Vector databases store embeddings and enable similarity search.
Popular Vector DBs
| Database | Strengths | Best For |
|---|---|---|
| Pinecone | Managed, easy, scalable | Production SaaS |
| Weaviate | Hybrid search (vector + keyword) | Combined search |
| Chroma | Simple, great for dev | Prototyping, local |
| Qdrant | High performance, Rust | Production self-hosted |
| pgvector | Postgres extension | Existing Postgres users |
| Milvus | Scalable, cloud-native | Large-scale deployments |
Vector Search Algorithm
Prompt Engineering
Basic Patterns
# Simple prompting
response = llm("What is Python?")
# Few-shot prompting
response = llm("""
Translate to French:
English: Hello -> French: Bonjour
English: Goodbye -> French: Au revoir
English: Thank you -> French:
""")
# Chain-of-thought
response = llm("""
Problem: If a train leaves at 2pm traveling 60mph, and another leaves at 3pm traveling 80mph, when do they meet?
Let's think through this step by step:
1. First train travels for t hours
2. Second train travels for t-1 hours
3. ...
""")
Structured Output
# Force structured responses
from pydantic import BaseModel
class UserProfile(BaseModel):
name: str
age: int
skills: list[str]
response = llm.with_structured_output(UserProfile)
user = response.invoke("John is 30 with Python and Java skills")
# user.name = "John", user.age = 30, user.skills = ["Python", "Java"]
Guardrails and Safety
Input/Output Validation
Common Guardrails
| Layer | Technique | Purpose |
|---|---|---|
| Input | PII detection | Remove personal data |
| Input | Topic classification | Detect off-topic requests |
| LLM | System prompts | Define behavior boundaries |
| Output | Toxicity detection | Block harmful content |
| Output | Fact verification | Reduce hallucinations |
Production Considerations
Latency Optimization
| Technique | Impact |
|---|---|
| Streaming responses | Perceive faster (TTFT vs total) |
| Caching (semantic) | Skip LLM for repeated queries |
| Smaller models for simple tasks | Faster + cheaper |
| Speculative decoding | Smaller model drafts, bigger verifies |
| Quantization | Less memory, faster inference |
Cost Management
*Self-hosted cost, varies by hardware
Multi-Modal Architectures
What to Remember for Interviews
- RAG is essential: Ground LLM responses in your data to reduce hallucinations
- Choose the right model: Not every task needs GPT-4; smaller models are faster and cheaper
- Vector databases: Enable semantic search over your documents
- Guardrails: Production systems need input/output validation and monitoring
- Trade-offs: Latency, cost, quality, and privacy all pull in different directions
Practice: Design a RAG system for a customer support chatbot. How would you ingest documentation? How would you retrieve relevant answers? How would you handle questions outside the knowledge base?