Gen AI Systems

Generative AI Systems: Architecture, LLMs, RAG, and Production Considerations

Learn the architecture of generative AI systems, including LLM fundamentals, retrieval-augmented generation (RAG), prompt engineering, and deploying AI at scale.

26 min readAILLMRAGgenerative AImachine learningembeddings

The Rise of Generative AI in System Design

Generative AI has transformed how we build software. Understanding the architecture behind these systems is essential for any system designer working with AI-powered features.

Key insight: LLMs are powerful but have limitations (hallucinations, knowledge cutoff, latency). Production AI systems combine LLMs with retrieval, grounding, and guardrails to build reliable products.


How LLMs Work

Transformer Architecture

Large Language Models are based on the Transformer architecture, introduced in the "Attention Is All You Need" paper.

Key Concepts

ConceptDescription
TokenText broken into pieces (~4 chars each)
EmbeddingNumeric vector representing token meaning
AttentionMechanism to relate different positions
Context WindowMax tokens the model can "see"
TemperatureControls randomness (0 = deterministic, 1 = creative)
💡

Token math: A typical sentence is 20-50 tokens. GPT-4 has a 128K context window. One page of text ≈ 300 tokens.


LLM Deployment Options

Cloud APIs (OpenAI, Anthropic, Google)

Pros: Easy, powerful, always up-to-date models
Cons: Latency, cost, data privacy concerns

Self-Hosted (Llama, Mistral, etc.)

Pros: Full control, privacy, cost-effective at scale
Cons: Infrastructure complexity, GPU costs

Deployment Frameworks

FrameworkUse Case
vLLMHigh-throughput serving, PagedAttention
TensorRT-LLMNVIDIA-optimized, highest performance
OllamaLocal development, easy setup
LM StudioDesktop inference, experimentation

Retrieval-Augmented Generation (RAG)

RAG combines LLMs with your data. The model can "look up" information rather than relying solely on training data.

How RAG Works

RAG Architecture

Chunking Strategies

StrategyDescriptionBest For
Fixed SizeSplit by character/word countGeneral purpose
SemanticSplit by paragraphs/sectionsLong documents
RecursiveSplit recursively until semantic unitsComplex docs
AgenticUse LLM to determine chunksHigh-quality needs

Chunk size matters: Too small = missing context. Too large = diluted relevance. 512 tokens is a common starting point; tune based on your use case.


Vector Databases

Vector databases store embeddings and enable similarity search.

DatabaseStrengthsBest For
PineconeManaged, easy, scalableProduction SaaS
WeaviateHybrid search (vector + keyword)Combined search
ChromaSimple, great for devPrototyping, local
QdrantHigh performance, RustProduction self-hosted
pgvectorPostgres extensionExisting Postgres users
MilvusScalable, cloud-nativeLarge-scale deployments

Vector Search Algorithm


Prompt Engineering

Basic Patterns

python
# Simple prompting
response = llm("What is Python?")

# Few-shot prompting
response = llm("""
Translate to French:
English: Hello -> French: Bonjour
English: Goodbye -> French: Au revoir
English: Thank you -> French:
""")

# Chain-of-thought
response = llm("""
Problem: If a train leaves at 2pm traveling 60mph, and another leaves at 3pm traveling 80mph, when do they meet?

Let's think through this step by step:
1. First train travels for t hours
2. Second train travels for t-1 hours
3. ...
""")

Structured Output

python
# Force structured responses
from pydantic import BaseModel

class UserProfile(BaseModel):
    name: str
    age: int
    skills: list[str]

response = llm.with_structured_output(UserProfile)
user = response.invoke("John is 30 with Python and Java skills")
# user.name = "John", user.age = 30, user.skills = ["Python", "Java"]

Guardrails and Safety

Input/Output Validation

Common Guardrails

LayerTechniquePurpose
InputPII detectionRemove personal data
InputTopic classificationDetect off-topic requests
LLMSystem promptsDefine behavior boundaries
OutputToxicity detectionBlock harmful content
OutputFact verificationReduce hallucinations

Production Considerations

Latency Optimization

TechniqueImpact
Streaming responsesPerceive faster (TTFT vs total)
Caching (semantic)Skip LLM for repeated queries
Smaller models for simple tasksFaster + cheaper
Speculative decodingSmaller model drafts, bigger verifies
QuantizationLess memory, faster inference

Cost Management

*Self-hosted cost, varies by hardware

Multi-Modal Architectures


What to Remember for Interviews

  1. RAG is essential: Ground LLM responses in your data to reduce hallucinations
  2. Choose the right model: Not every task needs GPT-4; smaller models are faster and cheaper
  3. Vector databases: Enable semantic search over your documents
  4. Guardrails: Production systems need input/output validation and monitoring
  5. Trade-offs: Latency, cost, quality, and privacy all pull in different directions

Practice: Design a RAG system for a customer support chatbot. How would you ingest documentation? How would you retrieve relevant answers? How would you handle questions outside the knowledge base?