Basic System Concepts: Latency, Throughput, Scaling, Load Balancing, and Caching

Understand the foundational metrics and patterns of distributed systems: latency numbers, vertical vs horizontal scaling, load balancing algorithms, and caching strategies.

13 min readlatencythroughputscalingload balancingcachingfoundations

The Building Blocks of System Design

Every system design decision ultimately comes down to trade-offs between a few fundamental concepts. Understanding these deeply is what separates a good engineer from a great one.

Let's start with the numbers that every system designer should have memorized.

Latency Numbers Everyone Should Know

These approximate numbers (from Peter Norvig's famous slide and Jeff Dean's talk) give you an intuition for the relative cost of different operations:

Operation	Latency	Human Comparison
L1 cache access	0.5 ns	—
L2 cache access	7 ns	—
L3 cache access	20 ns	—
Main memory access (RAM)	100 ns	—
SSD read	150 µs	1,500× slower than RAM
Network: same data center	500 µs	5,000× slower than RAM
TCP round trip (same data center)	1 ms	—
Disk seek (HDD)	10 ms	20× slower than SSD
Network: US to Europe (one way)	80 ms	—
TCP round trip (US to Europe)	150 ms	1,500,000× slower than RAM

✅

Mental model: Accessing RAM is like picking up a piece of paper from your desk. Accessing a remote server in another continent is like ordering a book from a library in a different country. This is why caching is so powerful — it brings data closer to where it's needed.

Latency vs Throughput

These two terms are often confused:

Latency — How long it takes to do one operation (measured in time: ms, µs, ns)
Throughput — How many operations you can do per unit of time (measured in ops/sec, MB/s, req/s)

┌────────────────────────────────────────────┐
│               Highway Analogy                │
├────────────────────────────────────────────┤
│                                             │
│  Latency = Time for one car to go           │
│            from A to B                      │
│            ───────🚗──────────>             │
│                                             │
│  Throughput = Number of cars passing          │
│               a point per hour               │
│            🚗🚗🚗🚗🚗🚗🚗🚗🚗🚗→              │
│                                             │
│  You can have low latency + low throughput   │
│  (one-lane highway, no traffic)              │
│                                             │
│  Or high latency + high throughput           │
│  (10-lane highway, but A to B is far)        │
└────────────────────────────────────────────┘

💡

Real-world example: Google's Borg/Omega/Kubernetes paper discusses this extensively. Batch processing systems (like MapReduce) optimize for throughput. Real-time serving systems (like Google Search) optimize for latency.

Vertical vs Horizontal Scaling

When your system can't handle the load, you have two options:

Vertical Scaling (Scale Up)

Buy a bigger machine. More CPU, more RAM, more disk.

Single Server
┌──────────────────────────┐
│                          │
│  CPU:  4 cores → 64 cores │
│  RAM:  16 GB → 512 GB     │
│  Disk: 500 GB → 10 TB     │
│                          │
└──────────────────────────┘

Pros:

Simple — no code changes needed
No distributed systems complexity
Works for databases that don't shard easily (PostgreSQL scales very well vertically — see Citus Data's research)

Cons:

Hardware has a ceiling (you can't buy infinite RAM)
Expensive — the cost curve is non-linear
Single point of failure
Downtime required for upgrades

Horizontal Scaling (Scale Out)

Add more machines. Distribute the load across them.

Pros:

Theoretically infinite scaling
Commodity hardware = cheaper
Fault-tolerant (one node fails, others take over)

Cons:

Distributed systems complexity (consistency, coordination, debugging)
Requires architectural changes (statelessness, data partitioning)
More operational overhead

The Decision Matrix

Factor	Scale Up	Scale Out
Cost	High (premium hardware)	Lower (commodity machines)
Complexity	Low	High
Fault tolerance	None	Built-in
Max capacity	Hardware limit	Theoretically infinite
Best for	Monoliths, databases	Microservices, web servers

⚠️

Common mistake: Many teams jump to horizontal scaling too early. If 2-3 vertical upgrades solve your problem, do that first. Horizontal scaling introduces complexity you may not need. As Instagram's engineering blog documents, they ran a single PostgreSQL instance for years before sharding.

Real-World Scaling Stories

Instagram: Ran one PostgreSQL instance for 5+ years before moving to sharding. They proved that vertical scaling with a well-tuned database goes much further than most people think.
Twitter (X): Migrated from monolithic Ruby on Rails to JVM-based services as horizontal scaling became necessary for their real-time feed.
Slack: Grew from a single MySQL instance to a sharded architecture as message volume exploded — a classic "start vertical, go horizontal when forced" story.

Load Balancing

Once you have multiple servers, you need a way to distribute traffic across them. That's what a load balancer does.

What Is a Load Balancer?

A load balancer sits between clients and your servers, distributing incoming requests across a pool of backend machines.

Layer 4 vs Layer 7 Load Balancing

Aspect	Layer 4 (Transport)	Layer 7 (Application)
What it sees	IP + port	HTTP headers, URL, cookies
Decision based on	IP address, port, protocol	URL path, headers, cookies, content
Speed	Faster (less processing)	Slower (must parse HTTP)
Intelligence	Dumb routing	Smart routing (e.g., `/api` → service A, `/static` → CDN)
Examples	HAProxy (TCP mode), Nginx (stream)	Nginx (HTTP mode), Envoy, ALB

💡

Deep dive: Netflix's Zuul is a Layer 7 load balancer / API gateway that routes millions of requests per second. Their engineering blog has extensive write-ups on how they manage traffic distribution. Cloudflare's blog also has excellent articles on load balancing at scale.

Load Balancing Algorithms

Algorithm	How It Works	Best For
Round Robin	Distribute requests evenly in order	Homogeneous servers, simple setups
Weighted Round Robin	Assign weights based on server capacity	Mixed hardware (some servers are bigger)
Least Connections	Send to the server with fewest active connections	Long-lived connections (WebSockets, APIs)
IP Hash	Route based on client IP — same client → same server	Session affinity without sticky cookies
Consistent Hashing	Hash the key to a position on a ring	Distributed caches, databases (minimizes reshuffling when nodes change)
Random	Pick a random server	Simple, surprisingly effective at scale

Consistent Hashing

Consistent hashing is one of the most important algorithms in distributed systems. It maps both servers and data to positions on a hash ring.

When Node A is added: only keys between Node C and Node A move. When Node B is removed: its keys transfer to the next node clockwise.

Unlike simple modulo hashing (key % N), consistent hashing ensures that when you add or remove a server, only a fraction of keys need to be redistributed.

✅

Real-world example: Amazon DynamoDB uses consistent hashing (the "Dynamo paper" describes this in detail). Discord's consistent hashing implementation is a practical guide. The original Dynamo paper from AWS is essential reading.

Load Balancers in Practice

Product	Type	Notable Users
NGINX	L7 reverse proxy / LB	Netflix, WordPress, GitHub
HAProxy	L4/L7 LB	Twitter, Reddit, GitHub
Envoy	L7 proxy (service mesh)	Lyft, Airbnb, Stripe
AWS ALB	Managed L7 LB	Countless AWS workloads
AWS NLB	Managed L4 LB	High-throughput workloads

💡

Further reading: NGINX's official load balancing guide and the Envoy proxy architecture documentation are excellent resources for understanding modern load balancing.

Caching Basics

Caching is the most effective technique for improving read performance. The core idea: store expensive-to-compute or expensive-to-fetch results in fast storage, and serve them from there on subsequent requests.

Why Cache?

If latency to your database is 10ms, but latency to an in-memory cache (Redis/Memcached) is 0.5ms, caching gives you a 20x speedup for read operations.

Where to Cache

Layer	What to Cache	Technology	TTL
Browser	Static assets, API responses	HTTP headers (`Cache-Control`)	Hours to days
CDN	Images, videos, static pages	Cloudflare, CloudFront, Fastly	Hours to weeks
Application	Computed results, session data	Redis, Memcached	Seconds to minutes
Database	Query results	PostgreSQL shared buffers, query cache	Managed by DB

✅

Case study: Cloudflare's caching infrastructure is one of the best examples of edge caching. Their blog posts on cache hit ratios and cache everything mode are essential reading for understanding CDN caching at scale.

Cache Invalidation Strategies

There are only two hard things in computer science: cache invalidation and naming things. — Phil Karlton

Strategy	How It Works	When to Use
TTL (Time To Live)	Data expires after a set duration	Most common; good when staleness is acceptable
LRU (Least Recently Used)	Evict the least recently accessed items when cache is full	Memory-constrained caches (Redis `maxmemory-policy allkeys-lru`)
Write-Through	Write to cache AND database simultaneously	Read-heavy data that must stay fresh
Write-Behind (Write-Back)	Write to cache first, asynchronously flush to database	Write-heavy systems where eventual consistency is fine
Cache Aside (Lazy Loading)	Application checks cache first; on miss, loads from DB and caches	Most common pattern; simple and effective
Invalidation on Write	Delete or update the cache entry when data changes	When you need strong consistency for specific keys

Cache Aside Pattern (Most Common)

Read Request:
1. Check cache for key
2. If HIT → return cached value
3. If MISS → query database
4. Store result in cache (with TTL)
5. Return value

Write Request:
1. Update database
2. Delete cache entry (NOT update it — delete!)
3. Next read will be a cache miss and repopulate

⚠️

Critical pitfall: When updating data, delete the cache entry, don't update it. Why? Because concurrent writes can create a race condition where the cache is set to a stale value. Deleting ensures the next read gets the fresh value from the database. See Martin Fowler's article on caching patterns for the full explanation.

Real-World Caching Patterns

GitHub: Uses a multi-layer caching strategy with Memcached and Redis. Their blog has deep dives on reducing database load through caching.
Reddit: Relies heavily on Memcached for caching — their open-source architecture has influenced many caching patterns.
Pinterest: Built a custom caching layer on top of Redis to handle their social graph. Their engineering blog has excellent posts on caching at scale.

What to Remember for Interviews

Latency numbers — Memorize the orders of magnitude: RAM (100ns) < SSD (150µs) < same-DC network (500µs) < cross-DC (80ms)
Latency vs Throughput — Latency is time per operation; throughput is operations per time
Scale up vs scale out — Start vertical, go horizontal when forced
Load balancing — L4 (fast, dumb) vs L7 (slower, smart); know consistent hashing
Caching — Know the cache aside pattern, TTL vs LRU, and the golden rule: delete on write, don't update

Databases — NoSQL Overview: CAP Theorem, Eventual Consistency, and When to Use What

Load Balancing Strategies: L4 vs L7, Algorithms, and Health Checks