Basic System Concepts: Latency, Throughput, Scaling, Load Balancing, and Caching
Understand the foundational metrics and patterns of distributed systems: latency numbers, vertical vs horizontal scaling, load balancing algorithms, and caching strategies.
The Building Blocks of System Design
Every system design decision ultimately comes down to trade-offs between a few fundamental concepts. Understanding these deeply is what separates a good engineer from a great one.
Let's start with the numbers that every system designer should have memorized.
Latency Numbers Everyone Should Know
These approximate numbers (from Peter Norvig's famous slide and Jeff Dean's talk) give you an intuition for the relative cost of different operations:
| Operation | Latency | Human Comparison |
|---|---|---|
| L1 cache access | 0.5 ns | — |
| L2 cache access | 7 ns | — |
| L3 cache access | 20 ns | — |
| Main memory access (RAM) | 100 ns | — |
| SSD read | 150 µs | 1,500× slower than RAM |
| Network: same data center | 500 µs | 5,000× slower than RAM |
| TCP round trip (same data center) | 1 ms | — |
| Disk seek (HDD) | 10 ms | 20× slower than SSD |
| Network: US to Europe (one way) | 80 ms | — |
| TCP round trip (US to Europe) | 150 ms | 1,500,000× slower than RAM |
Mental model: Accessing RAM is like picking up a piece of paper from your desk. Accessing a remote server in another continent is like ordering a book from a library in a different country. This is why caching is so powerful — it brings data closer to where it's needed.
Latency vs Throughput
These two terms are often confused:
- Latency — How long it takes to do one operation (measured in time: ms, µs, ns)
- Throughput — How many operations you can do per unit of time (measured in ops/sec, MB/s, req/s)
┌────────────────────────────────────────────┐
│ Highway Analogy │
├────────────────────────────────────────────┤
│ │
│ Latency = Time for one car to go │
│ from A to B │
│ ───────🚗──────────> │
│ │
│ Throughput = Number of cars passing │
│ a point per hour │
│ 🚗🚗🚗🚗🚗🚗🚗🚗🚗🚗→ │
│ │
│ You can have low latency + low throughput │
│ (one-lane highway, no traffic) │
│ │
│ Or high latency + high throughput │
│ (10-lane highway, but A to B is far) │
└────────────────────────────────────────────┘
Real-world example: Google's Borg/Omega/Kubernetes paper discusses this extensively. Batch processing systems (like MapReduce) optimize for throughput. Real-time serving systems (like Google Search) optimize for latency.
Vertical vs Horizontal Scaling
When your system can't handle the load, you have two options:
Vertical Scaling (Scale Up)
Buy a bigger machine. More CPU, more RAM, more disk.
Single Server
┌──────────────────────────┐
│ │
│ CPU: 4 cores → 64 cores │
│ RAM: 16 GB → 512 GB │
│ Disk: 500 GB → 10 TB │
│ │
└──────────────────────────┘
Pros:
- Simple — no code changes needed
- No distributed systems complexity
- Works for databases that don't shard easily (PostgreSQL scales very well vertically — see Citus Data's research)
Cons:
- Hardware has a ceiling (you can't buy infinite RAM)
- Expensive — the cost curve is non-linear
- Single point of failure
- Downtime required for upgrades
Horizontal Scaling (Scale Out)
Add more machines. Distribute the load across them.
Pros:
- Theoretically infinite scaling
- Commodity hardware = cheaper
- Fault-tolerant (one node fails, others take over)
Cons:
- Distributed systems complexity (consistency, coordination, debugging)
- Requires architectural changes (statelessness, data partitioning)
- More operational overhead
The Decision Matrix
| Factor | Scale Up | Scale Out |
|---|---|---|
| Cost | High (premium hardware) | Lower (commodity machines) |
| Complexity | Low | High |
| Fault tolerance | None | Built-in |
| Max capacity | Hardware limit | Theoretically infinite |
| Best for | Monoliths, databases | Microservices, web servers |
Common mistake: Many teams jump to horizontal scaling too early. If 2-3 vertical upgrades solve your problem, do that first. Horizontal scaling introduces complexity you may not need. As Instagram's engineering blog documents, they ran a single PostgreSQL instance for years before sharding.
Real-World Scaling Stories
- Instagram: Ran one PostgreSQL instance for 5+ years before moving to sharding. They proved that vertical scaling with a well-tuned database goes much further than most people think.
- Twitter (X): Migrated from monolithic Ruby on Rails to JVM-based services as horizontal scaling became necessary for their real-time feed.
- Slack: Grew from a single MySQL instance to a sharded architecture as message volume exploded — a classic "start vertical, go horizontal when forced" story.
Load Balancing
Once you have multiple servers, you need a way to distribute traffic across them. That's what a load balancer does.
What Is a Load Balancer?
A load balancer sits between clients and your servers, distributing incoming requests across a pool of backend machines.
Layer 4 vs Layer 7 Load Balancing
| Aspect | Layer 4 (Transport) | Layer 7 (Application) |
|---|---|---|
| What it sees | IP + port | HTTP headers, URL, cookies |
| Decision based on | IP address, port, protocol | URL path, headers, cookies, content |
| Speed | Faster (less processing) | Slower (must parse HTTP) |
| Intelligence | Dumb routing | Smart routing (e.g., /api → service A, /static → CDN) |
| Examples | HAProxy (TCP mode), Nginx (stream) | Nginx (HTTP mode), Envoy, ALB |
Deep dive: Netflix's Zuul is a Layer 7 load balancer / API gateway that routes millions of requests per second. Their engineering blog has extensive write-ups on how they manage traffic distribution. Cloudflare's blog also has excellent articles on load balancing at scale.
Load Balancing Algorithms
| Algorithm | How It Works | Best For |
|---|---|---|
| Round Robin | Distribute requests evenly in order | Homogeneous servers, simple setups |
| Weighted Round Robin | Assign weights based on server capacity | Mixed hardware (some servers are bigger) |
| Least Connections | Send to the server with fewest active connections | Long-lived connections (WebSockets, APIs) |
| IP Hash | Route based on client IP — same client → same server | Session affinity without sticky cookies |
| Consistent Hashing | Hash the key to a position on a ring | Distributed caches, databases (minimizes reshuffling when nodes change) |
| Random | Pick a random server | Simple, surprisingly effective at scale |
Consistent Hashing
Consistent hashing is one of the most important algorithms in distributed systems. It maps both servers and data to positions on a hash ring.
When Node A is added: only keys between Node C and Node A move. When Node B is removed: its keys transfer to the next node clockwise.
Unlike simple modulo hashing (key % N), consistent hashing ensures that when you add or remove a server, only a fraction of keys need to be redistributed.
Real-world example: Amazon DynamoDB uses consistent hashing (the "Dynamo paper" describes this in detail). Discord's consistent hashing implementation is a practical guide. The original Dynamo paper from AWS is essential reading.
Load Balancers in Practice
| Product | Type | Notable Users |
|---|---|---|
| NGINX | L7 reverse proxy / LB | Netflix, WordPress, GitHub |
| HAProxy | L4/L7 LB | Twitter, Reddit, GitHub |
| Envoy | L7 proxy (service mesh) | Lyft, Airbnb, Stripe |
| AWS ALB | Managed L7 LB | Countless AWS workloads |
| AWS NLB | Managed L4 LB | High-throughput workloads |
Further reading: NGINX's official load balancing guide and the Envoy proxy architecture documentation are excellent resources for understanding modern load balancing.
Caching Basics
Caching is the most effective technique for improving read performance. The core idea: store expensive-to-compute or expensive-to-fetch results in fast storage, and serve them from there on subsequent requests.
Why Cache?
If latency to your database is 10ms, but latency to an in-memory cache (Redis/Memcached) is 0.5ms, caching gives you a 20x speedup for read operations.
Where to Cache
| Layer | What to Cache | Technology | TTL |
|---|---|---|---|
| Browser | Static assets, API responses | HTTP headers (Cache-Control) | Hours to days |
| CDN | Images, videos, static pages | Cloudflare, CloudFront, Fastly | Hours to weeks |
| Application | Computed results, session data | Redis, Memcached | Seconds to minutes |
| Database | Query results | PostgreSQL shared buffers, query cache | Managed by DB |
Case study: Cloudflare's caching infrastructure is one of the best examples of edge caching. Their blog posts on cache hit ratios and cache everything mode are essential reading for understanding CDN caching at scale.
Cache Invalidation Strategies
There are only two hard things in computer science: cache invalidation and naming things. — Phil Karlton
| Strategy | How It Works | When to Use |
|---|---|---|
| TTL (Time To Live) | Data expires after a set duration | Most common; good when staleness is acceptable |
| LRU (Least Recently Used) | Evict the least recently accessed items when cache is full | Memory-constrained caches (Redis maxmemory-policy allkeys-lru) |
| Write-Through | Write to cache AND database simultaneously | Read-heavy data that must stay fresh |
| Write-Behind (Write-Back) | Write to cache first, asynchronously flush to database | Write-heavy systems where eventual consistency is fine |
| Cache Aside (Lazy Loading) | Application checks cache first; on miss, loads from DB and caches | Most common pattern; simple and effective |
| Invalidation on Write | Delete or update the cache entry when data changes | When you need strong consistency for specific keys |
Cache Aside Pattern (Most Common)
Read Request:
1. Check cache for key
2. If HIT → return cached value
3. If MISS → query database
4. Store result in cache (with TTL)
5. Return value
Write Request:
1. Update database
2. Delete cache entry (NOT update it — delete!)
3. Next read will be a cache miss and repopulate
Critical pitfall: When updating data, delete the cache entry, don't update it. Why? Because concurrent writes can create a race condition where the cache is set to a stale value. Deleting ensures the next read gets the fresh value from the database. See Martin Fowler's article on caching patterns for the full explanation.
Real-World Caching Patterns
- GitHub: Uses a multi-layer caching strategy with Memcached and Redis. Their blog has deep dives on reducing database load through caching.
- Reddit: Relies heavily on Memcached for caching — their open-source architecture has influenced many caching patterns.
- Pinterest: Built a custom caching layer on top of Redis to handle their social graph. Their engineering blog has excellent posts on caching at scale.
What to Remember for Interviews
- Latency numbers — Memorize the orders of magnitude: RAM (100ns) < SSD (150µs) < same-DC network (500µs) < cross-DC (80ms)
- Latency vs Throughput — Latency is time per operation; throughput is operations per time
- Scale up vs scale out — Start vertical, go horizontal when forced
- Load balancing — L4 (fast, dumb) vs L7 (slower, smart); know consistent hashing
- Caching — Know the cache aside pattern, TTL vs LRU, and the golden rule: delete on write, don't update