Foundations

Basic System Concepts: Latency, Throughput, Scaling, Load Balancing, and Caching

Understand the foundational metrics and patterns of distributed systems: latency numbers, vertical vs horizontal scaling, load balancing algorithms, and caching strategies.

13 min readlatencythroughputscalingload balancingcachingfoundations

The Building Blocks of System Design

Every system design decision ultimately comes down to trade-offs between a few fundamental concepts. Understanding these deeply is what separates a good engineer from a great one.

Let's start with the numbers that every system designer should have memorized.


Latency Numbers Everyone Should Know

These approximate numbers (from Peter Norvig's famous slide and Jeff Dean's talk) give you an intuition for the relative cost of different operations:

OperationLatencyHuman Comparison
L1 cache access0.5 ns
L2 cache access7 ns
L3 cache access20 ns
Main memory access (RAM)100 ns
SSD read150 µs1,500× slower than RAM
Network: same data center500 µs5,000× slower than RAM
TCP round trip (same data center)1 ms
Disk seek (HDD)10 ms20× slower than SSD
Network: US to Europe (one way)80 ms
TCP round trip (US to Europe)150 ms1,500,000× slower than RAM

Mental model: Accessing RAM is like picking up a piece of paper from your desk. Accessing a remote server in another continent is like ordering a book from a library in a different country. This is why caching is so powerful — it brings data closer to where it's needed.

Latency vs Throughput

These two terms are often confused:

  • Latency — How long it takes to do one operation (measured in time: ms, µs, ns)
  • Throughput — How many operations you can do per unit of time (measured in ops/sec, MB/s, req/s)
┌────────────────────────────────────────────┐ │ Highway Analogy │ ├────────────────────────────────────────────┤ │ │ │ Latency = Time for one car to go │ │ from A to B │ │ ───────🚗──────────> │ │ │ │ Throughput = Number of cars passing │ │ a point per hour │ │ 🚗🚗🚗🚗🚗🚗🚗🚗🚗🚗→ │ │ │ │ You can have low latency + low throughput │ │ (one-lane highway, no traffic) │ │ │ │ Or high latency + high throughput │ │ (10-lane highway, but A to B is far) │ └────────────────────────────────────────────┘
💡

Real-world example: Google's Borg/Omega/Kubernetes paper discusses this extensively. Batch processing systems (like MapReduce) optimize for throughput. Real-time serving systems (like Google Search) optimize for latency.


Vertical vs Horizontal Scaling

When your system can't handle the load, you have two options:

Vertical Scaling (Scale Up)

Buy a bigger machine. More CPU, more RAM, more disk.

Single Server ┌──────────────────────────┐ │ │ │ CPU: 4 cores → 64 cores │ │ RAM: 16 GB → 512 GB │ │ Disk: 500 GB → 10 TB │ │ │ └──────────────────────────┘

Pros:

  • Simple — no code changes needed
  • No distributed systems complexity
  • Works for databases that don't shard easily (PostgreSQL scales very well vertically — see Citus Data's research)

Cons:

  • Hardware has a ceiling (you can't buy infinite RAM)
  • Expensive — the cost curve is non-linear
  • Single point of failure
  • Downtime required for upgrades

Horizontal Scaling (Scale Out)

Add more machines. Distribute the load across them.

Pros:

  • Theoretically infinite scaling
  • Commodity hardware = cheaper
  • Fault-tolerant (one node fails, others take over)

Cons:

  • Distributed systems complexity (consistency, coordination, debugging)
  • Requires architectural changes (statelessness, data partitioning)
  • More operational overhead

The Decision Matrix

FactorScale UpScale Out
CostHigh (premium hardware)Lower (commodity machines)
ComplexityLowHigh
Fault toleranceNoneBuilt-in
Max capacityHardware limitTheoretically infinite
Best forMonoliths, databasesMicroservices, web servers
⚠️

Common mistake: Many teams jump to horizontal scaling too early. If 2-3 vertical upgrades solve your problem, do that first. Horizontal scaling introduces complexity you may not need. As Instagram's engineering blog documents, they ran a single PostgreSQL instance for years before sharding.

Real-World Scaling Stories


Load Balancing

Once you have multiple servers, you need a way to distribute traffic across them. That's what a load balancer does.

What Is a Load Balancer?

A load balancer sits between clients and your servers, distributing incoming requests across a pool of backend machines.

Layer 4 vs Layer 7 Load Balancing

AspectLayer 4 (Transport)Layer 7 (Application)
What it seesIP + portHTTP headers, URL, cookies
Decision based onIP address, port, protocolURL path, headers, cookies, content
SpeedFaster (less processing)Slower (must parse HTTP)
IntelligenceDumb routingSmart routing (e.g., /api → service A, /static → CDN)
ExamplesHAProxy (TCP mode), Nginx (stream)Nginx (HTTP mode), Envoy, ALB
💡

Deep dive: Netflix's Zuul is a Layer 7 load balancer / API gateway that routes millions of requests per second. Their engineering blog has extensive write-ups on how they manage traffic distribution. Cloudflare's blog also has excellent articles on load balancing at scale.

Load Balancing Algorithms

AlgorithmHow It WorksBest For
Round RobinDistribute requests evenly in orderHomogeneous servers, simple setups
Weighted Round RobinAssign weights based on server capacityMixed hardware (some servers are bigger)
Least ConnectionsSend to the server with fewest active connectionsLong-lived connections (WebSockets, APIs)
IP HashRoute based on client IP — same client → same serverSession affinity without sticky cookies
Consistent HashingHash the key to a position on a ringDistributed caches, databases (minimizes reshuffling when nodes change)
RandomPick a random serverSimple, surprisingly effective at scale

Consistent Hashing

Consistent hashing is one of the most important algorithms in distributed systems. It maps both servers and data to positions on a hash ring.

When Node A is added: only keys between Node C and Node A move. When Node B is removed: its keys transfer to the next node clockwise.

Unlike simple modulo hashing (key % N), consistent hashing ensures that when you add or remove a server, only a fraction of keys need to be redistributed.

Real-world example: Amazon DynamoDB uses consistent hashing (the "Dynamo paper" describes this in detail). Discord's consistent hashing implementation is a practical guide. The original Dynamo paper from AWS is essential reading.

Load Balancers in Practice

ProductTypeNotable Users
NGINXL7 reverse proxy / LBNetflix, WordPress, GitHub
HAProxyL4/L7 LBTwitter, Reddit, GitHub
EnvoyL7 proxy (service mesh)Lyft, Airbnb, Stripe
AWS ALBManaged L7 LBCountless AWS workloads
AWS NLBManaged L4 LBHigh-throughput workloads
💡

Further reading: NGINX's official load balancing guide and the Envoy proxy architecture documentation are excellent resources for understanding modern load balancing.


Caching Basics

Caching is the most effective technique for improving read performance. The core idea: store expensive-to-compute or expensive-to-fetch results in fast storage, and serve them from there on subsequent requests.

Why Cache?

If latency to your database is 10ms, but latency to an in-memory cache (Redis/Memcached) is 0.5ms, caching gives you a 20x speedup for read operations.

Where to Cache

LayerWhat to CacheTechnologyTTL
BrowserStatic assets, API responsesHTTP headers (Cache-Control)Hours to days
CDNImages, videos, static pagesCloudflare, CloudFront, FastlyHours to weeks
ApplicationComputed results, session dataRedis, MemcachedSeconds to minutes
DatabaseQuery resultsPostgreSQL shared buffers, query cacheManaged by DB

Case study: Cloudflare's caching infrastructure is one of the best examples of edge caching. Their blog posts on cache hit ratios and cache everything mode are essential reading for understanding CDN caching at scale.

Cache Invalidation Strategies

There are only two hard things in computer science: cache invalidation and naming things. — Phil Karlton

StrategyHow It WorksWhen to Use
TTL (Time To Live)Data expires after a set durationMost common; good when staleness is acceptable
LRU (Least Recently Used)Evict the least recently accessed items when cache is fullMemory-constrained caches (Redis maxmemory-policy allkeys-lru)
Write-ThroughWrite to cache AND database simultaneouslyRead-heavy data that must stay fresh
Write-Behind (Write-Back)Write to cache first, asynchronously flush to databaseWrite-heavy systems where eventual consistency is fine
Cache Aside (Lazy Loading)Application checks cache first; on miss, loads from DB and cachesMost common pattern; simple and effective
Invalidation on WriteDelete or update the cache entry when data changesWhen you need strong consistency for specific keys

Cache Aside Pattern (Most Common)

Read Request: 1. Check cache for key 2. If HIT → return cached value 3. If MISS → query database 4. Store result in cache (with TTL) 5. Return value Write Request: 1. Update database 2. Delete cache entry (NOT update it — delete!) 3. Next read will be a cache miss and repopulate
⚠️

Critical pitfall: When updating data, delete the cache entry, don't update it. Why? Because concurrent writes can create a race condition where the cache is set to a stale value. Deleting ensures the next read gets the fresh value from the database. See Martin Fowler's article on caching patterns for the full explanation.

Real-World Caching Patterns


What to Remember for Interviews

  1. Latency numbers — Memorize the orders of magnitude: RAM (100ns) < SSD (150µs) < same-DC network (500µs) < cross-DC (80ms)
  2. Latency vs Throughput — Latency is time per operation; throughput is operations per time
  3. Scale up vs scale out — Start vertical, go horizontal when forced
  4. Load balancing — L4 (fast, dumb) vs L7 (slower, smart); know consistent hashing
  5. Caching — Know the cache aside pattern, TTL vs LRU, and the golden rule: delete on write, don't update