Distributed Systems

The 8 Fallacies of Distributed Computing: A Deep Dive

Understand why distributed systems fail in unexpected ways. Peter Deutsch's 8 fallacies, real-world examples, and practical strategies to design resilient systems from the ground up.

28 min readdistributed systemsfallaciesfailure modesreliabilitynetworkresilience

Introduction: Why This Matters

I've been building distributed systems for over two decades, and I've seen the same mistakes made over and over again. Not because the engineers were careless—far from it—but because distributed systems behave in ways that counter our intuitions.

Peter Deutsch formulated the 8 fallacies of distributed computing in 1994, and remarkably, they're just as relevant today as they were then, maybe more so. Each fallacy is an assumption that sounds reasonable in isolation but, when your system reaches production scale, will bite you.

Why read this as a system designer? Because every distributed system I've helped debug—every outage I've seen, every cascade failure I had to explain to leadership—traces back to one or more of these fallacies. Understanding them isn't academic; it's survival.


The 8 Fallacies, At a Glance

Before we dive deep, here's the full list:

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn't change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

Let me walk you through each one, not just explaining what they are, but showing you how they manifest in real systems and what to do about them.


Fallacy 1: The Network is Reliable

This is the most fundamental fallacy, and the one that causes the most damage when violated.

What We Assume

"We'll send a message, and it'll arrive. Maybe it'll take a bit longer, but it'll get there."

Reality

Networks fail. Routers fail. Cables get cut (did you know a farmer in Ohio once severed 60% of internet traffic to the US by digging up a field?). DNS servers go down. Load balancers misbehave. Switches reboot.

But here's what most engineers miss: partial failures are more dangerous than total failures. When a service goes completely down, you notice immediately. When it's responding slowly or dropping 5% of requests, users report "weird behavior" and you spend days debugging before you realize it's a network issue.

A Real Scenario I've Seen

At one company, we had a payment service calling a fraud detection service. Both were in the same data center. One day, a network misconfiguration caused 2% of calls to timeout after 30 seconds. Sounds minor, right?

But our timeout was set to 30 seconds. For every failed fraud check, the payment would just... sit there. With 10,000 payments per minute and 2% failure rate, we had 200 payments per minute getting stuck in a pending state. Within an hour, we had 12,000 pending payments.

The fix? Three lines of configuration change. The debugging? Three days.

What to Do

Practical steps:

  • Set timeouts conservatively (I typically start with P99 latency × 2)
  • Always make operations idempotent
  • Use circuit breakers to prevent cascade failures
  • Implement bulkheads to isolate failures

Key Principle

Design as if the network will fail, because it will. Every call to another service must have a timeout. Every timeout must have a fallback strategy.


Fallacy 2: Latency is Zero

This fallacy leads to some of the most subtle performance problems in distributed systems.

What We Assume

"A call to another service is fast. Let's just call multiple services in parallel."

The Numbers That Will Change How You Think

Let me give you concrete numbers from production systems I've worked with:

OperationLatency
L1 cache reference1 nanosecond
L2 cache reference4 nanoseconds
Main memory reference100 nanoseconds
SSD read100 microseconds
Network within same datacenter500 microseconds (0.5 ms)
Network cross-continent100 milliseconds
Human blink150 milliseconds

Notice the jumps? Main memory to SSD is a 1000× increase. Local network to cross-continent is another 200×.

Why This Matters for Your API Calls

A call to a service in the same data center might be 1-5ms. That sounds fast. But if you're making 10 such calls sequentially, that's 10-50ms. Now add a timeout of 30 seconds and a retry, and suddenly you're in timeout territory.

The N+1 Query Problem (Distributed Edition)

I've seen this play out in countless systems:

Your service calls Service A (2ms) └─ Service A calls Service B (3ms) └─ Service B queries Database (5ms)

Looks innocent, right? But if you have 1000 concurrent requests, and Service B's database is under load and takes 50ms instead of 5ms, suddenly:

  • Service A: 52ms per request × 1000 = 52 seconds of processing time
  • Service B: 50ms per request × 1000 = 50 seconds
  • You: waiting for Service A, which is waiting for Service B

What to Do

  1. Measure latency distributions, not averages. P50 is meaningless. Look at P95, P99, and P99.9.
  2. Set appropriate timeouts based on actual latency, not wishful thinking.
  3. Use adaptive timeouts. If a service is normally fast but occasionally slow, calculate timeouts dynamically.
  4. Parallelize where possible, but understand the cost.

Key Principle

Latency is not free, and it's not constant. Assume the worst, measure the actual distribution, and design accordingly.


Fallacy 3: Bandwidth is Infinite

What We Assume

"We can send as much data as we want between services."

The Reality

Every network link has finite bandwidth. And here's what happens in practice: bandwidth gets consumed first by the largest users.

I've seen this destroy systems:

  • One team's report generation job started fetching massive datasets across the network
  • It saturated the 10Gbps link between two data centers
  • All other services sharing that link started experiencing timeouts
  • The entire platform went down

All because one team wasn't aware they were competing for bandwidth with critical services.

Where Bandwidth Gets Consumed

What to Do

  1. Compress data in transit. Use gzip, Brotli, or binary protocols.
  2. Paginate aggressively. No endpoint should return unlimited records.
  3. Cache responses. If a service fetches the same data repeatedly, cache it.
  4. Use efficient serialization. JSON is convenient but verbose. Consider Protocol Buffers or MessagePack for internal communication.
  5. Monitor bandwidth consumption per service.

Key Principle

Bandwidth is a shared, finite resource. Every service is competing for it. Design for efficiency.


Fallacy 4: The Network is Secure

What We Assume

"Our services are inside our data center, so they're secure."

The Threat Model That Doesn't Exist Anymore

The "perimeter security" model—where we protect the outer edge and trust everything inside—is fundamentally broken. Here's why:

  1. Lateral movement: Once an attacker gets inside (through a phishing email, compromised laptop, etc.), they have access to everything.
  2. Internal threats: Not all threats are external. A misconfigured service, an insider with malicious intent, or even just someone accidentally exposing a service to the internet.
  3. Supply chain attacks: Your dependencies might be compromised. Remember Log4j?

Real Attack Vector I've Observed

A team deployed a new microservice. They used an internal DNS name like analytics.internal.company.com. This was fine—internal services could call it. But they also exposed the management port on an internal IP address, not realizing that other services in the same network segment could access it.

One misconfiguration later, and sensitive metrics were accessible to any service that knew where to look.

What to Do

  1. Zero trust networking. Verify every request, even from internal services.
  2. Encrypt everything in transit. mTLS between all services, not just external endpoints.
  3. Principle of least privilege. Services should only have access to what they absolutely need.
  4. Network segmentation. Isolate sensitive workloads.
  5. Mutual authentication. Both client and server should verify each other's certificates.

Key Principle

There is no "inside" that's more trusted than "outside." Design your security assuming the network is hostile.


Fallacy 5: Topology Doesn't Change

What We Assume

"We know our network topology. Services are at these IPs. Load balancers are there. We're good."

Why This Assumption Fails

In modern cloud-native environments, topology changes constantly:

  • Services scale up and down (auto-scaling groups, Kubernetes replicas)
  • Services get deployed to new data centers
  • Load balancers get reconfigured
  • Services are moved between environments (dev → staging → prod)
  • Containers get new IPs when they restart

I've seen services go down because someone restarted a service and it got a new IP, and another service had hardcoded the old IP.

The Challenge of Dynamic Environments

In a traditional data center, we knew where things were. In Kubernetes:

  • Pods get IPs from a CNI range
  • Pods can be scheduled on any node
  • Services are routed via service discovery, not IP addresses
  • DNS entries can change

What to Do

  1. Never hardcode IPs. Use service discovery instead.
  2. Use DNS-based service discovery. Services should find each other by name, not address.
  3. Implement health checks. Take unhealthy instances out of rotation automatically.
  4. Design for rolling deployments. New instances come up, old ones go down, topology changes—but your service should handle this transparently.

Key Principle

Topology is fluid, not static. Design for an environment where addresses are temporary and services come and go.


Fallacy 6: There is One Administrator

What We Assume

"We control our systems. One team manages everything."

The Reality of Modern Organizations

Large organizations have:

  • Multiple teams managing different services
  • Different teams responsible for networking, security, databases
  • DevOps/SRE teams managing infrastructure
  • Platform teams providing shared services
  • Vendors managing some components
  • Cloud provider managing the underlying infrastructure

I've been in situations where fixing a problem required coordinating between 5 different teams, each with their own processes, deployment pipelines, and change management procedures.

Cross-Team Dependencies

The Blameless Postmortem Culture

When something goes wrong in a distributed system, pointing fingers is counterproductive. But understanding which team owns what is crucial for:

  • Incident response (who do you wake up at 3am?)
  • Change coordination (who needs to approve what?)
  • Troubleshooting (who knows the internals of component X?)

What to Do

  1. Clear ownership. Every service, every dependency, every piece of infrastructure should have a clear owner.
  2. SLAs between teams. If your service depends on another team's service, have a documented SLA.
  3. Shared on-call rotation. Consider having teams share on-call for critical cross-team workflows.
  4. Documentation of ownership. Keep a service catalog that maps services to owners.
  5. Design for loose coupling. Minimize cross-team dependencies where possible.

Key Principle

You don't control the whole system. Design for operating in an environment with multiple stakeholders, each controlling their piece.


Fallacy 7: Transport Cost is Zero

What We Assume

"Sending data across the network is free."

Why This Matters

There's actually significant cost to network transport:

  1. CPU cost of serialization/deserialization. Converting objects to bytes and back takes CPU cycles.
  2. Memory cost of buffering. Data needs to be held in memory while being sent or received.
  3. Infrastructure cost. Load balancers, proxies, service meshes—they all cost money to run.
  4. Latency cost. Every hop adds latency.

The Serialization Problem

I've seen services that were CPU-bound not because of complex business logic, but because they were serializing and deserializing massive amounts of JSON. The overhead compounds when services are exchanging data constantly across the network.

What to Do

  1. Choose efficient serialization formats. Protocol Buffers, FlatBuffers, or MessagePack for internal communication. JSON is fine for external APIs.
  2. Send only what you need. Don't fetch entire objects if you only need a few fields.
  3. Consider streaming for large datasets. Don't buffer everything in memory.
  4. Profile your serialization. It's often a hidden bottleneck.

Key Principle

Network transport has real costs—CPU, memory, latency, and money. Design for efficiency.


Fallacy 8: The Network is Homogeneous

What We Assume

"All parts of our network use the same protocols, the same configurations, the same everything."

Why This Assumption Fails

In reality, networks are heterogeneous:

  1. Different protocols. Some services might use HTTP, others gRPC, others might use message queues.
  2. Different security requirements. Payment services need stricter security than analytics services.
  3. Different latency profiles. Services in the same data center vs. services across regions.
  4. Different capabilities. Some nodes might have SSDs, others HDDs; some have GPU, others don't.
  5. Technology diversity. You might have services in Java, Python, Go, and Node.js all communicating.

The Interoperability Challenge

What to Do

  1. Use standard protocols. REST/gRPC for synchronous communication, Apache Kafka/RabbitMQ for async.
  2. Have translation layers. API gateways can translate between protocols.
  3. Design for evolution. Your services will be called by clients you haven't written yet, in languages you haven't chosen yet.
  4. Maintain compatibility. Don't break clients when you evolve your APIs.

Key Principle

The network contains diverse components with different capabilities and requirements. Design for interoperability.


Putting It All Together: The Resilient Distributed System

Now that we understand all the fallacies, what does a system designed with these lessons in mind look like?

The Resilient Service Checklist

FallacyDefense
Network is reliableCircuit breakers, retries, timeouts
Latency is zeroAdaptive timeouts, parallelization, caching
Bandwidth is infiniteCompression, pagination, efficient serialization
Network is securemTLS, zero trust, least privilege
Topology doesn't changeService discovery, health checks, rolling deploys
One administratorClear ownership, SLAs, on-call rotations
Transport cost is zeroEfficient formats, profiling, streaming
Network is homogeneousStandard protocols, translation layers

What to Remember for Interviews

  1. Know all 8 fallacies. Recite them without hesitation.
  2. Give real examples. When explaining each fallacy, be ready to describe a time you saw it cause problems in production.
  3. Explain mitigations. For each fallacy, know practical strategies to defend against it.
  4. Connect to patterns. Link fallacies to patterns like circuit breakers, retries, service discovery, etc.
  5. Understand the underlying principles. Why do these assumptions fail? What can we prove mathematically?

Interview tip: When asked about any system design question, proactively address how your design handles network failures, latency, security, and dynamic topology. This shows depth of understanding.


Further Reading

I've found these resources invaluable throughout my career:

  1. "Fallacies of Distributed Computing" — Wikipedia
    The original list by Peter Deutsch, with explanations from multiple practitioners.

  2. "A Note on Distributed Computing" — Sun Microsystems
    This paper by Waldo et al. explains why local and distributed computing are fundamentally different. Essential reading.

  3. "Production Ready Microservices" by Susan Fowler
    Practical guide to building resilient microservices, with excellent coverage of these fallacies in practice.

  4. "Chaos Engineering" — Principles and Practice
    Netflix's approach to testing resilience by intentionally breaking things.

💡

Final thought: These 8 fallacies aren't just theoretical. Every production outage I've experienced has been a violation of at least one of them. Internalize these lessons, and you'll be ahead of most engineers when debugging distributed systems.