The 8 Fallacies of Distributed Computing: A Deep Dive

Understand why distributed systems fail in unexpected ways. Peter Deutsch's 8 fallacies, real-world examples, and practical strategies to design resilient systems from the ground up.

28 min readdistributed systemsfallaciesfailure modesreliabilitynetworkresilience

Introduction: Why This Matters

I've been building distributed systems for over two decades, and I've seen the same mistakes made over and over again. Not because the engineers were careless—far from it—but because distributed systems behave in ways that counter our intuitions.

Peter Deutsch formulated the 8 fallacies of distributed computing in 1994, and remarkably, they're just as relevant today as they were then, maybe more so. Each fallacy is an assumption that sounds reasonable in isolation but, when your system reaches production scale, will bite you.

✅

Why read this as a system designer? Because every distributed system I've helped debug—every outage I've seen, every cascade failure I had to explain to leadership—traces back to one or more of these fallacies. Understanding them isn't academic; it's survival.

The 8 Fallacies, At a Glance

Before we dive deep, here's the full list:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous

Let me walk you through each one, not just explaining what they are, but showing you how they manifest in real systems and what to do about them.

Fallacy 1: The Network is Reliable

This is the most fundamental fallacy, and the one that causes the most damage when violated.

What We Assume

"We'll send a message, and it'll arrive. Maybe it'll take a bit longer, but it'll get there."

Reality

Networks fail. Routers fail. Cables get cut (did you know a farmer in Ohio once severed 60% of internet traffic to the US by digging up a field?). DNS servers go down. Load balancers misbehave. Switches reboot.

But here's what most engineers miss: partial failures are more dangerous than total failures. When a service goes completely down, you notice immediately. When it's responding slowly or dropping 5% of requests, users report "weird behavior" and you spend days debugging before you realize it's a network issue.

A Real Scenario I've Seen

At one company, we had a payment service calling a fraud detection service. Both were in the same data center. One day, a network misconfiguration caused 2% of calls to timeout after 30 seconds. Sounds minor, right?

But our timeout was set to 30 seconds. For every failed fraud check, the payment would just... sit there. With 10,000 payments per minute and 2% failure rate, we had 200 payments per minute getting stuck in a pending state. Within an hour, we had 12,000 pending payments.

The fix? Three lines of configuration change. The debugging? Three days.

What to Do

Practical steps:

Set timeouts conservatively (I typically start with P99 latency × 2)
Always make operations idempotent
Use circuit breakers to prevent cascade failures
Implement bulkheads to isolate failures

Key Principle

Design as if the network will fail, because it will. Every call to another service must have a timeout. Every timeout must have a fallback strategy.

Fallacy 2: Latency is Zero

This fallacy leads to some of the most subtle performance problems in distributed systems.

What We Assume

"A call to another service is fast. Let's just call multiple services in parallel."

The Numbers That Will Change How You Think

Let me give you concrete numbers from production systems I've worked with:

Operation	Latency
L1 cache reference	1 nanosecond
L2 cache reference	4 nanoseconds
Main memory reference	100 nanoseconds
SSD read	100 microseconds
Network within same datacenter	500 microseconds (0.5 ms)
Network cross-continent	100 milliseconds
Human blink	150 milliseconds

Notice the jumps? Main memory to SSD is a 1000× increase. Local network to cross-continent is another 200×.

Why This Matters for Your API Calls

A call to a service in the same data center might be 1-5ms. That sounds fast. But if you're making 10 such calls sequentially, that's 10-50ms. Now add a timeout of 30 seconds and a retry, and suddenly you're in timeout territory.

The N+1 Query Problem (Distributed Edition)

I've seen this play out in countless systems:

Your service calls Service A (2ms)
  └─ Service A calls Service B (3ms)
       └─ Service B queries Database (5ms)

Looks innocent, right? But if you have 1000 concurrent requests, and Service B's database is under load and takes 50ms instead of 5ms, suddenly:

Service A: 52ms per request × 1000 = 52 seconds of processing time
Service B: 50ms per request × 1000 = 50 seconds
You: waiting for Service A, which is waiting for Service B

What to Do

Measure latency distributions, not averages. P50 is meaningless. Look at P95, P99, and P99.9.
Set appropriate timeouts based on actual latency, not wishful thinking.
Use adaptive timeouts. If a service is normally fast but occasionally slow, calculate timeouts dynamically.
Parallelize where possible, but understand the cost.

Key Principle

Latency is not free, and it's not constant. Assume the worst, measure the actual distribution, and design accordingly.

Fallacy 3: Bandwidth is Infinite

What We Assume

"We can send as much data as we want between services."

The Reality

Every network link has finite bandwidth. And here's what happens in practice: bandwidth gets consumed first by the largest users.

I've seen this destroy systems:

One team's report generation job started fetching massive datasets across the network
It saturated the 10Gbps link between two data centers
All other services sharing that link started experiencing timeouts
The entire platform went down

All because one team wasn't aware they were competing for bandwidth with critical services.

Where Bandwidth Gets Consumed

What to Do

Compress data in transit. Use gzip, Brotli, or binary protocols.
Paginate aggressively. No endpoint should return unlimited records.
Cache responses. If a service fetches the same data repeatedly, cache it.
Use efficient serialization. JSON is convenient but verbose. Consider Protocol Buffers or MessagePack for internal communication.
Monitor bandwidth consumption per service.

Key Principle

Bandwidth is a shared, finite resource. Every service is competing for it. Design for efficiency.

Fallacy 4: The Network is Secure

What We Assume

"Our services are inside our data center, so they're secure."

The Threat Model That Doesn't Exist Anymore

The "perimeter security" model—where we protect the outer edge and trust everything inside—is fundamentally broken. Here's why:

Lateral movement: Once an attacker gets inside (through a phishing email, compromised laptop, etc.), they have access to everything.
Internal threats: Not all threats are external. A misconfigured service, an insider with malicious intent, or even just someone accidentally exposing a service to the internet.
Supply chain attacks: Your dependencies might be compromised. Remember Log4j?

Real Attack Vector I've Observed

A team deployed a new microservice. They used an internal DNS name like analytics.internal.company.com. This was fine—internal services could call it. But they also exposed the management port on an internal IP address, not realizing that other services in the same network segment could access it.

One misconfiguration later, and sensitive metrics were accessible to any service that knew where to look.

What to Do

Zero trust networking. Verify every request, even from internal services.
Encrypt everything in transit. mTLS between all services, not just external endpoints.
Principle of least privilege. Services should only have access to what they absolutely need.
Network segmentation. Isolate sensitive workloads.
Mutual authentication. Both client and server should verify each other's certificates.

Key Principle

There is no "inside" that's more trusted than "outside." Design your security assuming the network is hostile.

Fallacy 5: Topology Doesn't Change

What We Assume

"We know our network topology. Services are at these IPs. Load balancers are there. We're good."

Why This Assumption Fails

In modern cloud-native environments, topology changes constantly:

Services scale up and down (auto-scaling groups, Kubernetes replicas)
Services get deployed to new data centers
Load balancers get reconfigured
Services are moved between environments (dev → staging → prod)
Containers get new IPs when they restart

I've seen services go down because someone restarted a service and it got a new IP, and another service had hardcoded the old IP.

The Challenge of Dynamic Environments

In a traditional data center, we knew where things were. In Kubernetes:

Pods get IPs from a CNI range
Pods can be scheduled on any node
Services are routed via service discovery, not IP addresses
DNS entries can change

What to Do

Never hardcode IPs. Use service discovery instead.
Use DNS-based service discovery. Services should find each other by name, not address.
Implement health checks. Take unhealthy instances out of rotation automatically.
Design for rolling deployments. New instances come up, old ones go down, topology changes—but your service should handle this transparently.

Key Principle

Topology is fluid, not static. Design for an environment where addresses are temporary and services come and go.

Fallacy 6: There is One Administrator

What We Assume

"We control our systems. One team manages everything."

The Reality of Modern Organizations

Large organizations have:

Multiple teams managing different services
Different teams responsible for networking, security, databases
DevOps/SRE teams managing infrastructure
Platform teams providing shared services
Vendors managing some components
Cloud provider managing the underlying infrastructure

I've been in situations where fixing a problem required coordinating between 5 different teams, each with their own processes, deployment pipelines, and change management procedures.

Cross-Team Dependencies

The Blameless Postmortem Culture

When something goes wrong in a distributed system, pointing fingers is counterproductive. But understanding which team owns what is crucial for:

Incident response (who do you wake up at 3am?)
Change coordination (who needs to approve what?)
Troubleshooting (who knows the internals of component X?)

What to Do

Clear ownership. Every service, every dependency, every piece of infrastructure should have a clear owner.
SLAs between teams. If your service depends on another team's service, have a documented SLA.
Shared on-call rotation. Consider having teams share on-call for critical cross-team workflows.
Documentation of ownership. Keep a service catalog that maps services to owners.
Design for loose coupling. Minimize cross-team dependencies where possible.

Key Principle

You don't control the whole system. Design for operating in an environment with multiple stakeholders, each controlling their piece.

Fallacy 7: Transport Cost is Zero

What We Assume

"Sending data across the network is free."

Why This Matters

There's actually significant cost to network transport:

CPU cost of serialization/deserialization. Converting objects to bytes and back takes CPU cycles.
Memory cost of buffering. Data needs to be held in memory while being sent or received.
Infrastructure cost. Load balancers, proxies, service meshes—they all cost money to run.
Latency cost. Every hop adds latency.

The Serialization Problem

I've seen services that were CPU-bound not because of complex business logic, but because they were serializing and deserializing massive amounts of JSON. The overhead compounds when services are exchanging data constantly across the network.

What to Do

Choose efficient serialization formats. Protocol Buffers, FlatBuffers, or MessagePack for internal communication. JSON is fine for external APIs.
Send only what you need. Don't fetch entire objects if you only need a few fields.
Consider streaming for large datasets. Don't buffer everything in memory.
Profile your serialization. It's often a hidden bottleneck.

Key Principle

Network transport has real costs—CPU, memory, latency, and money. Design for efficiency.

Fallacy 8: The Network is Homogeneous

What We Assume

"All parts of our network use the same protocols, the same configurations, the same everything."

Why This Assumption Fails

In reality, networks are heterogeneous:

Different protocols. Some services might use HTTP, others gRPC, others might use message queues.
Different security requirements. Payment services need stricter security than analytics services.
Different latency profiles. Services in the same data center vs. services across regions.
Different capabilities. Some nodes might have SSDs, others HDDs; some have GPU, others don't.
Technology diversity. You might have services in Java, Python, Go, and Node.js all communicating.

The Interoperability Challenge

What to Do

Use standard protocols. REST/gRPC for synchronous communication, Apache Kafka/RabbitMQ for async.
Have translation layers. API gateways can translate between protocols.
Design for evolution. Your services will be called by clients you haven't written yet, in languages you haven't chosen yet.
Maintain compatibility. Don't break clients when you evolve your APIs.

Key Principle

The network contains diverse components with different capabilities and requirements. Design for interoperability.

Putting It All Together: The Resilient Distributed System

Now that we understand all the fallacies, what does a system designed with these lessons in mind look like?

The Resilient Service Checklist

Fallacy	Defense
Network is reliable	Circuit breakers, retries, timeouts
Latency is zero	Adaptive timeouts, parallelization, caching
Bandwidth is infinite	Compression, pagination, efficient serialization
Network is secure	mTLS, zero trust, least privilege
Topology doesn't change	Service discovery, health checks, rolling deploys
One administrator	Clear ownership, SLAs, on-call rotations
Transport cost is zero	Efficient formats, profiling, streaming
Network is homogeneous	Standard protocols, translation layers

What to Remember for Interviews

Know all 8 fallacies. Recite them without hesitation.
Give real examples. When explaining each fallacy, be ready to describe a time you saw it cause problems in production.
Explain mitigations. For each fallacy, know practical strategies to defend against it.
Connect to patterns. Link fallacies to patterns like circuit breakers, retries, service discovery, etc.
Understand the underlying principles. Why do these assumptions fail? What can we prove mathematically?

✅

Interview tip: When asked about any system design question, proactively address how your design handles network failures, latency, security, and dynamic topology. This shows depth of understanding.

The 8 Fallacies of Distributed Computing: A Deep Dive

Introduction: Why This Matters

The 8 Fallacies, At a Glance

Fallacy 1: The Network is Reliable

What We Assume

Reality

A Real Scenario I've Seen

What to Do

Key Principle

Fallacy 2: Latency is Zero

What We Assume

The Numbers That Will Change How You Think

Why This Matters for Your API Calls

The N+1 Query Problem (Distributed Edition)

What to Do

Key Principle

Fallacy 3: Bandwidth is Infinite

What We Assume

The Reality

Where Bandwidth Gets Consumed

What to Do

Key Principle

Fallacy 4: The Network is Secure

What We Assume

The Threat Model That Doesn't Exist Anymore

Real Attack Vector I've Observed

What to Do

Key Principle

Fallacy 5: Topology Doesn't Change

What We Assume

Why This Assumption Fails

The Challenge of Dynamic Environments

What to Do

Key Principle

Fallacy 6: There is One Administrator

What We Assume

The Reality of Modern Organizations

Cross-Team Dependencies

The Blameless Postmortem Culture

What to Do

Key Principle

Fallacy 7: Transport Cost is Zero

What We Assume

Why This Matters

The Serialization Problem

What to Do

Key Principle

Fallacy 8: The Network is Homogeneous

What We Assume

Why This Assumption Fails

The Interoperability Challenge

What to Do

Key Principle

Putting It All Together: The Resilient Distributed System

The Resilient Service Checklist

What to Remember for Interviews

Further Reading