Service Discovery & Configuration Management in Distributed Systems

Learn how services find each other in dynamic environments, implement client-side vs server-side discovery, and manage configuration across microservices with practical patterns and tools.

service discoveryconsuleurekaetcdconfiguration managementmicroserviceskubernetes

Introduction: The Problem of Finding Services

Picture this: it's 2015, and you're running a microservices architecture. Your Order Service needs to call the Payment Service. Simple, right? You know the Payment Service's IP address... wait, do you?

In a modern distributed system:

Services scale up and down based on demand
Services get deployed to new machines when old ones fail
Services move between data centers
Services get new IP addresses on every restart
Different environments (dev, staging, prod) have different addresses

So how does Order Service know where Payment Service is? Service discovery.

✅

Why this matters: I once spent three days debugging an outage that turned out to be a service whose IP address had changed after a restart, but nobody had updated the configuration file. Service discovery exists to prevent exactly this kind of silent failure.

What is Service Discovery?

Service discovery is the process by which services find each other in a distributed system. It answers the question: "Where is service X running right now?"

The Core Components

The Registration Patterns

There are two main approaches to service registration:

Self-Registration (The "Hey, I'm Here!" Pattern): Services register themselves with the service registry.
Third-Party Registration (The "Someone Else Tells Everyone" Pattern): An external system registers services on their behalf.

Why Self-Registration is Usually Better

In my experience, self-registration is more common because:

Aspect	Self-Registration	Third-Party Registration
Coupling	Service knows about registry	Service doesn't know about registry
Complexity	Simple (just add code to service)	Complex (need external agent)
Failure isolation	Service failure affects registry	Agent failure doesn't affect service
Startup latency	Service can fail registration	Depends on agent

Client-Side vs Server-Side Discovery

There are two fundamental approaches to service discovery, each with trade-offs.

Client-Side Discovery

In client-side discovery, the client (service making the call) is responsible for finding available service instances.

How it works:

Service needs to call another service
Service queries the service registry
Service chooses an instance (simple round-robin, random, or more sophisticated)
Service makes the request

Pros:

No additional network hop (compared to server-side)
Client can use sophisticated load balancing
Registry doesn't become a bottleneck

Cons:

Client must know about service registry
Client library needed in every service
More coupling between services and discovery mechanism

Server-Side Discovery

In server-side discovery, the client makes a request to a router/gateway, which queries the service registry and forwards the request.

How it works:

Client makes request to a known endpoint (API gateway)
Gateway queries the service registry
Gateway chooses an instance and forwards the request

Pros:

Services don't need discovery client library
Easier to manage cross-cutting concerns (auth, logging)
Single entry point for all clients

Cons:

Additional network hop
Gateway can become a bottleneck
Gateway must be highly available

When to Use Which

Factor	Client-Side	Server-Side
Number of services	Many, with complex routing	Fewer, simpler routing
Language diversity	Different languages need different clients	Single gateway handles all
Routing needs	Sophisticated per-service routing	Standard routing + common policies
Operational complexity	Lower (services manage themselves)	Higher (need to manage gateway)

✅

My experience: For most microservices architectures, I recommend server-side discovery with Kubernetes or an API gateway. It's simpler to manage, provides a natural place for cross-cutting concerns, and decouples services from the discovery mechanism.

Service Registries: The Heart of Service Discovery

What Makes a Good Service Registry?

A service registry needs to handle several challenges:

Service registration: How do services register?
Health checking: How do we know if a service instance is healthy?
Failure detection: How do we detect and remove unhealthy instances?
Distributed operation: How does the registry itself stay available?

Popular Service Registries

Registry	Developed By	Best For	Key Features
Consul	HashiCorp	General microservices	DNS interface, health checks, KV store
Eureka	Netflix	AWS microservices	Built for cloud, peer-to-peer replication
etcd	CoreOS	Kubernetes ecosystem	Distributed key-value store, used by K8s
ZooKeeper	Apache	Legacy systems	Mature, proven, but complex

Consul Architecture

Consul is my go-to for most service discovery use cases. Here's why:

Key features:

DNS interface: Services can be discovered via payment.service.consul
HTTP API: For programmatic discovery
Health checks: HTTP, TCP, script, or TTL-based
Key-Value store: For configuration data
Multi-datacenter: Native support for federated clusters

Service Registration Flow

Discovery methods:

DNS queries: dig payment-service.service.consul
HTTP API: GET /v1/catalog/service/payment-service
Blocking queries: Wait for changes without polling

Health Checking: The Unsung Hero

Service discovery only works if the registry knows which services are healthy. Health checking is how services prove they're alive.

Types of Health Checks

1. Active Health Checks (Registry Checks Service)

Types of active checks:

HTTP: Send HTTP GET to health endpoint
TCP: Open TCP connection to port
Script: Run a script to check health

2. Passive Health Checks (Service Reports to Registry)

3. Cassandra-Style (Gossip Protocol)

Services tell each other, and health info spreads via gossip:

What Makes a Good Health Check?

A good health check should:

Check what matters: Not just "is the port open" but "can this service do its job"
Be fast: Don't make clients wait for slow checks
Be deterministic: Same state should always pass or fail
Be lightweight: Health checks shouldn't stress the service

Health Check Levels

Level	Checks	Use When
L1: Liveness	Process is running	Basic availability
L2: Readiness	Can handle requests	Dependencies healthy (DB, cache)
L3: Deep	Business operations work	Critical services

Handling Unhealthy Instances

Key concepts:

Grace period: How long to wait before marking unhealthy
Recovery period: How many successful checks before marking healthy again
Deregistration delay: Time to wait before removing (prevents flapping)

Configuration Management: The Second Half

Service discovery finds services. Configuration management distributes configuration to them. They're often handled by the same tools.

The Configuration Problem

In a microservices architecture, you might have:

50 microservices
3 environments (dev, staging, prod)
Multiple teams changing configuration
Secrets that shouldn't be in code
Configuration that changes at runtime

How do you manage this?

Configuration Patterns

1. Environment Variables (The Simplest Approach)

The simplest approach is environment variables injected at runtime:

Approach	Pros	Cons
Env vars	Simple, universal	Hard to manage many variables
Config files	Structured, supports hierarchies	Per-service, no centralized updates
Configuration server	Central control, dynamic updates, versioning	Another service to manage

What to store in environment variables:

Database connection strings
API endpoints
Feature flags
Log levels

2. Centralized Configuration Server

Pros: Central control, dynamic updates, versioning
Cons: Another service to manage, network dependency

Configuration Refresh

Feature Flags: Beyond Configuration

Feature flags take configuration a step further—they control behavior, not just values.

Flag Type	Purpose
Release flags	Enable/disable features
Experiment flags	A/B testing
Ops flags	Kill switches, rate limits
Permission flags	User-specific features

Secrets Management

Never put secrets in configuration files or environment variables directly!

The principle: Applications should fetch secrets at startup or runtime, never hardcode them.

Kubernetes Service Discovery: A Case Study

If you're running Kubernetes, service discovery is built in and beautifully simple.

How Kubernetes DNS Works

Service names:

payment-service (short name)
payment-service.default (namespace)
payment-service.default.svc.cluster.local (fully qualified)

Kubernetes Service Types

Type	Use Case	How It Works
ClusterIP	Internal only	Stable IP within cluster
NodePort	Simple external access	Exposes on each node's IP
LoadBalancer	Cloud provider LB	External LB routes to service
ExternalName	CNAME to external service	Maps to external DNS

DNS Resolution Flow

Endpoints and EndpointSlices

Behind every Service is an Endpoints object that lists the actual pod IPs:

Component	Purpose
Service	Stable name + ClusterIP
Endpoints	List of pod IPs + ports
EndpointSlices	Scaled version for large clusters

Putting It All Together: A Practical Architecture

Here's how I'd set up service discovery and configuration for a typical microservices architecture:

Implementation Checklist

Component	Tool	Key Config
Service Registry	Consul	Health checks, DNS
Service Mesh	Istio / Linkerd	mTLS, traffic management
Configuration	Apollo / Spring Cloud Config	Git-backed, versioned
Secrets	HashiCorp Vault	PKI, dynamic secrets
Service Discovery (K8s)	CoreDNS	Built-in

What to Remember for Interviews

Client vs server-side discovery: Know the trade-offs and when to use each.
Health checks: Understand active vs passive and what makes a good health check.
Consul, Eureka, etcd: Be familiar with at least one service registry.
Configuration management: Know patterns for managing config across environments.
Kubernetes DNS: Understand how service discovery works in K8s.
Secrets management: Know that secrets should never be in config files.

✅

Interview tip: When designing any microservices system, always address service discovery. Say "we'll use Consul for service discovery with health checks" or "we'll use Kubernetes DNS for service-to-service communication." This shows operational awareness.