Write-Ahead Log (WAL): The Foundation of Crash Recovery and Durability

Understand the WAL algorithm - the foundation of crash recovery in PostgreSQL, durability in Kafka, and replication in databases. Learn fsync ordering, checkpointing, and point-in-time recovery.

WALcrash-recoverydurabilityPostgreSQLKafkafsynccheckpoint

Write-Ahead Log (WAL): The Foundation of Crash Recovery and Durability

I once spent a weekend recovering a PostgreSQL database that lost 15 minutes of transactions because the WAL disk filled up. That weekend taught me more about durability, fsync semantics, and checkpointing than any textbook ever could. The Write-Ahead Log is the single most important piece of your database's crash recovery story, and understanding it deeply separates engineers who build reliable systems from those who just hope they work.

Every database you use—PostgreSQL, MySQL, SQLite, MongoDB, Cassandra—uses a WAL (or its equivalent, the journal or commit log). Kafka's entire architecture is a distributed WAL. If you understand WAL, you understand how data survives crashes, how replication works, and how point-in-time recovery is possible.

✅

The core insight: A Write-Ahead Log implements one simple rule: before you modify a data page on disk, write a record describing that modification to the log. If the system crashes, you replay the log to reconstruct lost writes. This is the foundation of crash recovery in every serious storage engine.

The Fundamental WAL Protocol

The WAL protocol has three rules that every system must follow:

Write to the log first: Before modifying a data page, append a redo record to the log
Flush the log to disk: Ensure the log record is durable (fsync) before applying the change
Apply the change: Modify the data page in memory or on disk

This guarantees that even if the system crashes mid-operation, you can replay the log to recover.

Crash Scenarios

Crash Timing	Recovery Outcome	Why
Before WAL write	Transaction lost (atomicity)	Data never made it to any durable store
Before WAL fsync	Transaction lost (atomicity)	OS buffer cache not flushed to disk
After WAL fsync, before page write	Transaction recovered (durability)	WAL has the redo record
After page write, before WAL fsync	Partial page write	WAL replay fixes torn pages

REDO and UNDO Logging

There are two flavors of WAL records:

Record Type	Purpose	Example	Crash Recovery
REDO only	Can redo the change from log	"Set page 5, offset 100 to value 42"	Replay all REDO records
UNDO only	Can undo the change from log	"Page 5 originally had value 7"	Rollback uncommitted transactions
REDO + UNDO	Both (MySQL InnoDB)	Full before/after image	Replay committed, rollback uncommitted

PostgreSQL uses REDO-only WAL (it stores the new page image). MySQL InnoDB uses REDO + UNDO. The tradeoff is WAL size vs. recovery flexibility.

PostgreSQL WAL: Streaming Replication and PITR

PostgreSQL's WAL is the most sophisticated implementation in open-source databases. It enables three critical features:

WAL Segments and LSN

PostgreSQL writes WAL in 16 MB segments. Each record has a Log Sequence Number (LSN)—a byte offset into the WAL stream. Every data page stores the LSN of the last WAL record that modified it. During recovery, PostgreSQL replays WAL records with LSN greater than the page's LSN.

WAL Concept	Description
WAL segment	16 MB file, named by timeline and LSN
LSN	Log Sequence Number (file offset + byte)
REDO point	LSN where recovery starts (after checkpoint)
Checkpoint	Flushes all dirty pages up to a given LSN
Timeline	Branch identifier after PITR or recovery

Streaming Replication

Standby replicas connect to the primary and continuously receive WAL records. PostgreSQL supports three replication modes:

Mode	Durability	Performance	Use Case
Synchronous	Zero data loss	Higher latency	Financial systems
Asynchronous	Small window of loss	Lowest latency	Most applications
Quorum commit	Configurable	Configurable	Multi-DC deployments

In synchronous replication, the primary waits for at least one standby to acknowledge receiving the WAL before reporting commit to the client. This guarantees zero data loss.

Point-in-Time Recovery (PITR)

PITR lets you restore a database to any point in time by:

Restoring a base backup (full data directory snapshot)
Replaying all WAL segments up to the target time or LSN

This is how you recover from "DROP TABLE" at 2:30 PM—restore the backup from 2:00 AM and replay WAL up to 2:29:59 PM.

Kafka: The Distributed WAL

Kafka takes the WAL concept and makes it the entire system. A Kafka topic partition is just a WAL (called a "log")—an ordered, immutable sequence of records appended to the end.

Kafka's WAL Properties

Kafka WAL Property	How It Works
Immutable	Records are never modified after append
Offset-addressed	Each record has a unique, monotonically increasing offset
Segment-based	Log is divided into segments (default 1 GB or 7 days)
Durable on ack=all	All in-sync replicas acknowledge before commit
Compacted	Keeps the latest value for each key (log compaction)

Kafka fsync Configuration

Kafka's durability is controlled by the acks setting:

acks Value	Behavior	Crash Safety	Latency
`acks=0`	Fire and forget	Data loss on leader crash	Lowest
`acks=1`	Leader writes to page cache	Loss if leader crashes before fsync	Medium
`acks=all`	All in-sync replicas ack	Durable (no loss)	Highest

Unlike PostgreSQL which fsyncs on every commit, Kafka batches fsync calls periodically (controlled by log.flush.interval.ms). This gives Kafka absurdly high throughput—millions of messages per second on a single broker—at the cost of a small window of potential data loss.

⚠️

The durability-throughput tradeoff: Kafka can write 2+ million messages per second with acks=1 because it relies on OS page cache and batched fsync. With acks=all, throughput drops by 30-50% due to network round trips. Understand your tolerance for data loss before choosing a configuration. For analytics pipelines, acks=1 is often acceptable. For payment systems, always use acks=all.

fsync Semantics and Ordering Guarantees

The WAL's durability guarantee ultimately rests on one system call: fsync (or fdatasync on Linux). Understanding what fsync actually promises is critical.

What fsync Guarantees

Ordering: All writes issued before fsync are on disk before the call returns
Durability: Data survives power loss and OS crash
Atomicity: fsync of a file ensures metadata (size, timestamps) is also flushed

What fsync Does NOT Guarantee

Ordering between files: fsync on file A doesn't order writes to file B (use sync_file_range for ordering)
Storage controller cache: Some RAID controllers and SSDs lie about fsync (battery-backed write cache mitigates this)
Filesystem crash consistency: If you write to multiple files, fsync each one or use a journaling filesystem

Batched fsync and Group Commit

PostgreSQL uses group commit to amortize fsync costs across multiple concurrent transactions:

Optimization	What It Does	Benefit
Group commit	Batch fsync for multiple transactions	Up to 10x more transactions per second
WAL buffer	Keep WAL records in memory before flush	Reduces fsync frequency
Unlogged tables	Skip WAL for transient data	Up to 5x faster inserts
WAL compression	Compress WAL records	Reduces I/O and archive storage

Checkpointing

The WAL grows forever unless you periodically truncate it. Checkpointing is the process of:

Flushing all dirty data pages (modified by WAL records) to disk
Recording the REDO point—the LSN from which recovery must start
Truncating WAL segments before the REDO point

Checkpoint Frequency Tradeoff

Frequent Checkpoints	Infrequent Checkpoints
Faster crash recovery	Slower crash recovery
More I/O during normal operation	Less I/O overhead
Smaller WAL on disk	Larger WAL segments
More predictable performance	Recovery can take hours

PostgreSQL's checkpoint configuration:

Parameter	Default	Effect
`checkpoint_timeout`	5 minutes	Max time between checkpoints
`max_wal_size`	1 GB	Target max WAL size
`checkpoint_completion_target`	0.5	Spread checkpoint I/O over 50% of interval

WAL in PostgreSQL vs Kafka vs LevelDB

While all three use the WAL concept, their implementations differ dramatically based on their workload:

Property	PostgreSQL	Kafka	LevelDB / RocksDB
Unit of write	Row/tuple changes	Messages (variable size)	Key-value pairs
WAL format	Binary, page-oriented	Binary, message set	Binary, record batch
Segment size	16 MB	Configurable (default 1 GB)	Configurable (default 2 MB)
fsync policy	Every commit (group commit)	Periodic + on commit (acks=all)	Every write (sync=true) or periodic
Crash recovery	REDO from checkpoint	Replay from last checkpoint	REPLAY from MANIFEST
Replication	WAL shipping / streaming	ISR (In-Sync Replicas)	Raft-based (via etcd)
Compaction	Not native (use archive)	Log compaction	Compaction (merge)
Archive	WAL archive to S3/NFS	Tiered storage (KIP-405)	Backup via checkpoint

LevelDB / RocksDB WAL

LevelDB and RocksDB use a WAL called the "log" (confusingly, the same term Kafka uses). Each write is appended to the current WAL file. When the WAL reaches a threshold, the memtable is flushed to an SSTable, and the WAL is recycled.

WAL in MySQL InnoDB

MySQL InnoDB uses a REDO log (its name for WAL) combined with an UNDO log for MVCC (Multi-Version Concurrency Control). The REDO log stores physical page-level changes; the UNDO log stores logical rollback information.

InnoDB REDO Log Architecture

MySQL WAL Parameter	Default	Description
`innodb_log_buffer_size`	16 MB	In-memory buffer before flushing to disk
`innodb_log_file_size`	48 MB per file	Size of each REDO log file
`innodb_log_files_in_group`	2	Number of REDO log files (total = size * count)
`innodb_flush_log_at_trx_commit`	1	Controls fsync behavior (see table below)
`innodb_flush_log_at_timeout`	1 second	Time-based flush interval

innodb_flush_log_at_trx_commit Settings

Value	Behavior	Durability	Write Ops/sec
1	fsync on every commit (default)	Full ACID	~500 (NVMe)
2	Write to OS cache on commit, flush every second	Loss of 1 second of data	~5000
0	Write to OS cache every second	Loss of 1 second of data	~8000

The difference between 1 and 2 is the difference between "zero data loss" and "at most 1 second of data loss." For most web applications, setting 2 is acceptable and provides a 5-10x write throughput improvement.

Doublewrite Buffer

InnoDB has an additional protection: the doublewrite buffer. Before writing a page to its final location, InnoDB writes a copy to the doublewrite buffer. This prevents torn pages—situations where only part of a 16 KB page is written due to a crash.

Physical vs Logical WAL

WAL implementations fall into two categories: physical and logical logging.

Property	Physical Logging	Logical Logging
What is logged	Byte-level changes to pages	SQL statements or high-level operations
Example	"Page 42, offset 128: set bytes to X"	"UPDATE users SET balance = 100 WHERE id = 42"
Replay	Bit-identical page reconstruction	Re-execute the statement
Size per write	Small (just the changed bytes)	Larger (full statement text)
Replication	Must be exact replica (same page format)	Works across different versions
Parallel replay	Hard (page conflicts)	Easy (statement-level isolation)
Use case	PostgreSQL, InnoDB	MySQL binlog (statement mode), logical replication

PostgreSQL uses physical WAL. MySQL supports both: physical REDO log + logical binlog. The binlog is used for replication and PITR because it can be replayed on a different MySQL version.

Hybrid Logical-Physical Logging (MySQL Binlog)

MySQL's binary log (binlog) records logical changes at the row level:

text

### UPDATE `db`.`users`
### WHERE
###   @1=42 /* id */
###   @2='old@email.com' /* email */
### SET
###   @2='new@email.com' /* email */

This format is schema-independent and can be replayed on replicas with different hardware or MySQL versions.

WAL Archiving Strategies

WAL archiving separates the durability concern from the capacity concern. The live WAL is on fast local storage (NVMe). The archive is on cheaper, durable storage (S3, HDFS).

Strategy	How It Works	Recovery Time	Storage Cost
Continuous archiving	Every WAL segment is copied to archive immediately	Minutes (last segment)	Low (S3)
Batch archiving	Archive every N segments or every M minutes	Up to M minutes of loss	Low (S3)
No archive	Keep WAL on local disk only	Limited to local WAL retention	Lowest (no archive cost)
WAL-E / WAL-G	PostgreSQL-specific, compresses and encrypts WAL to S3	Seconds per GB replayed	Low + compression

PostgreSQL's archive_command is a simple shell command that runs after each WAL segment is filled:

text

archive_command = 'aws s3 cp %p s3://my-bucket/wal/%f'

Archive Tool	Compression	Encryption	Parallel Upload	Supported Backends
pgBackRest	Yes (zstd, lz4)	Yes	Yes	S3, Azure, GCS, local
WAL-G	Yes (zstd, lzma)	Yes	Yes	S3, GCS, Azure, Swift
barman	Yes (gzip, bzip2)	Yes	No (rsync)	SSH, S3

ARIES Recovery Algorithm

The ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) recovery algorithm is the theoretical foundation used by most WAL-based systems. It has three phases:

Phase 1: Analysis

Scan the WAL from the last checkpoint to find:

Dirty pages (modified but not yet flushed)
In-progress transactions (started but not committed)

Phase 2: REDO

Replay all REDO records from the REDO point forward. This is idempotent—replaying a change that was already applied is safe because the page LSN is checked.

Phase 3: UNDO

Rollback uncommitted transactions by applying UNDO records. The UNDO actions are also logged to the WAL to ensure idempotent recovery (a process called "compensation log records" or CLRs).

Compensation Log Records (CLRs)

When the UNDO phase rolls back a transaction, it writes CLRs to the WAL. This ensures that if the system crashes during UNDO, recovery can resume from where it left off rather than redoing the entire undo. CLRs are the mechanism that guarantees idempotent recovery.

Observability and WAL Monitoring

In production, you need to monitor WAL health to prevent the disaster I described in the introduction.

Metric	What It Reveals	Warning Threshold	Critical Threshold
WAL write latency (P99)	Disk I/O pressure	10 ms	100 ms
WAL generation rate	Write throughput	Steady increase	Sudden spike
Replication lag	Health of streaming replicas	10 seconds	5 minutes
Archive lag	Health of WAL archiving	5 minutes	30 minutes
WAL disk usage	Checkpoint frequency	70%	90%
Checkpoint age	Time since last checkpoint	2x checkpoint_timeout	3x checkpoint_timeout

PostgreSQL Specific Monitoring

Run these queries proactively:

sql

-- Check WAL generation rate
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') / 1024 / 1024 AS wal_mb;

-- Check replication lag
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;

-- Check checkpoint stats
SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time
FROM pg_stat_bgwriter;

⚠️

The WAL disk full scenario: When the WAL disk reaches 100%, the database PANICS and shuts down. There is no graceful degradation. PostgreSQL's pg_ctl will refuse to start until you free WAL space. This is why monitoring WAL disk usage is critical—set alerts at 70% full, page at 85%, and treat 90% as a critical incident. I learned this the hard way on that weekend I mentioned at the start of this article.

WAL Tradeoffs and Optimizations

The fsync Cost

fsync is expensive—typically 1-10 ms per call. On spinning disks, it can be 10+ ms as the disk must physically flush its cache. On SSDs, it's 1-3 ms but still blocks the calling thread.

Storage	Typical fsync latency	Transactions/second (single-threaded)
HDD (7200 RPM)	8-12 ms	~100
SATA SSD	1-3 ms	~500
NVMe SSD	0.1-0.5 ms	~5000
RAM disk (battery-backed)	0.01-0.05 ms	~20000

WAL Compression

PostgreSQL and Kafka support WAL compression. This reduces I/O and storage at the cost of CPU:

Workload	WAL Size (uncompressed)	WAL Size (compressed)	CPU Overhead
OLTP (many small writes)	100 GB	30-40 GB	5-10%
Bulk load	500 GB	100-150 GB	15-20%
Analytics (wide rows)	200 GB	50-60 GB	8-12%

WAL on Separate Disk

For maximum performance, put the WAL on a separate physical disk or NVMe device. This prevents:

WAL writes competing with data page reads for I/O bandwidth
WAL fsync waiting on data file writeback

Synchronous vs Asynchronous WAL Writes

Write Mode	Behavior	Crash Guarantee	Best For
Synchronous	fsync before COMMIT returns	Zero data loss	Financial, transactional
Deferred sync	fsync on timer (Kafka default)	At most N ms of data loss	Analytics, streaming
Async (no fsync)	No fsync at all	Data loss on crash	Test environments, caching
Group commit	Batch fsync across transactions	Zero data loss (same as sync)	High-throughput OLTP

WAL File System Choices

The filesystem hosting the WAL matters more than you might think:

Filesystem	WAL Performance	Crash Recovery	Notes
XFS	Excellent (delayed allocation)	Fast	PostgreSQL recommended
ext4	Very good	Fast	Good default on Linux
ZFS	Good (needs tuning)	Fast	Enables compression at filesystem level
NTFS	Good	Moderate	Windows deployments

Filesystem optimization tips:

Use noatime mount option (skip access time updates)
Use a separate filesystem for WAL (tune inode size for many small WAL files)
Consider O_DIRECT for WAL writes (PostgreSQL: wal_sync_method = open_datasync)
Disable filesystem journaling on the WAL volume if using battery-backed RAID (trade durability for performance)

WAL in Cloud Environments

When running databases in the cloud, WAL placement requires special consideration:

Cloud	WAL Placement Strategy	Performance
AWS	gp3 or io2 EBS volume dedicated to WAL	100-500 MB/s
AWS	Instance store (NVMe) for WAL, EBS for data	1-3 GB/s (ephemeral!)
GCP	Local SSD for WAL, persistent disk for data	1+ GB/s
Azure	Premium SSD v2 for WAL	200-500 MB/s

⚠️

If your WAL disk fails: If the WAL device fails and you can't archive WAL segments, you cannot recover past the last checkpoint. This is a single point of failure. In production, use RAID-1 (mirroring) for WAL devices, or use a distributed filesystem with redundancy. In cloud environments, use a multi-AZ deployment where the WAL is synchronously replicated across availability zones.

Summary

The Write-Ahead Log is the foundation of durability in every storage system that matters. The rule is simple—write to the log before modifying data—but the implementations range from PostgreSQL's feature-rich WAL to Kafka's high-throughput distributed log to LevelDB's minimalist approach.

Concept	Key Takeaway
Core protocol	Write WAL before data page; fsync WAL before acknowledging commit
REDO records	Describe how to replay the change after crash
Checkpoint	Flush dirty pages, truncate old WAL
fsync	The system call that guarantees durability
Group commit	Batch fsync across transactions
WAL in PostgreSQL	Streaming replication, PITR, archive
WAL in Kafka	Distributed commit log, topic partitions

Understanding WAL gives you a mental model for how data survives crashes, how replication works, and how point-in-time recovery operates. Whether you are configuring PostgreSQL's checkpoint intervals, tuning Kafka's fsync settings, or designing your own storage engine, the WAL is the foundation you build on.

"A database without a WAL is like flying without a parachute. It works great until it doesn't, and then recovery is catastrophic." — That database admin who learned the hard way

Bloom Filters: The Probabilistic Data Structure You Can't Ignore

Raft Consensus Algorithm: Leader Election, Log Replication, and Safety