Write-Ahead Log (WAL): The Foundation of Crash Recovery and Durability
Understand the WAL algorithm - the foundation of crash recovery in PostgreSQL, durability in Kafka, and replication in databases. Learn fsync ordering, checkpointing, and point-in-time recovery.
Write-Ahead Log (WAL): The Foundation of Crash Recovery and Durability
I once spent a weekend recovering a PostgreSQL database that lost 15 minutes of transactions because the WAL disk filled up. That weekend taught me more about durability, fsync semantics, and checkpointing than any textbook ever could. The Write-Ahead Log is the single most important piece of your database's crash recovery story, and understanding it deeply separates engineers who build reliable systems from those who just hope they work.
Every database you use—PostgreSQL, MySQL, SQLite, MongoDB, Cassandra—uses a WAL (or its equivalent, the journal or commit log). Kafka's entire architecture is a distributed WAL. If you understand WAL, you understand how data survives crashes, how replication works, and how point-in-time recovery is possible.
The core insight: A Write-Ahead Log implements one simple rule: before you modify a data page on disk, write a record describing that modification to the log. If the system crashes, you replay the log to reconstruct lost writes. This is the foundation of crash recovery in every serious storage engine.
The Fundamental WAL Protocol
The WAL protocol has three rules that every system must follow:
- Write to the log first: Before modifying a data page, append a redo record to the log
- Flush the log to disk: Ensure the log record is durable (fsync) before applying the change
- Apply the change: Modify the data page in memory or on disk
This guarantees that even if the system crashes mid-operation, you can replay the log to recover.
Crash Scenarios
| Crash Timing | Recovery Outcome | Why |
|---|---|---|
| Before WAL write | Transaction lost (atomicity) | Data never made it to any durable store |
| Before WAL fsync | Transaction lost (atomicity) | OS buffer cache not flushed to disk |
| After WAL fsync, before page write | Transaction recovered (durability) | WAL has the redo record |
| After page write, before WAL fsync | Partial page write | WAL replay fixes torn pages |
REDO and UNDO Logging
There are two flavors of WAL records:
| Record Type | Purpose | Example | Crash Recovery |
|---|---|---|---|
| REDO only | Can redo the change from log | "Set page 5, offset 100 to value 42" | Replay all REDO records |
| UNDO only | Can undo the change from log | "Page 5 originally had value 7" | Rollback uncommitted transactions |
| REDO + UNDO | Both (MySQL InnoDB) | Full before/after image | Replay committed, rollback uncommitted |
PostgreSQL uses REDO-only WAL (it stores the new page image). MySQL InnoDB uses REDO + UNDO. The tradeoff is WAL size vs. recovery flexibility.
PostgreSQL WAL: Streaming Replication and PITR
PostgreSQL's WAL is the most sophisticated implementation in open-source databases. It enables three critical features:
WAL Segments and LSN
PostgreSQL writes WAL in 16 MB segments. Each record has a Log Sequence Number (LSN)—a byte offset into the WAL stream. Every data page stores the LSN of the last WAL record that modified it. During recovery, PostgreSQL replays WAL records with LSN greater than the page's LSN.
| WAL Concept | Description |
|---|---|
| WAL segment | 16 MB file, named by timeline and LSN |
| LSN | Log Sequence Number (file offset + byte) |
| REDO point | LSN where recovery starts (after checkpoint) |
| Checkpoint | Flushes all dirty pages up to a given LSN |
| Timeline | Branch identifier after PITR or recovery |
Streaming Replication
Standby replicas connect to the primary and continuously receive WAL records. PostgreSQL supports three replication modes:
| Mode | Durability | Performance | Use Case |
|---|---|---|---|
| Synchronous | Zero data loss | Higher latency | Financial systems |
| Asynchronous | Small window of loss | Lowest latency | Most applications |
| Quorum commit | Configurable | Configurable | Multi-DC deployments |
In synchronous replication, the primary waits for at least one standby to acknowledge receiving the WAL before reporting commit to the client. This guarantees zero data loss.
Point-in-Time Recovery (PITR)
PITR lets you restore a database to any point in time by:
- Restoring a base backup (full data directory snapshot)
- Replaying all WAL segments up to the target time or LSN
This is how you recover from "DROP TABLE" at 2:30 PM—restore the backup from 2:00 AM and replay WAL up to 2:29:59 PM.
Kafka: The Distributed WAL
Kafka takes the WAL concept and makes it the entire system. A Kafka topic partition is just a WAL (called a "log")—an ordered, immutable sequence of records appended to the end.
Kafka's WAL Properties
| Kafka WAL Property | How It Works |
|---|---|
| Immutable | Records are never modified after append |
| Offset-addressed | Each record has a unique, monotonically increasing offset |
| Segment-based | Log is divided into segments (default 1 GB or 7 days) |
| Durable on ack=all | All in-sync replicas acknowledge before commit |
| Compacted | Keeps the latest value for each key (log compaction) |
Kafka fsync Configuration
Kafka's durability is controlled by the acks setting:
| acks Value | Behavior | Crash Safety | Latency |
|---|---|---|---|
acks=0 | Fire and forget | Data loss on leader crash | Lowest |
acks=1 | Leader writes to page cache | Loss if leader crashes before fsync | Medium |
acks=all | All in-sync replicas ack | Durable (no loss) | Highest |
Unlike PostgreSQL which fsyncs on every commit, Kafka batches fsync calls periodically (controlled by log.flush.interval.ms). This gives Kafka absurdly high throughput—millions of messages per second on a single broker—at the cost of a small window of potential data loss.
The durability-throughput tradeoff: Kafka can write 2+ million messages per second with acks=1 because it relies on OS page cache and batched fsync. With acks=all, throughput drops by 30-50% due to network round trips. Understand your tolerance for data loss before choosing a configuration. For analytics pipelines, acks=1 is often acceptable. For payment systems, always use acks=all.
fsync Semantics and Ordering Guarantees
The WAL's durability guarantee ultimately rests on one system call: fsync (or fdatasync on Linux). Understanding what fsync actually promises is critical.
What fsync Guarantees
- Ordering: All writes issued before fsync are on disk before the call returns
- Durability: Data survives power loss and OS crash
- Atomicity: fsync of a file ensures metadata (size, timestamps) is also flushed
What fsync Does NOT Guarantee
- Ordering between files: fsync on file A doesn't order writes to file B (use
sync_file_rangefor ordering) - Storage controller cache: Some RAID controllers and SSDs lie about fsync (battery-backed write cache mitigates this)
- Filesystem crash consistency: If you write to multiple files, fsync each one or use a journaling filesystem
Batched fsync and Group Commit
PostgreSQL uses group commit to amortize fsync costs across multiple concurrent transactions:
| Optimization | What It Does | Benefit |
|---|---|---|
| Group commit | Batch fsync for multiple transactions | Up to 10x more transactions per second |
| WAL buffer | Keep WAL records in memory before flush | Reduces fsync frequency |
| Unlogged tables | Skip WAL for transient data | Up to 5x faster inserts |
| WAL compression | Compress WAL records | Reduces I/O and archive storage |
Checkpointing
The WAL grows forever unless you periodically truncate it. Checkpointing is the process of:
- Flushing all dirty data pages (modified by WAL records) to disk
- Recording the REDO point—the LSN from which recovery must start
- Truncating WAL segments before the REDO point
Checkpoint Frequency Tradeoff
| Frequent Checkpoints | Infrequent Checkpoints |
|---|---|
| Faster crash recovery | Slower crash recovery |
| More I/O during normal operation | Less I/O overhead |
| Smaller WAL on disk | Larger WAL segments |
| More predictable performance | Recovery can take hours |
PostgreSQL's checkpoint configuration:
| Parameter | Default | Effect |
|---|---|---|
checkpoint_timeout | 5 minutes | Max time between checkpoints |
max_wal_size | 1 GB | Target max WAL size |
checkpoint_completion_target | 0.5 | Spread checkpoint I/O over 50% of interval |
WAL in PostgreSQL vs Kafka vs LevelDB
While all three use the WAL concept, their implementations differ dramatically based on their workload:
| Property | PostgreSQL | Kafka | LevelDB / RocksDB |
|---|---|---|---|
| Unit of write | Row/tuple changes | Messages (variable size) | Key-value pairs |
| WAL format | Binary, page-oriented | Binary, message set | Binary, record batch |
| Segment size | 16 MB | Configurable (default 1 GB) | Configurable (default 2 MB) |
| fsync policy | Every commit (group commit) | Periodic + on commit (acks=all) | Every write (sync=true) or periodic |
| Crash recovery | REDO from checkpoint | Replay from last checkpoint | REPLAY from MANIFEST |
| Replication | WAL shipping / streaming | ISR (In-Sync Replicas) | Raft-based (via etcd) |
| Compaction | Not native (use archive) | Log compaction | Compaction (merge) |
| Archive | WAL archive to S3/NFS | Tiered storage (KIP-405) | Backup via checkpoint |
LevelDB / RocksDB WAL
LevelDB and RocksDB use a WAL called the "log" (confusingly, the same term Kafka uses). Each write is appended to the current WAL file. When the WAL reaches a threshold, the memtable is flushed to an SSTable, and the WAL is recycled.
WAL in MySQL InnoDB
MySQL InnoDB uses a REDO log (its name for WAL) combined with an UNDO log for MVCC (Multi-Version Concurrency Control). The REDO log stores physical page-level changes; the UNDO log stores logical rollback information.
InnoDB REDO Log Architecture
| MySQL WAL Parameter | Default | Description |
|---|---|---|
innodb_log_buffer_size | 16 MB | In-memory buffer before flushing to disk |
innodb_log_file_size | 48 MB per file | Size of each REDO log file |
innodb_log_files_in_group | 2 | Number of REDO log files (total = size * count) |
innodb_flush_log_at_trx_commit | 1 | Controls fsync behavior (see table below) |
innodb_flush_log_at_timeout | 1 second | Time-based flush interval |
innodb_flush_log_at_trx_commit Settings
| Value | Behavior | Durability | Write Ops/sec |
|---|---|---|---|
| 1 | fsync on every commit (default) | Full ACID | ~500 (NVMe) |
| 2 | Write to OS cache on commit, flush every second | Loss of 1 second of data | ~5000 |
| 0 | Write to OS cache every second | Loss of 1 second of data | ~8000 |
The difference between 1 and 2 is the difference between "zero data loss" and "at most 1 second of data loss." For most web applications, setting 2 is acceptable and provides a 5-10x write throughput improvement.
Doublewrite Buffer
InnoDB has an additional protection: the doublewrite buffer. Before writing a page to its final location, InnoDB writes a copy to the doublewrite buffer. This prevents torn pages—situations where only part of a 16 KB page is written due to a crash.
Physical vs Logical WAL
WAL implementations fall into two categories: physical and logical logging.
| Property | Physical Logging | Logical Logging |
|---|---|---|
| What is logged | Byte-level changes to pages | SQL statements or high-level operations |
| Example | "Page 42, offset 128: set bytes to X" | "UPDATE users SET balance = 100 WHERE id = 42" |
| Replay | Bit-identical page reconstruction | Re-execute the statement |
| Size per write | Small (just the changed bytes) | Larger (full statement text) |
| Replication | Must be exact replica (same page format) | Works across different versions |
| Parallel replay | Hard (page conflicts) | Easy (statement-level isolation) |
| Use case | PostgreSQL, InnoDB | MySQL binlog (statement mode), logical replication |
PostgreSQL uses physical WAL. MySQL supports both: physical REDO log + logical binlog. The binlog is used for replication and PITR because it can be replayed on a different MySQL version.
Hybrid Logical-Physical Logging (MySQL Binlog)
MySQL's binary log (binlog) records logical changes at the row level:
### UPDATE `db`.`users`
### WHERE
### @1=42 /* id */
### @2='old@email.com' /* email */
### SET
### @2='new@email.com' /* email */
This format is schema-independent and can be replayed on replicas with different hardware or MySQL versions.
WAL Archiving Strategies
WAL archiving separates the durability concern from the capacity concern. The live WAL is on fast local storage (NVMe). The archive is on cheaper, durable storage (S3, HDFS).
| Strategy | How It Works | Recovery Time | Storage Cost |
|---|---|---|---|
| Continuous archiving | Every WAL segment is copied to archive immediately | Minutes (last segment) | Low (S3) |
| Batch archiving | Archive every N segments or every M minutes | Up to M minutes of loss | Low (S3) |
| No archive | Keep WAL on local disk only | Limited to local WAL retention | Lowest (no archive cost) |
| WAL-E / WAL-G | PostgreSQL-specific, compresses and encrypts WAL to S3 | Seconds per GB replayed | Low + compression |
PostgreSQL's archive_command is a simple shell command that runs after each WAL segment is filled:
archive_command = 'aws s3 cp %p s3://my-bucket/wal/%f'
| Archive Tool | Compression | Encryption | Parallel Upload | Supported Backends |
|---|---|---|---|---|
| pgBackRest | Yes (zstd, lz4) | Yes | Yes | S3, Azure, GCS, local |
| WAL-G | Yes (zstd, lzma) | Yes | Yes | S3, GCS, Azure, Swift |
| barman | Yes (gzip, bzip2) | Yes | No (rsync) | SSH, S3 |
ARIES Recovery Algorithm
The ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) recovery algorithm is the theoretical foundation used by most WAL-based systems. It has three phases:
Phase 1: Analysis
Scan the WAL from the last checkpoint to find:
- Dirty pages (modified but not yet flushed)
- In-progress transactions (started but not committed)
Phase 2: REDO
Replay all REDO records from the REDO point forward. This is idempotent—replaying a change that was already applied is safe because the page LSN is checked.
Phase 3: UNDO
Rollback uncommitted transactions by applying UNDO records. The UNDO actions are also logged to the WAL to ensure idempotent recovery (a process called "compensation log records" or CLRs).
Compensation Log Records (CLRs)
When the UNDO phase rolls back a transaction, it writes CLRs to the WAL. This ensures that if the system crashes during UNDO, recovery can resume from where it left off rather than redoing the entire undo. CLRs are the mechanism that guarantees idempotent recovery.
Observability and WAL Monitoring
In production, you need to monitor WAL health to prevent the disaster I described in the introduction.
| Metric | What It Reveals | Warning Threshold | Critical Threshold |
|---|---|---|---|
| WAL write latency (P99) | Disk I/O pressure | 10 ms | 100 ms |
| WAL generation rate | Write throughput | Steady increase | Sudden spike |
| Replication lag | Health of streaming replicas | 10 seconds | 5 minutes |
| Archive lag | Health of WAL archiving | 5 minutes | 30 minutes |
| WAL disk usage | Checkpoint frequency | 70% | 90% |
| Checkpoint age | Time since last checkpoint | 2x checkpoint_timeout | 3x checkpoint_timeout |
PostgreSQL Specific Monitoring
Run these queries proactively:
-- Check WAL generation rate
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') / 1024 / 1024 AS wal_mb;
-- Check replication lag
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
-- Check checkpoint stats
SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time
FROM pg_stat_bgwriter;
The WAL disk full scenario: When the WAL disk reaches 100%, the database PANICS and shuts down. There is no graceful degradation. PostgreSQL's pg_ctl will refuse to start until you free WAL space. This is why monitoring WAL disk usage is critical—set alerts at 70% full, page at 85%, and treat 90% as a critical incident. I learned this the hard way on that weekend I mentioned at the start of this article.
WAL Tradeoffs and Optimizations
The fsync Cost
fsync is expensive—typically 1-10 ms per call. On spinning disks, it can be 10+ ms as the disk must physically flush its cache. On SSDs, it's 1-3 ms but still blocks the calling thread.
| Storage | Typical fsync latency | Transactions/second (single-threaded) |
|---|---|---|
| HDD (7200 RPM) | 8-12 ms | ~100 |
| SATA SSD | 1-3 ms | ~500 |
| NVMe SSD | 0.1-0.5 ms | ~5000 |
| RAM disk (battery-backed) | 0.01-0.05 ms | ~20000 |
WAL Compression
PostgreSQL and Kafka support WAL compression. This reduces I/O and storage at the cost of CPU:
| Workload | WAL Size (uncompressed) | WAL Size (compressed) | CPU Overhead |
|---|---|---|---|
| OLTP (many small writes) | 100 GB | 30-40 GB | 5-10% |
| Bulk load | 500 GB | 100-150 GB | 15-20% |
| Analytics (wide rows) | 200 GB | 50-60 GB | 8-12% |
WAL on Separate Disk
For maximum performance, put the WAL on a separate physical disk or NVMe device. This prevents:
- WAL writes competing with data page reads for I/O bandwidth
- WAL fsync waiting on data file writeback
Synchronous vs Asynchronous WAL Writes
| Write Mode | Behavior | Crash Guarantee | Best For |
|---|---|---|---|
| Synchronous | fsync before COMMIT returns | Zero data loss | Financial, transactional |
| Deferred sync | fsync on timer (Kafka default) | At most N ms of data loss | Analytics, streaming |
| Async (no fsync) | No fsync at all | Data loss on crash | Test environments, caching |
| Group commit | Batch fsync across transactions | Zero data loss (same as sync) | High-throughput OLTP |
WAL File System Choices
The filesystem hosting the WAL matters more than you might think:
| Filesystem | WAL Performance | Crash Recovery | Notes |
|---|---|---|---|
| XFS | Excellent (delayed allocation) | Fast | PostgreSQL recommended |
| ext4 | Very good | Fast | Good default on Linux |
| ZFS | Good (needs tuning) | Fast | Enables compression at filesystem level |
| NTFS | Good | Moderate | Windows deployments |
Filesystem optimization tips:
- Use
noatimemount option (skip access time updates) - Use a separate filesystem for WAL (tune inode size for many small WAL files)
- Consider
O_DIRECTfor WAL writes (PostgreSQL:wal_sync_method = open_datasync) - Disable filesystem journaling on the WAL volume if using battery-backed RAID (trade durability for performance)
WAL in Cloud Environments
When running databases in the cloud, WAL placement requires special consideration:
| Cloud | WAL Placement Strategy | Performance |
|---|---|---|
| AWS | gp3 or io2 EBS volume dedicated to WAL | 100-500 MB/s |
| AWS | Instance store (NVMe) for WAL, EBS for data | 1-3 GB/s (ephemeral!) |
| GCP | Local SSD for WAL, persistent disk for data | 1+ GB/s |
| Azure | Premium SSD v2 for WAL | 200-500 MB/s |
If your WAL disk fails: If the WAL device fails and you can't archive WAL segments, you cannot recover past the last checkpoint. This is a single point of failure. In production, use RAID-1 (mirroring) for WAL devices, or use a distributed filesystem with redundancy. In cloud environments, use a multi-AZ deployment where the WAL is synchronously replicated across availability zones.
Summary
Summary
The Write-Ahead Log is the foundation of durability in every storage system that matters. The rule is simple—write to the log before modifying data—but the implementations range from PostgreSQL's feature-rich WAL to Kafka's high-throughput distributed log to LevelDB's minimalist approach.
| Concept | Key Takeaway |
|---|---|
| Core protocol | Write WAL before data page; fsync WAL before acknowledging commit |
| REDO records | Describe how to replay the change after crash |
| Checkpoint | Flush dirty pages, truncate old WAL |
| fsync | The system call that guarantees durability |
| Group commit | Batch fsync across transactions |
| WAL in PostgreSQL | Streaming replication, PITR, archive |
| WAL in Kafka | Distributed commit log, topic partitions |
Understanding WAL gives you a mental model for how data survives crashes, how replication works, and how point-in-time recovery operates. Whether you are configuring PostgreSQL's checkpoint intervals, tuning Kafka's fsync settings, or designing your own storage engine, the WAL is the foundation you build on.
"A database without a WAL is like flying without a parachute. It works great until it doesn't, and then recovery is catastrophic." — That database admin who learned the hard way