Core Algorithms

Write-Ahead Log (WAL): The Foundation of Crash Recovery and Durability

Understand the WAL algorithm - the foundation of crash recovery in PostgreSQL, durability in Kafka, and replication in databases. Learn fsync ordering, checkpointing, and point-in-time recovery.

WALcrash-recoverydurabilityPostgreSQLKafkafsynccheckpoint

Write-Ahead Log (WAL): The Foundation of Crash Recovery and Durability

I once spent a weekend recovering a PostgreSQL database that lost 15 minutes of transactions because the WAL disk filled up. That weekend taught me more about durability, fsync semantics, and checkpointing than any textbook ever could. The Write-Ahead Log is the single most important piece of your database's crash recovery story, and understanding it deeply separates engineers who build reliable systems from those who just hope they work.

Every database you use—PostgreSQL, MySQL, SQLite, MongoDB, Cassandra—uses a WAL (or its equivalent, the journal or commit log). Kafka's entire architecture is a distributed WAL. If you understand WAL, you understand how data survives crashes, how replication works, and how point-in-time recovery is possible.

The core insight: A Write-Ahead Log implements one simple rule: before you modify a data page on disk, write a record describing that modification to the log. If the system crashes, you replay the log to reconstruct lost writes. This is the foundation of crash recovery in every serious storage engine.


The Fundamental WAL Protocol

The WAL protocol has three rules that every system must follow:

  1. Write to the log first: Before modifying a data page, append a redo record to the log
  2. Flush the log to disk: Ensure the log record is durable (fsync) before applying the change
  3. Apply the change: Modify the data page in memory or on disk

This guarantees that even if the system crashes mid-operation, you can replay the log to recover.

Crash Scenarios

Crash TimingRecovery OutcomeWhy
Before WAL writeTransaction lost (atomicity)Data never made it to any durable store
Before WAL fsyncTransaction lost (atomicity)OS buffer cache not flushed to disk
After WAL fsync, before page writeTransaction recovered (durability)WAL has the redo record
After page write, before WAL fsyncPartial page writeWAL replay fixes torn pages

REDO and UNDO Logging

There are two flavors of WAL records:

Record TypePurposeExampleCrash Recovery
REDO onlyCan redo the change from log"Set page 5, offset 100 to value 42"Replay all REDO records
UNDO onlyCan undo the change from log"Page 5 originally had value 7"Rollback uncommitted transactions
REDO + UNDOBoth (MySQL InnoDB)Full before/after imageReplay committed, rollback uncommitted

PostgreSQL uses REDO-only WAL (it stores the new page image). MySQL InnoDB uses REDO + UNDO. The tradeoff is WAL size vs. recovery flexibility.


PostgreSQL WAL: Streaming Replication and PITR

PostgreSQL's WAL is the most sophisticated implementation in open-source databases. It enables three critical features:

WAL Segments and LSN

PostgreSQL writes WAL in 16 MB segments. Each record has a Log Sequence Number (LSN)—a byte offset into the WAL stream. Every data page stores the LSN of the last WAL record that modified it. During recovery, PostgreSQL replays WAL records with LSN greater than the page's LSN.

WAL ConceptDescription
WAL segment16 MB file, named by timeline and LSN
LSNLog Sequence Number (file offset + byte)
REDO pointLSN where recovery starts (after checkpoint)
CheckpointFlushes all dirty pages up to a given LSN
TimelineBranch identifier after PITR or recovery

Streaming Replication

Standby replicas connect to the primary and continuously receive WAL records. PostgreSQL supports three replication modes:

ModeDurabilityPerformanceUse Case
SynchronousZero data lossHigher latencyFinancial systems
AsynchronousSmall window of lossLowest latencyMost applications
Quorum commitConfigurableConfigurableMulti-DC deployments

In synchronous replication, the primary waits for at least one standby to acknowledge receiving the WAL before reporting commit to the client. This guarantees zero data loss.

Point-in-Time Recovery (PITR)

PITR lets you restore a database to any point in time by:

  1. Restoring a base backup (full data directory snapshot)
  2. Replaying all WAL segments up to the target time or LSN

This is how you recover from "DROP TABLE" at 2:30 PM—restore the backup from 2:00 AM and replay WAL up to 2:29:59 PM.


Kafka: The Distributed WAL

Kafka takes the WAL concept and makes it the entire system. A Kafka topic partition is just a WAL (called a "log")—an ordered, immutable sequence of records appended to the end.

Kafka's WAL Properties

Kafka WAL PropertyHow It Works
ImmutableRecords are never modified after append
Offset-addressedEach record has a unique, monotonically increasing offset
Segment-basedLog is divided into segments (default 1 GB or 7 days)
Durable on ack=allAll in-sync replicas acknowledge before commit
CompactedKeeps the latest value for each key (log compaction)

Kafka fsync Configuration

Kafka's durability is controlled by the acks setting:

acks ValueBehaviorCrash SafetyLatency
acks=0Fire and forgetData loss on leader crashLowest
acks=1Leader writes to page cacheLoss if leader crashes before fsyncMedium
acks=allAll in-sync replicas ackDurable (no loss)Highest

Unlike PostgreSQL which fsyncs on every commit, Kafka batches fsync calls periodically (controlled by log.flush.interval.ms). This gives Kafka absurdly high throughput—millions of messages per second on a single broker—at the cost of a small window of potential data loss.

⚠️

The durability-throughput tradeoff: Kafka can write 2+ million messages per second with acks=1 because it relies on OS page cache and batched fsync. With acks=all, throughput drops by 30-50% due to network round trips. Understand your tolerance for data loss before choosing a configuration. For analytics pipelines, acks=1 is often acceptable. For payment systems, always use acks=all.


fsync Semantics and Ordering Guarantees

The WAL's durability guarantee ultimately rests on one system call: fsync (or fdatasync on Linux). Understanding what fsync actually promises is critical.

What fsync Guarantees

  • Ordering: All writes issued before fsync are on disk before the call returns
  • Durability: Data survives power loss and OS crash
  • Atomicity: fsync of a file ensures metadata (size, timestamps) is also flushed

What fsync Does NOT Guarantee

  • Ordering between files: fsync on file A doesn't order writes to file B (use sync_file_range for ordering)
  • Storage controller cache: Some RAID controllers and SSDs lie about fsync (battery-backed write cache mitigates this)
  • Filesystem crash consistency: If you write to multiple files, fsync each one or use a journaling filesystem

Batched fsync and Group Commit

PostgreSQL uses group commit to amortize fsync costs across multiple concurrent transactions:

OptimizationWhat It DoesBenefit
Group commitBatch fsync for multiple transactionsUp to 10x more transactions per second
WAL bufferKeep WAL records in memory before flushReduces fsync frequency
Unlogged tablesSkip WAL for transient dataUp to 5x faster inserts
WAL compressionCompress WAL recordsReduces I/O and archive storage

Checkpointing

The WAL grows forever unless you periodically truncate it. Checkpointing is the process of:

  1. Flushing all dirty data pages (modified by WAL records) to disk
  2. Recording the REDO point—the LSN from which recovery must start
  3. Truncating WAL segments before the REDO point

Checkpoint Frequency Tradeoff

Frequent CheckpointsInfrequent Checkpoints
Faster crash recoverySlower crash recovery
More I/O during normal operationLess I/O overhead
Smaller WAL on diskLarger WAL segments
More predictable performanceRecovery can take hours

PostgreSQL's checkpoint configuration:

ParameterDefaultEffect
checkpoint_timeout5 minutesMax time between checkpoints
max_wal_size1 GBTarget max WAL size
checkpoint_completion_target0.5Spread checkpoint I/O over 50% of interval

WAL in PostgreSQL vs Kafka vs LevelDB

While all three use the WAL concept, their implementations differ dramatically based on their workload:

PropertyPostgreSQLKafkaLevelDB / RocksDB
Unit of writeRow/tuple changesMessages (variable size)Key-value pairs
WAL formatBinary, page-orientedBinary, message setBinary, record batch
Segment size16 MBConfigurable (default 1 GB)Configurable (default 2 MB)
fsync policyEvery commit (group commit)Periodic + on commit (acks=all)Every write (sync=true) or periodic
Crash recoveryREDO from checkpointReplay from last checkpointREPLAY from MANIFEST
ReplicationWAL shipping / streamingISR (In-Sync Replicas)Raft-based (via etcd)
CompactionNot native (use archive)Log compactionCompaction (merge)
ArchiveWAL archive to S3/NFSTiered storage (KIP-405)Backup via checkpoint

LevelDB / RocksDB WAL

LevelDB and RocksDB use a WAL called the "log" (confusingly, the same term Kafka uses). Each write is appended to the current WAL file. When the WAL reaches a threshold, the memtable is flushed to an SSTable, and the WAL is recycled.


WAL in MySQL InnoDB

MySQL InnoDB uses a REDO log (its name for WAL) combined with an UNDO log for MVCC (Multi-Version Concurrency Control). The REDO log stores physical page-level changes; the UNDO log stores logical rollback information.

InnoDB REDO Log Architecture

MySQL WAL ParameterDefaultDescription
innodb_log_buffer_size16 MBIn-memory buffer before flushing to disk
innodb_log_file_size48 MB per fileSize of each REDO log file
innodb_log_files_in_group2Number of REDO log files (total = size * count)
innodb_flush_log_at_trx_commit1Controls fsync behavior (see table below)
innodb_flush_log_at_timeout1 secondTime-based flush interval

innodb_flush_log_at_trx_commit Settings

ValueBehaviorDurabilityWrite Ops/sec
1fsync on every commit (default)Full ACID~500 (NVMe)
2Write to OS cache on commit, flush every secondLoss of 1 second of data~5000
0Write to OS cache every secondLoss of 1 second of data~8000

The difference between 1 and 2 is the difference between "zero data loss" and "at most 1 second of data loss." For most web applications, setting 2 is acceptable and provides a 5-10x write throughput improvement.

Doublewrite Buffer

InnoDB has an additional protection: the doublewrite buffer. Before writing a page to its final location, InnoDB writes a copy to the doublewrite buffer. This prevents torn pages—situations where only part of a 16 KB page is written due to a crash.


Physical vs Logical WAL

WAL implementations fall into two categories: physical and logical logging.

PropertyPhysical LoggingLogical Logging
What is loggedByte-level changes to pagesSQL statements or high-level operations
Example"Page 42, offset 128: set bytes to X""UPDATE users SET balance = 100 WHERE id = 42"
ReplayBit-identical page reconstructionRe-execute the statement
Size per writeSmall (just the changed bytes)Larger (full statement text)
ReplicationMust be exact replica (same page format)Works across different versions
Parallel replayHard (page conflicts)Easy (statement-level isolation)
Use casePostgreSQL, InnoDBMySQL binlog (statement mode), logical replication

PostgreSQL uses physical WAL. MySQL supports both: physical REDO log + logical binlog. The binlog is used for replication and PITR because it can be replayed on a different MySQL version.

Hybrid Logical-Physical Logging (MySQL Binlog)

MySQL's binary log (binlog) records logical changes at the row level:

text
### UPDATE `db`.`users`
### WHERE
###   @1=42 /* id */
###   @2='old@email.com' /* email */
### SET
###   @2='new@email.com' /* email */

This format is schema-independent and can be replayed on replicas with different hardware or MySQL versions.


WAL Archiving Strategies

WAL archiving separates the durability concern from the capacity concern. The live WAL is on fast local storage (NVMe). The archive is on cheaper, durable storage (S3, HDFS).

StrategyHow It WorksRecovery TimeStorage Cost
Continuous archivingEvery WAL segment is copied to archive immediatelyMinutes (last segment)Low (S3)
Batch archivingArchive every N segments or every M minutesUp to M minutes of lossLow (S3)
No archiveKeep WAL on local disk onlyLimited to local WAL retentionLowest (no archive cost)
WAL-E / WAL-GPostgreSQL-specific, compresses and encrypts WAL to S3Seconds per GB replayedLow + compression

PostgreSQL's archive_command is a simple shell command that runs after each WAL segment is filled:

text
archive_command = 'aws s3 cp %p s3://my-bucket/wal/%f'
Archive ToolCompressionEncryptionParallel UploadSupported Backends
pgBackRestYes (zstd, lz4)YesYesS3, Azure, GCS, local
WAL-GYes (zstd, lzma)YesYesS3, GCS, Azure, Swift
barmanYes (gzip, bzip2)YesNo (rsync)SSH, S3

ARIES Recovery Algorithm

The ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) recovery algorithm is the theoretical foundation used by most WAL-based systems. It has three phases:

Phase 1: Analysis

Scan the WAL from the last checkpoint to find:

  • Dirty pages (modified but not yet flushed)
  • In-progress transactions (started but not committed)

Phase 2: REDO

Replay all REDO records from the REDO point forward. This is idempotent—replaying a change that was already applied is safe because the page LSN is checked.

Phase 3: UNDO

Rollback uncommitted transactions by applying UNDO records. The UNDO actions are also logged to the WAL to ensure idempotent recovery (a process called "compensation log records" or CLRs).

Compensation Log Records (CLRs)

When the UNDO phase rolls back a transaction, it writes CLRs to the WAL. This ensures that if the system crashes during UNDO, recovery can resume from where it left off rather than redoing the entire undo. CLRs are the mechanism that guarantees idempotent recovery.


Observability and WAL Monitoring

In production, you need to monitor WAL health to prevent the disaster I described in the introduction.

MetricWhat It RevealsWarning ThresholdCritical Threshold
WAL write latency (P99)Disk I/O pressure10 ms100 ms
WAL generation rateWrite throughputSteady increaseSudden spike
Replication lagHealth of streaming replicas10 seconds5 minutes
Archive lagHealth of WAL archiving5 minutes30 minutes
WAL disk usageCheckpoint frequency70%90%
Checkpoint ageTime since last checkpoint2x checkpoint_timeout3x checkpoint_timeout

PostgreSQL Specific Monitoring

Run these queries proactively:

sql
-- Check WAL generation rate
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') / 1024 / 1024 AS wal_mb;

-- Check replication lag
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;

-- Check checkpoint stats
SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time
FROM pg_stat_bgwriter;
⚠️

The WAL disk full scenario: When the WAL disk reaches 100%, the database PANICS and shuts down. There is no graceful degradation. PostgreSQL's pg_ctl will refuse to start until you free WAL space. This is why monitoring WAL disk usage is critical—set alerts at 70% full, page at 85%, and treat 90% as a critical incident. I learned this the hard way on that weekend I mentioned at the start of this article.


WAL Tradeoffs and Optimizations

The fsync Cost

fsync is expensive—typically 1-10 ms per call. On spinning disks, it can be 10+ ms as the disk must physically flush its cache. On SSDs, it's 1-3 ms but still blocks the calling thread.

StorageTypical fsync latencyTransactions/second (single-threaded)
HDD (7200 RPM)8-12 ms~100
SATA SSD1-3 ms~500
NVMe SSD0.1-0.5 ms~5000
RAM disk (battery-backed)0.01-0.05 ms~20000

WAL Compression

PostgreSQL and Kafka support WAL compression. This reduces I/O and storage at the cost of CPU:

WorkloadWAL Size (uncompressed)WAL Size (compressed)CPU Overhead
OLTP (many small writes)100 GB30-40 GB5-10%
Bulk load500 GB100-150 GB15-20%
Analytics (wide rows)200 GB50-60 GB8-12%

WAL on Separate Disk

For maximum performance, put the WAL on a separate physical disk or NVMe device. This prevents:

  • WAL writes competing with data page reads for I/O bandwidth
  • WAL fsync waiting on data file writeback

Synchronous vs Asynchronous WAL Writes

Write ModeBehaviorCrash GuaranteeBest For
Synchronousfsync before COMMIT returnsZero data lossFinancial, transactional
Deferred syncfsync on timer (Kafka default)At most N ms of data lossAnalytics, streaming
Async (no fsync)No fsync at allData loss on crashTest environments, caching
Group commitBatch fsync across transactionsZero data loss (same as sync)High-throughput OLTP

WAL File System Choices

The filesystem hosting the WAL matters more than you might think:

FilesystemWAL PerformanceCrash RecoveryNotes
XFSExcellent (delayed allocation)FastPostgreSQL recommended
ext4Very goodFastGood default on Linux
ZFSGood (needs tuning)FastEnables compression at filesystem level
NTFSGoodModerateWindows deployments

Filesystem optimization tips:

  • Use noatime mount option (skip access time updates)
  • Use a separate filesystem for WAL (tune inode size for many small WAL files)
  • Consider O_DIRECT for WAL writes (PostgreSQL: wal_sync_method = open_datasync)
  • Disable filesystem journaling on the WAL volume if using battery-backed RAID (trade durability for performance)

WAL in Cloud Environments

When running databases in the cloud, WAL placement requires special consideration:

CloudWAL Placement StrategyPerformance
AWSgp3 or io2 EBS volume dedicated to WAL100-500 MB/s
AWSInstance store (NVMe) for WAL, EBS for data1-3 GB/s (ephemeral!)
GCPLocal SSD for WAL, persistent disk for data1+ GB/s
AzurePremium SSD v2 for WAL200-500 MB/s
⚠️

If your WAL disk fails: If the WAL device fails and you can't archive WAL segments, you cannot recover past the last checkpoint. This is a single point of failure. In production, use RAID-1 (mirroring) for WAL devices, or use a distributed filesystem with redundancy. In cloud environments, use a multi-AZ deployment where the WAL is synchronously replicated across availability zones.


Summary


Summary

The Write-Ahead Log is the foundation of durability in every storage system that matters. The rule is simple—write to the log before modifying data—but the implementations range from PostgreSQL's feature-rich WAL to Kafka's high-throughput distributed log to LevelDB's minimalist approach.

ConceptKey Takeaway
Core protocolWrite WAL before data page; fsync WAL before acknowledging commit
REDO recordsDescribe how to replay the change after crash
CheckpointFlush dirty pages, truncate old WAL
fsyncThe system call that guarantees durability
Group commitBatch fsync across transactions
WAL in PostgreSQLStreaming replication, PITR, archive
WAL in KafkaDistributed commit log, topic partitions

Understanding WAL gives you a mental model for how data survives crashes, how replication works, and how point-in-time recovery operates. Whether you are configuring PostgreSQL's checkpoint intervals, tuning Kafka's fsync settings, or designing your own storage engine, the WAL is the foundation you build on.

"A database without a WAL is like flying without a parachute. It works great until it doesn't, and then recovery is catastrophic." — That database admin who learned the hard way