TL;DR
LSM-Trees achieve high write throughput by converting random writes into sequential I/O: writes go to an in-memory buffer (memtable) first, which is periodically flushed to immutable disk segments (SSTables). Reads are slower — they may check multiple levels — so reads are amplified while writes are optimized.
A bottom-up explanation of LSM-Trees — memtables, SSTables, compaction strategies, and why write-heavy databases love them.In this post, I'll walk through the key concepts with code examples drawn from real production implementations.
Why Not B-Trees?
B-Trees are the dominant structure in read-heavy databases like PostgreSQL. They support O(log n) reads and updates in-place. But in-place updates mean random I/O: updating a key deep in the tree requires a read-modify-write cycle at an arbitrary disk location. For write-heavy workloads — logging, time-series, metrics — this is a bottleneck.
The LSM-Tree Insight
Log-Structured Merge Trees (LSM-Trees) replace random writes with sequential appends. Instead of modifying data in place, every write is appended to an in-memory buffer called a memtable. When the memtable reaches a size threshold, it is flushed to disk as an immutable, sorted file called an SSTable (Sorted String Table). Deletes are handled as tombstone markers rather than actual deletions.
Write Path
- Append the write to a Write-Ahead Log (WAL) on disk — this is crash recovery insurance.
- Insert the key-value pair into the in-memory memtable (typically a skip list or red-black tree for sorted order).
- When memtable exceeds the size threshold, freeze it and flush it to a new L0 SSTable on disk.
- Background compaction merges SSTables, removing deleted keys and pushing data down to deeper levels (L1, L2, etc.).
Read Path
Reads are more complex: check the memtable first (most recent writes), then immutable memtables (recently flushed but not yet L0), then L0 SSTables (unmerged, may overlap), then L1 and deeper (sorted, no overlap). A read that touches all levels is expensive — this is read amplification.
Bloom Filters
Each SSTable has a Bloom filter — a probabilistic data structure that can say "definitely not in this file" or "probably in this file." Reads skip SSTables where the Bloom filter returns negative, dramatically reducing I/O on point lookups.
Compaction Strategies
| Strategy | Write Amp | Read Amp | Space Amp | Best For |
|---|---|---|---|---|
| Leveled | High | Low | Low | Read-heavy workloads |
| Size-Tiered | Low | High | High | Write-heavy, append-only |
| FIFO | None | High | High | Time-series, recent data only |
Minimal Go Implementation
type MemTable struct { data map[string]string mu sync.RWMutex size int maxSize int}func (m *MemTable) Put(key, value string) bool { m.mu.Lock() defer m.mu.Unlock() m.data[key] = value m.size += len(key) + len(value) return m.size >= m.maxSize // true = flush needed}func (m *MemTable) Get(key string) (string, bool) { m.mu.RLock() defer m.mu.RUnlock() v, ok := m.data[key] return v, ok}// Flush memtable to a sorted SSTable filefunc (m *MemTable) Flush(path string) error { m.mu.Lock() defer m.mu.Unlock() keys := make([]string, 0, len(m.data)) for k := range m.data { keys = append(keys, k) } sort.Strings(keys) f, err := os.Create(path) if err != nil { return err } defer f.Close() enc := gob.NewEncoder(f) for _, k := range keys { enc.Encode([2]string{k, m.data[k]}) } return nil}Real-World Usage
RocksDB (Facebook), LevelDB (Google), Cassandra, ScyllaDB, and ClickHouse all use LSM-Tree variants. The structure is particularly dominant in time-series and event-log storage where writes are append-heavy and reads are primarily range scans over recent time windows.
The fundamental tradeoff — write-optimized at the cost of read amplification — means LSM-Trees shine in workloads where write throughput is the bottleneck and reads can tolerate slightly higher latency. For workloads requiring the lowest possible read latency with moderate write rate, B-Trees remain superior.