Back to Blog
Case StudyFeb 2026 · 14 min read

How LSM-Trees Work: The Write-Optimized Structure Behind RocksDB

Storage EnginesRocksDBGo

TL;DR

LSM-Trees achieve high write throughput by converting random writes into sequential I/O: writes go to an in-memory buffer (memtable) first, which is periodically flushed to immutable disk segments (SSTables). Reads are slower — they may check multiple levels — so reads are amplified while writes are optimized.

A bottom-up explanation of LSM-Trees — memtables, SSTables, compaction strategies, and why write-heavy databases love them.In this post, I'll walk through the key concepts with code examples drawn from real production implementations.

Why Not B-Trees?

B-Trees are the dominant structure in read-heavy databases like PostgreSQL. They support O(log n) reads and updates in-place. But in-place updates mean random I/O: updating a key deep in the tree requires a read-modify-write cycle at an arbitrary disk location. For write-heavy workloads — logging, time-series, metrics — this is a bottleneck.

The LSM-Tree Insight

Log-Structured Merge Trees (LSM-Trees) replace random writes with sequential appends. Instead of modifying data in place, every write is appended to an in-memory buffer called a memtable. When the memtable reaches a size threshold, it is flushed to disk as an immutable, sorted file called an SSTable (Sorted String Table). Deletes are handled as tombstone markers rather than actual deletions.

Write Path

  1. Append the write to a Write-Ahead Log (WAL) on disk — this is crash recovery insurance.
  2. Insert the key-value pair into the in-memory memtable (typically a skip list or red-black tree for sorted order).
  3. When memtable exceeds the size threshold, freeze it and flush it to a new L0 SSTable on disk.
  4. Background compaction merges SSTables, removing deleted keys and pushing data down to deeper levels (L1, L2, etc.).

Read Path

Reads are more complex: check the memtable first (most recent writes), then immutable memtables (recently flushed but not yet L0), then L0 SSTables (unmerged, may overlap), then L1 and deeper (sorted, no overlap). A read that touches all levels is expensive — this is read amplification.

Bloom Filters

Each SSTable has a Bloom filter — a probabilistic data structure that can say "definitely not in this file" or "probably in this file." Reads skip SSTables where the Bloom filter returns negative, dramatically reducing I/O on point lookups.

Compaction Strategies

StrategyWrite AmpRead AmpSpace AmpBest For
LeveledHighLowLowRead-heavy workloads
Size-TieredLowHighHighWrite-heavy, append-only
FIFONoneHighHighTime-series, recent data only

Minimal Go Implementation

lsm.go
type MemTable struct {    data   map[string]string    mu     sync.RWMutex    size   int    maxSize int}func (m *MemTable) Put(key, value string) bool {    m.mu.Lock()    defer m.mu.Unlock()    m.data[key] = value    m.size += len(key) + len(value)    return m.size >= m.maxSize  // true = flush needed}func (m *MemTable) Get(key string) (string, bool) {    m.mu.RLock()    defer m.mu.RUnlock()    v, ok := m.data[key]    return v, ok}// Flush memtable to a sorted SSTable filefunc (m *MemTable) Flush(path string) error {    m.mu.Lock()    defer m.mu.Unlock()    keys := make([]string, 0, len(m.data))    for k := range m.data {        keys = append(keys, k)    }    sort.Strings(keys)    f, err := os.Create(path)    if err != nil {        return err    }    defer f.Close()    enc := gob.NewEncoder(f)    for _, k := range keys {        enc.Encode([2]string{k, m.data[k]})    }    return nil}

Real-World Usage

RocksDB (Facebook), LevelDB (Google), Cassandra, ScyllaDB, and ClickHouse all use LSM-Tree variants. The structure is particularly dominant in time-series and event-log storage where writes are append-heavy and reads are primarily range scans over recent time windows.

The fundamental tradeoff — write-optimized at the cost of read amplification — means LSM-Trees shine in workloads where write throughput is the bottleneck and reads can tolerate slightly higher latency. For workloads requiring the lowest possible read latency with moderate write rate, B-Trees remain superior.