Build Your Own Key-Value Storage Engine—Week 8
Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. The conference starts today and lasts two days. Tomorrow, I’m giving a talk called Working on Complex Systems . I’d be glad to see you there 🙂 . Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Week 8: Concurrency Over this series, you built a working LSM tree: you flush to persist the memtable to disk and compact to reclaim space. Yet, you’ve been single-threaded so far. This week, we lift that constraint: flush and compaction will run in the background while you keep serving requests. There are many ways to add concurrency. The approach here is to introduce a versioned, ref-counted catalog that lets readers take a stable snapshot while background flush/compaction proceeds. A catalog holds references to: The current memtable. The current WAL. The current MANIFEST. Each request pins one catalog version for the duration of the operation. When a flush or compaction completes, the system creates a new catalog version. Old resources (e.g., obsolete SSTables) are not deleted immediately. Instead, each catalog tracks a refcount of in-flight requests. Once an old catalog’s refcount drops to zero and a newer catalog exists, you can safely garbage-collect the resources that appear in the old version but not in the new one. For example, with two catalog versions (red = older version’s elements, blue = newer element”): Once we can guarantee that catalog v1 is no longer referenced, we can delete the old MANIFEST, SST-2 and SST-3. Another example: a flush produced a new memtable and WAL file: In this case, once vatalog v1 has no remaining references, we can free the old memtable and delete the old WAL file. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Add a data structure that tracks: Memtable reference. MANIFEST path. Version (monotonic). Refcount of active readers. Implement a manager that keeps catalog versions in memory: Pick the latest catalog. Increment its refcount. Decrement the refcount of the catalog. If refcount is zero and there’s a new catalog version: Remove the current catalog. Remove elements present in the current catalog but not in the latest version (files, WAL, etc.) Create a new catalog based on the provided data. Assign a unique, monotonic version. At startup: Read from the authoritative MANIFEST (latest MANIFEST file). Treat any files not listed in MANIFEST as orphans and delete them. Read all WAL files you still have on disk, in order, to rebuild the in-memory state. Create the current catalog version from the reconstructed state. Start the background worker. In a nutshell, flush and compaction will move to the background. You’ll use internal queues plus worker pools to ensure no overlapping work on the same resources: at most one flush running at a time, and at most one compaction running at a time. Compaction: Keep the same trigger: Every 10,000 update requests. Do not run compaction in the request path. On compaction trigger: Post a notification to an internal queue and return. A single background thread listens on the queue and runs the actual work. Similar compaction process, except: Do not overwrite the existing WAL file. Instead, create a new file. Create a new catalog that references the new WAL. Keep the same trigger: When the memtable contains 2,000 entries. Do not run flush in the request path. On flush trigger: Allocate a new memtable and create a new WAL file for subsequent writes. Post a notification to an internal queue. Return immediately to the caller. A single background thread listens on the queue and runs the actual work. Similar flush process, except: Do not overwrite the existing MANIFEST file. Instead, create a new file. Create a new catalog referencing the new MANIFEST. Acquire a catalog from the manager. Do the operation using paths/refs from that catalog. Release the catalog. Concurrent requests make deterministic assertions harder. For example, suppose the validation file contains the following requests that can run in parallel: What should you assert for : , , or ? To make validation deterministic, you will handle barriers: all requests before a barrier must finish before starting the next block. You will also relax checks: a is valid if it returns any value written for a key before the last barrier. A similar example with barriers: The first two requests run in parallel. The first barrier waits for both to complete. The first GET should accept either or . The second request should accept only . The new validation file is a sequence of blocks separated by instructions: All the lines between two barriers form a block. On instruction, wait for all in-flight requests in the current block to finish before starting the next block. / lines are issued in parallel within their block. lines are also issued in parallel within their block. means the response must be any one of the list values. Download and run your client against a new file: concurrency.txt . When the memtable reaches 80% of capacity: Pre-allocate the next memtable in memory. Pre-create/rotate to the next WAL on disk. That's it for the whole series. You implemented a fully functional LSM tree: Started with a memtable (hashtable) and a flush that writes immutable SSTables to disk. Added a WAL for durability. Handled deletes and compaction to reclaim space. Introduced leveling and key-range partitioning to speed up reads. Switched to block-based SSTables with indexing. Added Bloom filters and replaced the memtable with a radix trie for faster lookups. Finally, introduced concurrency: a simple, single-threaded foreground path with flush and compaction running in the background. I hope you had fun building it. Thank you for following the series, and special thanks to our partner, ScyllaDB ! To get more information on how things work in production databases, you can read how RocksDB keeps track of live SST files . The structure is inspired by RocksDB’s . Conflict resolution is one aspect we’re missing in the series (maybe as a follow-up?) A versioned catalog is enough for reads, but what about conflicting writes? Suppose two clients, Alice and Bob, update the same key around the same time. A simple policy to resolve conflicts is latest wins. The database can serialize operations for the same key to ensure the latest request wins: In this example, the database ends up with as the latest state. This approach works with one node. But what about databases composed of multiple nodes? Say the two requests go to two different nodes at roughly the same time: With multiple nodes, the database must resolve conflicts consistently. There are two common ways: Coordination via a leader (consensus): Route both writes to the same leader node, which solves the conflict and determines the end state. Reconcile with comparable timestamps: Attach a timestamp to each write and store it with the key. By timestamp, we don’t mean relying on wall-clock time but a logical clock, so that “later“ is well-defined across nodes. If we go with the second approach and start storing data, we also unlock something production systems use: consistent snapshots. A read can include a timestamp, and the database returns the last version at or before that time; hence, providing a consistent view of the data, even while flush/compaction runs in the background. This pattern has a name: Multi-Version Concurrency Control (MVCC). It involves keeping multiple versions per key instead of only the last one, reading using a chosen point in time, and deleting old versions once they are no longer needed. See how ScyllaDB handles timestamp conflict resolution for more information. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Week 8: Concurrency Over this series, you built a working LSM tree: you flush to persist the memtable to disk and compact to reclaim space. Yet, you’ve been single-threaded so far. This week, we lift that constraint: flush and compaction will run in the background while you keep serving requests. There are many ways to add concurrency. The approach here is to introduce a versioned, ref-counted catalog that lets readers take a stable snapshot while background flush/compaction proceeds. A catalog holds references to: The current memtable. The current WAL. The current MANIFEST. Once we can guarantee that catalog v1 is no longer referenced, we can delete the old MANIFEST, SST-2 and SST-3. Another example: a flush produced a new memtable and WAL file: In this case, once vatalog v1 has no remaining references, we can free the old memtable and delete the old WAL file. Your Tasks 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Catalog Add a data structure that tracks: Memtable reference. MANIFEST path. Version (monotonic). Refcount of active readers. Implement a manager that keeps catalog versions in memory: : Pick the latest catalog. Increment its refcount. : Decrement the refcount of the catalog. If refcount is zero and there’s a new catalog version: Remove the current catalog. Remove elements present in the current catalog but not in the latest version (files, WAL, etc.) : Create a new catalog based on the provided data. Assign a unique, monotonic version. At startup: Read from the authoritative MANIFEST (latest MANIFEST file). Treat any files not listed in MANIFEST as orphans and delete them. Read all WAL files you still have on disk, in order, to rebuild the in-memory state. Create the current catalog version from the reconstructed state. Start the background worker. Compaction: Keep the same trigger: Every 10,000 update requests. Behavior: Do not run compaction in the request path. On compaction trigger: Post a notification to an internal queue and return. Worker: A single background thread listens on the queue and runs the actual work. Similar compaction process, except: Do not overwrite the existing WAL file. Instead, create a new file. Create a new catalog that references the new WAL. Keep the same trigger: When the memtable contains 2,000 entries. Behavior: Do not run flush in the request path. On flush trigger: Allocate a new memtable and create a new WAL file for subsequent writes. Post a notification to an internal queue. Return immediately to the caller. Worker: A single background thread listens on the queue and runs the actual work. Similar flush process, except: Do not overwrite the existing MANIFEST file. Instead, create a new file. Create a new catalog referencing the new MANIFEST. Acquire a catalog from the manager. Do the operation using paths/refs from that catalog. Release the catalog. What should you assert for : , , or ? To make validation deterministic, you will handle barriers: all requests before a barrier must finish before starting the next block. You will also relax checks: a is valid if it returns any value written for a key before the last barrier. A similar example with barriers: The first two requests run in parallel. The first barrier waits for both to complete. The first GET should accept either or . The second request should accept only . All the lines between two barriers form a block. On instruction, wait for all in-flight requests in the current block to finish before starting the next block. / lines are issued in parallel within their block. lines are also issued in parallel within their block. means the response must be any one of the list values. Pre-allocate the next memtable in memory. Pre-create/rotate to the next WAL on disk. Started with a memtable (hashtable) and a flush that writes immutable SSTables to disk. Added a WAL for durability. Handled deletes and compaction to reclaim space. Introduced leveling and key-range partitioning to speed up reads. Switched to block-based SSTables with indexing. Added Bloom filters and replaced the memtable with a radix trie for faster lookups. Finally, introduced concurrency: a simple, single-threaded foreground path with flush and compaction running in the background. In this example, the database ends up with as the latest state. This approach works with one node. But what about databases composed of multiple nodes? Say the two requests go to two different nodes at roughly the same time: With multiple nodes, the database must resolve conflicts consistently. There are two common ways: Coordination via a leader (consensus): Route both writes to the same leader node, which solves the conflict and determines the end state. Reconcile with comparable timestamps: Attach a timestamp to each write and store it with the key. By timestamp, we don’t mean relying on wall-clock time but a logical clock, so that “later“ is well-defined across nodes.