GreatReads - Blog Aggregator · Phoenix Framework

DevOps

0 views

The Coder Cafe 1 weeks ago

LSM Trees Explained

☕ Welcome to The Coder Cafe! Some people reached out after I published the Build Your Own Key-Value Storage Engine series to say they hadn’t gone through all eight posts, but they were curious about the core ideas. So I distilled everything into a single post. No implementation, no exercises, just the core concepts behind LSM trees. Get cozy, grab a coffee, and let’s begin! Fundamental Insights To understand LSM trees, we first need to understand why writes are hard. A B-tree-based database updates data in place . When we write a key, the engine finds the right page on disk and modifies it. This is a random write: the disk head has to seek to an arbitrary location before writing. On spinning disks, that seek takes time. But even on SSDs, random writes cause problems: they wear out cells unevenly and trigger expensive internal garbage collection. LSM trees take a completely different approach. Instead of writing data where it ultimately belongs, they write data sequentially . Writes are recorded in memory and appended to a log file for durability. When the in-memory buffer fills up, its contents are streamed to a new file in one sequential pass. Sequential writes are dramatically faster than random writes because there is no seeking involved. The disk just keeps writing forward. The price of this design is complexity. Data doesn’t live in one place. It accumulates across multiple files over time, and those files need to be periodically merged and reorganized in the background to stay manageable. That background work is what every piece of an LSM tree is built around. The in-memory buffer is called the memtable . The sorted files on disk are called SSTables . We’ll look at each in detail. Every write in an LSM tree starts in memory, in a structure called the memtable. The memtable is a mutable, in-memory store . When a write request arrives, the engine records the key-value pair in the memtable and appends it to a sequential log file on disk (called the write-ahead log, or WAL, which we’ll cover in the next section). The WAL write is a sequential append, so it is fast. There is no random I/O, no page lookup, no in-place modification. This is why LSM trees can sustain very high write throughput. A hashtable works for lookups but not for in-order iteration. Sorting a hashtable takes at flush time. A better choice is an ordered data structure. The most common in practice is a skip list ; for example, LevelDB and RocksDB both use one as their default. A radix trie is another elegant option: it keeps keys in lexicographic order naturally, so iterating in order is just a depth-first traversal, and flushing becomes a simple stream with no sorting step needed. A balanced BST works too. Production implementations typically attach a monotonic sequence number to each entry, so the engine can always determine which version of a key is the most recent, regardless of arrival order. The memtable doesn’t grow forever . At some point, it gets flushed to disk, and a new empty memtable takes its place. What triggers that flush depends on the implementation: it can be a size limit (a number of entries or a memory threshold), elapsed time, or memory pressure, for example. That flush produces a sorted file on disk called an SSTable, which we’ll look at after the WAL. There is a problem with keeping writes in memory: if the process crashes, everything in the memtable is gone. Any write the client received an acknowledgment for is now lost. That breaks a core database guarantee: durability . The solution is a Write-Ahead Log, or WAL. Before writing to the memtable, the engine appends the operation to the WAL, an append-only file on disk . Only after the WAL entry is safely persisted does the engine update the memtable and acknowledge the client. This ordering is what the “write-ahead” in the name refers to: the log is always written before the in-memory state changes. The WAL is not the final home for data; it’s a safety net. If the engine crashes and restarts, it replays the WAL from the beginning to reconstruct the memtable, recovering any writes that hadn’t been flushed to disk yet. One subtlety: writing to a file is not the same as persisting it. Operating systems buffer writes in memory before flushing to disk. To guarantee durability, the engine must call after each WAL entry, forcing the OS to flush its buffers to physical storage. This is not free, though. adds latency to every write. Production systems often use instead, which persists the data without flushing unnecessary file metadata, keeping WAL appends faster. Many also use a technique called group commit to amortize this cost further: instead of syncing after every write, they batch multiple WAL entries and call once for the group. The WAL introduces write amplification : the ratio of data written to disk versus data actually requested by a client. Every byte we write to the database gets written to disk twice: once to the WAL immediately, and once to an SSTable when the memtable is eventually flushed. That cost buys us durability. As we said, when the memtable fills up, it gets written to disk as a Sorted String Table, or SSTable . An SSTable is an immutable, sorted file. Immutable means it is never modified after creation. Sorted means keys are stored in lexicographic order. Both properties matter: Immutability makes SSTables safe to read concurrently without locking. Sorted order makes lookups inside a file efficient. In a simple implementation, an SSTable is just a JSON array of key-value pairs, sorted by key: Production systems use a binary block-based format instead. The SSTable is divided into fixed-size blocks, typically 4 KB, though the exact size varies by implementation. Data blocks hold the actual key-value entries. The SSTable also contains an index block storing the first key of each data block, which makes it possible to binary search for the right block without reading the entire file. In most implementations, the index block is written at the end of the file, since block boundaries are only known after all data blocks have been streamed out. To look up a key, we read the index block, binary search it to find the right data block, fetch that single block from disk, verify its integrity with a checksum, and then binary search within the block. When the index block is not cached, this means most lookups read two disk pages: the index block and one data block. In practice, index blocks are typically kept in memory, so most lookups require only one disk read . Each data block also carries a checksum computed over the block’s bytes. Before using the data, the engine verifies the checksum. If they don’t match, the block is corrupted, and the read fails safely rather than returning garbage. As SSTables accumulate, the engine maintains a catalog file (often called a MANIFEST in systems like RocksDB), which is an append-only log listing all existing SSTables in order of creation. This catalog is the engine’s source of truth for what files exist on disk. On startup, the engine reads it to know which files are live, and replays the WAL to restore the memtable. After a successful flush, the old WAL can be discarded. The data is now safely in an SSTable. Production systems also compress data blocks , typically with a fast algorithm like Snappy, LZ4, or zstd. Compression reduces disk footprint and I/O at the cost of CPU, and it interacts with block sizing: a compressed block may be smaller than a disk page, so implementations often track both logical and physical block sizes. LSM trees are optimized for writes. Reads are where the trade-off shows . To look up a key, the engine searches in order of recency : first the memtable, then SSTables from newest to oldest. The first match wins. This ordering matters because the same key can appear multiple times across different SSTables. Each write to a key produces a new entry rather than updating the existing one. The newest version is the correct one. The problem becomes clear as SSTables accumulate. A key that was written once and never updated might still require the engine to search through dozens of SSTables before finding it, or confirming it doesn’t exist. Each SSTable search is a disk read. This is called read amplification : a single logical read triggers multiple physical reads. For a key that doesn’t exist at all, the engine must check every SSTable before returning a not-found error. That’s the worst case for read amplification, and it gets worse the more SSTables there are. This is a fundamental tension in LSM trees, and it reflects a deeper principle known as the RUM conjecture: a storage engine can excel at two of reads, updates, and memory efficiency, but not all three at once . LSM trees make a deliberate choice: optimize for updates, accept read amplification as the cost. The sorted structure also enables efficient range scans. To retrieve all keys between and , the engine scans the memtable in order, then merges sorted streams from the relevant SSTables. The answer to accumulating SSTables is compaction . Compaction is a background process that takes multiple SSTables, merges them into fewer, cleaner ones , and discards the originals. The result is fewer files to search through, which directly reduces read amplification. It also reclaims disk space consumed by redundant entries: if the same key appears in three different SSTables, compaction keeps only the newest version and discards the rest. One common algorithm is a k-way merge . The engine opens iterators over all SSTables being compacted, each positioned at the first entry. It uses a min-heap to always pull the smallest key across all iterators. When the same key appears in multiple SSTables, the engine picks the version from the newest SSTable and discards the older ones. The merged output is streamed into new SSTable files. In practice, real systems limit the number of SSTables that can participate in a single compaction run to keep resource consumption under control. Updating the catalog after compaction requires care . The engine must not delete the old SSTables before the new ones are safely written to disk. The safe sequence is: write new SSTables, fsync, write a new catalog pointing to the new files, fsync, then delete the old SSTables. A crash at any point leaves the engine in a recoverable state: either the old files are still referenced by the old catalog, or the new files are referenced by the new catalog. Compaction is not free . It consumes I/O and CPU in the background, competing with foreground reads and writes. Every byte of data gets rewritten multiple times across its lifetime, adding to write amplification. Tuning when compaction triggers (and how aggressively it runs) is one of the main knobs in LSM tree performance. We might expect deletion to be straightforward: find the key, remove it. In an LSM tree, it is anything but straightforward . SSTables are immutable. We cannot reach into an existing SSTable and remove an entry. So when a key is deleted, the engine writes a special marker to the memtable called a tombstone , an entry that says “ this key is deleted ”. It eventually gets flushed to an SSTable like any other write. During reads, the engine respects tombstones. If a tombstone for a key is found before a value for that key, scanning newest to oldest, the key is treated as deleted, and a not-found error is returned. The tombstone shadows any older value. The tricky part is knowing when it is safe to discard a tombstone during compaction. Consider this situation: a tombstone for key exists in a newer SSTable, and an old value for exists in an older SSTable that hasn’t been compacted yet. If we drop the tombstone during compaction without also removing the old value, the old value becomes visible again. Deleted data reappears. This is called data resurrection , and it is a correctness bug. NOTE : Correctness here means the engine returns what was actually written, not a stale or deleted value. This is different from consistency in the distributed systems sense, which describes the guarantees clients have about which version of data they see across replicas. The rule is strict: a tombstone can only be dropped when the engine can guarantee that no older value for that key exists anywhere below it on disk . In practice, this means the compaction must include the oldest SSTables that could still hold a shadowed value. This is one of those details that seems minor until we get it wrong. A storage engine that resurrects deleted data is not a storage engine we can trust. Getting this right requires knowing exactly where older values can hide, which brings us to how SSTables are organized on disk. Basic compaction, merging all SSTables into one flat pool, works but doesn’t scale. As the dataset grows, a flat pool of SSTables means reads still have to check many files. Leveling is the structural answer . In a leveled LSM tree, SSTables are organized into levels: , , , and so on. Each level has different rules: is the landing zone . When the memtable flushes, the resulting SSTable lands in L0. files can have overlapping key ranges: two L0 files might both contain entries for key . This is acceptable because L0 files are small and short-lived. and deeper levels are different. Each level maintains non-overlapping key ranges across all its files . A given key can exist in at most one file per level. This is the critical property that makes reads efficient: to look up a key in , we don’t scan all files. We use the key ranges to jump directly to the one file that could contain it. When accumulates enough files, a compaction runs to merge into . This merge enforces the non-overlapping invariant: files (which may overlap) get merged with the relevant L1 files (which define the ranges), producing new files with clean, non-overlapping ranges. Similarly, when grows too large, a compaction merges part of into . Each deeper level is typically larger by a fixed ratio, for example, 10x. might hold 10 MB, 100 MB, 1 GB, and so on. Most data ends up in the deepest level. Most compaction work happens between levels. The benefit is controlled read amplification . To look up a key, we check the memtable, scan all files, then do one binary search per deeper level. The number of deeper levels grows logarithmically with data size. For a dataset with a few levels, that’s a small, bounded number of disk reads, regardless of how many total SSTables exist. When compaction falls behind and accumulates too many files, the engine may trigger a write stall : new writes are paused until compaction catches up and is drained. This is one of the more painful operational issues in LSM-based systems. Leveled compaction is also not the only strategy. Tiered compaction , used by Cassandra, for example, takes a different approach: instead of enforcing non-overlapping ranges per level, it groups SSTables of similar size and merges them when a tier grows too large. Tiered compaction generates less write amplification but more read amplification. The right choice depends on the workload. Leveling helps with reads, but there is still one painful case: looking up a key that doesn’t exist . For a missing key, the engine checks the memtable (not there), checks each L0 file (not there), then checks one file per deeper level (not there). Each check is a disk read. Even with leveling, this adds up. Bloom filters solve this. A Bloom filter is a probabilistic data structure that can answer one question: Is this key definitely not in this SSTable? It has no false negatives: if the key is in the SSTable, the filter will say so. It can have false positives (occasionally it says a key might be present when it isn’t), but in practice, the false positive rate is tunable and kept very low. Many implementations attach a Bloom filter to each SSTable, built at creation time from all the keys it contains. The filters are small, a few kilobytes per SSTable, so they can be loaded into memory at startup and kept there. How does it work? A Bloom filter is a bitset. When a key is added, several hash functions are applied to it, each producing an index into the bitset. The bit at each index is set to 1. To check if a key is in the filter, the same hash functions are applied. If any of the resulting bits is 0, the key is definitely not in the SSTable. No disk read needed. If all bits are 1, the key might be there, and the engine proceeds to read the SSTable. The practical impact is significant. For a key that doesn’t exist (the worst case), the engine skips almost every SSTable without a single disk read . Only the rare false positive triggers an unnecessary disk read. Read amplification for missing keys drops dramatically. Some engines take this further and attach Bloom filters not just per SSTable but per data block within an SSTable, enabling even more precise filtering before fetching a block from disk. Everything described so far assumes a single thread. In reality, a storage engine needs to handle concurrent reads and writes, while flush and compaction run in the background . This is where things get subtle. The core problem: a flush operation replaces the current memtable with a new one and registers a new SSTable in the catalog. A compaction operation removes old SSTables and registers new ones. If a read is in the middle of searching an SSTable that gets deleted by a concurrent compaction, that’s a crash. One common solution is a versioned catalog . A catalog is a snapshot of the engine’s state at a point in time: a reference to the current memtable, the current WAL path, and the current catalog file. Every incoming request acquires the latest catalog version, pins it by incrementing a reference count, performs its work, then releases it by decrementing the reference count. Background workers (the flush worker and the compaction worker) never modify an existing catalog . Instead, when a flush or compaction completes, they create a new catalog version pointing to the updated memtable and SSTable set. From that moment, new requests acquire the new catalog. Old requests that pinned the previous catalog continue reading from it safely. An old catalog version is only cleaned up (its SSTables deleted, its WAL file discarded) when its reference count drops to zero. No reader is using it anymore, so it is safe to remove. This approach keeps foreground reads and writes lock-free in the hot path. Background operations never block requests, and requests never block background operations. They operate on independent catalog versions and only synchronize at the moment of catalog swap , which in many implementations is a single atomic pointer update. The versioned catalog is also what makes crash recovery clean. On startup, the engine reads the latest catalog file on disk, which always reflects a consistent state: either from before the last flush/compaction, or after. Any SSTables on disk not referenced by the catalog are orphans from an incomplete operation and can be safely deleted. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. LSM trees optimize for write throughput by turning random disk writes into sequential ones, at the cost of more complex reads. The memtable absorbs writes in memory; an ordered structure like a skip list, balanced BST, or radix trie keeps keys sorted for efficient flushing. The WAL provides durability: every write is logged to disk before the memtable is updated, enabling crash recovery. SSTables are immutable, sorted files produced by flushing the memtable; a binary block format with checksums makes point lookups efficient and reads safe. A catalog file tracks which SSTables are live and is updated atomically to ensure the engine always has a consistent view of disk state. Read amplification is the fundamental trade-off: finding a key may require searching multiple SSTables, one per level, plus all files. Compaction merges SSTables, eliminates redundant entries, and reclaims space, at the cost of write amplification and background I/O. Tombstones handle deletions in an immutable structure; they can only be discarded when no older value they shadow still exists on disk. Leveling organizes SSTables into levels with non-overlapping key ranges, bounding read amplification to one file lookup per level. Tiered compaction is an alternative strategy that trades less write amplification for more read amplification. Bloom filters allow the engine to skip SSTable reads for missing keys with near certainty, eliminating the worst-case read scenario. A versioned catalog is one common approach to enabling lock-free concurrent reads and background operations by letting each request pin a consistent snapshot of engine state. CRDTs Explained Availability Models Explained The PACELC Theorem Explained The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary . Build Your Own Key-Value Storage Engine IO devices and latency Fundamental Insights To understand LSM trees, we first need to understand why writes are hard. A B-tree-based database updates data in place . When we write a key, the engine finds the right page on disk and modifies it. This is a random write: the disk head has to seek to an arbitrary location before writing. On spinning disks, that seek takes time. But even on SSDs, random writes cause problems: they wear out cells unevenly and trigger expensive internal garbage collection. LSM trees take a completely different approach. Instead of writing data where it ultimately belongs, they write data sequentially . Writes are recorded in memory and appended to a log file for durability. When the in-memory buffer fills up, its contents are streamed to a new file in one sequential pass. Sequential writes are dramatically faster than random writes because there is no seeking involved. The disk just keeps writing forward. The price of this design is complexity. Data doesn’t live in one place. It accumulates across multiple files over time, and those files need to be periodically merged and reorganized in the background to stay manageable. That background work is what every piece of an LSM tree is built around. The in-memory buffer is called the memtable . The sorted files on disk are called SSTables . We’ll look at each in detail. The Memtable Every write in an LSM tree starts in memory, in a structure called the memtable. The memtable is a mutable, in-memory store . When a write request arrives, the engine records the key-value pair in the memtable and appends it to a sequential log file on disk (called the write-ahead log, or WAL, which we’ll cover in the next section). The WAL write is a sequential append, so it is fast. There is no random I/O, no page lookup, no in-place modification. This is why LSM trees can sustain very high write throughput. A hashtable works for lookups but not for in-order iteration. Sorting a hashtable takes at flush time. A better choice is an ordered data structure. The most common in practice is a skip list ; for example, LevelDB and RocksDB both use one as their default. A radix trie is another elegant option: it keeps keys in lexicographic order naturally, so iterating in order is just a depth-first traversal, and flushing becomes a simple stream with no sorting step needed. A balanced BST works too. Production implementations typically attach a monotonic sequence number to each entry, so the engine can always determine which version of a key is the most recent, regardless of arrival order. The memtable doesn’t grow forever . At some point, it gets flushed to disk, and a new empty memtable takes its place. What triggers that flush depends on the implementation: it can be a size limit (a number of entries or a memory threshold), elapsed time, or memory pressure, for example. That flush produces a sorted file on disk called an SSTable, which we’ll look at after the WAL. The Write-Ahead Log There is a problem with keeping writes in memory: if the process crashes, everything in the memtable is gone. Any write the client received an acknowledgment for is now lost. That breaks a core database guarantee: durability . The solution is a Write-Ahead Log, or WAL. Before writing to the memtable, the engine appends the operation to the WAL, an append-only file on disk . Only after the WAL entry is safely persisted does the engine update the memtable and acknowledge the client. This ordering is what the “write-ahead” in the name refers to: the log is always written before the in-memory state changes. The WAL is not the final home for data; it’s a safety net. If the engine crashes and restarts, it replays the WAL from the beginning to reconstruct the memtable, recovering any writes that hadn’t been flushed to disk yet. One subtlety: writing to a file is not the same as persisting it. Operating systems buffer writes in memory before flushing to disk. To guarantee durability, the engine must call after each WAL entry, forcing the OS to flush its buffers to physical storage. This is not free, though. adds latency to every write. Production systems often use instead, which persists the data without flushing unnecessary file metadata, keeping WAL appends faster. Many also use a technique called group commit to amortize this cost further: instead of syncing after every write, they batch multiple WAL entries and call once for the group. The WAL introduces write amplification : the ratio of data written to disk versus data actually requested by a client. Every byte we write to the database gets written to disk twice: once to the WAL immediately, and once to an SSTable when the memtable is eventually flushed. That cost buys us durability. SSTables As we said, when the memtable fills up, it gets written to disk as a Sorted String Table, or SSTable . An SSTable is an immutable, sorted file. Immutable means it is never modified after creation. Sorted means keys are stored in lexicographic order. Both properties matter: Immutability makes SSTables safe to read concurrently without locking. Sorted order makes lookups inside a file efficient. is the landing zone . When the memtable flushes, the resulting SSTable lands in L0. files can have overlapping key ranges: two L0 files might both contain entries for key . This is acceptable because L0 files are small and short-lived. and deeper levels are different. Each level maintains non-overlapping key ranges across all its files . A given key can exist in at most one file per level. This is the critical property that makes reads efficient: to look up a key in , we don’t scan all files. We use the key ranges to jump directly to the one file that could contain it. Each deeper level is typically larger by a fixed ratio, for example, 10x. might hold 10 MB, 100 MB, 1 GB, and so on. Most data ends up in the deepest level. Most compaction work happens between levels. The benefit is controlled read amplification . To look up a key, we check the memtable, scan all files, then do one binary search per deeper level. The number of deeper levels grows logarithmically with data size. For a dataset with a few levels, that’s a small, bounded number of disk reads, regardless of how many total SSTables exist. When compaction falls behind and accumulates too many files, the engine may trigger a write stall : new writes are paused until compaction catches up and is drained. This is one of the more painful operational issues in LSM-based systems. Leveled compaction is also not the only strategy. Tiered compaction , used by Cassandra, for example, takes a different approach: instead of enforcing non-overlapping ranges per level, it groups SSTables of similar size and merges them when a tier grows too large. Tiered compaction generates less write amplification but more read amplification. The right choice depends on the workload. Bloom Filters Leveling helps with reads, but there is still one painful case: looking up a key that doesn’t exist . For a missing key, the engine checks the memtable (not there), checks each L0 file (not there), then checks one file per deeper level (not there). Each check is a disk read. Even with leveling, this adds up. Bloom filters solve this. A Bloom filter is a probabilistic data structure that can answer one question: Is this key definitely not in this SSTable? It has no false negatives: if the key is in the SSTable, the filter will say so. It can have false positives (occasionally it says a key might be present when it isn’t), but in practice, the false positive rate is tunable and kept very low. Many implementations attach a Bloom filter to each SSTable, built at creation time from all the keys it contains. The filters are small, a few kilobytes per SSTable, so they can be loaded into memory at startup and kept there. How does it work? A Bloom filter is a bitset. When a key is added, several hash functions are applied to it, each producing an index into the bitset. The bit at each index is set to 1. To check if a key is in the filter, the same hash functions are applied. If any of the resulting bits is 0, the key is definitely not in the SSTable. No disk read needed. If all bits are 1, the key might be there, and the engine proceeds to read the SSTable. The practical impact is significant. For a key that doesn’t exist (the worst case), the engine skips almost every SSTable without a single disk read . Only the rare false positive triggers an unnecessary disk read. Read amplification for missing keys drops dramatically. Some engines take this further and attach Bloom filters not just per SSTable but per data block within an SSTable, enabling even more precise filtering before fetching a block from disk. Concurrency Everything described so far assumes a single thread. In reality, a storage engine needs to handle concurrent reads and writes, while flush and compaction run in the background . This is where things get subtle. The core problem: a flush operation replaces the current memtable with a new one and registers a new SSTable in the catalog. A compaction operation removes old SSTables and registers new ones. If a read is in the middle of searching an SSTable that gets deleted by a concurrent compaction, that’s a crash. One common solution is a versioned catalog . A catalog is a snapshot of the engine’s state at a point in time: a reference to the current memtable, the current WAL path, and the current catalog file. Every incoming request acquires the latest catalog version, pins it by incrementing a reference count, performs its work, then releases it by decrementing the reference count. Background workers (the flush worker and the compaction worker) never modify an existing catalog . Instead, when a flush or compaction completes, they create a new catalog version pointing to the updated memtable and SSTable set. From that moment, new requests acquire the new catalog. Old requests that pinned the previous catalog continue reading from it safely. An old catalog version is only cleaned up (its SSTables deleted, its WAL file discarded) when its reference count drops to zero. No reader is using it anymore, so it is safe to remove. This approach keeps foreground reads and writes lock-free in the hot path. Background operations never block requests, and requests never block background operations. They operate on independent catalog versions and only synchronize at the moment of catalog swap , which in many implementations is a single atomic pointer update. The versioned catalog is also what makes crash recovery clean. On startup, the engine reads the latest catalog file on disk, which always reflects a consistent state: either from before the last flush/compaction, or after. Any SSTables on disk not referenced by the catalog are orphans from an incomplete operation and can be safely deleted. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. Summary LSM trees optimize for write throughput by turning random disk writes into sequential ones, at the cost of more complex reads. The memtable absorbs writes in memory; an ordered structure like a skip list, balanced BST, or radix trie keeps keys sorted for efficient flushing. The WAL provides durability: every write is logged to disk before the memtable is updated, enabling crash recovery. SSTables are immutable, sorted files produced by flushing the memtable; a binary block format with checksums makes point lookups efficient and reads safe. A catalog file tracks which SSTables are live and is updated atomically to ensure the engine always has a consistent view of disk state. Read amplification is the fundamental trade-off: finding a key may require searching multiple SSTables, one per level, plus all files. Compaction merges SSTables, eliminates redundant entries, and reclaims space, at the cost of write amplification and background I/O. Tombstones handle deletions in an immutable structure; they can only be discarded when no older value they shadow still exists on disk. Leveling organizes SSTables into levels with non-overlapping key ranges, bounding read amplification to one file lookup per level. Tiered compaction is an alternative strategy that trades less write amplification for more read amplification. Bloom filters allow the engine to skip SSTable reads for missing keys with near certainty, eliminating the worst-case read scenario. A versioned catalog is one common approach to enabling lock-free concurrent reads and background operations by letting each request pin a consistent snapshot of engine state. CRDTs Explained Availability Models Explained The PACELC Theorem Explained The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary . Build Your Own Key-Value Storage Engine IO devices and latency

0 views

The Coder Cafe 2 weeks ago

AI for Production

☕ Welcome to The Coder Cafe! These days, most posts about AI for production circle the same ideas: automated remediation, anomaly detection, alerting triage, etc. These are interesting starting points, but they share a common assumption: that AI’s job is to replace what SREs do. In this post, I want to explore the idea of having AI as a cognitive partner, something that extends what a single engineer can hold in their head at once. Get cozy, grab a coffee, and let’s begin! At Google, I’m an SRE on the Google Distributed Cloud team, where the infrastructure stack spans Kubernetes, Borg, distributed storage, virtualization, networking, and more. Over the past months, I’ve been experimenting with ways AI can help not only by automating work away, but also by reducing the cognitive overhead that makes production work quite overwhelming sometimes. Here are three directions that changed how I thought about the problem. In my team, we have hundreds of dashboards. Kubernetes clusters, Borg jobs, storage metrics, VM utilization, network metrics, etc. Each one tells part of the story. When something went wrong, and I wanted to understand the current state of the system, I needed to spend a significant amount of time opening tabs and cross-referencing panels to get a complete picture. This is a fundamentally human bottleneck. Each dashboard was designed to answer a specific question . The question “ What is the current situation? ” doesn’t map to any single dashboard, and navigating all of them to reconstruct an answer takes time we often don’t have. Interestingly, this is where AI can change the equation. Instead of navigating dashboards, imagine describing your system to an AI agent with access to your observability stack and simply asking: “ What’s going on? ” The agent queries across your telemetry data, picks out what stands out, and hands you back a coherent narrative , something you can actually act on. Like: “ This specific cluster has an issue with all the containers using distributed storage running on that specific node since 2h. ” This shifts the focus from navigator (opening dashboards one by one) to interpreter (acting on a synthesized summary). And that shift matters: every minute you spend navigating is a minute you're not spending on the actual problem. A few months ago, I was investigating a storage incident on a cluster. The failure itself was clear: a disk issue that surfaced as elevated latency and eventually a service degradation. What wasn’t clear was why it happened when it did. I used Gemini CLI to navigate the metrics data around the event window. What it surfaced surprised me: the root cause signals had been present in the telemetry hours before the incident triggered any alert. Subtle correlations across metrics that individually looked like noise: disk read latency creeping slightly upward, I/O wait ticking up on specific nodes, a minor memory pressure pattern. Together, they pointed directly at the failure that was coming. A human reviewing those dashboards in real time would almost certainly have missed it. Each individual signal was within an acceptable range. The pattern only became visible when we looked at all of them together, across time. This is what I’d call telemetry archaeology : using AI to go back through your metrics data and surface the correlations an alerting system wasn’t designed to catch. It’s worth being precise about what makes this different from anomaly detection. Anomaly detection tells you when something looks wrong. Telemetry archaeology is about finding the patterns that appear before anything looks wrong at all , relationships that no one thought to encode into an alert, because no one knew they existed until the incident happened. The practical implication is significant. If these correlations exist in your past incidents, they likely exist in future ones. An AI agent that continuously monitors for these multi-signal patterns could surface a warning (” This looks like the early stages of what happened last time ”) long before your system starts showing symptoms. Active incidents can be cognitively brutal . You can be debugging a live system, managing communication with stakeholders, coordinating with other engineers, and trying to remember what you checked 20 minutes ago, all at the same time. A common consequence is that the engineer with the deepest system knowledge gets pulled out of deep focus to write status updates, summarize what’s been tried, and maintain a running timeline. This work is necessary, but it’s expensive. Every context switch makes it harder to hold the full mental model of the incident in your head. And once that model fragments, rebuilding it takes time you don’t have. NOTE : This is actually one of the reasons Google developed the IMAG process, with clear role separation: The Incident Commander (IC) coordinates the overall response, the Communications Lead (CL) handles stakeholder updates, and the Operations Lead (OL) focuses on mitigating the issue. The explicit goal is to prevent any single person from being pulled in too many directions at once. AI can absorb most of this overhead . Think of it as a second brain that’s been in the room the whole time: it tracks what hypotheses have been tested, which ones were ruled out and why, what changed in the system during the incident window, and what hasn’t been explored yet. When a new engineer joins the investigation, instead of spending ten minutes getting them up to speed, you ask the AI for a summary. AI’s role here is handling the administrative layer of the incident: the parts that pull you out of flow, so you can stay in the problem instead of constantly being yanked out of it. I’ve been using AI this way during my own shifts. Even without a purpose-built tool, maintaining a running log with AI (e.g., what we’ve tried, what we know, what’s next) noticeably changes how an incident feels. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. The common “AI for production” narrative focuses on automation and replacement; cognitive augmentation is the underexplored angle. Situation awareness: AI can synthesize across hundreds of dashboards to answer “ What’s the current situation? ” in seconds, shifting your role from navigator to interpreter. Telemetry archaeology: AI can surface hidden correlations across metrics that individually look like noise, revealing root cause signals that were present hours before any alert fired. Incident co-pilot: AI can absorb the administrative layer of an active incident (status updates, running timeline, hypothesis tracking), keeping the engineer in deep focus instead of constant context switching. None of this requires replacing the engineer. The value is in extending what one person can hold in their head under pressure. Reliability Resilient, Fault-tolerant, Robust, or Reliable? Lurking Variables Google Site Reliability Engineering: Incident Management Guide The future of software engineering is SRE At Google, I’m an SRE on the Google Distributed Cloud team, where the infrastructure stack spans Kubernetes, Borg, distributed storage, virtualization, networking, and more. Over the past months, I’ve been experimenting with ways AI can help not only by automating work away, but also by reducing the cognitive overhead that makes production work quite overwhelming sometimes. Here are three directions that changed how I thought about the problem. Situation Awareness In my team, we have hundreds of dashboards. Kubernetes clusters, Borg jobs, storage metrics, VM utilization, network metrics, etc. Each one tells part of the story. When something went wrong, and I wanted to understand the current state of the system, I needed to spend a significant amount of time opening tabs and cross-referencing panels to get a complete picture. This is a fundamentally human bottleneck. Each dashboard was designed to answer a specific question . The question “ What is the current situation? ” doesn’t map to any single dashboard, and navigating all of them to reconstruct an answer takes time we often don’t have. Interestingly, this is where AI can change the equation. Instead of navigating dashboards, imagine describing your system to an AI agent with access to your observability stack and simply asking: “ What’s going on? ” The agent queries across your telemetry data, picks out what stands out, and hands you back a coherent narrative , something you can actually act on. Like: “ This specific cluster has an issue with all the containers using distributed storage running on that specific node since 2h. ” This shifts the focus from navigator (opening dashboards one by one) to interpreter (acting on a synthesized summary). And that shift matters: every minute you spend navigating is a minute you're not spending on the actual problem. Telemetry Archaeology A few months ago, I was investigating a storage incident on a cluster. The failure itself was clear: a disk issue that surfaced as elevated latency and eventually a service degradation. What wasn’t clear was why it happened when it did. I used Gemini CLI to navigate the metrics data around the event window. What it surfaced surprised me: the root cause signals had been present in the telemetry hours before the incident triggered any alert. Subtle correlations across metrics that individually looked like noise: disk read latency creeping slightly upward, I/O wait ticking up on specific nodes, a minor memory pressure pattern. Together, they pointed directly at the failure that was coming. A human reviewing those dashboards in real time would almost certainly have missed it. Each individual signal was within an acceptable range. The pattern only became visible when we looked at all of them together, across time. This is what I’d call telemetry archaeology : using AI to go back through your metrics data and surface the correlations an alerting system wasn’t designed to catch. It’s worth being precise about what makes this different from anomaly detection. Anomaly detection tells you when something looks wrong. Telemetry archaeology is about finding the patterns that appear before anything looks wrong at all , relationships that no one thought to encode into an alert, because no one knew they existed until the incident happened. The practical implication is significant. If these correlations exist in your past incidents, they likely exist in future ones. An AI agent that continuously monitors for these multi-signal patterns could surface a warning (” This looks like the early stages of what happened last time ”) long before your system starts showing symptoms. Incident Co-Pilot Active incidents can be cognitively brutal . You can be debugging a live system, managing communication with stakeholders, coordinating with other engineers, and trying to remember what you checked 20 minutes ago, all at the same time. A common consequence is that the engineer with the deepest system knowledge gets pulled out of deep focus to write status updates, summarize what’s been tried, and maintain a running timeline. This work is necessary, but it’s expensive. Every context switch makes it harder to hold the full mental model of the incident in your head. And once that model fragments, rebuilding it takes time you don’t have. NOTE : This is actually one of the reasons Google developed the IMAG process, with clear role separation: The Incident Commander (IC) coordinates the overall response, the Communications Lead (CL) handles stakeholder updates, and the Operations Lead (OL) focuses on mitigating the issue. The explicit goal is to prevent any single person from being pulled in too many directions at once. AI can absorb most of this overhead . Think of it as a second brain that’s been in the room the whole time: it tracks what hypotheses have been tested, which ones were ruled out and why, what changed in the system during the incident window, and what hasn’t been explored yet. When a new engineer joins the investigation, instead of spending ten minutes getting them up to speed, you ask the AI for a summary. AI’s role here is handling the administrative layer of the incident: the parts that pull you out of flow, so you can stay in the problem instead of constantly being yanked out of it. I’ve been using AI this way during my own shifts. Even without a purpose-built tool, maintaining a running log with AI (e.g., what we’ve tried, what we know, what’s next) noticeably changes how an incident feels. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. Summary The common “AI for production” narrative focuses on automation and replacement; cognitive augmentation is the underexplored angle. Situation awareness: AI can synthesize across hundreds of dashboards to answer “ What’s the current situation? ” in seconds, shifting your role from navigator to interpreter. Telemetry archaeology: AI can surface hidden correlations across metrics that individually look like noise, revealing root cause signals that were present hours before any alert fired. Incident co-pilot: AI can absorb the administrative layer of an active incident (status updates, running timeline, hypothesis tracking), keeping the engineer in deep focus instead of constant context switching. None of this requires replacing the engineer. The value is in extending what one person can hold in their head under pressure. Reliability Resilient, Fault-tolerant, Robust, or Reliable? Lurking Variables Google Site Reliability Engineering: Incident Management Guide The future of software engineering is SRE

DevOps

AI

0 views

The Coder Cafe 3 weeks ago

Cache Use Cases Explained

☕ Welcome to The Coder Cafe! Today, we discuss cache use cases. When we think about caching, it’s pretty frequent to focus on where it happens; for example, client-side, server-side, or in a CDN. Yet, there’s a more important question that should be answered first: What’s the use case? In this post, we will break down two common cache use cases: reducing latency and improving capacity. And we will see why the line between the two is blurrier than it seems. Get cozy, grab a coffee, and let’s begin! A Cache for Latency Latency is the time between when a request is sent and when a response is received. A cache for latency exists to reduce the average latency of a service . The classic access pattern looks like this 1 : We check the cache first. On a cache hit, we return the data directly without touching the backend. On a miss, we go to the backend, return the result, and store it in the cache for future requests. Why does this reduce latency? The cache keeps data in memory, which is significantly faster to read from than a remote database that may involve network round-trips, disk I/O, and query execution. On a hit, all of that work is skipped. In Soft vs. Hard Dependency , we introduced two kinds of dependencies: A soft dependency is a non-critical dependency for the service to operate properly. A hard dependency is a critical dependency for the service to operate properly. A cache for latency is a soft dependency . If the cache becomes unavailable, requests fall through to the backend. The system keeps working, just at a higher latency. Keep this in mind, because it’s the key difference we’ll come back to. A cache for capacity exists to serve higher throughput than the backend can handle on its own. The access pattern is identical to the latency case: cache first, then backend on a miss. So what actually makes these two different? The difference is not in the code; it’s in what the backend can absorb. In a capacity scenario, the backend would be overwhelmed if it received all the traffic directly. The cache absorbs a large portion of the requests, keeping the backend load manageable. This changes the nature of the dependency . If the cache goes down, the backend is suddenly hit with all the traffic it was previously shielded from. Whether the system survives depends on the backend’s own capacity. If the backend can scale fast enough, the cache is still a soft dependency: there will be a rough period, but the system recovers. If the backend can’t cope with the load, the cache becomes a hard dependency . Without it, the system fails . Here’s a question worth asking: if the access pattern for both types is identical, how do we know which one we have? In most cases, caches are introduced to reduce latency. But here’s what can happen over time: Our system is stable. Cache hit rates are high, backend load is low. Traffic grows. The backend load stays low because the cache is absorbing most of it. Nothing breaks. No alerts fire. Six months pass. Nothing has changed, no code, no configuration, no architecture decision. And yet the cache is no longer reducing latency. It’s keeping the backend alive. The cache didn’t change. The code didn’t change. The system grew around the cache, and the cache quietly became load-bearing . The same risk appears when a cache goes cold. For example: A migration to a new cache instance A data format change that requires purging existing entries A cache restart after maintenance Any of these can produce a large wave of cache misses in a short window. If we were running a latency cache, we would see higher latency for a while. If we were running a capacity cache, we would see a traffic spike that the backend can’t absorb. The unsettling part is that the code is identical in both cases. The difference only becomes visible at failure time . The root problem is that teams often don’t know which type of cache they’re running . They built it for latency, and that’s still how they think about it, even as the system outgrows that assumption. A few approaches help here: Periodically ask: could the backend handle the current traffic if the cache were completely removed ? Load testing without the cache, or estimating backend capacity against current traffic levels, gives you a concrete answer. Treat cache hit rate as a meaningful operational signal , not just a performance metric. A sustained drop in hit rate means the backend is absorbing more traffic than usual. If that trend continues, it’s an early warning that you may be drifting toward a capacity problem. When migrating a cache or invalidating a large portion of its data, warm the new cache before routing live traffic to it. This prevents a cold-start burst from hitting the backend all at once. Finally, once we recognize that a cache is operating as a capacity cache , we should treat it accordingly. It’s no longer optional infrastructure and it deserves proper alerting and a clear plan for what happens if it goes down. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. A cache for latency serves data from memory to reduce average response time. It is a soft dependency: if unavailable, the system degrades in latency but continues to work. A cache for capacity absorbs traffic that the backend couldn’t handle on its own. It can be a soft or a hard dependency, depending on whether the backend can absorb the load without it. Both types share the same access pattern, which makes them easy to confuse. A latency cache can silently become a capacity cache as traffic grows, without any code change. When a capacity cache goes cold or fails, the backend can be overwhelmed. Hit rate monitoring, periodic load testing, and cache warming are practical ways to manage this risk. Availability Models Safety and Liveness The PACELC Theorem The Three Types of Cache Cache stampede Even though variations exist. A Cache for Latency Latency is the time between when a request is sent and when a response is received. A cache for latency exists to reduce the average latency of a service . The classic access pattern looks like this 1 : We check the cache first. On a cache hit, we return the data directly without touching the backend. On a miss, we go to the backend, return the result, and store it in the cache for future requests. A soft dependency is a non-critical dependency for the service to operate properly. A hard dependency is a critical dependency for the service to operate properly. Our system is stable. Cache hit rates are high, backend load is low. Traffic grows. The backend load stays low because the cache is absorbing most of it. Nothing breaks. No alerts fire. Six months pass. Nothing has changed, no code, no configuration, no architecture decision. And yet the cache is no longer reducing latency. It’s keeping the backend alive. A migration to a new cache instance A data format change that requires purging existing entries A cache restart after maintenance Periodically ask: could the backend handle the current traffic if the cache were completely removed ? Load testing without the cache, or estimating backend capacity against current traffic levels, gives you a concrete answer. Treat cache hit rate as a meaningful operational signal , not just a performance metric. A sustained drop in hit rate means the backend is absorbing more traffic than usual. If that trend continues, it’s an early warning that you may be drifting toward a capacity problem. When migrating a cache or invalidating a large portion of its data, warm the new cache before routing live traffic to it. This prevents a cold-start burst from hitting the backend all at once. Finally, once we recognize that a cache is operating as a capacity cache , we should treat it accordingly. It’s no longer optional infrastructure and it deserves proper alerting and a clear plan for what happens if it goes down. A cache for latency serves data from memory to reduce average response time. It is a soft dependency: if unavailable, the system degrades in latency but continues to work. A cache for capacity absorbs traffic that the backend couldn’t handle on its own. It can be a soft or a hard dependency, depending on whether the backend can absorb the load without it. Both types share the same access pattern, which makes them easy to confuse. A latency cache can silently become a capacity cache as traffic grows, without any code change. When a capacity cache goes cold or fails, the backend can be overwhelmed. Hit rate monitoring, periodic load testing, and cache warming are practical ways to manage this risk. Availability Models Safety and Liveness The PACELC Theorem The Three Types of Cache Cache stampede

Backend

0 views

The Coder Cafe 1 months ago

How Linux 7.0 Broke PostgreSQL

☕ Welcome to The Coder Cafe! On April 3, 2026, Salvatore Dipietro, an engineer at AWS, posted a patch to the Linux kernel mailing list. The reason: on a 96-vCPU Graviton4 machine running Linux 7.0, PostgreSQL throughput had dropped to roughly half of what it produced on Linux 6.x. In this post, we will trace what changed in Linux 7.0, how PostgreSQL manages memory, and what role memory pages play in making the problem appear (or disappear). Get cozy, grab a coffee, and let’s begin! The Problem Salvatore Dipietro ran pgbench (PostgreSQL’s standard benchmarking tool) on a Graviton4 processor with 96 vCPUs. The workload was a benchmark doing simple updates at scale factor 8,470 (i.e., roughly a 847 million row table), simulating 1,024 clients and 96 threads. A serious, high-parallelism load designed to stress the system. The results were striking. Linux 7.0 delivered roughly half the throughput of Linux 6.x on the same hardware and workload: Linux 6.x : 98,565 transactions per second Linux 7.0 : 50,751 transactions per second To find where the time was going, Dipietro ran , a Linux profiling tool that samples what the CPU is actually doing. The result was unambiguous: 55% of the machine’s CPU time was spent inside a single function: . The culprit was traced back to a change in how Linux 7.0 schedules processes. Let’s start there. When multiple threads run on a machine, the OS needs to share the CPU between them. That’s the scheduler’s job. But the scheduler also decides something subtler: when to interrupt a running thread and hand the CPU to another. That decision is called preemption , and the answer varies depending on how the kernel is configured. Before Linux 7.0, there were three options: : The kernel almost never interrupts a running thread. A thread runs until it voluntarily gives up the CPU: when it makes a syscall, blocks on I/O, or explicitly sleeps. This was the traditional server default with fewer context switches, higher throughput, and predictable behavior under load. : The kernel can interrupt a running thread at almost any safe point, even if it is in the middle of doing useful work. This means a thread never has to wait for the current one to finish its slice before getting CPU time, which reduces response time but increases context-switch overhead. Historically, the desktop default, where responsiveness matters more than raw throughput. : Introduced in Linux 6.12 as a compromise between the two. The scheduler can interrupt threads, but tries to wait for natural boundaries rather than cutting in aggressively. The intent is to approximate ‘s throughput behavior while still allowing preemption when needed. In Linux 7.0, was removed as an option on modern CPU architectures, leaving only and . Indeed, was designed to be a drop-in replacement on throughput workloads, and for the vast majority of server software, it is. But PostgreSQL hit a specific case where the difference is catastrophic, and to understand why, we need to look at how PostgreSQL manages memory. PostgreSQL, like most databases, doesn’t store data as rows in a flat file. Instead, it uses a fixed-size abstraction called a data page (8 KB by default) as its basic unit of storage. Everything on disk (e.g., table rows, B-tree index nodes, metadata) is stored in these pages. A table with millions of rows is ultimately a large sequence of data pages on disk. Reading from disk is slow. So PostgreSQL maintains a shared buffer pool , a large region of shared memory that caches recently read data pages . The more of the working set that fits in the buffer pool, the less disk I/O is needed. When a client connects to PostgreSQL, the server spawns a dedicated process to handle that connection, called a backend . Every backend that needs a data page not already in the buffer pool has to first read it from disk, then find a buffer to store it in: either one that is already free, or one currently holding another page that can be evicted. The job of finding that buffer falls to a single crucial function called . To coordinate access to the buffer pool across hundreds of concurrent backends, uses a spinlock. A spinlock is a locking mechanism built on a simple idea: instead of going to sleep while waiting for a lock to become available, a process just keeps checking in a tight loop (it spins ): Why would we ever want that? For very short critical sections, the overhead of putting a thread to sleep and waking it back up can be more expensive than just “spinning“ , meaning actively waiting. If we know the lock holder will be done in nanoseconds, spinning is faster than sleeping. The key assumption behind spinlocks is the following: the thread holding the lock will release it very soon. Nobody is going to preempt that thread in the middle of a 20-nanosecond critical section. The holder will finish and release the lock before anyone has time to notice. uses a single global spinlock to protect the critical section where it selects a buffer. On a 96-vCPU machine with 1,024 clients all hammering the database, every backend competes for the same lock, and any time it takes longer than expected to release, all of them burn CPU spinning. But why did the Linux 7.0 preemption change make it so much worse? The answer lies in how memory works at the hardware level. Every process in Linux, including PostgreSQL, works with virtual memory addresses. For example, the address in one process is a completely different memory from the same address in another process. The hardware translates virtual addresses to physical addresses using a data structure called the page table , maintained by the kernel in memory. A page table is a multi-level tree, so a single address translation requires several sequential memory reads to walk it. Doing that for every memory access would be impossibly slow. Instead, CPUs have a small hardware cache for recent translations called the Translation Lookaside Buffer (TLB): When a process accesses an address it has accessed recently, the TLB already has the translation, and the memory access proceeds quickly. When a process accesses an address it hasn’t seen before, it gets a TLB miss : the CPU has to walk the page table, find the physical address, and store the translation in the TLB. That takes time. There is one more concept to introduce. When PostgreSQL starts, it allocates the shared buffer pool as a large virtual memory region. But allocating virtual memory and having physical memory ready to use are two different things. Indeed, Linux uses a principle called lazy allocation : the allocation is noted, but the actual physical pages are only mapped on first access. The first time any code touches a previously-unmapped virtual address, a minor page fault occurs: the kernel allocates a physical page and stores the mapping. That takes microseconds, orders of magnitude slower than a regular read or write where the page is already mapped. When a process accesses memory for the first time, the kernel doesn’t map it byte by byte. Instead, it maps memory in fixed-size chunks called memory pages via the page table. NOTE : We already used the word “page” to characterize data pages, meaning how PostgreSQL organizes data on disk into fixed-size 8 KB blocks. This is a different concept than a Linux page, which is the unit the kernel uses to manage physical memory. By default, a Linux memory page is 4 KB. PostgreSQL's shared buffer pool, like all memory on Linux, is backed by Linux memory pages under the hood. In Dipietro’s benchmark, the shared buffer pool was configured to 120 GB via the parameter, which at 4 KB per Linux memory page means roughly 31 million memory pages . Therefore, 31 million potential first-touch page faults. Now let’s consider what happens inside . Each backend acquires the spinlock to find a free slot in the buffer pool. To do so, it reads or writes shared memory. If that region of shared memory hasn’t been touched yet, accessing it triggers a minor page fault , meaning that the kernel has to allocate a physical memory page and store the mapping. During a long benchmark with a 120 GB shared buffer pool, new regions keep entering the working set throughout the run, so these faults happen constantly, not just at startup . And when a fault occurs while a backend is holding the spinlock, the consequences are severe. Indeed, we discussed that the key assumption behind spinlocks is that the lock will be released very soon. In that case, the assumption breaks : the holder is stuck inside the kernel fault handler while it stores a physical memory page mapping, and every other backend on the machine is spinning, burning CPU, waiting for a lock that won't be released until the faulting process resumes. The impact of a fault when the lock was acquired depends on the preemption model . Let’s consider the following example. Backend acquires the lock but triggers a page fault. Meanwhile, backends , , and arrive and try to acquire the lock. Since they can’t, they spin, burning CPU on a tight loop while waiting for backend to release the lock. With (before Linux 7) : Once backend enters the fault handler, the kernel handles the fault. Since avoids voluntary rescheduling points, backend is unlikely to be scheduled away before the fault resolves and the lock is released. The spinners wait a bit longer than expected, but the damage is limited. With (Linux 7 and beyond) : The scheduler may decide to preempt backend A while it’s still inside the fault handler, scheduling another process in its place. Backend won’t resume until the scheduler hands control back to it, which can take some time, even after the fault is fully handled: The spinlock hold time goes from “ duration of the fault ” to “ duration of the fault + time waiting for the scheduler .” And that extra wait, let’s call it , is not just of wasted CPU; instead, it is multiplied by every backend currently spinning . In the previous example, backends B, C, and D each burn extra cycles, making the total waste . On a 96-vCPU machine with hundreds of backends, that multiplier is devastating. That's how the benchmark ended up with 56% of the CPU burning in . That extra time waiting for the scheduler was the root cause of the issue. Fortunately, there is an option to overcome this issue in PostgreSQL. The main variable we discussed was , 120 GB in the benchmark, meaning roughly 31 million memory pages. But there is another variable we can adjust: the size of a memory page . As we said, it defaults to 4 KB, but the kernel supports larger pages called huge pages . On x86_64 and ARM64, the supported sizes are 2 MB and 1 GB: 4 KB pages : ~31,000,000 potential page faults 2 MB huge pages : ~61,440 potential page faults 1 GB huge pages : ~120 potential page faults Increasing the size of a memory page reduces the number of potential page faults but also reduces TLB pressure. Indeed, far fewer entries need to cover the same memory, so the working set fits comfortably in the TLB, meaning far fewer TLB misses and page table walks on the hot path. Overall, stops triggering faults while holding the lock. The lock holder finishes quickly. The other backends wait microseconds instead of milliseconds. The regression disappears . NOTE : Setting huge pages in PostgreSQL is controlled by the configuration parameter, which accepts three values: , , and (the default). With , PostgreSQL uses huge pages if available and silently falls back to 4 KB pages otherwise. Use instead so PostgreSQL fails to start rather than running misconfigured without you noticing. The size of the huge pages themselves is a Linux configuration. However, setting huge pages is not without tradeoffs . Huge pages are pre-allocated and reserved upfront, meaning that memory is no longer available to the rest of the system even if PostgreSQL isn’t using it all. There is also a memory waste concern: a huge page is allocated as a whole, so if only a fraction of it is used, the rest is wasted. For most production PostgreSQL deployments with large , these tradeoffs are probably worth it, but they are good to know about. Peter Zijlstra, the Intel kernel engineer who authored the preemption change, proposed a fix: PostgreSQL should adopt Restartable Sequences ( ), a Linux kernel facility that lets userspace code detect whether it was preempted or migrated during a critical section and restart it if so. PostgreSQL's spinlock paths would use to detect preemption and retry, avoiding the scenario where a preempted lock holder stalls all waiting backends. The PostgreSQL community’s response was not enthusiastic. Using a kernel facility specifically to recover performance that PostgreSQL had for free before Linux 7.0 is a tough sell. It also sits uncomfortably next to the kernel’s long-standing principle of not breaking userspace : if software worked correctly before a kernel upgrade, it should work correctly after. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. Linux 7.0 removed on modern CPU architectures, leaving only and . On most distributions, the default shifted to . An AWS engineer benchmarked PostgreSQL on a 96-vCPU Graviton4 and found throughput cut in half on Linux 7.0, with 55% of CPU burning inside a single spinlock in . The root cause is minor page faults occurring while a backend holds the spinlock. With 4 KB memory pages backing a 120 GB , there are up to 31 million potential first-touch faults throughout a benchmark run. Under , the faulting process resumed quickly and released the lock. Under , the scheduler may preempt it mid-fault, extending the hold time and causing every waiting backend to keep spinning. Enabling huge pages (2 MB or 1 GB) reduces the number of potential faults by orders of magnitude and eliminates TLB pressure, making the regression disappear. Linux Soft vs. Hard Lockup Instruction Pipelining Explained Simultaneous Multithreading Explained [PATCH 0/1] sched: Restore PREEMPT_NONE as default AWS Engineer Reports PostgreSQL Performance Halved By Linux 7.0, But A Fix May Not Be Easy PREEMPT_NONE Is Dead; Your Postgres Probably Doesn’t Care The long road to lazy preemption Buffer Manager Restartable Sequences The Problem Salvatore Dipietro ran pgbench (PostgreSQL’s standard benchmarking tool) on a Graviton4 processor with 96 vCPUs. The workload was a benchmark doing simple updates at scale factor 8,470 (i.e., roughly a 847 million row table), simulating 1,024 clients and 96 threads. A serious, high-parallelism load designed to stress the system. The results were striking. Linux 7.0 delivered roughly half the throughput of Linux 6.x on the same hardware and workload: Linux 6.x : 98,565 transactions per second Linux 7.0 : 50,751 transactions per second : The kernel almost never interrupts a running thread. A thread runs until it voluntarily gives up the CPU: when it makes a syscall, blocks on I/O, or explicitly sleeps. This was the traditional server default with fewer context switches, higher throughput, and predictable behavior under load. : The kernel can interrupt a running thread at almost any safe point, even if it is in the middle of doing useful work. This means a thread never has to wait for the current one to finish its slice before getting CPU time, which reduces response time but increases context-switch overhead. Historically, the desktop default, where responsiveness matters more than raw throughput. : Introduced in Linux 6.12 as a compromise between the two. The scheduler can interrupt threads, but tries to wait for natural boundaries rather than cutting in aggressively. The intent is to approximate ‘s throughput behavior while still allowing preemption when needed. When a process accesses an address it has accessed recently, the TLB already has the translation, and the memory access proceeds quickly. When a process accesses an address it hasn’t seen before, it gets a TLB miss : the CPU has to walk the page table, find the physical address, and store the translation in the TLB. That takes time. With (before Linux 7) : Once backend enters the fault handler, the kernel handles the fault. Since avoids voluntary rescheduling points, backend is unlikely to be scheduled away before the fault resolves and the lock is released. The spinners wait a bit longer than expected, but the damage is limited. With (Linux 7 and beyond) : The scheduler may decide to preempt backend A while it’s still inside the fault handler, scheduling another process in its place. Backend won’t resume until the scheduler hands control back to it, which can take some time, even after the fault is fully handled: The spinlock hold time goes from “ duration of the fault ” to “ duration of the fault + time waiting for the scheduler .” And that extra wait, let’s call it , is not just of wasted CPU; instead, it is multiplied by every backend currently spinning . In the previous example, backends B, C, and D each burn extra cycles, making the total waste . On a 96-vCPU machine with hundreds of backends, that multiplier is devastating. That's how the benchmark ended up with 56% of the CPU burning in . That extra time waiting for the scheduler was the root cause of the issue. Huge Pages to the Rescue Fortunately, there is an option to overcome this issue in PostgreSQL. The main variable we discussed was , 120 GB in the benchmark, meaning roughly 31 million memory pages. But there is another variable we can adjust: the size of a memory page . As we said, it defaults to 4 KB, but the kernel supports larger pages called huge pages . On x86_64 and ARM64, the supported sizes are 2 MB and 1 GB: 4 KB pages : ~31,000,000 potential page faults 2 MB huge pages : ~61,440 potential page faults 1 GB huge pages : ~120 potential page faults Linux 7.0 removed on modern CPU architectures, leaving only and . On most distributions, the default shifted to . An AWS engineer benchmarked PostgreSQL on a 96-vCPU Graviton4 and found throughput cut in half on Linux 7.0, with 55% of CPU burning inside a single spinlock in . The root cause is minor page faults occurring while a backend holds the spinlock. With 4 KB memory pages backing a 120 GB , there are up to 31 million potential first-touch faults throughout a benchmark run. Under , the faulting process resumed quickly and released the lock. Under , the scheduler may preempt it mid-fault, extending the hold time and causing every waiting backend to keep spinning. Enabling huge pages (2 MB or 1 GB) reduces the number of potential faults by orders of magnitude and eliminates TLB pressure, making the regression disappear. Linux Soft vs. Hard Lockup Instruction Pipelining Explained Simultaneous Multithreading Explained [PATCH 0/1] sched: Restore PREEMPT_NONE as default AWS Engineer Reports PostgreSQL Performance Halved By Linux 7.0, But A Fix May Not Be Easy PREEMPT_NONE Is Dead; Your Postgres Probably Doesn’t Care The long road to lazy preemption Buffer Manager Restartable Sequences

Linux

0 views

The Coder Cafe 1 months ago

The Reading Room is Open

We’re launching something new: The Reading Room , a book club right here in The Coder Cafe community. We’re kicking things off with one of my all-time favorite technical book: Designing Data-Intensive Applications , since the second edition just got released. If you’re interested, here’s how it works : One chapter every two weeks (no pressure, no guilt). You can find the full schedule here . Discussion happens in the #ddia-v2 channel on Discord. O’Reilly is kindly sponsoring the reading group! 🎉 3 participants will be randomly selected at the start to receive a free digital copy of the book. Depending on engagement, we may also organize a live session every half of the book to discuss together. A shared reading experience with other engineers who care about the same stuff as you. Next steps : To join, add a 👍 to this message in the Discord. Not in the server yet? Join here . To have a chance to win one of the 3 free copies, fill in this form (O’Reilly requires an email address to send the free digital copy). The random draw will happen on May 1st. We will start reading the first chapter will start on May 4th . See you in The Reading Room . We’re launching something new: The Reading Room , a book club right here in The Coder Cafe community. We’re kicking things off with one of my all-time favorite technical book: Designing Data-Intensive Applications , since the second edition just got released. If you’re interested, here’s how it works : One chapter every two weeks (no pressure, no guilt). You can find the full schedule here . Discussion happens in the #ddia-v2 channel on Discord. O’Reilly is kindly sponsoring the reading group! 🎉 3 participants will be randomly selected at the start to receive a free digital copy of the book. Depending on engagement, we may also organize a live session every half of the book to discuss together. To join, add a 👍 to this message in the Discord. Not in the server yet? Join here . To have a chance to win one of the 3 free copies, fill in this form (O’Reilly requires an email address to send the free digital copy). The random draw will happen on May 1st. We will start reading the first chapter will start on May 4th .

Books

0 views

The Coder Cafe 1 months ago

Systems Thinking Explained

☕ Welcome to The Coder Cafe! In a previous post , I briefly touched on systems thinking after reading Learning Systems Thinking . My honest take: it was an interesting introduction, but I wasn’t fully convinced. The concepts felt abstract, the examples too sparse. Then I read Thinking in Systems by Donella Meadows. It might be one of the best books I’ve read in my career (and it’s not even a computer science book). This post is my own introduction to the core concepts, grounded in a real example from my experience. Get cozy, grab a coffee, and let’s begin! Introduction Have you ever fixed an incident, only to see it come back two weeks later? Or made a change that improved one metric while quietly degrading another? Or spent months firefighting without ever feeling like things were actually getting better? These aren’t signs of bad engineering. They’re signs of reacting to events without understanding the structures that produce them. Understanding those structures requires a different kind of thinking, and that’s exactly what systems thinking is: the ability to shift from reacting to events through responsive patterns of behaviors to generating improved systemic structures. This post is an introduction to systems thinking, covering the core concepts through a real example from my experience at Google. First, let’s define what a system is. In essence, a system is: A set of elements Interconnected To achieve something Distributed systems are an obvious example. For example, a 3-node, single leader database is composed of: 3 nodes (elements) Connections from the leader to the replicas (interconnections) With the goal of storing data reliably over time Interestingly, this is why distributed systems can surprise even their own designers: add enough nodes, replication lag, and competing writes, and the system starts behaving in ways no single component would predict. To reason about how systems change over time, we need two important concepts: A stock is an accumulation of material or information that has built up in a system over time. For example: the number of machines available in a cluster, the size of a message queue, the amount of technical debt in a codebase. A flow is what changes a stock: material or information entering or leaving it. For example: machines being added or removed from service, messages being enqueued and consumed, or requests being received and processed. The key thing to keep in mind: stocks take time to change because flows take time to flow . You can’t instantly restore machine availability or drain a queue with a single action. This has real consequences for how systems behave under pressure. We will come back to it. One of the most important concepts in systems thinking is the feedback loop . A feedback loop is what the system does automatically because its own result feeds back into it. Said differently: If causes , then influences . Let’s take a concrete example. Suppose you live in a house with a central thermostat set at 20°C. It turns the heating on when the temperature drops to 19°C, and off when it reaches 21°C. The feedback loop works like this: : Temperature change : Thermostat turns heating on or off The thermostat turning on or off ( ) is caused by the temperature change ( ). But the temperature change ( ) is in turn influenced by the thermostat ( ). Each effect feeds back into its own cause. This is a feedback loop. There are two kinds of feedback loops. A balancing feedback loop resists change : It pushes the system back toward a goal or limit. Think of it as a stabilizer: when something moves away from the target, the loop acts to bring it back. The thermostat is a perfect example. As the temperature drifts away from 20°C, the thermostat reacts, and the system returns to equilibrium. A reinforcing feedback loop amplifies change : More leads to more, less leads to less. An action produces a result that drives more of the same action, generating growth or decline at an accelerating rate. The YouTube algorithm is a clear illustration: the more a video is viewed, the more the algorithm surfaces it; the more it’s surfaced, the more views it gets. More formally, we can have 4 cases of feedback loops: Balancing ceiling : If causes , then influences Balancing floor : If causes , then influences Reinforcing growth : If causes , then influences Reinforcing collapse : If causes , then influences The more feedback loops a system contains, the more complex and surprising its behavior becomes, especially when those loops interact. An often overlooked but critical property of feedback loops is the delay between an action and its effects . Delays are pervasive in systems and strong determinants of behavior. When the gap between action and effect is long, two things happen: Foresight becomes essential : Acting only when a problem becomes obvious means missing the window to address it early. Oscillations become likely : We overreact because the system hasn’t had time to respond, then overreact again in the other direction. Think of an autoscaler that takes 3 minutes to provision new instances. By the time the new capacity is ready, the traffic spike has already peaked. The window to act had opened before the problem was even visible on the dashboard. This is why foresight matters: when there is a significant delay between action and effect, reacting to what you see now means always acting too late. And the consequences compound. The autoscaler, still responding to the old signal, overshoots. Then it sees too much capacity and scales down, right before the next spike arrives. One example, two problems: a system that needed foresight got a reaction, and then oscillated because of it. The delay didn’t change the goal. It made the system work against itself. System boundaries are artificial . They help us frame a problem, but in reality, everything is interconnected. The boundaries we draw determine what we see and, therefore, what we miss. Consider a microservices architecture in which each team owns a service. Every team has solid SLOs, careful on-call rotations, and clean dashboards. And yet end-to-end latency keeps creeping up, and users are complaining. Each team looks at its own service and sees green. The problem is that the boundary is wrong; no one is looking at the system as a whole . This is one of the most common traps in engineering: optimizing within a boundary while the real issue lives outside it. Before changing a system, it is worth asking: Am I looking at the right boundary? When something goes wrong in a system, what do we actually see? Usually just the surface: an incident, a spike, an outage. The iceberg model gives us a way to think beneath it. The model has four levels: Events are what’s visible: the incident alert, the latency spike on the dashboard. This is where most of our attention goes, and where reactive thinking lives. Patterns and trends are what you find when you zoom out. Has this happened before? At what frequency? Under what circumstances? Patterns reveal that what felt like a one-off event is actually part of a larger rhythm. Structure is the underlying system design: the feedback loops, the incentives, the processes that produce the patterns. You can’t fix a pattern without understanding the structure that generates it. Mental models are the beliefs and assumptions that shaped the structure in the first place. They’re the hardest to see and the hardest to change. Credits Most incident response lives at the event level. Systems thinking asks us to go deeper. As an SRE, this model resonates: we’re trained not just to react to incidents but to understand the why: the patterns, the structures, and eventually the assumptions that caused them. Let me now bring all of these concepts together through a concrete example from my previous role at Google, where I worked on the systems powering Google’s ML infrastructure. I was heavily focused on a system called the Safe Removal Service 1 (SRS). This service had a simple API and one core responsibility: to say yes or no when another system requested permission to disrupt a given entity . Indeed, most disruptive services at Google, the ones that reboot machines, drain jobs, or take clusters offline, were designed to ask this service before acting. In our context, the key constraint was preserving capacity, meaning ML TPUs and GPUs. For example, within a given cluster, at least 90% of TPUs must remain available at all times. So if 95% were currently available, SRS could approve disruptions, as long as availability didn’t drop below 90%. NOTE : The threshold values and other details have been altered for confidentiality reasons. The API was deliberately simple: “ Can I reboot this machine? ” → Yes/No “ Can I drain this job? ” → Yes/No “ Can I take down this cluster? ” → Yes/No SRS implemented several balancing feedback loops . For example, when available capacity dropped toward 90%, the service would start refusing disruptive requests, pushing availability back up. This was the primary loop: a governor that kept the system in a safe zone. There was also an implicit reinforcing loop on the positive side: by allowing maintenance to proceed when capacity was healthy, the service enabled machines to be upgraded, patched, and kept in good shape, which in turn kept capacity high. So far, so good. But here’s where it gets interesting. The balancing loop protected current capacity. What it didn’t account for was what happened when capacity was already constrained. When available capacity hovered near 90%, SRS would block most maintenance requests. Machines couldn’t be patched. Hardware with known error trends couldn’t be swapped. Security upgrades were deferred. Maintenance debt accumulated, silently, invisibly. This created a first hidden reinforcing loop: Less capacity → Deferred maintenance → More failures → Even less capacity The balancing loop was actively feeding the very problem it was trying to prevent. A second reinforcing loop emerged from human behavior: Low capacity → More incidents → Bypass mechanisms invoked → Riskier actions taken → Capacity lower still When the system was under stress, operators would sometimes override SRS to unblock critical work. Each bypass, reasonable in isolation, eroded the safety margins that the balancing loop was designed to protect. There’s a principle from Thinking in Systems that describes this precisely: System behavior is particularly sensitive to the goals of feedback loops . If the goals—the indicators of satisfaction of the rules—are defined inaccurately or incompletely, the system may obediently work to produce a result that is not really intended or wanted. Specify indicators and goals that reflect the real welfare of the system . Be especially careful not to confuse effort with result or you will end up with a system that is producing effort, not result. SRS was measuring the right-looking metric: current capacity. But the current capacity was not the same as the real health . A cluster at 92% availability, accumulating maintenance debt and hardware errors, was far more fragile than a cluster at 91% that was fully patched and stable. The balancing loop couldn’t tell the difference. The deeper fix wasn’t just tuning the threshold. It was making the controller health-aware, not just capacity-aware . Rather than gating only on “ % available right now ,” the system needed to incorporate slow indicators: maintenance backlog growth rate, share of fleet on known-bad firmware versions, hardware error trendlines, override and bypass rates. By the time the reinforcing loops made their effects visible, the stock (cluster health) had already been degrading for weeks. The delay between cause and effect made the problem invisible until it was expensive to fix. This example was not about a flawed design. It was about a structure that, taken as a whole, was quietly working against itself. A system is a set of elements interconnected to achieve a goal. Stocks are accumulations that change over time through flows; stocks take time to change. A feedback loop occurs when an effect feeds back into its own cause. Balancing feedback loops resist change and push toward equilibrium; reinforcing feedback loops amplify change. Delays between action and effect can cause oscillations and make problems invisible until too late. System boundaries are artificial; the boundary we draw determines what we see and miss. The iceberg model: events are visible, but patterns, structure, and mental models lie beneath. System goals must reflect real welfare, not just what’s measurable; inaccurate goals lead to unwanted behaviors. A well-designed balancing loop can mask hidden reinforcing dynamics. The most dangerous moment is when a system appears to be working. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. Working on Complex Systems Probabilistic Increment Thinking In Systems Learning Systems Thinking Leverage Points: Places to Intervene in a System ❤️ If you enjoyed this post, please hit the like button. 💬 Have you ever built or maintained a system that looked healthy on the dashboard while something was quietly accumulating underneath? Leave a comment I already mentioned that service in a previous post. You can find more information in this whitepaper: VM Live Migration At Scale . Introduction Have you ever fixed an incident, only to see it come back two weeks later? Or made a change that improved one metric while quietly degrading another? Or spent months firefighting without ever feeling like things were actually getting better? These aren’t signs of bad engineering. They’re signs of reacting to events without understanding the structures that produce them. Understanding those structures requires a different kind of thinking, and that’s exactly what systems thinking is: the ability to shift from reacting to events through responsive patterns of behaviors to generating improved systemic structures. This post is an introduction to systems thinking, covering the core concepts through a real example from my experience at Google. What Is a System? First, let’s define what a system is. In essence, a system is: A set of elements Interconnected To achieve something 3 nodes (elements) Connections from the leader to the replicas (interconnections) With the goal of storing data reliably over time A stock is an accumulation of material or information that has built up in a system over time. For example: the number of machines available in a cluster, the size of a message queue, the amount of technical debt in a codebase. A flow is what changes a stock: material or information entering or leaving it. For example: machines being added or removed from service, messages being enqueued and consumed, or requests being received and processed. : Temperature change : Thermostat turns heating on or off A balancing feedback loop resists change : It pushes the system back toward a goal or limit. Think of it as a stabilizer: when something moves away from the target, the loop acts to bring it back. The thermostat is a perfect example. As the temperature drifts away from 20°C, the thermostat reacts, and the system returns to equilibrium. A reinforcing feedback loop amplifies change : More leads to more, less leads to less. An action produces a result that drives more of the same action, generating growth or decline at an accelerating rate. The YouTube algorithm is a clear illustration: the more a video is viewed, the more the algorithm surfaces it; the more it’s surfaced, the more views it gets. Balancing ceiling : If causes , then influences Balancing floor : If causes , then influences Reinforcing growth : If causes , then influences Reinforcing collapse : If causes , then influences Foresight becomes essential : Acting only when a problem becomes obvious means missing the window to address it early. Oscillations become likely : We overreact because the system hasn’t had time to respond, then overreact again in the other direction. Events are what’s visible: the incident alert, the latency spike on the dashboard. This is where most of our attention goes, and where reactive thinking lives. Patterns and trends are what you find when you zoom out. Has this happened before? At what frequency? Under what circumstances? Patterns reveal that what felt like a one-off event is actually part of a larger rhythm. Structure is the underlying system design: the feedback loops, the incentives, the processes that produce the patterns. You can’t fix a pattern without understanding the structure that generates it. Mental models are the beliefs and assumptions that shaped the structure in the first place. They’re the hardest to see and the hardest to change. Credits Most incident response lives at the event level. Systems thinking asks us to go deeper. As an SRE, this model resonates: we’re trained not just to react to incidents but to understand the why: the patterns, the structures, and eventually the assumptions that caused them. A Concrete Example: Safe Removal Service Let me now bring all of these concepts together through a concrete example from my previous role at Google, where I worked on the systems powering Google’s ML infrastructure. I was heavily focused on a system called the Safe Removal Service 1 (SRS). This service had a simple API and one core responsibility: to say yes or no when another system requested permission to disrupt a given entity . Indeed, most disruptive services at Google, the ones that reboot machines, drain jobs, or take clusters offline, were designed to ask this service before acting. In our context, the key constraint was preserving capacity, meaning ML TPUs and GPUs. For example, within a given cluster, at least 90% of TPUs must remain available at all times. So if 95% were currently available, SRS could approve disruptions, as long as availability didn’t drop below 90%. NOTE : The threshold values and other details have been altered for confidentiality reasons. The API was deliberately simple: “ Can I reboot this machine? ” → Yes/No “ Can I drain this job? ” → Yes/No “ Can I take down this cluster? ” → Yes/No A system is a set of elements interconnected to achieve a goal. Stocks are accumulations that change over time through flows; stocks take time to change. A feedback loop occurs when an effect feeds back into its own cause. Balancing feedback loops resist change and push toward equilibrium; reinforcing feedback loops amplify change. Delays between action and effect can cause oscillations and make problems invisible until too late. System boundaries are artificial; the boundary we draw determines what we see and miss. The iceberg model: events are visible, but patterns, structure, and mental models lie beneath. System goals must reflect real welfare, not just what’s measurable; inaccurate goals lead to unwanted behaviors. A well-designed balancing loop can mask hidden reinforcing dynamics. The most dangerous moment is when a system appears to be working. Working on Complex Systems Probabilistic Increment Thinking In Systems Learning Systems Thinking Leverage Points: Places to Intervene in a System

DevOps

0 views

The Coder Cafe 1 months ago

How an SSD Works

☕ Welcome to The Coder Cafe! Today, we explore quantum physics. Not the abstract kind, but the kind that runs inside the device you are reading this on. Indeed, every time you save a file to an SSD, electrons exploit quantum physics to cross a physical barrier they classically have no business crossing. I’m not a physicist, but I’ve been in love with quantum physics for years, and over the last few months I've gone deep into these concepts. Get cozy, grab a coffee, and let’s begin! An Introduction to Matter To start, what is matter? Matter is made up of molecules, and molecules are assemblages of atoms , the building blocks of matter. For example, water is an H₂O molecule: 2 hydrogen atoms and 1 oxygen atom. An atom is itself composed of a nucleus and electrons , which carry a negative charge and orbit around it. The nucleus contains two types of particles: Protons , which carry a positive charge, naturally repel each other. And neutrons , which carry no electric charge and act as a kind of “glue,” helping to keep the nucleus stable. The attraction between electrons (−) and protons (+) keeps the whole thing in a stable state . On the other hand, too few or too many neutrons relative to the protons, and the nucleus becomes unstable. It will eventually decay by emitting energy. This is the principle of radioactivity. Carbon-14, for example, is slightly unstable. It decays slowly and predictably. This predictability allows it to be used as a clock to date ancient elements. One might think that when touching a solid object, like a table, for instance, what gives the table its solidity is that it is “filled” with matter, preventing our finger from passing through. Yet, if we took the nucleus and enlarged it into a marble and placed that marble on a football pitch, the electrons would be orbiting at the level of the stands. If the nucleus of an atom were the size of a marble placed at the center of a football pitch, the electrons would only be found orbiting at the level of the distant stands with almost nothing in between. An atom is therefore almost entirely empty . Solid matter is almost nothing, and what gives this impression of solidity are forces between atoms called electromagnetic forces . The universe is made up of 4 and only 4 fundamental forces: Gravity : Attracts everything with mass toward everything else with mass. The strong nuclear force : It glues protons and neutrons together inside the nucleus. The weak nuclear force : Responsible for certain radioactive decays. It is what allows a neutron to transform into a proton (or vice versa). And the electromagnetic force . If we focus on this last one, it is the one that: Attracts opposite charges And repels identical charges. Unlike the two nuclear forces, which only act inside the nucleus, the electromagnetic force has an infinite range. That is why it is the one that governs interactions between atoms at our scale. It is therefore the electromagnetic force that creates the illusion of solidity . When we touch a table, it is the electrons in our hand and those in the table that repel each other. We never truly touch anything. Let’s now talk about light. So, what is light ? It is an electromagnetic wave, a disturbance of the electric and magnetic fields that propagates through space. Light is a spectrum . Indeed, so-called visible light, the light our eyes can perceive, is only a tiny portion of what exists. The full spectrum is called the electromagnetic spectrum: Radio wave → Microwave → Infrared → Visible light → UV → X-rays → Gamma rays When a radio picks up radio waves, it is therefore picking up light, invisible due to its frequency. Indeed, what varies across the electromagnetic spectrum is the frequency of the wave, and therefore its energy. But light hides a surprise: it is also a particle. NOTE : A particle can be summarized as follows: an indivisible packet of energy. We know it is a particle thanks to Einstein in 1905 (for which he received his only Nobel Prize, not for relativity). When a light bulb emits light, it emits specific particles called photons. When we vary the intensity of that light bulb, one might assume it is the energy of the photon that varies, but that is not the case. The energy of each photon is fixed by its frequency. The higher the frequency of a photon, the more energetic each photon is. That is why, for example, UV rays burn the skin. What makes a light bulb emit more light is the increase in electric voltage, which therefore produces more photons. It is the quantity of photons that makes a light bulb shine more or less. In flight, the photon behaves like a wave : it propagates, it oscillates, and it can interfere with other photons. But when it comes into contact with matter, it behaves like a particle: it interacts in one single hit, in one single place. When a photon collides with matter, it can either be: Absorbed : The photon ceases to exist. Its energy is transferred to an atom, which moves to a higher energy level. This is what an eye does: it absorbs the photon and converts it into an electrical signal. Reflected : Technically, this is not a true reflection because it is not the same photon that leaves. The atom absorbs the photon and then re-emits a photon of the same energy in a different direction. NOTE : What determines whether a photon is absorbed or reflected depends on the energy levels of the electrons in the atoms of the surface. If the photon’s frequency matches an available energy level, the atom absorbs it. Otherwise, the photon is re-emitted. That is why glass is transparent, why the retina absorbs light, and why a mirror reflects almost everything. We have seen that light is a wave. But how do we know this? This is where Young’s double-slit experiment comes in, and it is this very experiment that will lay the foundations of quantum physics. Young’s experiment, carried out for the first time in 1801, is as follows: A laser projects photons (light) A wall with two small slits, A and B A screen behind to detect where the photons land If light were a “packet” of something, we would see the following result: If light behaved purely as a particle, firing it through two slits would simply produce two bright bands on the screen, one for each slit ( credits ). Yet, the result of Young’s double-slit experiment is as follows: Instead of two bands, light actually produces multiple alternating stripes on the screen, proof that it behaves as a wave, interfering with itself after passing through both slits simultaneously ( credits ). We obtain what is called an interference pattern . The wave passes through both slits simultaneously, splits into two, and these two waves meet on the other side. Where two light waves meet after passing through a double slit, they either reinforce or cancel each other out, creating an alternating pattern of bright and dark bands on the screen. When two waves meet, they add up or cancel out depending on their respective phase: Two crests meeting → they add up → bright zone A crest meeting a trough → they cancel out → dark zone The result is an alternating pattern of bright and dark bands on the screen: that is an interference pattern . In the 20th century, researchers then had an idea: apply Young’s experiment no longer by projecting photons (light) but electrons (matter). The experiment is therefore similar, but instead of a laser, an electron gun is used to then measure on the screen where the matter lands. Obviously, with this experiment, we are going to get two bands of matter, right? Well, still no! An interference pattern is observed as well . This result was not a complete surprise to everyone. In 1924, physicist Louis de Broglie had already theoretically proposed that matter, like light, could have a wave-like nature. But this time, it's not a wave-like light; it's a probability wave . This is one of the greatest discoveries in quantum physics: at the atomic level, the particle has no defined position . The position of a particle is determined by a function called the wave function , , which describes the probabilities of finding that particle at a given location in time. A smooth sinusoidal curve representing the wave function ψ(x), showing how the probability of finding a particle oscillates across different positions in space. A small clarification on this concept of undefined position to make sure the concept is clear, because this is the moment where our rational brain can start to “let go.” Let’s take a coin for a coin toss. We throw it in the air and hide the result. We are in a state of uncertainty, but this uncertainty is called epistemic . We do not know the result (heads or tails) because we have not looked yet, yet that result already exists. For a particle in the quantum world, the uncertainty is called ontological . It is not that we lack information about the position of the particle; it is that this position simply does not exist yet . This is what is called quantum superposition : an unmeasured particle exists in multiple states simultaneously. However, measurement changes everything. When we measure the position of a particle, we will find it in one of the possible positions described by the wave function. We then say that the wave function “collapses” because it restricts the possibilities into a single real state. Once a particle’s position is measured, its wave function collapses from a spread of possibilities into a single sharp spike, pinpointing the particle at one exact location. As an analogy, it is a bit like Minecraft. A default Minecraft map is 60 million x 60 million blocks. For the initial loading, the server does not generate the entire map. It only generates the world around the observer , i.e., the player. However, when the player moves, they force the server to generate the world's continuation. Where this analogy reaches its limits is that the generation of the Minecraft world, even if it is random, is still deterministic because each world has its own seed. The quantum world, on the other hand, appears to be purely random, meaning without hidden information. Let’s return to Young’s experiment. What would happen if, when a particle passes through a slit, we placed a detector there to observe which slit the particle goes through? We recall that a wave passes through both slits at once. When a detector is added to observe which slit the particle passes through, the outcome on the screen becomes uncertain because the act of measuring the particle’s position disrupts its wave-like behavior. This is the moment where the brain completely lets go: observing the particle changes the result of the experiment . Indeed, observing that particle “forces” it to have a defined position, and it then behaves like a classical marble. The result, therefore, gives us two bands of matter . To summarize what we have seen so far: an unobserved particle exists as a probability wave, in multiple positions simultaneously. As soon as we measure it, this wave collapses, and the particle ends up at a precise location. But then, why do we never see this in everyday life? The answer is decoherence . Quantum superposition is only possible as long as a particle remains isolated from its environment. As soon as it interacts with anything , another atom, a photon, an electric field, that interaction constitutes a measurement in the quantum sense. The wave function collapses, and the particle ends up in a precise state. An isolated electron in a vacuum can remain in superposition. But a macroscopic object like a table is made up of billions upon billions of atoms that permanently interact with the surrounding air, light photons, and electromagnetic fields. These interactions occur billions of times per second. The superposition collapses instantaneously before we can even observe it. That is why quantum physics is only observable at the atomic scale. And that is also why a single electron in a transistor behaves very differently from an object we can hold in our hand. OK, so the original Young’s experiment with light produces an interference pattern because light is a wave. The variation with electrons (or indeed subsequently other elements such as atoms) also produces an interference pattern, which proves that matter is a wave, but this time a probability wave. When we measure the result, we change the result of the experiment because we force the particle to “choose” its position. But incidentally, how does this measurement work in the experiment? It works thanks to photons . Indeed, when the electron passes through one of the slits, we project a photon which will interact with the electron and be re-emitted in a direction that allows us to deduce which slit the electron went through. Researchers wanted to know what would happen if they performed the exact same experiment, measuring which slit the particle went through, but this time, instead of reading the information encoded in the orientation of the photon, they destroyed that information . And here, another surprise: if we destroy the information, we return to an interference pattern . It was as if, since we were not using that information, there was nothing forcing the particle to choose which slit to go through, and so it could remain in the form of a probability wave. This new experiment, therefore, demonstrates something fundamental in quantum physics: technically, it is not the act of measuring that influences the experiment, but whether or not this information exists somewhere in the universe . If this information is destroyed, the interference pattern returns. The key is therefore information . NOTE : How does the destruction of this information work? One might think it would simply be a matter of having the photon absorbed by an absorbing surface before reading it, but this does not work, and we are left with bands. Indeed, by doing so, the information theoretically exists because the absorbing surface could have determined the position of the particle through the orientation of the photon. The destruction works with another incredible principle of quantum physics that I will not detail in this article: entanglement. The photon is sent onto a special crystal, which splits it into two twin photons quantumly linked. One of the twins is then destroyed, making the information unrecoverable because to read the information, one absolutely needs to read both twins. To simplify, the two twins are not copies; they form a single system whose properties are not individually defined. We are slowly getting closer to SSDs. But before that, there is one last quantum concept we need to talk about: the tunnel effect . We said that an unobserved electron does not exist like a marble at a precise location. It exists as a probability wave spread out in space. This wave function gives a probability of finding the electron at each point in space. Now let’s imagine a physical barrier . We send an electron toward this barrier. Classically, if the electron does not have enough energy to pass over it, it is blocked. Full stop. Yet quantum mechanically, the wave function of the electron does not stop abruptly at the barrier. Because it is a wave, it propagates and gradually decays through the barrier. It does not fall to zero. On the other side, there therefore remains a non-zero probability of finding the electron . This is the tunnel effect: a real chance for the electron to end up on the other side , without having had the classical energy needed to cross. This probability is not fixed. It depends directly on the thickness of the barrier: the thinner the barrier, the more the wave function survives on the other side, and the higher the tunneling probability. At our scale, the barriers are far too thick for this effect to be observable. But at the scale of a few nanometers, the probability exists. In an SSD, we want to store data. They work with bits, but it is precisely in the management of these bits that the principles of quantum physics come into play. In an SSD, each bit is encoded in cells called floating gates : small zones isolated on all sides by an insulating layer. This box can contain electrons or not: Box with electrons : Bit = 0 Box without electrons : Bit = 1 In an SSD, each bit is stored in a floating gate cell: a cell filled with electrons represents a 0, while an empty cell represents a 1. If we need: To write , we therefore need to make electrons enter this isolated box. What we do is apply an electric voltage that deforms the wave function of the electrons and increases their probability of ending up on the other side. The electrons, therefore, cross the barrier via the tunnel effect. To erase , we apply a reverse voltage, which also impacts the wave function, and the electrons cross in the other direction. To read , it is a classical, non-quantum measurement: we measure the electric current passing through the transistor. Electrons present: weak current: 0. No electrons: strong current: 1. We saw, however, that the wave function gives a probability , not a certainty. If we apply a voltage to write or erase, we therefore only have a probability that the electron will cross the barrier. How can an SSD be reliable then? An individual electron is unpredictable, but we never send just one electron. We send millions simultaneously. Statistically, enough of them cross the barrier to charge the floating gate reliably. And after each write, the controller immediately re-reads the cell to verify. If not enough electrons have crossed, it tries again. That is why SSDs embed error correction mechanisms, ECC (Error Correcting Code) , precisely because the process is probabilistic by nature. When a cell exceeds a certain error threshold over time, it is finally marked as defective and taken out of service. The data it held is moved to a healthy cell. That is why SSDs always have an over-provisioning capacity: a reserve of cells invisible to the user, planned from the manufacturing stage to replace defective cells over time. And that is also why an SSD does not fail all at once; it degrades progressively , cell by cell, until the reserve is exhausted. And this is where quantum physics imposes its limits. The more transistors shrink, the thinner the insulating barriers become, and the more the tunnel effect becomes uncontrollable, electrons escape spontaneously, errors increase, and cells age faster. Moore’s Law, which predicts a doubling of transistor density every two years, is today running up against these fundamental physical limits. This is not an engineering problem: it is quantum physics that sets the boundary . Matter is made up of atoms, themselves composed of a nucleus (protons and neutrons) and electrons. An atom is almost entirely empty: what we perceive as “solid” is an illusion created by the electromagnetic forces between atoms. Light is both an electromagnetic wave and a particle called a photon. In flight, it behaves like a wave, but it is emitted and absorbed like a particle, in one single hit, in one single place. Young’s double-slit experiment proves that light is a wave : it produces an interference pattern, impossible to obtain with classical particles. Matter behaves in the same way. But unlike light, its wave is not physical: it is a probability wave that describes the possible positions of a particle. This is quantum superposition: an unmeasured particle exists in multiple states simultaneously. It is not the act of measuring that collapses the superposition: it is the existence of the information somewhere in the universe. If the information is destroyed, the superposition is restored. Decoherence explains why we never see superposition at our scale: any macroscopic object permanently interacts with its environment, which instantaneously collapses its wave function. The tunnel effect is a direct consequence of the wave-like nature of particles: the wave function of an electron does not stop abruptly at a physical barrier. There exists a non-zero probability of finding it on the other side, without having had the classical energy to cross. SSDs exploit the tunnel effect to write and erase data: an electric voltage deforms the wave function of electrons and increases their probability of crossing the insulating barrier of a floating gate. Reliability rests on the large number of electrons sent and on ECC. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Instruction Pipelining Simultaneous Multithreading Linux Soft vs. Hard Lockup Something Deeply Hidden We Have No Idea Quantum Country The Double-Slit Experiment - Veritasium ❤️ If you enjoyed this post, please hit the like button. 💬 Did you know quantum physics was hiding in your laptop all along? I’d love to hear your reaction in the comments. Leave a comment An Introduction to Matter To start, what is matter? Matter is made up of molecules, and molecules are assemblages of atoms , the building blocks of matter. For example, water is an H₂O molecule: 2 hydrogen atoms and 1 oxygen atom. An atom is itself composed of a nucleus and electrons , which carry a negative charge and orbit around it. The nucleus contains two types of particles: Protons , which carry a positive charge, naturally repel each other. And neutrons , which carry no electric charge and act as a kind of “glue,” helping to keep the nucleus stable. If the nucleus of an atom were the size of a marble placed at the center of a football pitch, the electrons would only be found orbiting at the level of the distant stands with almost nothing in between. An atom is therefore almost entirely empty . Solid matter is almost nothing, and what gives this impression of solidity are forces between atoms called electromagnetic forces . The Fundamental Forces in the Universe The universe is made up of 4 and only 4 fundamental forces: Gravity : Attracts everything with mass toward everything else with mass. The strong nuclear force : It glues protons and neutrons together inside the nucleus. The weak nuclear force : Responsible for certain radioactive decays. It is what allows a neutron to transform into a proton (or vice versa). And the electromagnetic force . Attracts opposite charges And repels identical charges. Absorbed : The photon ceases to exist. Its energy is transferred to an atom, which moves to a higher energy level. This is what an eye does: it absorbs the photon and converts it into an electrical signal. Reflected : Technically, this is not a true reflection because it is not the same photon that leaves. The atom absorbs the photon and then re-emits a photon of the same energy in a different direction. A laser projects photons (light) A wall with two small slits, A and B A screen behind to detect where the photons land If light behaved purely as a particle, firing it through two slits would simply produce two bright bands on the screen, one for each slit ( credits ). Yet, the result of Young’s double-slit experiment is as follows: Instead of two bands, light actually produces multiple alternating stripes on the screen, proof that it behaves as a wave, interfering with itself after passing through both slits simultaneously ( credits ). We obtain what is called an interference pattern . The wave passes through both slits simultaneously, splits into two, and these two waves meet on the other side. Where two light waves meet after passing through a double slit, they either reinforce or cancel each other out, creating an alternating pattern of bright and dark bands on the screen. When two waves meet, they add up or cancel out depending on their respective phase: Two crests meeting → they add up → bright zone A crest meeting a trough → they cancel out → dark zone A smooth sinusoidal curve representing the wave function ψ(x), showing how the probability of finding a particle oscillates across different positions in space. A small clarification on this concept of undefined position to make sure the concept is clear, because this is the moment where our rational brain can start to “let go.” Let’s take a coin for a coin toss. We throw it in the air and hide the result. We are in a state of uncertainty, but this uncertainty is called epistemic . We do not know the result (heads or tails) because we have not looked yet, yet that result already exists. For a particle in the quantum world, the uncertainty is called ontological . It is not that we lack information about the position of the particle; it is that this position simply does not exist yet . This is what is called quantum superposition : an unmeasured particle exists in multiple states simultaneously. However, measurement changes everything. When we measure the position of a particle, we will find it in one of the possible positions described by the wave function. We then say that the wave function “collapses” because it restricts the possibilities into a single real state. Once a particle’s position is measured, its wave function collapses from a spread of possibilities into a single sharp spike, pinpointing the particle at one exact location. As an analogy, it is a bit like Minecraft. A default Minecraft map is 60 million x 60 million blocks. For the initial loading, the server does not generate the entire map. It only generates the world around the observer , i.e., the player. However, when the player moves, they force the server to generate the world's continuation. Where this analogy reaches its limits is that the generation of the Minecraft world, even if it is random, is still deterministic because each world has its own seed. The quantum world, on the other hand, appears to be purely random, meaning without hidden information. Let’s return to Young’s experiment. What would happen if, when a particle passes through a slit, we placed a detector there to observe which slit the particle goes through? We recall that a wave passes through both slits at once. When a detector is added to observe which slit the particle passes through, the outcome on the screen becomes uncertain because the act of measuring the particle’s position disrupts its wave-like behavior. This is the moment where the brain completely lets go: observing the particle changes the result of the experiment . Indeed, observing that particle “forces” it to have a defined position, and it then behaves like a classical marble. The result, therefore, gives us two bands of matter . To summarize what we have seen so far: an unobserved particle exists as a probability wave, in multiple positions simultaneously. As soon as we measure it, this wave collapses, and the particle ends up at a precise location. But then, why do we never see this in everyday life? The answer is decoherence . Decoherence Quantum superposition is only possible as long as a particle remains isolated from its environment. As soon as it interacts with anything , another atom, a photon, an electric field, that interaction constitutes a measurement in the quantum sense. The wave function collapses, and the particle ends up in a precise state. An isolated electron in a vacuum can remain in superposition. But a macroscopic object like a table is made up of billions upon billions of atoms that permanently interact with the surrounding air, light photons, and electromagnetic fields. These interactions occur billions of times per second. The superposition collapses instantaneously before we can even observe it. That is why quantum physics is only observable at the atomic scale. And that is also why a single electron in a transistor behaves very differently from an object we can hold in our hand. The Key is Information OK, so the original Young’s experiment with light produces an interference pattern because light is a wave. The variation with electrons (or indeed subsequently other elements such as atoms) also produces an interference pattern, which proves that matter is a wave, but this time a probability wave. When we measure the result, we change the result of the experiment because we force the particle to “choose” its position. But incidentally, how does this measurement work in the experiment? It works thanks to photons . Indeed, when the electron passes through one of the slits, we project a photon which will interact with the electron and be re-emitted in a direction that allows us to deduce which slit the electron went through. Researchers wanted to know what would happen if they performed the exact same experiment, measuring which slit the particle went through, but this time, instead of reading the information encoded in the orientation of the photon, they destroyed that information . And here, another surprise: if we destroy the information, we return to an interference pattern . It was as if, since we were not using that information, there was nothing forcing the particle to choose which slit to go through, and so it could remain in the form of a probability wave. This new experiment, therefore, demonstrates something fundamental in quantum physics: technically, it is not the act of measuring that influences the experiment, but whether or not this information exists somewhere in the universe . If this information is destroyed, the interference pattern returns. The key is therefore information . NOTE : How does the destruction of this information work? One might think it would simply be a matter of having the photon absorbed by an absorbing surface before reading it, but this does not work, and we are left with bands. Indeed, by doing so, the information theoretically exists because the absorbing surface could have determined the position of the particle through the orientation of the photon. The destruction works with another incredible principle of quantum physics that I will not detail in this article: entanglement. The photon is sent onto a special crystal, which splits it into two twin photons quantumly linked. One of the twins is then destroyed, making the information unrecoverable because to read the information, one absolutely needs to read both twins. To simplify, the two twins are not copies; they form a single system whose properties are not individually defined. The Tunnel Effect We are slowly getting closer to SSDs. But before that, there is one last quantum concept we need to talk about: the tunnel effect . We said that an unobserved electron does not exist like a marble at a precise location. It exists as a probability wave spread out in space. This wave function gives a probability of finding the electron at each point in space. Now let’s imagine a physical barrier . We send an electron toward this barrier. Classically, if the electron does not have enough energy to pass over it, it is blocked. Full stop. Yet quantum mechanically, the wave function of the electron does not stop abruptly at the barrier. Because it is a wave, it propagates and gradually decays through the barrier. It does not fall to zero. On the other side, there therefore remains a non-zero probability of finding the electron . This is the tunnel effect: a real chance for the electron to end up on the other side , without having had the classical energy needed to cross. This probability is not fixed. It depends directly on the thickness of the barrier: the thinner the barrier, the more the wave function survives on the other side, and the higher the tunneling probability. At our scale, the barriers are far too thick for this effect to be observable. But at the scale of a few nanometers, the probability exists. How SSDs Use Quantum Physics In an SSD, we want to store data. They work with bits, but it is precisely in the management of these bits that the principles of quantum physics come into play. In an SSD, each bit is encoded in cells called floating gates : small zones isolated on all sides by an insulating layer. This box can contain electrons or not: Box with electrons : Bit = 0 Box without electrons : Bit = 1 In an SSD, each bit is stored in a floating gate cell: a cell filled with electrons represents a 0, while an empty cell represents a 1. If we need: To write , we therefore need to make electrons enter this isolated box. What we do is apply an electric voltage that deforms the wave function of the electrons and increases their probability of ending up on the other side. The electrons, therefore, cross the barrier via the tunnel effect. To erase , we apply a reverse voltage, which also impacts the wave function, and the electrons cross in the other direction. To read , it is a classical, non-quantum measurement: we measure the electric current passing through the transistor. Electrons present: weak current: 0. No electrons: strong current: 1. Matter is made up of atoms, themselves composed of a nucleus (protons and neutrons) and electrons. An atom is almost entirely empty: what we perceive as “solid” is an illusion created by the electromagnetic forces between atoms. Light is both an electromagnetic wave and a particle called a photon. In flight, it behaves like a wave, but it is emitted and absorbed like a particle, in one single hit, in one single place. Young’s double-slit experiment proves that light is a wave : it produces an interference pattern, impossible to obtain with classical particles. Matter behaves in the same way. But unlike light, its wave is not physical: it is a probability wave that describes the possible positions of a particle. This is quantum superposition: an unmeasured particle exists in multiple states simultaneously. It is not the act of measuring that collapses the superposition: it is the existence of the information somewhere in the universe. If the information is destroyed, the superposition is restored. Decoherence explains why we never see superposition at our scale: any macroscopic object permanently interacts with its environment, which instantaneously collapses its wave function. The tunnel effect is a direct consequence of the wave-like nature of particles: the wave function of an electron does not stop abruptly at a physical barrier. There exists a non-zero probability of finding it on the other side, without having had the classical energy to cross. SSDs exploit the tunnel effect to write and erase data: an electric voltage deforms the wave function of electrons and increases their probability of crossing the insulating barrier of a floating gate. Reliability rests on the large number of electrons sent and on ECC. Instruction Pipelining Simultaneous Multithreading Linux Soft vs. Hard Lockup Something Deeply Hidden We Have No Idea Quantum Country The Double-Slit Experiment - Veritasium

Hardware Science

0 views

The Coder Cafe 2 months ago

Working on Complex Systems

☕ Welcome to The Coder Cafe! Today, I’m sharing the talk I gave at the Monster SCALE Summit 2026 on working on complex systems. Get cozy, grab a coffee, and let’s begin! Introduction If you’ve been a subscriber since mid-2025, you were already here when I published the post that performed best on my newsletter: Working on Complex Systems . I really loved writing it, and I always had in mind to revisit it at some point. So, when someone at ScyllaDB reached out to invite me to speak at Monster SCALE Summit, I saw the perfect opportunity to turn it into a talk. The video isn’t a 1:1 mapping of the original content. I expanded it with more examples and new ideas. In it, I define what complex systems are, then discuss their common characteristics, and finally explore patterns for navigating them. Hope you will enjoy it! Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Latency and User Experience Probabilistic Increment Bloom Filters Monster SCALE Summit - 2026 Tech Talks ❤️ If you enjoyed this post, please hit the like button. Leave a comment Introduction If you’ve been a subscriber since mid-2025, you were already here when I published the post that performed best on my newsletter: Working on Complex Systems . I really loved writing it, and I always had in mind to revisit it at some point. So, when someone at ScyllaDB reached out to invite me to speak at Monster SCALE Summit, I saw the perfect opportunity to turn it into a talk. The video isn’t a 1:1 mapping of the original content. I expanded it with more examples and new ideas. In it, I define what complex systems are, then discuss their common characteristics, and finally explore patterns for navigating them. Hope you will enjoy it! Video Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Resources More From the Distributed Systems Category Latency and User Experience Probabilistic Increment Bloom Filters Monster SCALE Summit - 2026 Tech Talks

0 views

The Coder Cafe 2 months ago

3 Bullets and a Call to Action

☕ Welcome to The Coder Cafe! Today, we discuss an efficient communication method presented in the Debugging Teams book called 3 bullets and a call to action. I’ve been using it extensively over the past months, and I can confirm its efficiency. Get cozy, grab a coffee, and let’s begin! At Google, I recently switched to a new domain: Google Distributed Cloud Connected 1 . Here, all the teams are very busy, and finding an efficient way to communicate over email or chat can be challenging, especially when asking someone to do something. Recently, I came across a simple technique: three bullets and one call to action. The idea is the following: Add three bullet points explaining the key context Follow with one clear call to action Let’s look at a concrete example. Suppose you receive the following email: I recently wrote a design doc on how to save storage in the context of X, where I describe the current problem and the approach we could take to address it. In the document, I go through the main trade-offs involved and explain why the proposal focuses on solution Y in particular. I also included several open questions related to the deployment strategy and some areas where feedback would be especially helpful. It would be great if you could take a look at the document and leave comments by Friday. Quite a mouthful. It requires a non-trivial amount of brain time to understand both the context and what the person is actually asking for. Now let’s apply the three bullets and a call to action strategy: I recently wrote a design doc on how to save storage in the context of X. It highlights the main trade-offs and focuses on the solution Y. I’ve added open questions around the deployment strategy. Could you please have a look and leave comments by Friday? Much better, right? The call to action is clear, and the context is structured around short and easy-to-scan sentences. Why does it work? When communicating via email or chat, people prefer short and memorable messages that do not require too much cognitive effort to process. Bullet points help break information into smaller chunks, which makes the message easier to scan quickly. Ideally, the bullet points and the call to action should be as short as possible. Another aspect is that 3 is often a magic number in communication. With 2 items, you often get a contrast. With 3 items, you start to get a small structure or rhythm that is easier for the mind to process. That is one of the reasons why the rule of three appears so often in writing, storytelling, and presentations, where it helps make ideas more engaging and convincing. Remember: to improve your chances of getting an answer to your request, use 3 short bullets and an efficient call to action. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. 10 Rules I Learned About Technical Writing The XY Problem Don’t Forget About Your Mental Health Rule of three — Thinking Insights Rule of three (writing) — Wikipedia Only 2, sorry about that. ❤️ If you enjoyed this post, please hit the like button. 💬 What do you think about this strategy? Have you tried something similar? Leave a comment That partially explains why I wasn’t so active with The Coder Cafe these days. It will get better, I promise. At Google, I recently switched to a new domain: Google Distributed Cloud Connected 1 . Here, all the teams are very busy, and finding an efficient way to communicate over email or chat can be challenging, especially when asking someone to do something. Recently, I came across a simple technique: three bullets and one call to action. The idea is the following: Add three bullet points explaining the key context Follow with one clear call to action I recently wrote a design doc on how to save storage in the context of X. It highlights the main trade-offs and focuses on the solution Y. I’ve added open questions around the deployment strategy. 10 Rules I Learned About Technical Writing The XY Problem Don’t Forget About Your Mental Health Rule of three — Thinking Insights Rule of three (writing) — Wikipedia

0 views

The Coder Cafe 2 months ago

Build Your Own Key-Value Storage Engine—Week 8

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. The conference starts today and lasts two days. Tomorrow, I’m giving a talk called Working on Complex Systems . I’d be glad to see you there 🙂 . Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Week 8: Concurrency Over this series, you built a working LSM tree: you flush to persist the memtable to disk and compact to reclaim space. Yet, you’ve been single-threaded so far. This week, we lift that constraint: flush and compaction will run in the background while you keep serving requests. There are many ways to add concurrency. The approach here is to introduce a versioned, ref-counted catalog that lets readers take a stable snapshot while background flush/compaction proceeds. A catalog holds references to: The current memtable. The current WAL. The current MANIFEST. Each request pins one catalog version for the duration of the operation. When a flush or compaction completes, the system creates a new catalog version. Old resources (e.g., obsolete SSTables) are not deleted immediately. Instead, each catalog tracks a refcount of in-flight requests. Once an old catalog’s refcount drops to zero and a newer catalog exists, you can safely garbage-collect the resources that appear in the old version but not in the new one. For example, with two catalog versions (red = older version’s elements, blue = newer element”): Once we can guarantee that catalog v1 is no longer referenced, we can delete the old MANIFEST, SST-2 and SST-3. Another example: a flush produced a new memtable and WAL file: In this case, once vatalog v1 has no remaining references, we can free the old memtable and delete the old WAL file. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Add a data structure that tracks: Memtable reference. MANIFEST path. Version (monotonic). Refcount of active readers. Implement a manager that keeps catalog versions in memory: Pick the latest catalog. Increment its refcount. Decrement the refcount of the catalog. If refcount is zero and there’s a new catalog version: Remove the current catalog. Remove elements present in the current catalog but not in the latest version (files, WAL, etc.) Create a new catalog based on the provided data. Assign a unique, monotonic version. At startup: Read from the authoritative MANIFEST (latest MANIFEST file). Treat any files not listed in MANIFEST as orphans and delete them. Read all WAL files you still have on disk, in order, to rebuild the in-memory state. Create the current catalog version from the reconstructed state. Start the background worker. In a nutshell, flush and compaction will move to the background. You’ll use internal queues plus worker pools to ensure no overlapping work on the same resources: at most one flush running at a time, and at most one compaction running at a time. Compaction: Keep the same trigger: Every 10,000 update requests. Do not run compaction in the request path. On compaction trigger: Post a notification to an internal queue and return. A single background thread listens on the queue and runs the actual work. Similar compaction process, except: Do not overwrite the existing WAL file. Instead, create a new file. Create a new catalog that references the new WAL. Keep the same trigger: When the memtable contains 2,000 entries. Do not run flush in the request path. On flush trigger: Allocate a new memtable and create a new WAL file for subsequent writes. Post a notification to an internal queue. Return immediately to the caller. A single background thread listens on the queue and runs the actual work. Similar flush process, except: Do not overwrite the existing MANIFEST file. Instead, create a new file. Create a new catalog referencing the new MANIFEST. Acquire a catalog from the manager. Do the operation using paths/refs from that catalog. Release the catalog. Concurrent requests make deterministic assertions harder. For example, suppose the validation file contains the following requests that can run in parallel: What should you assert for : , , or ? To make validation deterministic, you will handle barriers: all requests before a barrier must finish before starting the next block. You will also relax checks: a is valid if it returns any value written for a key before the last barrier. A similar example with barriers: The first two requests run in parallel. The first barrier waits for both to complete. The first GET should accept either or . The second request should accept only . The new validation file is a sequence of blocks separated by instructions: All the lines between two barriers form a block. On instruction, wait for all in-flight requests in the current block to finish before starting the next block. / lines are issued in parallel within their block. lines are also issued in parallel within their block. means the response must be any one of the list values. Download and run your client against a new file: concurrency.txt . When the memtable reaches 80% of capacity: Pre-allocate the next memtable in memory. Pre-create/rotate to the next WAL on disk. That's it for the whole series. You implemented a fully functional LSM tree: Started with a memtable (hashtable) and a flush that writes immutable SSTables to disk. Added a WAL for durability. Handled deletes and compaction to reclaim space. Introduced leveling and key-range partitioning to speed up reads. Switched to block-based SSTables with indexing. Added Bloom filters and replaced the memtable with a radix trie for faster lookups. Finally, introduced concurrency: a simple, single-threaded foreground path with flush and compaction running in the background. I hope you had fun building it. Thank you for following the series, and special thanks to our partner, ScyllaDB ! To get more information on how things work in production databases, you can read how RocksDB keeps track of live SST files . The structure is inspired by RocksDB’s . Conflict resolution is one aspect we’re missing in the series (maybe as a follow-up?) A versioned catalog is enough for reads, but what about conflicting writes? Suppose two clients, Alice and Bob, update the same key around the same time. A simple policy to resolve conflicts is latest wins. The database can serialize operations for the same key to ensure the latest request wins: In this example, the database ends up with as the latest state. This approach works with one node. But what about databases composed of multiple nodes? Say the two requests go to two different nodes at roughly the same time: With multiple nodes, the database must resolve conflicts consistently. There are two common ways: Coordination via a leader (consensus): Route both writes to the same leader node, which solves the conflict and determines the end state. Reconcile with comparable timestamps: Attach a timestamp to each write and store it with the key. By timestamp, we don’t mean relying on wall-clock time but a logical clock, so that “later“ is well-defined across nodes. If we go with the second approach and start storing data, we also unlock something production systems use: consistent snapshots. A read can include a timestamp, and the database returns the last version at or before that time; hence, providing a consistent view of the data, even while flush/compaction runs in the background. This pattern has a name: Multi-Version Concurrency Control (MVCC). It involves keeping multiple versions per key instead of only the last one, reading using a chosen point in time, and deleting old versions once they are no longer needed. See how ScyllaDB handles timestamp conflict resolution for more information. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Week 8: Concurrency Over this series, you built a working LSM tree: you flush to persist the memtable to disk and compact to reclaim space. Yet, you’ve been single-threaded so far. This week, we lift that constraint: flush and compaction will run in the background while you keep serving requests. There are many ways to add concurrency. The approach here is to introduce a versioned, ref-counted catalog that lets readers take a stable snapshot while background flush/compaction proceeds. A catalog holds references to: The current memtable. The current WAL. The current MANIFEST. Once we can guarantee that catalog v1 is no longer referenced, we can delete the old MANIFEST, SST-2 and SST-3. Another example: a flush produced a new memtable and WAL file: In this case, once vatalog v1 has no remaining references, we can free the old memtable and delete the old WAL file. Your Tasks 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Catalog Add a data structure that tracks: Memtable reference. MANIFEST path. Version (monotonic). Refcount of active readers. Implement a manager that keeps catalog versions in memory: : Pick the latest catalog. Increment its refcount. : Decrement the refcount of the catalog. If refcount is zero and there’s a new catalog version: Remove the current catalog. Remove elements present in the current catalog but not in the latest version (files, WAL, etc.) : Create a new catalog based on the provided data. Assign a unique, monotonic version. At startup: Read from the authoritative MANIFEST (latest MANIFEST file). Treat any files not listed in MANIFEST as orphans and delete them. Read all WAL files you still have on disk, in order, to rebuild the in-memory state. Create the current catalog version from the reconstructed state. Start the background worker. Compaction: Keep the same trigger: Every 10,000 update requests. Behavior: Do not run compaction in the request path. On compaction trigger: Post a notification to an internal queue and return. Worker: A single background thread listens on the queue and runs the actual work. Similar compaction process, except: Do not overwrite the existing WAL file. Instead, create a new file. Create a new catalog that references the new WAL. Keep the same trigger: When the memtable contains 2,000 entries. Behavior: Do not run flush in the request path. On flush trigger: Allocate a new memtable and create a new WAL file for subsequent writes. Post a notification to an internal queue. Return immediately to the caller. Worker: A single background thread listens on the queue and runs the actual work. Similar flush process, except: Do not overwrite the existing MANIFEST file. Instead, create a new file. Create a new catalog referencing the new MANIFEST. Acquire a catalog from the manager. Do the operation using paths/refs from that catalog. Release the catalog. What should you assert for : , , or ? To make validation deterministic, you will handle barriers: all requests before a barrier must finish before starting the next block. You will also relax checks: a is valid if it returns any value written for a key before the last barrier. A similar example with barriers: The first two requests run in parallel. The first barrier waits for both to complete. The first GET should accept either or . The second request should accept only . All the lines between two barriers form a block. On instruction, wait for all in-flight requests in the current block to finish before starting the next block. / lines are issued in parallel within their block. lines are also issued in parallel within their block. means the response must be any one of the list values. Pre-allocate the next memtable in memory. Pre-create/rotate to the next WAL on disk. Started with a memtable (hashtable) and a flush that writes immutable SSTables to disk. Added a WAL for durability. Handled deletes and compaction to reclaim space. Introduced leveling and key-range partitioning to speed up reads. Switched to block-based SSTables with indexing. Added Bloom filters and replaced the memtable with a radix trie for faster lookups. Finally, introduced concurrency: a simple, single-threaded foreground path with flush and compaction running in the background. In this example, the database ends up with as the latest state. This approach works with one node. But what about databases composed of multiple nodes? Say the two requests go to two different nodes at roughly the same time: With multiple nodes, the database must resolve conflicts consistently. There are two common ways: Coordination via a leader (consensus): Route both writes to the same leader node, which solves the conflict and determines the end state. Reconcile with comparable timestamps: Attach a timestamp to each write and store it with the key. By timestamp, we don’t mean relying on wall-clock time but a logical clock, so that “later“ is well-defined across nodes.

0 views

The Coder Cafe 3 months ago

Build Your Own Key-Value Storage Engine—Week 7

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Over the last few weeks, you refined your LSM tree to introduce leveling. In case of a key miss, the process requires the following steps: Lookup from the memtable. Lookup from all the L0 SSTables. Lookup from one L1 SSTable. Lookup from one L2 SSTable. Last week, you optimized the lookups by introducing block-based SSTables and indexing, but a lookup is still not a “free” operation. Worst case, it requires fetching two pages (one for the index block and one for the data block) to find out that a key is missing in an SSTable. This week, you will optimize searches by introducing a “tiny” level of caching per SSTable. If you’re an avid reader of The Coder Cafe 1 , we already discussed a great candidate for such a cache: One that doesn’t consume too much memory to make sure we don’t increase space amplification drastically. One that is fast enough so that a lookup doesn’t introduce too much overhead, especially if we have to check a cache before making any lookup in an SSTable. You will implement a cache using Bloom filters : a space-efficient, probabilistic data structure to check for set membership. A Bloom filter can return two possible answers: The element is definitely not in the set (no false negatives). The element may be in the set (false positives are possible). In addition to optimizing SSTable lookups, you will also optimize your memtable. In week 2, you implemented a memtable using a hashtable. Let’s get some perspective to understand the problems of using a hashtable: A memtable buffers writes. As it’s the main entry point for writes, a write has to be fast. → OK: a hashtable has average inserts, plus ( : the length of the key) for hashing. For reads, doing a key lookup has to be fast → OK: average lookups, plus to hash. Doing range scanning operations (week 5, optional work), such as: “ Give me the list of keys between bar and foo “ → A hashtable, because it’s not an ordered data structure, is terrible: you end up touching everything so with the number of elements in the hashtable. Flush to L0 → A hashtable isn’t ordered, so it requires sorting all the keys ( ) with n the number of elements) to produce the SSTables. Because of these negative points, could we find a better data structure? Yes! This week, you will switch the memtable to a radix trie (see Further Notes for a discussion on alternative data structures). A trie is a tree-shaped data structure usually used to store strings efficiently. The common example to illustrate a trie is to store a dictionary. For example, suppose you want to store these two words: Despite that starts with the same four letters, you need to store a total of 4 + 5 = 9 letters. Tries optimize the storage required by sharing prefixes. Each node stores one letter. Here’s an example of a trie storing these two words in addition to the word foo ( nodes represent the end of a word): As you can see, we didn’t duplicate the first four letters of to store . In this very example, instead of storing 9 letters for and , we stored only five letters. Yet, you’re not going to implement a “basic” trie for your memtable; instead, you will implement a compressed trie called a radix trie (also known as a patricia 2 trie). Back to the previous example, storing one node (one square) has an overhead. It usually means at least one extra field to store the next element, usually a pointer. In the previous example, we needed 11 nodes in total, but what if we could compress the number of nodes required? The idea is to combine nodes with a single child: This new trie stores the exact same information, except it requires 6 nodes instead of 11. That’s what radix tries are about. To summarize the benefits of switching a memtable from a hashtable to a radix trie: Ordered by design: Tries keep keys in order and make prefix/range lookups natural, which helps for and for streaming a sorted flush. No rebalancing/rehashing pauses: The shape doesn’t depend on insertion order, and operations don’t need rebalancing; you avoid periodic rehash work. Prefix compression: A radix trie can cut duplicated key bytes in the memtable, reducing in-memory space. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Let’s size the Bloom filter. You will target: (false-positive rate) = 1% (max elements per SSTable) = 1,953 (hash functions) = 5 Using the formula from the Bloom Filters post: We get ≈ 19,230 bits, i.e., 2,404 B. We will round up to 2,496 B (39 × 64 B), so the bitset is a whole number of cache lines. NOTE : Using =7 would shave only ~2–3% space for ~40% more hash work, so =5 is a good trade-off. To distribute elements across the bitvector, you will use the following approach. You will use xxHash64 with two different constant seeds to get two base hashes, then derive k indices by double hashing (pseudo-code): The required changes to introduce Bloom filters: For each SSTable in the MANIFEST, cache its related Bloom filter in memory. Since each Bloom filter requires only a small amount of space, this optimization has a minimal memory footprint. For example, caching 1,000 Bloom filters of the type you designed requires less than 2.5 MB of memory. SSTable creation: For each new SSTable you write, initialize an empty bitvector of 2,496 B. Build the Bloom filter in memory as you emit the keys (including tombstones): Compute based on the key. For each , set bit at position . When the SSTable is done, persist a sidecar file next to it (e.g., and ) and the file. Update the cache containing the Bloom filters. Compaction: Delete from memory the Bloom filters corresponding to deleted SSTables. Before reading an SSTable: Compute based on the key. If all the bits of are set: The key may be present, therefore, proceed with your normal lookup in the SSTable. Otherwise: Skip this SSTable. Now, let’s replace your hashtable with a trie. : Compressed edge fragment. : A map keyed by the next character after to a node. : An enum with the different possible values: : The node is just a prefix, no full key ends here. : A full key exists at this node. : This key was explicitly deleted. : If is , the corresponding value. Root is a sentinel node with an empty . Walk from the root, matching the longest common prefix against . If partial match in the middle of an edge, split once: Create a parent with the common part, two children: the old suffix and the new suffix. Descend via the next child (next unmatched character). At the terminal node: set and Walk edges by longest-prefix match. If an edge doesn’t match, return not found. At the terminal node: If : return If or , return not found. Walk as in . If the path doesn’t fully exist, create the missing suffix nodes with so that a terminal node exists. At the terminal node: set (you may have to clear ). Flush process: In-order traversal: : Emit tombstone. : Emit nothing. There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Use per-SSTable random seeds for the Bloom hash functions. Persist them in the Bloom filter files. In Bloom Filters , you introduced blocked Bloom filters, a variant that optimizes spatial locality by: Dividing the bloom filter into contiguous blocks, each the size of a cache line. Restricting each query to a single block to ensure all bit lookups stay within the same cache line. Switch to blocked Bloom filters and see the impacts on latency and throughput. If you implemented the operation from week 5 (optional work), wire it to your memtable radix trie. That’s it for this week! You optimized lookups with per-SSTable Bloom filters and switched the memtable to a radix trie, an ordered data structure. Since the beginning of the series, everything you built has been single-threaded, and flush/compaction remains stop-the-world. In two weeks, you will finally tackle the final boss of LSM trees: concurrency. If you want to dive more into tries, Trie Memtables in Cassandra is a paper that explains why Cassandra moved from a skip list + B-tree memtable to a trie, and what it changed for topics such as GC and CPU locality. A popular variant of radix trie is the Adaptive Radix Tree (ART): it dynamically resizes node types based on the number of children to stay compact and cache-friendly, while supporting fast in-memory lookups, inserts, and deletes. This paper (or this summary ) explores the topic in depth. You should also be aware that tries aren’t the only option for memtables, as other data structures exist. For example, RocksDB relies on a skip list. See this resource for more information. About Bloom filters, some engines keep a Bloom filter not only per SSTable but per data-block range as well. This was the case for RocksDB’s older block-based filter format ( source ). RocksDB later shifted toward partitioned index/filters, which partition the index and full-file filter into smaller blocks with a top-level directory for on-demand loading. The official doc delves into the new approach. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. I’m sure you are. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Over the last few weeks, you refined your LSM tree to introduce leveling. In case of a key miss, the process requires the following steps: Lookup from the memtable. Lookup from all the L0 SSTables. Lookup from one L1 SSTable. Lookup from one L2 SSTable. One that doesn’t consume too much memory to make sure we don’t increase space amplification drastically. One that is fast enough so that a lookup doesn’t introduce too much overhead, especially if we have to check a cache before making any lookup in an SSTable. The element is definitely not in the set (no false negatives). The element may be in the set (false positives are possible). A memtable buffers writes. As it’s the main entry point for writes, a write has to be fast. → OK: a hashtable has average inserts, plus ( : the length of the key) for hashing. For reads, doing a key lookup has to be fast → OK: average lookups, plus to hash. Doing range scanning operations (week 5, optional work), such as: “ Give me the list of keys between bar and foo “ → A hashtable, because it’s not an ordered data structure, is terrible: you end up touching everything so with the number of elements in the hashtable. Flush to L0 → A hashtable isn’t ordered, so it requires sorting all the keys ( ) with n the number of elements) to produce the SSTables. As you can see, we didn’t duplicate the first four letters of to store . In this very example, instead of storing 9 letters for and , we stored only five letters. Yet, you’re not going to implement a “basic” trie for your memtable; instead, you will implement a compressed trie called a radix trie (also known as a patricia 2 trie). Back to the previous example, storing one node (one square) has an overhead. It usually means at least one extra field to store the next element, usually a pointer. In the previous example, we needed 11 nodes in total, but what if we could compress the number of nodes required? The idea is to combine nodes with a single child: This new trie stores the exact same information, except it requires 6 nodes instead of 11. That’s what radix tries are about. To summarize the benefits of switching a memtable from a hashtable to a radix trie: Ordered by design: Tries keep keys in order and make prefix/range lookups natural, which helps for and for streaming a sorted flush. No rebalancing/rehashing pauses: The shape doesn’t depend on insertion order, and operations don’t need rebalancing; you avoid periodic rehash work. Prefix compression: A radix trie can cut duplicated key bytes in the memtable, reducing in-memory space. (false-positive rate) = 1% (max elements per SSTable) = 1,953 (hash functions) = 5 Startup: For each SSTable in the MANIFEST, cache its related Bloom filter in memory. Since each Bloom filter requires only a small amount of space, this optimization has a minimal memory footprint. For example, caching 1,000 Bloom filters of the type you designed requires less than 2.5 MB of memory. SSTable creation: For each new SSTable you write, initialize an empty bitvector of 2,496 B. Build the Bloom filter in memory as you emit the keys (including tombstones): Compute based on the key. For each , set bit at position . When the SSTable is done, persist a sidecar file next to it (e.g., and ) and the file. Update the cache containing the Bloom filters. Compaction: Delete from memory the Bloom filters corresponding to deleted SSTables. Lookup: Before reading an SSTable: Compute based on the key. If all the bits of are set: The key may be present, therefore, proceed with your normal lookup in the SSTable. Otherwise: Skip this SSTable. : Compressed edge fragment. : A map keyed by the next character after to a node. : An enum with the different possible values: : The node is just a prefix, no full key ends here. : A full key exists at this node. : This key was explicitly deleted. : If is , the corresponding value. : Walk from the root, matching the longest common prefix against . If partial match in the middle of an edge, split once: Create a parent with the common part, two children: the old suffix and the new suffix. Descend via the next child (next unmatched character). At the terminal node: set and : Walk edges by longest-prefix match. If an edge doesn’t match, return not found. At the terminal node: If : return If or , return not found. : Walk as in . If the path doesn’t fully exist, create the missing suffix nodes with so that a terminal node exists. At the terminal node: set (you may have to clear ). In-order traversal: : Emit . : Emit tombstone. : Emit nothing. Dividing the bloom filter into contiguous blocks, each the size of a cache line. Restricting each query to a single block to ensure all bit lookups stay within the same cache line.

0 views

The Coder Cafe 4 months ago

Build Your Own Key-Value Storage Engine—Week 6

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing In week 2, you used JSON as the SSTable format. That works for document databases, but the overhead of this serialization format doesn’t make it the best choice for your storage engine: Best case: You stream the file and linearly scan entries until you find the key, but a miss means scanning the entire file. Worst case: You read the whole file and parse everything, then search for the key. This week, you will switch to block-based SSTables. Data will be chunked into fixed-size blocks designed to fit within a single disk page. The main benefits: Efficient I/O: Each lookup can fetch a complete block with a single page read. Predictable latency: Since every block maps to exactly one page, each read involves a fixed, bounded amount of I/O, improving latency consistency. Smaller on disk: Binary encoding typically compresses better than JSON. Integrity: Per-block checksums detect corruption without requiring a re-read of the file. Caching: Hot SSTable blocks are cached in a memory-based block cache to reduce I/O and decompression overhead. Alongside the data blocks, you will maintain a small index that stores the first key of each block and its corresponding offset, allowing lookups to jump directly to the relevant block without scanning all of them. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Fixed 64-byte keys and values: This alleviates a lot of logic to keep fixed-size blocks, making the implementation easier to write and reason about. Because of the week 1 assumption (keys are lowercase ASCII strings), each character is one byte, which also makes the implementation easier. A block-based SSTable will be composed of: One index block (first 4 KB page) Multiple data blocks (each 4 KB) Each block has a fixed size of 4 KB. Aligning blocks to 4 KB means a disk read can fetch a block in one page. If blocks are not aligned, a read may span two pages. Here’s the file layout at a glance: The layout of an index block (4 KB): : The number of data blocks in the SSTable. A set of key entries (64 B), each being the first key of the corresponding data block. Entries are sorted by key and used to decide which block to fetch during a lookup. To make the index fit into a single 4 KB page, it must contain at most 63 entries. Here’s the layout (note this is a binary layout; newlines are used only for the representation): NOTE : If you’re not familiar with the concept of padding: it’s filling unused bytes (here with 0x00) so fields and blocks have fixed sizes. has a value between 0 and 63. If you encoded 63 as text, you would need two bytes ( = and = ). Instead, you can store it as a binary integer so it fits in one byte: . Same layout, with explicit offsets: An example of an SSTable with three data blocks, hence three entries. Remember: this is binary; newlines are for readability only: This index block indicates: Block 0 starts with the key . Block 1 starts with the key . Block 2 starts with the key . You don’t need to store per-block offsets. Because the index is stored on a 4 KB page and every data block is exactly 4 KB and written contiguously, offsets can be calculated this way ( starts at 0): Block 0 starts at offset 4096. Block 1 starts at offset 8192. Block 2 starts at offset 12288. Now, let’s focus on data blocks. In addition to the key-value entries, reserve 8 bytes in the block at the start to store a CRC computed over + all entries; this lets you verify data integrity on read. The layout of a data block (4 KB per block): Header (128 B): (8 B): A checksum computed over bytes [8..4096). You can choose any standard variant (e.g., CRC-64/ECMA-182). (1 B): the number of entries in this block (0..31). Padding (119 B). Entries area (31 x 128 B = 3968 B), each entry is: (64 B, right-padded). (64 B, right-padded). The last data block may contain fewer than 31 entries ( ), but always pad with zeros to reach exactly 4 KB. This guarantees one-page reads and prevents errors across read modes (e.g., with mmap ). The layout of a data block (again, newlines are used only for the representation): Same layout, with explicit offsets: An example of a block composed of three key-value pairs: Note that because the index block holds at most 63 key entries, an SSTable can have at most 63 data blocks. With 31 entries per block, that caps an SSTable at 63 × 31 = 1,953 entries. A tombstone is represented by a value of 64 bytes all set to 0x00. Due to this sentinel, the all-zero value is reserved and cannot be used as an application value from this week onward. Searching for a value doesn’t change (memtable → L0 → L1, etc.). What changes is how you read one SSTable (remember: from L1, you only need to read one SSTable per level because of non-overlapping key ranges). The process to read from an SSTable: Binary search the index in to find the largest ≤ key and get . If not found (e.g., first index key is and your key is ), return a miss for this SSTable. Compute the block offset: . Fetch the corresponding 4 KB block. Verify CRC before using the block: Compute CRC64 over bytes [8..4096). Compare with the 8-byte CRC stored at offset 0..7. If it doesn’t match, fail the read for this SSTable. Binary search the entries in for the key. Return the corresponding value or a miss. Last week, you split at 2,000 entries during the compaction process. This week, because a single SSTable is limited to 1,953 entries, change the split threshold to 1,953. There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Drop the 64-byte constraint: store a length-prefixed key and value per entry (short header with key length and value length). Keep entries sorted and include the lengths in your checksum. Tombstones are currently represented by a sentinel value (a 64-byte all-zero value), which prevents storing an actual empty value. Instead, avoid reserving any value for deletes: add an explicit entry type per record (value or tombstone). Now that the format is binary, compression becomes more effective and saves more space. As an optional task, compress each data block independently so lookups still touch only one block: Record each block’s offset and compressed size in the index. Read just those bytes, decompress, and search. This packs more logical blocks into each cached page, raising cache hit rates, reducing pages touched during scans, and smoothing read latency. That’s it for this week! You implemented block-based SSTables and indexing, gaining benefits like more efficient I/O and reduced write amplification. In two weeks, you will focus on improving read performance by adding a layer that can tell whether an SSTable is worth parsing, and say goodbye to your hashtable-based memtable, replacing it with a more efficient data structure. For a production-grade implementation of block-based SSTables, see RocksDB’s block-based SSTable format . It details block layout, per-block compression, and how the index stores offsets and sizes. You can also check out ScyllaDB’s SSTables v3 docs . ScyllaDB maintains a small in-memory summary of sampled keys to narrow the search, then uses the on-disk index to locate the exact block. This provides a nice contrast to our single-page index and illustrates how to scale when SSTables grow large. For a deeper look at how things work in practice in terms of directory structure, you can explore the ScyllaDB SSTables directory structure , which shows how metadata and data are organized on disk. Regarding CRC read failures, we mentioned that a checksum mismatch should simply cause the read to fail for that SSTable. In real systems, databases rely on replication to handle corruption. When multiple replicas exist, a system can recover by using data from an intact replica if one becomes corrupted or unavailable. Upon detecting a checksum mismatch, the system discards the corrupt replica and rebuilds it from a healthy one. This approach only works as long as a valid replica exists, which is why frequent checksum verification is critical: it ensures corruption is caught and repaired as early as possible, before it propagates. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing In week 2, you used JSON as the SSTable format. That works for document databases, but the overhead of this serialization format doesn’t make it the best choice for your storage engine: Best case: You stream the file and linearly scan entries until you find the key, but a miss means scanning the entire file. Worst case: You read the whole file and parse everything, then search for the key. Efficient I/O: Each lookup can fetch a complete block with a single page read. Predictable latency: Since every block maps to exactly one page, each read involves a fixed, bounded amount of I/O, improving latency consistency. Smaller on disk: Binary encoding typically compresses better than JSON. Integrity: Per-block checksums detect corruption without requiring a re-read of the file. Caching: Hot SSTable blocks are cached in a memory-based block cache to reduce I/O and decompression overhead. Fixed 64-byte keys and values: This alleviates a lot of logic to keep fixed-size blocks, making the implementation easier to write and reason about. Because of the week 1 assumption (keys are lowercase ASCII strings), each character is one byte, which also makes the implementation easier. One index block (first 4 KB page) Multiple data blocks (each 4 KB) : The number of data blocks in the SSTable. A set of key entries (64 B), each being the first key of the corresponding data block. Entries are sorted by key and used to decide which block to fetch during a lookup. Block 0 starts with the key . Block 1 starts with the key . Block 2 starts with the key . Block 0 starts at offset 4096. Block 1 starts at offset 8192. Block 2 starts at offset 12288. Header (128 B): (8 B): A checksum computed over bytes [8..4096). You can choose any standard variant (e.g., CRC-64/ECMA-182). (1 B): the number of entries in this block (0..31). Padding (119 B). Entries area (31 x 128 B = 3968 B), each entry is: (64 B, right-padded). (64 B, right-padded). Binary search the index in to find the largest ≤ key and get . If not found (e.g., first index key is and your key is ), return a miss for this SSTable. Compute the block offset: . Fetch the corresponding 4 KB block. Verify CRC before using the block: Compute CRC64 over bytes [8..4096). Compare with the 8-byte CRC stored at offset 0..7. If it doesn’t match, fail the read for this SSTable. Binary search the entries in for the key. Return the corresponding value or a miss. Record each block’s offset and compressed size in the index. Read just those bytes, decompress, and search.

0 views

The Coder Cafe 4 months ago

Build Your Own Key-Value Storage Engine—Week 5

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. I’ll also give a talk there, so feel free to join! Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Last week, you implemented deletion and compaction, making sure the LSM tree wouldn’t grow indefinitely. Still, there’s a weak spot: in the worst-case scenario (e.g., on a key miss), a single read has to scan all SSTables. To address this, you will implement leveling, a core idea in LSM trees. Instead of a single flat list of SSTables, leveling stores data across multiple levels: , , , etc. gets compacted to and makes space for future memtable flushes. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. This process is called level compaction. Something important to understand: is slightly different from all the other levels. is created during memtable flushes. If a key already exists at and also in the memtable, the next flush can write that key again to a new file. In other words, can have overlapping keys. For all the other levels ( to ), that’s not the case. They are created by compaction, which removes duplicates and produces non-overlapping key ranges. In this week’s simplified design, an to compaction takes all SSTables from and , performs a k-way merge, then rewrites fully. As a result, each key appears at most once per level from downward. What’s the consequence of non-overlapping keys? You can improve lookups using a simple range-to-file mapping, for example: Keys from to are stored in this SSTable. Keys from to are stored in this SSTable. With this setup, a read checks only one SSTable per level from to . is the exception due to overlaps, so a read may still need to scan all SSTables. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Limit the number of levels to two: , which may contain overlapping keys. , no overlapping keys. Create a folder for each level: , and . Keep one global file at the root. You will create a layout for both and : remains a simple list of SSTables. allows key-range partitioning. For example: This indicates: is composed of three SSTables: Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . The main goal of the compaction process is to compact both and . At the end, you should merge all the data from and into . will be left empty. When reaches five full SSTable files (2,000 entries each), run an → compaction: Open iterators on all and SSTables. Apply the k-way merge algorithm: Comparator: Primary: . Tie-break (equal ): Prefer over . At , prefer the newest SSTable. Version order: any record from is newer than records from . Within , newer files win (same as week 4). Keep at most one record per key (newest wins). Tombstones: because is the bottom level, drop a tombstone if no older value for that key remains in the merge result. Create new L1 SSTables with at most 2,000 entries. When naming new L1 files, make sure they are unique. For example, if contains and , the first SSTable file created should be . Publish atomically: each new file the directory. Update the atomically. the file. the root directory (the directory containing the file and and folders). Delete obsolete L1 files, then . Delete all files in , then . The logic is unchanged from previous weeks. The only difference is that flush writes to and updates the file in the section. Check the memtable. If not found, scan all files newest to oldest using section of the . If not found at : Use the section of the to choose the one shard that contains the key’s range, then read only that L1 file. Return the value if found; otherwise, return . There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Introducing leveling has a fundamental impact on deletions. With a single level, compaction sees all versions of every key at once, so a tombstone can be dropped as soon as it has “killed“ every older record for that key. Yet, the rule we mentioned last week holds true: a tombstone can be evicted only after all data it shadows no longer exist on disk. With multiple levels, compaction must propagate tombstones downward. It’s only at the bottommost level that tombstones can be dropped, because only there you can prove they no longer shadow any other records. As an optional task, make the number of levels configurable: , , …, : Define a size ratio so each level has a target size larger than the previous one. Keep one directory per level: , , …, . Keep a single global . When a level reaches its max number of SSTables (derived from the size ratio), compact that level into the next. Only drop tombstones at the bottommost level . At any intermediate level with , propagate the tombstone downward during compaction. Implement : Return all keys between (included) and (excluded). Use put-delete-scan.txt to validate that your changes are correct. It introduces the keyword. For example: This line means: between (included) and (excluded), the keys are , , (the output will always be sorted) NOTE : If this route conflicts with , rename the single-key route to . That’s it for this week! Your LSM tree is taking shape. You implemented leveling, a key LSM design idea, and refined compaction so reads are tighter and storage stays under control. In two weeks, you will revisit the week 2 choice of JSON for SSTables. You will switch to block-based SSTables to reduce parsing and I/O overhead and add indexing within each SSTable. We mentioned that, because of key overlaps, a read may still need to scan all SSTables (e.g., key miss). This is the main reason why is typically kept small. In general, each level is larger than the one above it by a fixed size ratio (e.g., 10×). Some databases even use less static mechanisms. For instance, RocksDB relies on Dynamic Leveled Compaction , where the size of each level is automatically adjusted based on the size of the oldest (last) level, eliminating the need to define each level’s size statically. Regarding compaction, you should know that in real-world databases, it isn’t done in batch mode across all data. Let’s understand why. Suppose you have four levels and a layout like this for one key: The key exists at L3. The key doesn’t exist at L2. The key is updated at L1. A tombstone is placed at L0. You can’t compact L0 with L1/L2/L3 in one shot; that would mean checking every SSTable against every level. What happens in reality is that compaction is a promotion process. In our example, the tombstone at L0 is promoted to L1. Implementations ensure that it either (a) is compacted together with the L1 SSTable it shadows, or (b) waits until that L1 data is promoted to L2. The same rule repeats level by level, until the tombstone reaches L3 and finally removes the shadowed value. Meanwhile, it’s essential to understand that compaction is crucial in LSM trees. Let’s take some perspective to understand the reason. An LSM tree buffers writes in a memtable and flushes to L0. Compaction merges SSTables across levels to control read amplification. If compaction falls behind, L0 files accumulate, flushes slow down (or stall at file-count thresholds), write latency climbs, and in the worst case, you can observe write pauses. Not because the memtable is “locked,” but because the engine can’t safely create more L0 files until compaction catches up. This is one of the reasons why the RUM conjecture we introduced last week is important. If you compact too eagerly, you burn a lot of disk I/O and lose the LSM’s write advantage. If you compact too lazily, you incur a penalty on your read path. If you compact everything all the time, you incur a space-amplification penalty during compaction roughly equal to the working set size. Because compaction is so important, most key-value stores support parallel compactions across levels (except → , which isn’t parallelized due to overlapping key ranges in L0). You should also be aware that ongoing research keeps improving compaction. For example, the SILK: Preventing Latency Spikes in LSM Key-Value Stores paper analyzes why LSM systems can exhibit high tail latency. The main reason is that limited I/O bandwidth causes interference between client writes, flushes, and compactions. The key takeaway is that not all internal operations are equal. The paper explores solutions such as Bandwidth awareness: Monitor client I/O and allocate the leftover to internal work dynamically instead of static configuration. Prioritization: Give priority to operations near the top of the tree (flushes and L0 → L1 compaction). Slowdowns there create backpressure that impacts tail latency more than work at deeper levels. Last but not least, what you implemented this week is called level compaction. Other strategies like tiered compaction exist, which merge SSTables based on their size and count rather than fixed levels. You can explore this great resource from Mark Callaghan, which dives deeper into the design trade-offs and performance characteristics of different compaction strategies in LSM trees. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Last week, you implemented deletion and compaction, making sure the LSM tree wouldn’t grow indefinitely. Still, there’s a weak spot: in the worst-case scenario (e.g., on a key miss), a single read has to scan all SSTables. To address this, you will implement leveling, a core idea in LSM trees. Instead of a single flat list of SSTables, leveling stores data across multiple levels: , , , etc. gets compacted to and makes space for future memtable flushes. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. This process is called level compaction. Something important to understand: is slightly different from all the other levels. is created during memtable flushes. If a key already exists at and also in the memtable, the next flush can write that key again to a new file. In other words, can have overlapping keys. For all the other levels ( to ), that’s not the case. They are created by compaction, which removes duplicates and produces non-overlapping key ranges. In this week’s simplified design, an to compaction takes all SSTables from and , performs a k-way merge, then rewrites fully. As a result, each key appears at most once per level from downward. What’s the consequence of non-overlapping keys? You can improve lookups using a simple range-to-file mapping, for example: Keys from to are stored in this SSTable. Keys from to are stored in this SSTable. Limit the number of levels to two: , which may contain overlapping keys. , no overlapping keys. Create a folder for each level: , and . Keep one global file at the root. remains a simple list of SSTables. allows key-range partitioning. is composed of three SSTables: . : Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Open iterators on all and SSTables. Apply the k-way merge algorithm: Comparator: Primary: . Tie-break (equal ): Prefer over . At , prefer the newest SSTable. Version order: any record from is newer than records from . Within , newer files win (same as week 4). Keep at most one record per key (newest wins). Tombstones: because is the bottom level, drop a tombstone if no older value for that key remains in the merge result. Create new L1 SSTables with at most 2,000 entries. When naming new L1 files, make sure they are unique. For example, if contains and , the first SSTable file created should be . Publish atomically: each new file the directory. Update the atomically. the file. the root directory (the directory containing the file and and folders). Clean up: Delete obsolete L1 files, then . Delete all files in , then . Check the memtable. If not found, scan all files newest to oldest using section of the . If not found at : Use the section of the to choose the one shard that contains the key’s range, then read only that L1 file. Return the value if found; otherwise, return . Define a size ratio so each level has a target size larger than the previous one. Keep one directory per level: , , …, . Keep a single global . When a level reaches its max number of SSTables (derived from the size ratio), compact that level into the next. Only drop tombstones at the bottommost level . At any intermediate level with , propagate the tombstone downward during compaction. Return all keys between (included) and (excluded). Use put-delete-scan.txt to validate that your changes are correct. It introduces the keyword. For example: This line means: between (included) and (excluded), the keys are , , (the output will always be sorted) The key exists at L3. The key doesn’t exist at L2. The key is updated at L1. A tombstone is placed at L0. This is one of the reasons why the RUM conjecture we introduced last week is important. If you compact too eagerly, you burn a lot of disk I/O and lose the LSM’s write advantage. If you compact too lazily, you incur a penalty on your read path. If you compact everything all the time, you incur a space-amplification penalty during compaction roughly equal to the working set size. Bandwidth awareness: Monitor client I/O and allocate the leftover to internal work dynamically instead of static configuration. Prioritization: Give priority to operations near the top of the tree (flushes and L0 → L1 compaction). Slowdowns there create backpressure that impacts tail latency more than work at deeper levels.

0 views

The Coder Cafe 4 months ago

The Cold Start Problem

☕ Welcome to The Coder Cafe! I sucked at product management 1 . Early in my career, I was only passionate about how a product works under the hood, not about what the product actually does. Over time, I began to change and open up. Today, I want to share a concept from a book recommended on X by that I really loved: The Cold Start Problem . Get cozy, grab a coffee, and let’s begin! Network Effects Let’s consider a dating app. If there are only three people who installed the app in New-York, anyone new will probably try it for a few seconds and uninstall it. But if a big part of the city is on it, then someone single will probably stick around. Said differently, the more people use the product, the more valuable it becomes. There’s a term for that type of product, and it’s called the network effect: when a product gets more valuable as more people use it . Delving into the network effect, it’s not a single force but actually a trio of forces: The acquisition effect : How a product can use its own network to attract new people. The engagement effect : The more people join, the more useful and sticky it becomes. The economic effect : A larger network reduces costs, improves monetization, and strengthens the business model. Now comes the real challenge. If our product relies on network effects, how do we launch it? Do we start from zero, or wait until enough people are on board? It’s a chicken-and-egg problem, and that’s the cold start problem. One common mistake to solve the cold start problem is the big bang launch: releasing to everyone before any community exists. Google+ is the perfect example. Launched in 2011, it was Google’s attempt at a social network with a Facebook-style feed. The problem was that when people joined, they found empty timelines and left. Google later admitted that 90% of user sessions lasted less than five seconds. At one point, Google even tied YouTube comments to Google+, requiring an account just to comment. The platform eventually reached more than 500 million users, but the issue was never sign-ups. The real problem was that a newcomer’s first session didn’t feel like walking into a lively room. It was forced growth instead of real networks. Google+ didn’t fail because Google couldn’t build a social network. It lost because it never created a place where a new user could land in a live network. In short, Google+ failed to solve the cold start problem. A solution to the cold start problem was applied by Tinder, and it involves focusing on the concept of atomic networks. Back then, dating apps were not very popular. Yet, Tinder was ambitious and wanted to succeed in that market. Instead of launching worldwide, they did the complete opposite. They organized a party at a college that required installing the application to attend. The next day, most students there had Tinder installed. This college was an atomic network: the smallest self-sustaining cluster of users where network effects actually work. Soon after, they repeated the same process in another college in the same city, with the same result: within days, most students there had joined Tinder. What’s powerful about atomic networks is simple: if we can build one network, we can build two. If we can build two, we can build thousands. Tinder repeated this strategy college by college, then city by city, eventually growing into entire countries. The takeaway is that when we start a product with network effects, the first step is to build a single, tiny network that’s self-sustaining on its own. We just want to get started. If we can create one stable, engaged network, then we can build a second one next to it. From there, we can replicate the process and eventually connect them into one large network that spans the whole market. Network effects happen when a product gets more valuable as more people use it. How do we launch such a product with zero users? That’s the cold start problem. Big bang launches fail when newcomers find empty networks. One effective approach is to build atomic networks: the smallest self-sustaining clusters where network effects work. If we can build one atomic network, we can repeat it and scale across a market. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Don’t Forget About Your Mental Health The XY Problem Lateral Thinking The Cold Start Problem Project Strobe: Protecting your data, improving our third-party APIs, and sunsetting consumer Google+ Google+: Communities and photos ❤️ If you enjoyed the post, please consider giving it a like. 💬 The book is really great and covers many more aspects. I definitely recommend it. Are you into product management? What resources would you recommend? Leave a comment Right, Afroditi? The acquisition effect : How a product can use its own network to attract new people. The engagement effect : The more people join, the more useful and sticky it becomes. The economic effect : A larger network reduces costs, improves monetization, and strengthens the business model. Network effects happen when a product gets more valuable as more people use it. How do we launch such a product with zero users? That’s the cold start problem. Big bang launches fail when newcomers find empty networks. One effective approach is to build atomic networks: the smallest self-sustaining clusters where network effects work. If we can build one atomic network, we can repeat it and scale across a market. Don’t Forget About Your Mental Health The XY Problem Lateral Thinking The Cold Start Problem Project Strobe: Protecting your data, improving our third-party APIs, and sunsetting consumer Google+ Google+: Communities and photos

Business

1 views

The Coder Cafe 5 months ago

Build Your Own Key-Value Storage Engine—Week 4

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Over the past few weeks, you built an LSM tree and three main components: a memtable, SSTables, and a WAL that records the same operations you keep in the memtable. To prevent on-disk data from growing forever, you will implement compaction, a critical process in LSM trees. Compaction periodically merges SSTables to reclaim space and keep read performance predictable. For example, if key exists in every SSTable on disk: Compaction drops duplicates and keeps only the newest record: . In addition, you will implement a endpoint. Handling deletes in an LSM tree isn’t straightforward at all: SSTables are immutable. To preserve the append-only nature of LSM trees, deletions are written as tombstones: markers indicating a key was logically deleted. You write it to the WAL, keep it in the memtable, and propagate it during flush. How should compaction work in the presence of tombstones? Suppose you have the following SSTables on disk: the key exists in , doesn’t exist in , exists in , and is deleted at : .”","title":null,"type":"image/png","href":null,"belowTheFold":true,"topImage":false,"internalRedirect":"https://read.thecoder.cafe/i/174613473?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png","isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" title="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" srcset="https://substackcdn.com/image/fetch/$s_!rqB7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 424w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 848w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1272w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1456w" sizes="100vw" loading="lazy"> If the key doesn’t exist in the memtable, the current state for is deleted. Now, imagine that during compaction, you merge and . As the key is marked as deleted in the newest SSTable, you may decide to drop the tombstone, as it hides the key in : Now, would do: Key doesn’t exist in the memtable → Continue. Key doesn’t exist in SST-5 → Continue. Key doesn’t exist in SST-2 → Continue. Key exists in SST-1 → Return (instead of ). The fundamental rule is the following: during compaction, a tombstone can be evicted only after all data it shadows no longer exist on disk. Otherwise, dropping a tombstone too early can make an old value reappear. This is known as data resurrection: a key that “comes back to life” after a deletion. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Flush and compaction should be single-threaded and stop-the-world operations: do not serve client requests until the operations complete. Append a tombstone to the WAL file, with : Update the memtable: Do not remove the key directly; mark it deleted with a tombstone. Acknowledge the request. During flush, carry tombstones into the new SSTable using a new field. For example: The keys must remain sorted. The goals of the compaction process for this week are the following: For each key, keep only the newest record. Drop records hidden by newer versions. This is where merging happens: the newest record wins, and older versions are evicted. Drop tombstones when no older value remains. The compaction trigger is: every 10,000 update requests ( and , not ), compact all SSTables. Algorithm ( k-way merge using a min-heap on key): Open an iterator for each SSTable file known by the MANIFEST. Push each iterator’s current record into a min-heap with the following comparator: Primary: . Tie-break (equal ): Newest SSTable first based on MANIFEST order (to make sure an old value doesn’t win). While the heap is not empty: Pop the smallest key (this first pop is the newest version of due to the tie-break). Drain all other heap entries whose key is and discard them (older values). For the record you picked: If it’s a tombstone, emit nothing for . Otherwise, emit the value for . Advance only the iterators you drained for and push their next records into the heap. Stream emitted records (sorted) into new SSTables. Remember: the max entries in an SSTable should remain 2,000. each new SSTable file, then its parent directory. Update the MANIFEST atomically (see week 3). Remove the old SSTable files. Check the memtable: If the key is marked as deleted, return . Else, return the value. Scan SSTables from newest to oldest, given the MANIFEST order (same as before). For the first record with the requested key: If , return . Else, return the value. If the key isn’t found, return . When replaying the WAL, make sure to take into account tombstone values ( ). Update your client to handle lines → Send a request to . Download and run your client against a new file containing requests: put-delete.txt . NOTE : Refer to week 1 if you need to generate your own file with the number of lines you want. That’s it for this week! Your storage engine now supports deletes and a compaction mechanism that prevents unbounded growth. The Coder Cafe will take a break for two weeks. On January 7th, you will continue exploring LSM trees and cover leveling. In your current implementation, a miss still scans all SSTables; therefore, you will also add key range partitioning to limit the number of SSTables that need to be checked during a lookup. See you next year! The compaction trigger you used was simple: every 10,000 PUT or DELETE requests. In real systems, compaction is usually driven by factors such as too many SSTable files, space pressure, or high read amplification. Also, many systems add safeguards to keep compaction controlled and resource-efficient. For example, a common one is bounded fan-in (merging only a small, fixed number of SSTables per batch), so the engine never opens every file at once. Others track each SSTable’s first and last key to select only overlapping candidates, hence avoiding unrelated files. Taking a step back, it’s interesting to note that the core LSM idea—append-only writes with regular compaction—shows up in many systems, even outside pure LSM trees. For example: Lucene : Immutable segments are created and later merged in the background, an LSM-like pattern, even though it isn’t an LSM tree per se. Memcached Extstore : Flushes values to free RAM, but keeps the hashtable, keys, and storage pointers in memory. It later compacts the data. Kafka : Rewrites segments to keep the latest value per key and drop older versions, which is conceptually similar to SSTable compaction. Also, we briefly introduced the concept of key resurrection in the introduction. You should be aware that this is a common challenge with LSM trees. In real-world conditions, crashes, slow WAL truncation, and complex compaction can allow an old value to be replayed during recovery after its tombstone has been removed, leading to key resurrection. Here are two great references that delve more into this kind of problem: Preventing Data Resurrection with Repair Based Tombstone Garbage Collection Repair Time Requirements to Prevent Data Resurrection in Cassandra & Scylla Another excellent reference is Acheron: Persisting Tombstones in LSM Engines . It shows how standard LSM compaction can leave tombstones stuck for long periods, so “deleted" data may still linger in lower levels and complicate compliance requirements such as GDPR/CCPA compliance. The paper introduces delete-aware techniques that prioritize pushing tombstones down the tree to make deletions persist more predictably. Lastly, you can explore the RUM conjecture . Structurally, it’s similar to the CAP theorem : “ three things, pick two” . In short, you can make a database excel at two of: reads, updates (insert/change/delete), and memory/space, but not all three at once. Make any two really good and the third gets worse; that’s an unavoidable trade-off. This helps explain why, for example, LSM trees optimized for fast updates and good space efficiency pay a cost in read performance due to read amplification. That trade-off shows up in the design of the compaction process you implemented this week: you trade space and significant I/O for simplicity by compacting everything in one shot. This is fine for the example, but with 500GB of SSTables, you may need roughly another 500GB of free space during the merge in the worst case. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Over the past few weeks, you built an LSM tree and three main components: a memtable, SSTables, and a WAL that records the same operations you keep in the memtable. To prevent on-disk data from growing forever, you will implement compaction, a critical process in LSM trees. Compaction periodically merges SSTables to reclaim space and keep read performance predictable. For example, if key exists in every SSTable on disk: Compaction drops duplicates and keeps only the newest record: . In addition, you will implement a endpoint. Handling deletes in an LSM tree isn’t straightforward at all: SSTables are immutable. To preserve the append-only nature of LSM trees, deletions are written as tombstones: markers indicating a key was logically deleted. You write it to the WAL, keep it in the memtable, and propagate it during flush. How should compaction work in the presence of tombstones? Suppose you have the following SSTables on disk: the key exists in , doesn’t exist in , exists in , and is deleted at : .”","title":null,"type":"image/png","href":null,"belowTheFold":true,"topImage":false,"internalRedirect":"https://read.thecoder.cafe/i/174613473?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png","isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" title="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" srcset="https://substackcdn.com/image/fetch/$s_!rqB7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 424w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 848w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1272w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1456w" sizes="100vw" loading="lazy"> If the key doesn’t exist in the memtable, the current state for is deleted. Now, imagine that during compaction, you merge and . As the key is marked as deleted in the newest SSTable, you may decide to drop the tombstone, as it hides the key in : Now, would do: Key doesn’t exist in the memtable → Continue. Key doesn’t exist in SST-5 → Continue. Key doesn’t exist in SST-2 → Continue. Key exists in SST-1 → Return (instead of ). Flush and compaction should be single-threaded and stop-the-world operations: do not serve client requests until the operations complete. Append a tombstone to the WAL file, with : Update the memtable: Do not remove the key directly; mark it deleted with a tombstone. Acknowledge the request. For each key, keep only the newest record. Drop records hidden by newer versions. This is where merging happens: the newest record wins, and older versions are evicted. Drop tombstones when no older value remains. Open an iterator for each SSTable file known by the MANIFEST. Push each iterator’s current record into a min-heap with the following comparator: Primary: . Tie-break (equal ): Newest SSTable first based on MANIFEST order (to make sure an old value doesn’t win). While the heap is not empty: Pop the smallest key (this first pop is the newest version of due to the tie-break). Drain all other heap entries whose key is and discard them (older values). For the record you picked: If it’s a tombstone, emit nothing for . Otherwise, emit the value for . Advance only the iterators you drained for and push their next records into the heap. Stream emitted records (sorted) into new SSTables. Remember: the max entries in an SSTable should remain 2,000. each new SSTable file, then its parent directory. Update the MANIFEST atomically (see week 3). Remove the old SSTable files. Check the memtable: If the key is marked as deleted, return . Else, return the value. Scan SSTables from newest to oldest, given the MANIFEST order (same as before). For the first record with the requested key: If , return . Else, return the value. If the key isn’t found, return . When replaying the WAL, make sure to take into account tombstone values ( ). Update your client to handle lines → Send a request to . Download and run your client against a new file containing requests: put-delete.txt . NOTE : Refer to week 1 if you need to generate your own file with the number of lines you want. Lucene : Immutable segments are created and later merged in the background, an LSM-like pattern, even though it isn’t an LSM tree per se. Memcached Extstore : Flushes values to free RAM, but keeps the hashtable, keys, and storage pointers in memory. It later compacts the data. Kafka : Rewrites segments to keep the latest value per key and drop older versions, which is conceptually similar to SSTable compaction. Preventing Data Resurrection with Repair Based Tombstone Garbage Collection Repair Time Requirements to Prevent Data Resurrection in Cassandra & Scylla

0 views

The Coder Cafe 5 months ago

Build Your Own Key-Value Storage Engine—Week 3

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Last week, you built the first version of an LSM: an in-memory memtable for recent writes, immutable SSTables on disk, and a MANIFEST file listing the SSTable files. However, if the database crashes, data in the memtable would be lost. This week, you will focus on durability by introducing Write-Ahead Logging (WAL). A WAL is an append-only file on disk that records the same operations you keep in memory. How it works: On write, record it in the WAL and the memtable. On restart, you read the WAL from start to end and apply each record to the memtable. Introducing a WAL is not free, though. Writes are slower because each write also goes to the WAL. It also increases write amplification, the ratio of data written to data requested by a client. Another important aspect of durability is when to synchronize a file’s state with the storage device. When you write to a file, it may appear as saved, but the bytes may sit in memory caches rather than on the physical disk. These caches are managed by the OS’s filesystem, an abstraction over the disk. If the machine crashes before the data is flushed, you can lose data. To force the data to stable storage, you need to call a sync primitive. The simple, portable choice is to call fsync , a system call that flushes a file’s buffered data and required metadata to disk. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord For the WAL data format, you won’t use JSON like the SSTables, but NDJSON (Newline-Delimited JSON). It is a true append-only format with one JSON object per line. Append a record to the WAL file , opened with . Set the field to , and the and fields to the provided key and value. For example, writing : Update the memtable with the same logic as before: If the key exists, update the value. Otherwise, create a new entry. Acknowledge the HTTP request. Create an empty file if it doesn’t exist. Replay the WAL from start to end. For each valid line, apply it to the memtable. Keep the same flush trigger (2,000 entries) and the same logic (stop-the-world operation) as last week: Write the new SSTable: Flush the memtable as a new immutable JSON SSTable file with keys sorted (same as before). fsync the SSTable file. the parent directory of the SSTable to make the new filename persistent. Update the MANIFEST atomically: Read the current MANIFEST lines into memory and append the new SSTable filename. Open with . Write the entire list to from the start. Rename → . the parent directory of the MANIFEST. Reset the WAL: Truncate the WAL to zero length. the WAL file. If the server is unavailable, do not fail. Retry indefinitely with a short delay (or exponential backoff). To assess durability: Run the client against the same input file ( put.txt ). Stop and restart your database randomly during the run. Your client should confirm that no acknowledged writes were lost after recovery. Add a per-record checksum to each WAL record. On startup, verify records and stop at the first invalid/truncated one, discarding the tail. For reference, ScyllaDB checksums segments using CRC32; see its commitlog segment file format for inspiration. Regarding the flush process, if the database crashes after step 1 (write the new SSTable) and before step 2 (update the MANIFEST atomically), you may end up with a dangling SSTable file on disk. Add a startup routine to delete any file that exists on disk but is not listed in the MANIFEST. This keeps the data directory aligned with the MANIFEST after a crash. That’s it for this week! Your storage engine is now durable. On restart, data that was in the memtable is recovered from the WAL. This is made possible by and the atomic update of the MANIFEST. Deletion is not handled yet. In the worst case, a miss can read all SSTables, which quickly becomes highly inefficient. In two weeks, you will add a endpoint and learn how SSTables are compacted so the engine can reclaim space and keep reads efficient. In your implementation, you used as a simple “make it durable now“ button. In practice, offers finer control both over what you sync and when you sync. What: (or opening the file with ) persists the data without pushing unrelated metadata, which is usually what you want for WAL appends. You can go further with to bypass the page cache and sync only the data you wrote, but that comes with extra complexity. When: While calling a sync primitive after every request is offered by systems that promise durability, it is often not the default. Many databases use group commit, which batches several writes into one call to amortize the cost while still providing strong guarantees. For additional information, see A write-ahead log is not a universal part of durability by . For example, RocksDB provides options for tuning WAL behavior to meet the needs of different applications: Synchronous WAL writes (what you implemented this week) Group commit. No WAL writes at all. If you want, you can also explore group commit in your implementation and its impact on durability and latency/throughput, since this series will not cover it later. Also, you should know that since a WAL adds I/O to the write path, storage engines use a few practical tricks to keep it fast and predictable. A common one is to preallocate fixed-size WAL segments at startup to: Avoid the penalty of dynamic allocation. Prevent write fragmentation. Align buffers for (an open (2) flag for direct I/O that bypasses the OS page cache). Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Last week, you built the first version of an LSM: an in-memory memtable for recent writes, immutable SSTables on disk, and a MANIFEST file listing the SSTable files. However, if the database crashes, data in the memtable would be lost. This week, you will focus on durability by introducing Write-Ahead Logging (WAL). A WAL is an append-only file on disk that records the same operations you keep in memory. How it works: On write, record it in the WAL and the memtable. On restart, you read the WAL from start to end and apply each record to the memtable. Append a record to the WAL file , opened with . Set the field to , and the and fields to the provided key and value. For example, writing : Update the memtable with the same logic as before: If the key exists, update the value. Otherwise, create a new entry. Acknowledge the HTTP request. Create an empty file if it doesn’t exist. Replay the WAL from start to end. For each valid line, apply it to the memtable. Write the new SSTable: Flush the memtable as a new immutable JSON SSTable file with keys sorted (same as before). fsync the SSTable file. the parent directory of the SSTable to make the new filename persistent. Update the MANIFEST atomically: Read the current MANIFEST lines into memory and append the new SSTable filename. Open with . Write the entire list to from the start. Rename → . the parent directory of the MANIFEST. Reset the WAL: Truncate the WAL to zero length. the WAL file. Run the client against the same input file ( put.txt ). Stop and restart your database randomly during the run. Your client should confirm that no acknowledged writes were lost after recovery. What: (or opening the file with ) persists the data without pushing unrelated metadata, which is usually what you want for WAL appends. You can go further with to bypass the page cache and sync only the data you wrote, but that comes with extra complexity. When: While calling a sync primitive after every request is offered by systems that promise durability, it is often not the default. Many databases use group commit, which batches several writes into one call to amortize the cost while still providing strong guarantees. For additional information, see A write-ahead log is not a universal part of durability by . For example, RocksDB provides options for tuning WAL behavior to meet the needs of different applications: Synchronous WAL writes (what you implemented this week) Group commit. No WAL writes at all. Avoid the penalty of dynamic allocation. Prevent write fragmentation. Align buffers for (an open (2) flag for direct I/O that bypasses the OS page cache).

0 views

The Coder Cafe 6 months ago

Linus Torvalds vs. Ambiguous Abstractions

🎄 If you’re planning to do Advent of Code this year, join The Coder Cafe leaderboard: . I’ll find a few prizes for the winner(s). If you’re new to Advent of Code, I wrote a short introduction last year, and I also wrote a blog post called I Completed All 8 Advents of Code in One Go: Here Are the Lessons I Learned if you’re interested. I’ve also created a custom channel in the Discord channel. Join the Discord ☕ Welcome to The Coder Cafe! Today, we discuss a recent comment from Linus Torvalds about the use of a helper function. Get cozy, grab a coffee, and let’s begin! In August 2025, there was (yet another) drama involving Linus Torvalds replying on a pull request: No. This is garbage and it came in too late. I asked for early pull requests because I’m traveling, and if you can’t follow that rule, at least make the pull requests good. This adds various garbage that isn’t RISC-V specific to generic header files. And by “garbage” I really mean it. This is stuff that nobody should ever send me, never mind late in a merge window. Like this crazy and pointless make_u32_from_two_u16() “helper”. That thing makes the world actively a worse place to live. It’s useless garbage that makes any user incomprehensible, and actively WORSE than not using that stupid “helper”. If you write the code out as “(a << 16) + b”, you know what it does and which is the high word. Maybe you need to add a cast to make sure that ‘b’ doesn’t have high bits that pollutes the end result, so maybe it’s not going to be exactly pretty, but it’s not going to be wrong and incomprehensible either. In contrast, if you write make_u32_from_two_u16(a,b) you have not a f^%$ing clue what the word order is . IOW, you just made things WORSE, and you added that “helper” to a generic non-RISC-V file where people are apparently supposed to use it to make other code worse too. So no. Things like this need to get bent. It does not go into generic header files, and it damn well does not happen late in the merge window. Let’s not discuss the rudeness of this comment (it’s atrocious). Instead, let’s focus on the content itself. , a popular newsletter, wrote a post about it: the main point Linus makes here is that good code optimizes for reducing cognitive load . {…] Humans have limited working memory capacity - let’s say the human brain can only store 4-7 “chunks” at at time. Each abstraction or helper function costs a chunk slot. Each abstractions costs more tokens. I share the view that good code optimizes for reducing cognitive load 1 , but I don’t understand Linus’s comment in exactly the same way. Yes, Linus is virulent about the helper function, but in my opinion, his main argument isn’t simply that an abstraction costs a “chunk slot” as mentioned; it’s rather that this isn’t the right abstraction. Here is the code added in the pull request: This macro builds a 32-bit integer by putting one 16-bit value in the high half and the other in the low half. For example: The main problem with this macro isn’t necessarily that it exists. It’s that its intent (meaning what it tries to accomplish) could have been clearer. Indeed, the helper’s name doesn’t tell which word is high and which one is low and that’s exactly what Linus is calling out with “ you have not a f^%$ing clue what the word order is ”. Because we can’t get the intent from the name ( ), we have to open the macro to understand the order. That’s precisely why it costs a “chunk slot.”: not because the abstraction exists, but because it’s an ambiguous one. If we wanted to keep using a macro, a better approach, in my opinion 2 , would be to encode the word order in the name itself ( = most significant word, = least significant word): In this case, the word order is carried by the macro name, which makes it a clearer abstraction. Reading the call site doesn’t require opening the macro to understand the word order: Such an abstraction doesn’t cost a “chunk slot” in terms of cognitive load. Its intent is clear from the name, so we don’t need to load an extra piece of information into our working memory to understand it. In summary, if we want to optimize for cognitive load, there’s not necessarily an issue with using helper functions. But if we do, we should make the abstraction as explicit as possible, and that starts with a clear function name that conveys what it tries to accomplish. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Readability Cognitive Load Nested Code Re: [GIT PULL] RISC-V Patches for the 6.17 Merge Window, Part 1 - Linus Torvalds // The discussion. GitHub // The code proposed in the pull request Linus and the two youts // Interestingly, the macro was plain wrong when the second word was negative. The full explanation is here. ❤️ If you enjoyed this post, please hit the like button. 💬 Where do you draw the line between “helpful” and “harmful” abstraction? Leave a comment At least most of the time. Sometimes we must optimize for performance at the expense of cognitive load. Mr Torvalds, if you see this and you disagree, please do not insult me. In August 2025, there was (yet another) drama involving Linus Torvalds replying on a pull request: No. This is garbage and it came in too late. I asked for early pull requests because I’m traveling, and if you can’t follow that rule, at least make the pull requests good. This adds various garbage that isn’t RISC-V specific to generic header files. And by “garbage” I really mean it. This is stuff that nobody should ever send me, never mind late in a merge window. Like this crazy and pointless make_u32_from_two_u16() “helper”. That thing makes the world actively a worse place to live. It’s useless garbage that makes any user incomprehensible, and actively WORSE than not using that stupid “helper”. If you write the code out as “(a << 16) + b”, you know what it does and which is the high word. Maybe you need to add a cast to make sure that ‘b’ doesn’t have high bits that pollutes the end result, so maybe it’s not going to be exactly pretty, but it’s not going to be wrong and incomprehensible either. In contrast, if you write make_u32_from_two_u16(a,b) you have not a f^%$ing clue what the word order is . IOW, you just made things WORSE, and you added that “helper” to a generic non-RISC-V file where people are apparently supposed to use it to make other code worse too. So no. Things like this need to get bent. It does not go into generic header files, and it damn well does not happen late in the merge window. Let’s not discuss the rudeness of this comment (it’s atrocious). Instead, let’s focus on the content itself. , a popular newsletter, wrote a post about it: the main point Linus makes here is that good code optimizes for reducing cognitive load . {…] Humans have limited working memory capacity - let’s say the human brain can only store 4-7 “chunks” at at time. Each abstraction or helper function costs a chunk slot. Each abstractions costs more tokens. I share the view that good code optimizes for reducing cognitive load 1 , but I don’t understand Linus’s comment in exactly the same way. Yes, Linus is virulent about the helper function, but in my opinion, his main argument isn’t simply that an abstraction costs a “chunk slot” as mentioned; it’s rather that this isn’t the right abstraction. Here is the code added in the pull request: This macro builds a 32-bit integer by putting one 16-bit value in the high half and the other in the low half. For example: The main problem with this macro isn’t necessarily that it exists. It’s that its intent (meaning what it tries to accomplish) could have been clearer. Indeed, the helper’s name doesn’t tell which word is high and which one is low and that’s exactly what Linus is calling out with “ you have not a f^%$ing clue what the word order is ”. Because we can’t get the intent from the name ( ), we have to open the macro to understand the order. That’s precisely why it costs a “chunk slot.”: not because the abstraction exists, but because it’s an ambiguous one. If we wanted to keep using a macro, a better approach, in my opinion 2 , would be to encode the word order in the name itself ( = most significant word, = least significant word): In this case, the word order is carried by the macro name, which makes it a clearer abstraction. Reading the call site doesn’t require opening the macro to understand the word order: Such an abstraction doesn’t cost a “chunk slot” in terms of cognitive load. Its intent is clear from the name, so we don’t need to load an extra piece of information into our working memory to understand it. In summary, if we want to optimize for cognitive load, there’s not necessarily an issue with using helper functions. But if we do, we should make the abstraction as explicit as possible, and that starts with a clear function name that conveys what it tries to accomplish. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Resources More From the Programming Category Readability Cognitive Load Nested Code Re: [GIT PULL] RISC-V Patches for the 6.17 Merge Window, Part 1 - Linus Torvalds // The discussion. GitHub // The code proposed in the pull request Linus and the two youts // Interestingly, the macro was plain wrong when the second word was negative. The full explanation is here.

Open Source

0 views

The Coder Cafe 6 months ago

Build Your Own Key-Value Storage Engine—Week 2

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Before delving into this week’s tasks, it’s important to understand what you will implement. This week, you will implement a basic log-structured merge-tree (LSM tree). At its core, an LSM tree is a data structure that prioritizes write efficiency by trading off some read complexity. It buffers writes in memory and uses append-only files on disk, then rewrites data during compaction. It consists of two main components: A mutable in-memory data structure called a memtable, used to store recent writes. A set of immutable SSTables (Sorted String Table) stored on disk. Regularly, the current memtable is snapshotted, its entries are sorted by key, and a new immutable SSTable file is written. In addition, a MANIFEST file is an append-only list of SSTable filenames. It tells the engine which SSTable files exist and in which order to read them, newest to oldest. Why LSM trees shine for write-heavy workloads: Fast writes with sequential I/O: New updates are buffered in memory (memtable) and later written sequentially to disk during a flush (SSTable), which is faster than the random I/O patterns common with B-trees, for example. Decouples writes from read optimization: Writes complete against the memtable, while compaction work runs later (you will tackle that in a future week). Space and long-term efficiency: Compaction processes remove dead data and merge many small files into larger sorted files, which keeps space usage in check and sustains read performance over time. For the memtable, you will start with a hashtable. In a future week, you will learn why a hashtable is not the most efficient data structure for an LSM tree, but it is a simple starting point. For the SSTables, you will use JSON as the data format. Get comfortable with a JSON parser if you are not already. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord This week’s implementation is single-threaded. You will revisit that assumption later. Implement a hashtable to store requests (create or update). You can probably reuse a lot of code from Week 1. When your memtable contains 2,000 entries: Flush the memtable as a new immutable JSON SSTable file with keys sorted. The SSTable file is a JSON array of objects, each with two fields, and . Keys are unique within a file. For example, if your memtable contains the following entries: You need to create the following SSTable: Use a counter for the filename prefix, for example , , . After writing the new SSTable, append its filename to the MANIFEST (append only), then clear the memtable: For now, the flush is a stop-the-world operation. While the file is being written, do not serve reads or writes. You will revisit that later. Create an empty file if it doesn’t exist. Derive the next SSTable ID from the MANIFEST so you don't reuse the same filename. Check the memtable: If found, return the corresponding value. If not found, read the MANIFEST to list SSTable filenames: Scan SSTables from newest to oldest (for example , then , then ). Use a simple linear scan inside each file for now. Stop at the first hit and return the corresponding value. If still not found, return . There are no changes to the client you built in week 1. Run it against the same file ( put.txt ) to validate that your changes are correct. Keep a small LRU cache of known-absent keys (negative cache) between the memtable and SSTables. This avoids repeated disk scans for hot misses: after the first miss, subsequent lookups are O(1). Implementation details are up to you. Instead of parsing the MANIFEST file for each request, you can cache the content in-memory. That’s it for this week! You have built the first version of an LSM tree: a memtable in memory, SSTable files written by regular flushes, and a MANIFEST that lists those SSTables. For now, durability isn’t guaranteed. Data already flushed to SSTables will be read after a restart, but anything still in the memtable during a crash is lost. In two weeks, you will make sure that any request acknowledged to a client remains in your storage engine, even after a restart. The flush trigger you used was pretty simple: once the memtable contains 2,000 entries. In real systems, flushes can be triggered by various factors, for example: Some databases flush when the memtable reaches a target size in bytes, ensuring predictable memory usage. A flush can also occur after a period of time has passed. This occurs because the database eventually needs to release commit log segments. For tables with very low write activity, this can sometimes lead to data resurrection scenarios. Here’s an old issue from the ScyllaDB codebase that illustrates this behavior. Regarding the model, this series assumes a simple key–value one: every PUT stores the whole value, so a GET just finds the newest entry and returns it. If you need a richer model (e.g., rows with many fields or collections), writes are often partial (patches) rather than full replacements. Therefore, reads must reconstruct the result by scanning newest to oldest and merging changes until all required fields are found or a full-write record is encountered. Last but not least, in this series, you implicitly rely on client-side ordering: the validation client issues requests sequentially. Production KV databases typically attach a sequence number or a logical timestamp to each write to handle out-of-order arrivals, merging, and reconciling results. Pure wall-clock timestamps are convenient but brittle; see Kyle Kingsbury’s notes on clock pitfalls for a deeper dive. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary . ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations A mutable in-memory data structure called a memtable, used to store recent writes. A set of immutable SSTables (Sorted String Table) stored on disk. Fast writes with sequential I/O: New updates are buffered in memory (memtable) and later written sequentially to disk during a flush (SSTable), which is faster than the random I/O patterns common with B-trees, for example. Decouples writes from read optimization: Writes complete against the memtable, while compaction work runs later (you will tackle that in a future week). Space and long-term efficiency: Compaction processes remove dead data and merge many small files into larger sorted files, which keeps space usage in check and sustains read performance over time. This week’s implementation is single-threaded. You will revisit that assumption later. Flush the memtable as a new immutable JSON SSTable file with keys sorted. The SSTable file is a JSON array of objects, each with two fields, and . Keys are unique within a file. For example, if your memtable contains the following entries: You need to create the following SSTable: Use a counter for the filename prefix, for example , , . After writing the new SSTable, append its filename to the MANIFEST (append only), then clear the memtable: Create an empty file if it doesn’t exist. Derive the next SSTable ID from the MANIFEST so you don't reuse the same filename. Check the memtable: If found, return the corresponding value. If not found, read the MANIFEST to list SSTable filenames: Scan SSTables from newest to oldest (for example , then , then ). Use a simple linear scan inside each file for now. Stop at the first hit and return the corresponding value. If still not found, return . Some databases flush when the memtable reaches a target size in bytes, ensuring predictable memory usage. A flush can also occur after a period of time has passed. This occurs because the database eventually needs to release commit log segments. For tables with very low write activity, this can sometimes lead to data resurrection scenarios. Here’s an old issue from the ScyllaDB codebase that illustrates this behavior. The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary .

0 views

The Coder Cafe 6 months ago

Nothing Beats Kindness

☕ Welcome to The Coder Cafe! Today, November 13, is World Kindness Day. For this special occasion, we discuss how kindness matters at work. Get cozy, grab a coffee, and let’s begin! We’re in 2022. It’s Saturday evening, and I’m about to go to bed. I’m on-call that night. I haven’t been paged, but just to make sure everything is OK, I logged in and checked Slack. An incident was going on, and a colleague was already on it. I DMed him: “ Why didn’t you contact me? ” He replied: “ It’s late and I thought you might be sleeping. I was awake, so I looked to see if there’s something I could do. ” My first reaction was: I’m on-call. I’m paid for it. I’ll take care of it. Go to bed. But here’s the thing: on a Saturday evening, he chose to help because he thought I might be sleeping, even though I was the one on-call, the one paid to handle it. That was a pure act of kindness. No points. No credit. Just care. And after that? Honestly, I would have done anything for that person . At work, we work with people long before we work with code. There’s always a little distance between us: roles and power dynamics, deadlines and pressure, different cultures, communication styles, sometimes different time zones. Kindness is the fastest bridge across that distance. Kindness is about being generous, considerate, and having concern for others without expecting praise or reward in return. It’s a voluntary act that creates psychological safety among team members. When people feel safe, they surface risks earlier, ask the “naive” questions, and move faster together. Kind people make work better day in, day out. Kindness boosts trust, speeds decisions, reduces stress, and quietly raises the bar for everyone. Let’s look at a few places where kindness matters in our daily jobs: Code review : When we’re assigned a review, we’re not there to rate someone’s code. We’re there to merge the best possible change together. Be respectful and stay factual. Favor questions over pronouncements: “ What scenarios does this handle? I’m worried about X; would Y cover Z? ” Point out what’s good, suggest concrete fixes, and link to standards or examples. If there’s confusion, offer help. Meetings : Make space so everyone can be heard. Don’t interrupt. Invite quieter people in: “ Ben, anything you would add? ” It’s not because someone is more vocal that they’re more right. Mentoring : People make mistakes. Don’t jump to blame or perform expertise. The goal is to protect in public and correct in private. Give clear, kind feedback, focus on the next step, and share your own past mistakes to lower the temperature. Random thank you : When you receive help or just enjoy working with someone, say thank you. Recognition matters, and doing it publicly multiplies the effect. For example, at Google, there’s a program called gThanks that lets you thank someone publicly so others can see it too. Make time to listen : Being kind also means making time to listen. I remember going through a difficult period of my life, and a former manager just took time to talk, without judging. That mattered more than any advice. Self-compassion : Kindness also applies to yourself. Give yourself the same understanding you would give a teammate. Take breaks, ask for help, forgive your own mistakes, and learn from them. Being kind is a bridge to people, and even in a professional context, as Aesop wrote, no act of kindness, no matter how small, is ever wasted. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Don’t Forget About Your Mental Health Keeping a Mistake Journal The XY Problem Why Kindness at Work Pays Off Random Acts of Kindness Foundation ❤️ If you enjoyed this post, please hit the like button. 💬 What’s one act of kindness that changed your workday? Leave a comment We’re in 2022. It’s Saturday evening, and I’m about to go to bed. I’m on-call that night. I haven’t been paged, but just to make sure everything is OK, I logged in and checked Slack. An incident was going on, and a colleague was already on it. I DMed him: “ Why didn’t you contact me? ” He replied: “ It’s late and I thought you might be sleeping. I was awake, so I looked to see if there’s something I could do. ” My first reaction was: I’m on-call. I’m paid for it. I’ll take care of it. Go to bed. But here’s the thing: on a Saturday evening, he chose to help because he thought I might be sleeping, even though I was the one on-call, the one paid to handle it. That was a pure act of kindness. No points. No credit. Just care. And after that? Honestly, I would have done anything for that person . Why Kindness Wins At work, we work with people long before we work with code. There’s always a little distance between us: roles and power dynamics, deadlines and pressure, different cultures, communication styles, sometimes different time zones. Kindness is the fastest bridge across that distance. Kindness is about being generous, considerate, and having concern for others without expecting praise or reward in return. It’s a voluntary act that creates psychological safety among team members. When people feel safe, they surface risks earlier, ask the “naive” questions, and move faster together. Kind people make work better day in, day out. Kindness boosts trust, speeds decisions, reduces stress, and quietly raises the bar for everyone. Let’s look at a few places where kindness matters in our daily jobs: Code review : When we’re assigned a review, we’re not there to rate someone’s code. We’re there to merge the best possible change together. Be respectful and stay factual. Favor questions over pronouncements: “ What scenarios does this handle? I’m worried about X; would Y cover Z? ” Point out what’s good, suggest concrete fixes, and link to standards or examples. If there’s confusion, offer help. Meetings : Make space so everyone can be heard. Don’t interrupt. Invite quieter people in: “ Ben, anything you would add? ” It’s not because someone is more vocal that they’re more right. Mentoring : People make mistakes. Don’t jump to blame or perform expertise. The goal is to protect in public and correct in private. Give clear, kind feedback, focus on the next step, and share your own past mistakes to lower the temperature. Random thank you : When you receive help or just enjoy working with someone, say thank you. Recognition matters, and doing it publicly multiplies the effect. For example, at Google, there’s a program called gThanks that lets you thank someone publicly so others can see it too. Make time to listen : Being kind also means making time to listen. I remember going through a difficult period of my life, and a former manager just took time to talk, without judging. That mattered more than any advice. Self-compassion : Kindness also applies to yourself. Give yourself the same understanding you would give a teammate. Take breaks, ask for help, forgive your own mistakes, and learn from them. Don’t Forget About Your Mental Health Keeping a Mistake Journal The XY Problem Why Kindness at Work Pays Off Random Acts of Kindness Foundation