GreatReads - Blog Aggregator · Phoenix Framework

Build Your Own Key-Value Storage Engine—Week 7

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Over the last few weeks, you refined your LSM tree to introduce leveling. In case of a key miss, the process requires the following steps: Lookup from the memtable. Lookup from all the L0 SSTables. Lookup from one L1 SSTable. Lookup from one L2 SSTable. Last week, you optimized the lookups by introducing block-based SSTables and indexing, but a lookup is still not a “free” operation. Worst case, it requires fetching two pages (one for the index block and one for the data block) to find out that a key is missing in an SSTable. This week, you will optimize searches by introducing a “tiny” level of caching per SSTable. If you’re an avid reader of The Coder Cafe 1 , we already discussed a great candidate for such a cache: One that doesn’t consume too much memory to make sure we don’t increase space amplification drastically. One that is fast enough so that a lookup doesn’t introduce too much overhead, especially if we have to check a cache before making any lookup in an SSTable. You will implement a cache using Bloom filters : a space-efficient, probabilistic data structure to check for set membership. A Bloom filter can return two possible answers: The element is definitely not in the set (no false negatives). The element may be in the set (false positives are possible). In addition to optimizing SSTable lookups, you will also optimize your memtable. In week 2, you implemented a memtable using a hashtable. Let’s get some perspective to understand the problems of using a hashtable: A memtable buffers writes. As it’s the main entry point for writes, a write has to be fast. → OK: a hashtable has average inserts, plus ( : the length of the key) for hashing. For reads, doing a key lookup has to be fast → OK: average lookups, plus to hash. Doing range scanning operations (week 5, optional work), such as: “ Give me the list of keys between bar and foo “ → A hashtable, because it’s not an ordered data structure, is terrible: you end up touching everything so with the number of elements in the hashtable. Flush to L0 → A hashtable isn’t ordered, so it requires sorting all the keys ( ) with n the number of elements) to produce the SSTables. Because of these negative points, could we find a better data structure? Yes! This week, you will switch the memtable to a radix trie (see Further Notes for a discussion on alternative data structures). A trie is a tree-shaped data structure usually used to store strings efficiently. The common example to illustrate a trie is to store a dictionary. For example, suppose you want to store these two words: Despite that starts with the same four letters, you need to store a total of 4 + 5 = 9 letters. Tries optimize the storage required by sharing prefixes. Each node stores one letter. Here’s an example of a trie storing these two words in addition to the word foo ( nodes represent the end of a word): As you can see, we didn’t duplicate the first four letters of to store . In this very example, instead of storing 9 letters for and , we stored only five letters. Yet, you’re not going to implement a “basic” trie for your memtable; instead, you will implement a compressed trie called a radix trie (also known as a patricia 2 trie). Back to the previous example, storing one node (one square) has an overhead. It usually means at least one extra field to store the next element, usually a pointer. In the previous example, we needed 11 nodes in total, but what if we could compress the number of nodes required? The idea is to combine nodes with a single child: This new trie stores the exact same information, except it requires 6 nodes instead of 11. That’s what radix tries are about. To summarize the benefits of switching a memtable from a hashtable to a radix trie: Ordered by design: Tries keep keys in order and make prefix/range lookups natural, which helps for and for streaming a sorted flush. No rebalancing/rehashing pauses: The shape doesn’t depend on insertion order, and operations don’t need rebalancing; you avoid periodic rehash work. Prefix compression: A radix trie can cut duplicated key bytes in the memtable, reducing in-memory space. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Let’s size the Bloom filter. You will target: (false-positive rate) = 1% (max elements per SSTable) = 1,953 (hash functions) = 5 Using the formula from the Bloom Filters post: We get ≈ 19,230 bits, i.e., 2,404 B. We will round up to 2,496 B (39 × 64 B), so the bitset is a whole number of cache lines. NOTE : Using =7 would shave only ~2–3% space for ~40% more hash work, so =5 is a good trade-off. To distribute elements across the bitvector, you will use the following approach. You will use xxHash64 with two different constant seeds to get two base hashes, then derive k indices by double hashing (pseudo-code): The required changes to introduce Bloom filters: For each SSTable in the MANIFEST, cache its related Bloom filter in memory. Since each Bloom filter requires only a small amount of space, this optimization has a minimal memory footprint. For example, caching 1,000 Bloom filters of the type you designed requires less than 2.5 MB of memory. SSTable creation: For each new SSTable you write, initialize an empty bitvector of 2,496 B. Build the Bloom filter in memory as you emit the keys (including tombstones): Compute based on the key. For each , set bit at position . When the SSTable is done, persist a sidecar file next to it (e.g., and ) and the file. Update the cache containing the Bloom filters. Compaction: Delete from memory the Bloom filters corresponding to deleted SSTables. Before reading an SSTable: Compute based on the key. If all the bits of are set: The key may be present, therefore, proceed with your normal lookup in the SSTable. Otherwise: Skip this SSTable. Now, let’s replace your hashtable with a trie. : Compressed edge fragment. : A map keyed by the next character after to a node. : An enum with the different possible values: : The node is just a prefix, no full key ends here. : A full key exists at this node. : This key was explicitly deleted. : If is , the corresponding value. Root is a sentinel node with an empty . Walk from the root, matching the longest common prefix against . If partial match in the middle of an edge, split once: Create a parent with the common part, two children: the old suffix and the new suffix. Descend via the next child (next unmatched character). At the terminal node: set and Walk edges by longest-prefix match. If an edge doesn’t match, return not found. At the terminal node: If : return If or , return not found. Walk as in . If the path doesn’t fully exist, create the missing suffix nodes with so that a terminal node exists. At the terminal node: set (you may have to clear ). Flush process: In-order traversal: : Emit tombstone. : Emit nothing. There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Use per-SSTable random seeds for the Bloom hash functions. Persist them in the Bloom filter files. In Bloom Filters , you introduced blocked Bloom filters, a variant that optimizes spatial locality by: Dividing the bloom filter into contiguous blocks, each the size of a cache line. Restricting each query to a single block to ensure all bit lookups stay within the same cache line. Switch to blocked Bloom filters and see the impacts on latency and throughput. If you implemented the operation from week 5 (optional work), wire it to your memtable radix trie. That’s it for this week! You optimized lookups with per-SSTable Bloom filters and switched the memtable to a radix trie, an ordered data structure. Since the beginning of the series, everything you built has been single-threaded, and flush/compaction remains stop-the-world. In two weeks, you will finally tackle the final boss of LSM trees: concurrency. If you want to dive more into tries, Trie Memtables in Cassandra is a paper that explains why Cassandra moved from a skip list + B-tree memtable to a trie, and what it changed for topics such as GC and CPU locality. A popular variant of radix trie is the Adaptive Radix Tree (ART): it dynamically resizes node types based on the number of children to stay compact and cache-friendly, while supporting fast in-memory lookups, inserts, and deletes. This paper (or this summary ) explores the topic in depth. You should also be aware that tries aren’t the only option for memtables, as other data structures exist. For example, RocksDB relies on a skip list. See this resource for more information. About Bloom filters, some engines keep a Bloom filter not only per SSTable but per data-block range as well. This was the case for RocksDB’s older block-based filter format ( source ). RocksDB later shifted toward partitioned index/filters, which partition the index and full-file filter into smaller blocks with a top-level directory for on-demand loading. The official doc delves into the new approach. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. I’m sure you are. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Over the last few weeks, you refined your LSM tree to introduce leveling. In case of a key miss, the process requires the following steps: Lookup from the memtable. Lookup from all the L0 SSTables. Lookup from one L1 SSTable. Lookup from one L2 SSTable. One that doesn’t consume too much memory to make sure we don’t increase space amplification drastically. One that is fast enough so that a lookup doesn’t introduce too much overhead, especially if we have to check a cache before making any lookup in an SSTable. The element is definitely not in the set (no false negatives). The element may be in the set (false positives are possible). A memtable buffers writes. As it’s the main entry point for writes, a write has to be fast. → OK: a hashtable has average inserts, plus ( : the length of the key) for hashing. For reads, doing a key lookup has to be fast → OK: average lookups, plus to hash. Doing range scanning operations (week 5, optional work), such as: “ Give me the list of keys between bar and foo “ → A hashtable, because it’s not an ordered data structure, is terrible: you end up touching everything so with the number of elements in the hashtable. Flush to L0 → A hashtable isn’t ordered, so it requires sorting all the keys ( ) with n the number of elements) to produce the SSTables. As you can see, we didn’t duplicate the first four letters of to store . In this very example, instead of storing 9 letters for and , we stored only five letters. Yet, you’re not going to implement a “basic” trie for your memtable; instead, you will implement a compressed trie called a radix trie (also known as a patricia 2 trie). Back to the previous example, storing one node (one square) has an overhead. It usually means at least one extra field to store the next element, usually a pointer. In the previous example, we needed 11 nodes in total, but what if we could compress the number of nodes required? The idea is to combine nodes with a single child: This new trie stores the exact same information, except it requires 6 nodes instead of 11. That’s what radix tries are about. To summarize the benefits of switching a memtable from a hashtable to a radix trie: Ordered by design: Tries keep keys in order and make prefix/range lookups natural, which helps for and for streaming a sorted flush. No rebalancing/rehashing pauses: The shape doesn’t depend on insertion order, and operations don’t need rebalancing; you avoid periodic rehash work. Prefix compression: A radix trie can cut duplicated key bytes in the memtable, reducing in-memory space. (false-positive rate) = 1% (max elements per SSTable) = 1,953 (hash functions) = 5 Startup: For each SSTable in the MANIFEST, cache its related Bloom filter in memory. Since each Bloom filter requires only a small amount of space, this optimization has a minimal memory footprint. For example, caching 1,000 Bloom filters of the type you designed requires less than 2.5 MB of memory. SSTable creation: For each new SSTable you write, initialize an empty bitvector of 2,496 B. Build the Bloom filter in memory as you emit the keys (including tombstones): Compute based on the key. For each , set bit at position . When the SSTable is done, persist a sidecar file next to it (e.g., and ) and the file. Update the cache containing the Bloom filters. Compaction: Delete from memory the Bloom filters corresponding to deleted SSTables. Lookup: Before reading an SSTable: Compute based on the key. If all the bits of are set: The key may be present, therefore, proceed with your normal lookup in the SSTable. Otherwise: Skip this SSTable. : Compressed edge fragment. : A map keyed by the next character after to a node. : An enum with the different possible values: : The node is just a prefix, no full key ends here. : A full key exists at this node. : This key was explicitly deleted. : If is , the corresponding value. : Walk from the root, matching the longest common prefix against . If partial match in the middle of an edge, split once: Create a parent with the common part, two children: the old suffix and the new suffix. Descend via the next child (next unmatched character). At the terminal node: set and : Walk edges by longest-prefix match. If an edge doesn’t match, return not found. At the terminal node: If : return If or , return not found. : Walk as in . If the path doesn’t fully exist, create the missing suffix nodes with so that a terminal node exists. At the terminal node: set (you may have to clear ). In-order traversal: : Emit . : Emit tombstone. : Emit nothing. Dividing the bloom filter into contiguous blocks, each the size of a cache line. Restricting each query to a single block to ensure all bit lookups stay within the same cache line.

Programming

Database

0 views

The Coder Cafe 1 months ago

Build Your Own Key-Value Storage Engine—Week 6

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing In week 2, you used JSON as the SSTable format. That works for document databases, but the overhead of this serialization format doesn’t make it the best choice for your storage engine: Best case: You stream the file and linearly scan entries until you find the key, but a miss means scanning the entire file. Worst case: You read the whole file and parse everything, then search for the key. This week, you will switch to block-based SSTables. Data will be chunked into fixed-size blocks designed to fit within a single disk page. The main benefits: Efficient I/O: Each lookup can fetch a complete block with a single page read. Predictable latency: Since every block maps to exactly one page, each read involves a fixed, bounded amount of I/O, improving latency consistency. Smaller on disk: Binary encoding typically compresses better than JSON. Integrity: Per-block checksums detect corruption without requiring a re-read of the file. Caching: Hot SSTable blocks are cached in a memory-based block cache to reduce I/O and decompression overhead. Alongside the data blocks, you will maintain a small index that stores the first key of each block and its corresponding offset, allowing lookups to jump directly to the relevant block without scanning all of them. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Fixed 64-byte keys and values: This alleviates a lot of logic to keep fixed-size blocks, making the implementation easier to write and reason about. Because of the week 1 assumption (keys are lowercase ASCII strings), each character is one byte, which also makes the implementation easier. A block-based SSTable will be composed of: One index block (first 4 KB page) Multiple data blocks (each 4 KB) Each block has a fixed size of 4 KB. Aligning blocks to 4 KB means a disk read can fetch a block in one page. If blocks are not aligned, a read may span two pages. Here’s the file layout at a glance: The layout of an index block (4 KB): : The number of data blocks in the SSTable. A set of key entries (64 B), each being the first key of the corresponding data block. Entries are sorted by key and used to decide which block to fetch during a lookup. To make the index fit into a single 4 KB page, it must contain at most 63 entries. Here’s the layout (note this is a binary layout; newlines are used only for the representation): NOTE : If you’re not familiar with the concept of padding: it’s filling unused bytes (here with 0x00) so fields and blocks have fixed sizes. has a value between 0 and 63. If you encoded 63 as text, you would need two bytes ( = and = ). Instead, you can store it as a binary integer so it fits in one byte: . Same layout, with explicit offsets: An example of an SSTable with three data blocks, hence three entries. Remember: this is binary; newlines are for readability only: This index block indicates: Block 0 starts with the key . Block 1 starts with the key . Block 2 starts with the key . You don’t need to store per-block offsets. Because the index is stored on a 4 KB page and every data block is exactly 4 KB and written contiguously, offsets can be calculated this way ( starts at 0): Block 0 starts at offset 4096. Block 1 starts at offset 8192. Block 2 starts at offset 12288. Now, let’s focus on data blocks. In addition to the key-value entries, reserve 8 bytes in the block at the start to store a CRC computed over + all entries; this lets you verify data integrity on read. The layout of a data block (4 KB per block): Header (128 B): (8 B): A checksum computed over bytes [8..4096). You can choose any standard variant (e.g., CRC-64/ECMA-182). (1 B): the number of entries in this block (0..31). Padding (119 B). Entries area (31 x 128 B = 3968 B), each entry is: (64 B, right-padded). (64 B, right-padded). The last data block may contain fewer than 31 entries ( ), but always pad with zeros to reach exactly 4 KB. This guarantees one-page reads and prevents errors across read modes (e.g., with mmap ). The layout of a data block (again, newlines are used only for the representation): Same layout, with explicit offsets: An example of a block composed of three key-value pairs: Note that because the index block holds at most 63 key entries, an SSTable can have at most 63 data blocks. With 31 entries per block, that caps an SSTable at 63 × 31 = 1,953 entries. A tombstone is represented by a value of 64 bytes all set to 0x00. Due to this sentinel, the all-zero value is reserved and cannot be used as an application value from this week onward. Searching for a value doesn’t change (memtable → L0 → L1, etc.). What changes is how you read one SSTable (remember: from L1, you only need to read one SSTable per level because of non-overlapping key ranges). The process to read from an SSTable: Binary search the index in to find the largest ≤ key and get . If not found (e.g., first index key is and your key is ), return a miss for this SSTable. Compute the block offset: . Fetch the corresponding 4 KB block. Verify CRC before using the block: Compute CRC64 over bytes [8..4096). Compare with the 8-byte CRC stored at offset 0..7. If it doesn’t match, fail the read for this SSTable. Binary search the entries in for the key. Return the corresponding value or a miss. Last week, you split at 2,000 entries during the compaction process. This week, because a single SSTable is limited to 1,953 entries, change the split threshold to 1,953. There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Drop the 64-byte constraint: store a length-prefixed key and value per entry (short header with key length and value length). Keep entries sorted and include the lengths in your checksum. Tombstones are currently represented by a sentinel value (a 64-byte all-zero value), which prevents storing an actual empty value. Instead, avoid reserving any value for deletes: add an explicit entry type per record (value or tombstone). Now that the format is binary, compression becomes more effective and saves more space. As an optional task, compress each data block independently so lookups still touch only one block: Record each block’s offset and compressed size in the index. Read just those bytes, decompress, and search. This packs more logical blocks into each cached page, raising cache hit rates, reducing pages touched during scans, and smoothing read latency. That’s it for this week! You implemented block-based SSTables and indexing, gaining benefits like more efficient I/O and reduced write amplification. In two weeks, you will focus on improving read performance by adding a layer that can tell whether an SSTable is worth parsing, and say goodbye to your hashtable-based memtable, replacing it with a more efficient data structure. For a production-grade implementation of block-based SSTables, see RocksDB’s block-based SSTable format . It details block layout, per-block compression, and how the index stores offsets and sizes. You can also check out ScyllaDB’s SSTables v3 docs . ScyllaDB maintains a small in-memory summary of sampled keys to narrow the search, then uses the on-disk index to locate the exact block. This provides a nice contrast to our single-page index and illustrates how to scale when SSTables grow large. For a deeper look at how things work in practice in terms of directory structure, you can explore the ScyllaDB SSTables directory structure , which shows how metadata and data are organized on disk. Regarding CRC read failures, we mentioned that a checksum mismatch should simply cause the read to fail for that SSTable. In real systems, databases rely on replication to handle corruption. When multiple replicas exist, a system can recover by using data from an intact replica if one becomes corrupted or unavailable. Upon detecting a checksum mismatch, the system discards the corrupt replica and rebuilds it from a healthy one. This approach only works as long as a valid replica exists, which is why frequent checksum verification is critical: it ensures corruption is caught and repaired as early as possible, before it propagates. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing In week 2, you used JSON as the SSTable format. That works for document databases, but the overhead of this serialization format doesn’t make it the best choice for your storage engine: Best case: You stream the file and linearly scan entries until you find the key, but a miss means scanning the entire file. Worst case: You read the whole file and parse everything, then search for the key. Efficient I/O: Each lookup can fetch a complete block with a single page read. Predictable latency: Since every block maps to exactly one page, each read involves a fixed, bounded amount of I/O, improving latency consistency. Smaller on disk: Binary encoding typically compresses better than JSON. Integrity: Per-block checksums detect corruption without requiring a re-read of the file. Caching: Hot SSTable blocks are cached in a memory-based block cache to reduce I/O and decompression overhead. Fixed 64-byte keys and values: This alleviates a lot of logic to keep fixed-size blocks, making the implementation easier to write and reason about. Because of the week 1 assumption (keys are lowercase ASCII strings), each character is one byte, which also makes the implementation easier. One index block (first 4 KB page) Multiple data blocks (each 4 KB) : The number of data blocks in the SSTable. A set of key entries (64 B), each being the first key of the corresponding data block. Entries are sorted by key and used to decide which block to fetch during a lookup. Block 0 starts with the key . Block 1 starts with the key . Block 2 starts with the key . Block 0 starts at offset 4096. Block 1 starts at offset 8192. Block 2 starts at offset 12288. Header (128 B): (8 B): A checksum computed over bytes [8..4096). You can choose any standard variant (e.g., CRC-64/ECMA-182). (1 B): the number of entries in this block (0..31). Padding (119 B). Entries area (31 x 128 B = 3968 B), each entry is: (64 B, right-padded). (64 B, right-padded). Binary search the index in to find the largest ≤ key and get . If not found (e.g., first index key is and your key is ), return a miss for this SSTable. Compute the block offset: . Fetch the corresponding 4 KB block. Verify CRC before using the block: Compute CRC64 over bytes [8..4096). Compare with the 8-byte CRC stored at offset 0..7. If it doesn’t match, fail the read for this SSTable. Binary search the entries in for the key. Return the corresponding value or a miss. Record each block’s offset and compressed size in the index. Read just those bytes, decompress, and search.

Database

JSON

0 views

The Coder Cafe 1 months ago

Build Your Own Key-Value Storage Engine—Week 5

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. I’ll also give a talk there, so feel free to join! Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Last week, you implemented deletion and compaction, making sure the LSM tree wouldn’t grow indefinitely. Still, there’s a weak spot: in the worst-case scenario (e.g., on a key miss), a single read has to scan all SSTables. To address this, you will implement leveling, a core idea in LSM trees. Instead of a single flat list of SSTables, leveling stores data across multiple levels: , , , etc. gets compacted to and makes space for future memtable flushes. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. This process is called level compaction. Something important to understand: is slightly different from all the other levels. is created during memtable flushes. If a key already exists at and also in the memtable, the next flush can write that key again to a new file. In other words, can have overlapping keys. For all the other levels ( to ), that’s not the case. They are created by compaction, which removes duplicates and produces non-overlapping key ranges. In this week’s simplified design, an to compaction takes all SSTables from and , performs a k-way merge, then rewrites fully. As a result, each key appears at most once per level from downward. What’s the consequence of non-overlapping keys? You can improve lookups using a simple range-to-file mapping, for example: Keys from to are stored in this SSTable. Keys from to are stored in this SSTable. With this setup, a read checks only one SSTable per level from to . is the exception due to overlaps, so a read may still need to scan all SSTables. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Limit the number of levels to two: , which may contain overlapping keys. , no overlapping keys. Create a folder for each level: , and . Keep one global file at the root. You will create a layout for both and : remains a simple list of SSTables. allows key-range partitioning. For example: This indicates: is composed of three SSTables: Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . The main goal of the compaction process is to compact both and . At the end, you should merge all the data from and into . will be left empty. When reaches five full SSTable files (2,000 entries each), run an → compaction: Open iterators on all and SSTables. Apply the k-way merge algorithm: Comparator: Primary: . Tie-break (equal ): Prefer over . At , prefer the newest SSTable. Version order: any record from is newer than records from . Within , newer files win (same as week 4). Keep at most one record per key (newest wins). Tombstones: because is the bottom level, drop a tombstone if no older value for that key remains in the merge result. Create new L1 SSTables with at most 2,000 entries. When naming new L1 files, make sure they are unique. For example, if contains and , the first SSTable file created should be . Publish atomically: each new file the directory. Update the atomically. the file. the root directory (the directory containing the file and and folders). Delete obsolete L1 files, then . Delete all files in , then . The logic is unchanged from previous weeks. The only difference is that flush writes to and updates the file in the section. Check the memtable. If not found, scan all files newest to oldest using section of the . If not found at : Use the section of the to choose the one shard that contains the key’s range, then read only that L1 file. Return the value if found; otherwise, return . There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Introducing leveling has a fundamental impact on deletions. With a single level, compaction sees all versions of every key at once, so a tombstone can be dropped as soon as it has “killed“ every older record for that key. Yet, the rule we mentioned last week holds true: a tombstone can be evicted only after all data it shadows no longer exist on disk. With multiple levels, compaction must propagate tombstones downward. It’s only at the bottommost level that tombstones can be dropped, because only there you can prove they no longer shadow any other records. As an optional task, make the number of levels configurable: , , …, : Define a size ratio so each level has a target size larger than the previous one. Keep one directory per level: , , …, . Keep a single global . When a level reaches its max number of SSTables (derived from the size ratio), compact that level into the next. Only drop tombstones at the bottommost level . At any intermediate level with , propagate the tombstone downward during compaction. Implement : Return all keys between (included) and (excluded). Use put-delete-scan.txt to validate that your changes are correct. It introduces the keyword. For example: This line means: between (included) and (excluded), the keys are , , (the output will always be sorted) NOTE : If this route conflicts with , rename the single-key route to . That’s it for this week! Your LSM tree is taking shape. You implemented leveling, a key LSM design idea, and refined compaction so reads are tighter and storage stays under control. In two weeks, you will revisit the week 2 choice of JSON for SSTables. You will switch to block-based SSTables to reduce parsing and I/O overhead and add indexing within each SSTable. We mentioned that, because of key overlaps, a read may still need to scan all SSTables (e.g., key miss). This is the main reason why is typically kept small. In general, each level is larger than the one above it by a fixed size ratio (e.g., 10×). Some databases even use less static mechanisms. For instance, RocksDB relies on Dynamic Leveled Compaction , where the size of each level is automatically adjusted based on the size of the oldest (last) level, eliminating the need to define each level’s size statically. Regarding compaction, you should know that in real-world databases, it isn’t done in batch mode across all data. Let’s understand why. Suppose you have four levels and a layout like this for one key: The key exists at L3. The key doesn’t exist at L2. The key is updated at L1. A tombstone is placed at L0. You can’t compact L0 with L1/L2/L3 in one shot; that would mean checking every SSTable against every level. What happens in reality is that compaction is a promotion process. In our example, the tombstone at L0 is promoted to L1. Implementations ensure that it either (a) is compacted together with the L1 SSTable it shadows, or (b) waits until that L1 data is promoted to L2. The same rule repeats level by level, until the tombstone reaches L3 and finally removes the shadowed value. Meanwhile, it’s essential to understand that compaction is crucial in LSM trees. Let’s take some perspective to understand the reason. An LSM tree buffers writes in a memtable and flushes to L0. Compaction merges SSTables across levels to control read amplification. If compaction falls behind, L0 files accumulate, flushes slow down (or stall at file-count thresholds), write latency climbs, and in the worst case, you can observe write pauses. Not because the memtable is “locked,” but because the engine can’t safely create more L0 files until compaction catches up. This is one of the reasons why the RUM conjecture we introduced last week is important. If you compact too eagerly, you burn a lot of disk I/O and lose the LSM’s write advantage. If you compact too lazily, you incur a penalty on your read path. If you compact everything all the time, you incur a space-amplification penalty during compaction roughly equal to the working set size. Because compaction is so important, most key-value stores support parallel compactions across levels (except → , which isn’t parallelized due to overlapping key ranges in L0). You should also be aware that ongoing research keeps improving compaction. For example, the SILK: Preventing Latency Spikes in LSM Key-Value Stores paper analyzes why LSM systems can exhibit high tail latency. The main reason is that limited I/O bandwidth causes interference between client writes, flushes, and compactions. The key takeaway is that not all internal operations are equal. The paper explores solutions such as Bandwidth awareness: Monitor client I/O and allocate the leftover to internal work dynamically instead of static configuration. Prioritization: Give priority to operations near the top of the tree (flushes and L0 → L1 compaction). Slowdowns there create backpressure that impacts tail latency more than work at deeper levels. Last but not least, what you implemented this week is called level compaction. Other strategies like tiered compaction exist, which merge SSTables based on their size and count rather than fixed levels. You can explore this great resource from Mark Callaghan, which dives deeper into the design trade-offs and performance characteristics of different compaction strategies in LSM trees. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Last week, you implemented deletion and compaction, making sure the LSM tree wouldn’t grow indefinitely. Still, there’s a weak spot: in the worst-case scenario (e.g., on a key miss), a single read has to scan all SSTables. To address this, you will implement leveling, a core idea in LSM trees. Instead of a single flat list of SSTables, leveling stores data across multiple levels: , , , etc. gets compacted to and makes space for future memtable flushes. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. This process is called level compaction. Something important to understand: is slightly different from all the other levels. is created during memtable flushes. If a key already exists at and also in the memtable, the next flush can write that key again to a new file. In other words, can have overlapping keys. For all the other levels ( to ), that’s not the case. They are created by compaction, which removes duplicates and produces non-overlapping key ranges. In this week’s simplified design, an to compaction takes all SSTables from and , performs a k-way merge, then rewrites fully. As a result, each key appears at most once per level from downward. What’s the consequence of non-overlapping keys? You can improve lookups using a simple range-to-file mapping, for example: Keys from to are stored in this SSTable. Keys from to are stored in this SSTable. Limit the number of levels to two: , which may contain overlapping keys. , no overlapping keys. Create a folder for each level: , and . Keep one global file at the root. remains a simple list of SSTables. allows key-range partitioning. is composed of three SSTables: . : Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Open iterators on all and SSTables. Apply the k-way merge algorithm: Comparator: Primary: . Tie-break (equal ): Prefer over . At , prefer the newest SSTable. Version order: any record from is newer than records from . Within , newer files win (same as week 4). Keep at most one record per key (newest wins). Tombstones: because is the bottom level, drop a tombstone if no older value for that key remains in the merge result. Create new L1 SSTables with at most 2,000 entries. When naming new L1 files, make sure they are unique. For example, if contains and , the first SSTable file created should be . Publish atomically: each new file the directory. Update the atomically. the file. the root directory (the directory containing the file and and folders). Clean up: Delete obsolete L1 files, then . Delete all files in , then . Check the memtable. If not found, scan all files newest to oldest using section of the . If not found at : Use the section of the to choose the one shard that contains the key’s range, then read only that L1 file. Return the value if found; otherwise, return . Define a size ratio so each level has a target size larger than the previous one. Keep one directory per level: , , …, . Keep a single global . When a level reaches its max number of SSTables (derived from the size ratio), compact that level into the next. Only drop tombstones at the bottommost level . At any intermediate level with , propagate the tombstone downward during compaction. Return all keys between (included) and (excluded). Use put-delete-scan.txt to validate that your changes are correct. It introduces the keyword. For example: This line means: between (included) and (excluded), the keys are , , (the output will always be sorted) The key exists at L3. The key doesn’t exist at L2. The key is updated at L1. A tombstone is placed at L0. This is one of the reasons why the RUM conjecture we introduced last week is important. If you compact too eagerly, you burn a lot of disk I/O and lose the LSM’s write advantage. If you compact too lazily, you incur a penalty on your read path. If you compact everything all the time, you incur a space-amplification penalty during compaction roughly equal to the working set size. Bandwidth awareness: Monitor client I/O and allocate the leftover to internal work dynamically instead of static configuration. Prioritization: Give priority to operations near the top of the tree (flushes and L0 → L1 compaction). Slowdowns there create backpressure that impacts tail latency more than work at deeper levels.

Database

JSON

0 views

The Coder Cafe 1 months ago

The Cold Start Problem

☕ Welcome to The Coder Cafe! I sucked at product management 1 . Early in my career, I was only passionate about how a product works under the hood, not about what the product actually does. Over time, I began to change and open up. Today, I want to share a concept from a book recommended on X by that I really loved: The Cold Start Problem . Get cozy, grab a coffee, and let’s begin! Network Effects Let’s consider a dating app. If there are only three people who installed the app in New-York, anyone new will probably try it for a few seconds and uninstall it. But if a big part of the city is on it, then someone single will probably stick around. Said differently, the more people use the product, the more valuable it becomes. There’s a term for that type of product, and it’s called the network effect: when a product gets more valuable as more people use it . Delving into the network effect, it’s not a single force but actually a trio of forces: The acquisition effect : How a product can use its own network to attract new people. The engagement effect : The more people join, the more useful and sticky it becomes. The economic effect : A larger network reduces costs, improves monetization, and strengthens the business model. Now comes the real challenge. If our product relies on network effects, how do we launch it? Do we start from zero, or wait until enough people are on board? It’s a chicken-and-egg problem, and that’s the cold start problem. One common mistake to solve the cold start problem is the big bang launch: releasing to everyone before any community exists. Google+ is the perfect example. Launched in 2011, it was Google’s attempt at a social network with a Facebook-style feed. The problem was that when people joined, they found empty timelines and left. Google later admitted that 90% of user sessions lasted less than five seconds. At one point, Google even tied YouTube comments to Google+, requiring an account just to comment. The platform eventually reached more than 500 million users, but the issue was never sign-ups. The real problem was that a newcomer’s first session didn’t feel like walking into a lively room. It was forced growth instead of real networks. Google+ didn’t fail because Google couldn’t build a social network. It lost because it never created a place where a new user could land in a live network. In short, Google+ failed to solve the cold start problem. A solution to the cold start problem was applied by Tinder, and it involves focusing on the concept of atomic networks. Back then, dating apps were not very popular. Yet, Tinder was ambitious and wanted to succeed in that market. Instead of launching worldwide, they did the complete opposite. They organized a party at a college that required installing the application to attend. The next day, most students there had Tinder installed. This college was an atomic network: the smallest self-sustaining cluster of users where network effects actually work. Soon after, they repeated the same process in another college in the same city, with the same result: within days, most students there had joined Tinder. What’s powerful about atomic networks is simple: if we can build one network, we can build two. If we can build two, we can build thousands. Tinder repeated this strategy college by college, then city by city, eventually growing into entire countries. The takeaway is that when we start a product with network effects, the first step is to build a single, tiny network that’s self-sustaining on its own. We just want to get started. If we can create one stable, engaged network, then we can build a second one next to it. From there, we can replicate the process and eventually connect them into one large network that spans the whole market. Network effects happen when a product gets more valuable as more people use it. How do we launch such a product with zero users? That’s the cold start problem. Big bang launches fail when newcomers find empty networks. One effective approach is to build atomic networks: the smallest self-sustaining clusters where network effects work. If we can build one atomic network, we can repeat it and scale across a market. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Don’t Forget About Your Mental Health The XY Problem Lateral Thinking The Cold Start Problem Project Strobe: Protecting your data, improving our third-party APIs, and sunsetting consumer Google+ Google+: Communities and photos ❤️ If you enjoyed the post, please consider giving it a like. 💬 The book is really great and covers many more aspects. I definitely recommend it. Are you into product management? What resources would you recommend? Leave a comment Right, Afroditi? The acquisition effect : How a product can use its own network to attract new people. The engagement effect : The more people join, the more useful and sticky it becomes. The economic effect : A larger network reduces costs, improves monetization, and strengthens the business model. Network effects happen when a product gets more valuable as more people use it. How do we launch such a product with zero users? That’s the cold start problem. Big bang launches fail when newcomers find empty networks. One effective approach is to build atomic networks: the smallest self-sustaining clusters where network effects work. If we can build one atomic network, we can repeat it and scale across a market. Don’t Forget About Your Mental Health The XY Problem Lateral Thinking The Cold Start Problem Project Strobe: Protecting your data, improving our third-party APIs, and sunsetting consumer Google+ Google+: Communities and photos

Business

1 views

The Coder Cafe 2 months ago

Build Your Own Key-Value Storage Engine—Week 4

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Over the past few weeks, you built an LSM tree and three main components: a memtable, SSTables, and a WAL that records the same operations you keep in the memtable. To prevent on-disk data from growing forever, you will implement compaction, a critical process in LSM trees. Compaction periodically merges SSTables to reclaim space and keep read performance predictable. For example, if key exists in every SSTable on disk: Compaction drops duplicates and keeps only the newest record: . In addition, you will implement a endpoint. Handling deletes in an LSM tree isn’t straightforward at all: SSTables are immutable. To preserve the append-only nature of LSM trees, deletions are written as tombstones: markers indicating a key was logically deleted. You write it to the WAL, keep it in the memtable, and propagate it during flush. How should compaction work in the presence of tombstones? Suppose you have the following SSTables on disk: the key exists in , doesn’t exist in , exists in , and is deleted at : .”","title":null,"type":"image/png","href":null,"belowTheFold":true,"topImage":false,"internalRedirect":"https://read.thecoder.cafe/i/174613473?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png","isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" title="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" srcset="https://substackcdn.com/image/fetch/$s_!rqB7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 424w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 848w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1272w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1456w" sizes="100vw" loading="lazy"> If the key doesn’t exist in the memtable, the current state for is deleted. Now, imagine that during compaction, you merge and . As the key is marked as deleted in the newest SSTable, you may decide to drop the tombstone, as it hides the key in : Now, would do: Key doesn’t exist in the memtable → Continue. Key doesn’t exist in SST-5 → Continue. Key doesn’t exist in SST-2 → Continue. Key exists in SST-1 → Return (instead of ). The fundamental rule is the following: during compaction, a tombstone can be evicted only after all data it shadows no longer exist on disk. Otherwise, dropping a tombstone too early can make an old value reappear. This is known as data resurrection: a key that “comes back to life” after a deletion. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Flush and compaction should be single-threaded and stop-the-world operations: do not serve client requests until the operations complete. Append a tombstone to the WAL file, with : Update the memtable: Do not remove the key directly; mark it deleted with a tombstone. Acknowledge the request. During flush, carry tombstones into the new SSTable using a new field. For example: The keys must remain sorted. The goals of the compaction process for this week are the following: For each key, keep only the newest record. Drop records hidden by newer versions. This is where merging happens: the newest record wins, and older versions are evicted. Drop tombstones when no older value remains. The compaction trigger is: every 10,000 update requests ( and , not ), compact all SSTables. Algorithm ( k-way merge using a min-heap on key): Open an iterator for each SSTable file known by the MANIFEST. Push each iterator’s current record into a min-heap with the following comparator: Primary: . Tie-break (equal ): Newest SSTable first based on MANIFEST order (to make sure an old value doesn’t win). While the heap is not empty: Pop the smallest key (this first pop is the newest version of due to the tie-break). Drain all other heap entries whose key is and discard them (older values). For the record you picked: If it’s a tombstone, emit nothing for . Otherwise, emit the value for . Advance only the iterators you drained for and push their next records into the heap. Stream emitted records (sorted) into new SSTables. Remember: the max entries in an SSTable should remain 2,000. each new SSTable file, then its parent directory. Update the MANIFEST atomically (see week 3). Remove the old SSTable files. Check the memtable: If the key is marked as deleted, return . Else, return the value. Scan SSTables from newest to oldest, given the MANIFEST order (same as before). For the first record with the requested key: If , return . Else, return the value. If the key isn’t found, return . When replaying the WAL, make sure to take into account tombstone values ( ). Update your client to handle lines → Send a request to . Download and run your client against a new file containing requests: put-delete.txt . NOTE : Refer to week 1 if you need to generate your own file with the number of lines you want. That’s it for this week! Your storage engine now supports deletes and a compaction mechanism that prevents unbounded growth. The Coder Cafe will take a break for two weeks. On January 7th, you will continue exploring LSM trees and cover leveling. In your current implementation, a miss still scans all SSTables; therefore, you will also add key range partitioning to limit the number of SSTables that need to be checked during a lookup. See you next year! The compaction trigger you used was simple: every 10,000 PUT or DELETE requests. In real systems, compaction is usually driven by factors such as too many SSTable files, space pressure, or high read amplification. Also, many systems add safeguards to keep compaction controlled and resource-efficient. For example, a common one is bounded fan-in (merging only a small, fixed number of SSTables per batch), so the engine never opens every file at once. Others track each SSTable’s first and last key to select only overlapping candidates, hence avoiding unrelated files. Taking a step back, it’s interesting to note that the core LSM idea—append-only writes with regular compaction—shows up in many systems, even outside pure LSM trees. For example: Lucene : Immutable segments are created and later merged in the background, an LSM-like pattern, even though it isn’t an LSM tree per se. Memcached Extstore : Flushes values to free RAM, but keeps the hashtable, keys, and storage pointers in memory. It later compacts the data. Kafka : Rewrites segments to keep the latest value per key and drop older versions, which is conceptually similar to SSTable compaction. Also, we briefly introduced the concept of key resurrection in the introduction. You should be aware that this is a common challenge with LSM trees. In real-world conditions, crashes, slow WAL truncation, and complex compaction can allow an old value to be replayed during recovery after its tombstone has been removed, leading to key resurrection. Here are two great references that delve more into this kind of problem: Preventing Data Resurrection with Repair Based Tombstone Garbage Collection Repair Time Requirements to Prevent Data Resurrection in Cassandra & Scylla Another excellent reference is Acheron: Persisting Tombstones in LSM Engines . It shows how standard LSM compaction can leave tombstones stuck for long periods, so “deleted" data may still linger in lower levels and complicate compliance requirements such as GDPR/CCPA compliance. The paper introduces delete-aware techniques that prioritize pushing tombstones down the tree to make deletions persist more predictably. Lastly, you can explore the RUM conjecture . Structurally, it’s similar to the CAP theorem : “ three things, pick two” . In short, you can make a database excel at two of: reads, updates (insert/change/delete), and memory/space, but not all three at once. Make any two really good and the third gets worse; that’s an unavoidable trade-off. This helps explain why, for example, LSM trees optimized for fast updates and good space efficiency pay a cost in read performance due to read amplification. That trade-off shows up in the design of the compaction process you implemented this week: you trade space and significant I/O for simplicity by compacting everything in one shot. This is fine for the example, but with 500GB of SSTables, you may need roughly another 500GB of free space during the merge in the worst case. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Over the past few weeks, you built an LSM tree and three main components: a memtable, SSTables, and a WAL that records the same operations you keep in the memtable. To prevent on-disk data from growing forever, you will implement compaction, a critical process in LSM trees. Compaction periodically merges SSTables to reclaim space and keep read performance predictable. For example, if key exists in every SSTable on disk: Compaction drops duplicates and keeps only the newest record: . In addition, you will implement a endpoint. Handling deletes in an LSM tree isn’t straightforward at all: SSTables are immutable. To preserve the append-only nature of LSM trees, deletions are written as tombstones: markers indicating a key was logically deleted. You write it to the WAL, keep it in the memtable, and propagate it during flush. How should compaction work in the presence of tombstones? Suppose you have the following SSTables on disk: the key exists in , doesn’t exist in , exists in , and is deleted at : .”","title":null,"type":"image/png","href":null,"belowTheFold":true,"topImage":false,"internalRedirect":"https://read.thecoder.cafe/i/174613473?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png","isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" title="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" srcset="https://substackcdn.com/image/fetch/$s_!rqB7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 424w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 848w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1272w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1456w" sizes="100vw" loading="lazy"> If the key doesn’t exist in the memtable, the current state for is deleted. Now, imagine that during compaction, you merge and . As the key is marked as deleted in the newest SSTable, you may decide to drop the tombstone, as it hides the key in : Now, would do: Key doesn’t exist in the memtable → Continue. Key doesn’t exist in SST-5 → Continue. Key doesn’t exist in SST-2 → Continue. Key exists in SST-1 → Return (instead of ). Flush and compaction should be single-threaded and stop-the-world operations: do not serve client requests until the operations complete. Append a tombstone to the WAL file, with : Update the memtable: Do not remove the key directly; mark it deleted with a tombstone. Acknowledge the request. For each key, keep only the newest record. Drop records hidden by newer versions. This is where merging happens: the newest record wins, and older versions are evicted. Drop tombstones when no older value remains. Open an iterator for each SSTable file known by the MANIFEST. Push each iterator’s current record into a min-heap with the following comparator: Primary: . Tie-break (equal ): Newest SSTable first based on MANIFEST order (to make sure an old value doesn’t win). While the heap is not empty: Pop the smallest key (this first pop is the newest version of due to the tie-break). Drain all other heap entries whose key is and discard them (older values). For the record you picked: If it’s a tombstone, emit nothing for . Otherwise, emit the value for . Advance only the iterators you drained for and push their next records into the heap. Stream emitted records (sorted) into new SSTables. Remember: the max entries in an SSTable should remain 2,000. each new SSTable file, then its parent directory. Update the MANIFEST atomically (see week 3). Remove the old SSTable files. Check the memtable: If the key is marked as deleted, return . Else, return the value. Scan SSTables from newest to oldest, given the MANIFEST order (same as before). For the first record with the requested key: If , return . Else, return the value. If the key isn’t found, return . When replaying the WAL, make sure to take into account tombstone values ( ). Update your client to handle lines → Send a request to . Download and run your client against a new file containing requests: put-delete.txt . NOTE : Refer to week 1 if you need to generate your own file with the number of lines you want. Lucene : Immutable segments are created and later merged in the background, an LSM-like pattern, even though it isn’t an LSM tree per se. Memcached Extstore : Flushes values to free RAM, but keeps the hashtable, keys, and storage pointers in memory. It later compacts the data. Kafka : Rewrites segments to keep the latest value per key and drop older versions, which is conceptually similar to SSTable compaction. Preventing Data Resurrection with Repair Based Tombstone Garbage Collection Repair Time Requirements to Prevent Data Resurrection in Cassandra & Scylla

Database

Tutorial

0 views

The Coder Cafe 2 months ago

Build Your Own Key-Value Storage Engine—Week 3

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Last week, you built the first version of an LSM: an in-memory memtable for recent writes, immutable SSTables on disk, and a MANIFEST file listing the SSTable files. However, if the database crashes, data in the memtable would be lost. This week, you will focus on durability by introducing Write-Ahead Logging (WAL). A WAL is an append-only file on disk that records the same operations you keep in memory. How it works: On write, record it in the WAL and the memtable. On restart, you read the WAL from start to end and apply each record to the memtable. Introducing a WAL is not free, though. Writes are slower because each write also goes to the WAL. It also increases write amplification, the ratio of data written to data requested by a client. Another important aspect of durability is when to synchronize a file’s state with the storage device. When you write to a file, it may appear as saved, but the bytes may sit in memory caches rather than on the physical disk. These caches are managed by the OS’s filesystem, an abstraction over the disk. If the machine crashes before the data is flushed, you can lose data. To force the data to stable storage, you need to call a sync primitive. The simple, portable choice is to call fsync , a system call that flushes a file’s buffered data and required metadata to disk. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord For the WAL data format, you won’t use JSON like the SSTables, but NDJSON (Newline-Delimited JSON). It is a true append-only format with one JSON object per line. Append a record to the WAL file , opened with . Set the field to , and the and fields to the provided key and value. For example, writing : Update the memtable with the same logic as before: If the key exists, update the value. Otherwise, create a new entry. Acknowledge the HTTP request. Create an empty file if it doesn’t exist. Replay the WAL from start to end. For each valid line, apply it to the memtable. Keep the same flush trigger (2,000 entries) and the same logic (stop-the-world operation) as last week: Write the new SSTable: Flush the memtable as a new immutable JSON SSTable file with keys sorted (same as before). fsync the SSTable file. the parent directory of the SSTable to make the new filename persistent. Update the MANIFEST atomically: Read the current MANIFEST lines into memory and append the new SSTable filename. Open with . Write the entire list to from the start. Rename → . the parent directory of the MANIFEST. Reset the WAL: Truncate the WAL to zero length. the WAL file. If the server is unavailable, do not fail. Retry indefinitely with a short delay (or exponential backoff). To assess durability: Run the client against the same input file ( put.txt ). Stop and restart your database randomly during the run. Your client should confirm that no acknowledged writes were lost after recovery. Add a per-record checksum to each WAL record. On startup, verify records and stop at the first invalid/truncated one, discarding the tail. For reference, ScyllaDB checksums segments using CRC32; see its commitlog segment file format for inspiration. Regarding the flush process, if the database crashes after step 1 (write the new SSTable) and before step 2 (update the MANIFEST atomically), you may end up with a dangling SSTable file on disk. Add a startup routine to delete any file that exists on disk but is not listed in the MANIFEST. This keeps the data directory aligned with the MANIFEST after a crash. That’s it for this week! Your storage engine is now durable. On restart, data that was in the memtable is recovered from the WAL. This is made possible by and the atomic update of the MANIFEST. Deletion is not handled yet. In the worst case, a miss can read all SSTables, which quickly becomes highly inefficient. In two weeks, you will add a endpoint and learn how SSTables are compacted so the engine can reclaim space and keep reads efficient. In your implementation, you used as a simple “make it durable now“ button. In practice, offers finer control both over what you sync and when you sync. What: (or opening the file with ) persists the data without pushing unrelated metadata, which is usually what you want for WAL appends. You can go further with to bypass the page cache and sync only the data you wrote, but that comes with extra complexity. When: While calling a sync primitive after every request is offered by systems that promise durability, it is often not the default. Many databases use group commit, which batches several writes into one call to amortize the cost while still providing strong guarantees. For additional information, see A write-ahead log is not a universal part of durability by . For example, RocksDB provides options for tuning WAL behavior to meet the needs of different applications: Synchronous WAL writes (what you implemented this week) Group commit. No WAL writes at all. If you want, you can also explore group commit in your implementation and its impact on durability and latency/throughput, since this series will not cover it later. Also, you should know that since a WAL adds I/O to the write path, storage engines use a few practical tricks to keep it fast and predictable. A common one is to preallocate fixed-size WAL segments at startup to: Avoid the penalty of dynamic allocation. Prevent write fragmentation. Align buffers for (an open (2) flag for direct I/O that bypasses the OS page cache). Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Last week, you built the first version of an LSM: an in-memory memtable for recent writes, immutable SSTables on disk, and a MANIFEST file listing the SSTable files. However, if the database crashes, data in the memtable would be lost. This week, you will focus on durability by introducing Write-Ahead Logging (WAL). A WAL is an append-only file on disk that records the same operations you keep in memory. How it works: On write, record it in the WAL and the memtable. On restart, you read the WAL from start to end and apply each record to the memtable. Append a record to the WAL file , opened with . Set the field to , and the and fields to the provided key and value. For example, writing : Update the memtable with the same logic as before: If the key exists, update the value. Otherwise, create a new entry. Acknowledge the HTTP request. Create an empty file if it doesn’t exist. Replay the WAL from start to end. For each valid line, apply it to the memtable. Write the new SSTable: Flush the memtable as a new immutable JSON SSTable file with keys sorted (same as before). fsync the SSTable file. the parent directory of the SSTable to make the new filename persistent. Update the MANIFEST atomically: Read the current MANIFEST lines into memory and append the new SSTable filename. Open with . Write the entire list to from the start. Rename → . the parent directory of the MANIFEST. Reset the WAL: Truncate the WAL to zero length. the WAL file. Run the client against the same input file ( put.txt ). Stop and restart your database randomly during the run. Your client should confirm that no acknowledged writes were lost after recovery. What: (or opening the file with ) persists the data without pushing unrelated metadata, which is usually what you want for WAL appends. You can go further with to bypass the page cache and sync only the data you wrote, but that comes with extra complexity. When: While calling a sync primitive after every request is offered by systems that promise durability, it is often not the default. Many databases use group commit, which batches several writes into one call to amortize the cost while still providing strong guarantees. For additional information, see A write-ahead log is not a universal part of durability by . For example, RocksDB provides options for tuning WAL behavior to meet the needs of different applications: Synchronous WAL writes (what you implemented this week) Group commit. No WAL writes at all. Avoid the penalty of dynamic allocation. Prevent write fragmentation. Align buffers for (an open (2) flag for direct I/O that bypasses the OS page cache).

Database

Tutorial

JSON

Programming

0 views

The Coder Cafe 3 months ago

Linus Torvalds vs. Ambiguous Abstractions

🎄 If you’re planning to do Advent of Code this year, join The Coder Cafe leaderboard: . I’ll find a few prizes for the winner(s). If you’re new to Advent of Code, I wrote a short introduction last year, and I also wrote a blog post called I Completed All 8 Advents of Code in One Go: Here Are the Lessons I Learned if you’re interested. I’ve also created a custom channel in the Discord channel. Join the Discord ☕ Welcome to The Coder Cafe! Today, we discuss a recent comment from Linus Torvalds about the use of a helper function. Get cozy, grab a coffee, and let’s begin! In August 2025, there was (yet another) drama involving Linus Torvalds replying on a pull request: No. This is garbage and it came in too late. I asked for early pull requests because I’m traveling, and if you can’t follow that rule, at least make the pull requests good. This adds various garbage that isn’t RISC-V specific to generic header files. And by “garbage” I really mean it. This is stuff that nobody should ever send me, never mind late in a merge window. Like this crazy and pointless make_u32_from_two_u16() “helper”. That thing makes the world actively a worse place to live. It’s useless garbage that makes any user incomprehensible, and actively WORSE than not using that stupid “helper”. If you write the code out as “(a << 16) + b”, you know what it does and which is the high word. Maybe you need to add a cast to make sure that ‘b’ doesn’t have high bits that pollutes the end result, so maybe it’s not going to be exactly pretty, but it’s not going to be wrong and incomprehensible either. In contrast, if you write make_u32_from_two_u16(a,b) you have not a f^%$ing clue what the word order is . IOW, you just made things WORSE, and you added that “helper” to a generic non-RISC-V file where people are apparently supposed to use it to make other code worse too. So no. Things like this need to get bent. It does not go into generic header files, and it damn well does not happen late in the merge window. Let’s not discuss the rudeness of this comment (it’s atrocious). Instead, let’s focus on the content itself. , a popular newsletter, wrote a post about it: the main point Linus makes here is that good code optimizes for reducing cognitive load . {…] Humans have limited working memory capacity - let’s say the human brain can only store 4-7 “chunks” at at time. Each abstraction or helper function costs a chunk slot. Each abstractions costs more tokens. I share the view that good code optimizes for reducing cognitive load 1 , but I don’t understand Linus’s comment in exactly the same way. Yes, Linus is virulent about the helper function, but in my opinion, his main argument isn’t simply that an abstraction costs a “chunk slot” as mentioned; it’s rather that this isn’t the right abstraction. Here is the code added in the pull request: This macro builds a 32-bit integer by putting one 16-bit value in the high half and the other in the low half. For example: The main problem with this macro isn’t necessarily that it exists. It’s that its intent (meaning what it tries to accomplish) could have been clearer. Indeed, the helper’s name doesn’t tell which word is high and which one is low and that’s exactly what Linus is calling out with “ you have not a f^%$ing clue what the word order is ”. Because we can’t get the intent from the name ( ), we have to open the macro to understand the order. That’s precisely why it costs a “chunk slot.”: not because the abstraction exists, but because it’s an ambiguous one. If we wanted to keep using a macro, a better approach, in my opinion 2 , would be to encode the word order in the name itself ( = most significant word, = least significant word): In this case, the word order is carried by the macro name, which makes it a clearer abstraction. Reading the call site doesn’t require opening the macro to understand the word order: Such an abstraction doesn’t cost a “chunk slot” in terms of cognitive load. Its intent is clear from the name, so we don’t need to load an extra piece of information into our working memory to understand it. In summary, if we want to optimize for cognitive load, there’s not necessarily an issue with using helper functions. But if we do, we should make the abstraction as explicit as possible, and that starts with a clear function name that conveys what it tries to accomplish. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Readability Cognitive Load Nested Code Re: [GIT PULL] RISC-V Patches for the 6.17 Merge Window, Part 1 - Linus Torvalds // The discussion. GitHub // The code proposed in the pull request Linus and the two youts // Interestingly, the macro was plain wrong when the second word was negative. The full explanation is here. ❤️ If you enjoyed this post, please hit the like button. 💬 Where do you draw the line between “helpful” and “harmful” abstraction? Leave a comment At least most of the time. Sometimes we must optimize for performance at the expense of cognitive load. Mr Torvalds, if you see this and you disagree, please do not insult me. In August 2025, there was (yet another) drama involving Linus Torvalds replying on a pull request: No. This is garbage and it came in too late. I asked for early pull requests because I’m traveling, and if you can’t follow that rule, at least make the pull requests good. This adds various garbage that isn’t RISC-V specific to generic header files. And by “garbage” I really mean it. This is stuff that nobody should ever send me, never mind late in a merge window. Like this crazy and pointless make_u32_from_two_u16() “helper”. That thing makes the world actively a worse place to live. It’s useless garbage that makes any user incomprehensible, and actively WORSE than not using that stupid “helper”. If you write the code out as “(a << 16) + b”, you know what it does and which is the high word. Maybe you need to add a cast to make sure that ‘b’ doesn’t have high bits that pollutes the end result, so maybe it’s not going to be exactly pretty, but it’s not going to be wrong and incomprehensible either. In contrast, if you write make_u32_from_two_u16(a,b) you have not a f^%$ing clue what the word order is . IOW, you just made things WORSE, and you added that “helper” to a generic non-RISC-V file where people are apparently supposed to use it to make other code worse too. So no. Things like this need to get bent. It does not go into generic header files, and it damn well does not happen late in the merge window. Let’s not discuss the rudeness of this comment (it’s atrocious). Instead, let’s focus on the content itself. , a popular newsletter, wrote a post about it: the main point Linus makes here is that good code optimizes for reducing cognitive load . {…] Humans have limited working memory capacity - let’s say the human brain can only store 4-7 “chunks” at at time. Each abstraction or helper function costs a chunk slot. Each abstractions costs more tokens. I share the view that good code optimizes for reducing cognitive load 1 , but I don’t understand Linus’s comment in exactly the same way. Yes, Linus is virulent about the helper function, but in my opinion, his main argument isn’t simply that an abstraction costs a “chunk slot” as mentioned; it’s rather that this isn’t the right abstraction. Here is the code added in the pull request: This macro builds a 32-bit integer by putting one 16-bit value in the high half and the other in the low half. For example: The main problem with this macro isn’t necessarily that it exists. It’s that its intent (meaning what it tries to accomplish) could have been clearer. Indeed, the helper’s name doesn’t tell which word is high and which one is low and that’s exactly what Linus is calling out with “ you have not a f^%$ing clue what the word order is ”. Because we can’t get the intent from the name ( ), we have to open the macro to understand the order. That’s precisely why it costs a “chunk slot.”: not because the abstraction exists, but because it’s an ambiguous one. If we wanted to keep using a macro, a better approach, in my opinion 2 , would be to encode the word order in the name itself ( = most significant word, = least significant word): In this case, the word order is carried by the macro name, which makes it a clearer abstraction. Reading the call site doesn’t require opening the macro to understand the word order: Such an abstraction doesn’t cost a “chunk slot” in terms of cognitive load. Its intent is clear from the name, so we don’t need to load an extra piece of information into our working memory to understand it. In summary, if we want to optimize for cognitive load, there’s not necessarily an issue with using helper functions. But if we do, we should make the abstraction as explicit as possible, and that starts with a clear function name that conveys what it tries to accomplish. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Resources More From the Programming Category Readability Cognitive Load Nested Code Re: [GIT PULL] RISC-V Patches for the 6.17 Merge Window, Part 1 - Linus Torvalds // The discussion. GitHub // The code proposed in the pull request Linus and the two youts // Interestingly, the macro was plain wrong when the second word was negative. The full explanation is here.

Programming

Open Source

0 views

The Coder Cafe 3 months ago

Build Your Own Key-Value Storage Engine—Week 2

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Before delving into this week’s tasks, it’s important to understand what you will implement. This week, you will implement a basic log-structured merge-tree (LSM tree). At its core, an LSM tree is a data structure that prioritizes write efficiency by trading off some read complexity. It buffers writes in memory and uses append-only files on disk, then rewrites data during compaction. It consists of two main components: A mutable in-memory data structure called a memtable, used to store recent writes. A set of immutable SSTables (Sorted String Table) stored on disk. Regularly, the current memtable is snapshotted, its entries are sorted by key, and a new immutable SSTable file is written. In addition, a MANIFEST file is an append-only list of SSTable filenames. It tells the engine which SSTable files exist and in which order to read them, newest to oldest. Why LSM trees shine for write-heavy workloads: Fast writes with sequential I/O: New updates are buffered in memory (memtable) and later written sequentially to disk during a flush (SSTable), which is faster than the random I/O patterns common with B-trees, for example. Decouples writes from read optimization: Writes complete against the memtable, while compaction work runs later (you will tackle that in a future week). Space and long-term efficiency: Compaction processes remove dead data and merge many small files into larger sorted files, which keeps space usage in check and sustains read performance over time. For the memtable, you will start with a hashtable. In a future week, you will learn why a hashtable is not the most efficient data structure for an LSM tree, but it is a simple starting point. For the SSTables, you will use JSON as the data format. Get comfortable with a JSON parser if you are not already. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord This week’s implementation is single-threaded. You will revisit that assumption later. Implement a hashtable to store requests (create or update). You can probably reuse a lot of code from Week 1. When your memtable contains 2,000 entries: Flush the memtable as a new immutable JSON SSTable file with keys sorted. The SSTable file is a JSON array of objects, each with two fields, and . Keys are unique within a file. For example, if your memtable contains the following entries: You need to create the following SSTable: Use a counter for the filename prefix, for example , , . After writing the new SSTable, append its filename to the MANIFEST (append only), then clear the memtable: For now, the flush is a stop-the-world operation. While the file is being written, do not serve reads or writes. You will revisit that later. Create an empty file if it doesn’t exist. Derive the next SSTable ID from the MANIFEST so you don't reuse the same filename. Check the memtable: If found, return the corresponding value. If not found, read the MANIFEST to list SSTable filenames: Scan SSTables from newest to oldest (for example , then , then ). Use a simple linear scan inside each file for now. Stop at the first hit and return the corresponding value. If still not found, return . There are no changes to the client you built in week 1. Run it against the same file ( put.txt ) to validate that your changes are correct. Keep a small LRU cache of known-absent keys (negative cache) between the memtable and SSTables. This avoids repeated disk scans for hot misses: after the first miss, subsequent lookups are O(1). Implementation details are up to you. Instead of parsing the MANIFEST file for each request, you can cache the content in-memory. That’s it for this week! You have built the first version of an LSM tree: a memtable in memory, SSTable files written by regular flushes, and a MANIFEST that lists those SSTables. For now, durability isn’t guaranteed. Data already flushed to SSTables will be read after a restart, but anything still in the memtable during a crash is lost. In two weeks, you will make sure that any request acknowledged to a client remains in your storage engine, even after a restart. The flush trigger you used was pretty simple: once the memtable contains 2,000 entries. In real systems, flushes can be triggered by various factors, for example: Some databases flush when the memtable reaches a target size in bytes, ensuring predictable memory usage. A flush can also occur after a period of time has passed. This occurs because the database eventually needs to release commit log segments. For tables with very low write activity, this can sometimes lead to data resurrection scenarios. Here’s an old issue from the ScyllaDB codebase that illustrates this behavior. Regarding the model, this series assumes a simple key–value one: every PUT stores the whole value, so a GET just finds the newest entry and returns it. If you need a richer model (e.g., rows with many fields or collections), writes are often partial (patches) rather than full replacements. Therefore, reads must reconstruct the result by scanning newest to oldest and merging changes until all required fields are found or a full-write record is encountered. Last but not least, in this series, you implicitly rely on client-side ordering: the validation client issues requests sequentially. Production KV databases typically attach a sequence number or a logical timestamp to each write to handle out-of-order arrivals, merging, and reconciling results. Pure wall-clock timestamps are convenient but brittle; see Kyle Kingsbury’s notes on clock pitfalls for a deeper dive. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary . ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations A mutable in-memory data structure called a memtable, used to store recent writes. A set of immutable SSTables (Sorted String Table) stored on disk. Fast writes with sequential I/O: New updates are buffered in memory (memtable) and later written sequentially to disk during a flush (SSTable), which is faster than the random I/O patterns common with B-trees, for example. Decouples writes from read optimization: Writes complete against the memtable, while compaction work runs later (you will tackle that in a future week). Space and long-term efficiency: Compaction processes remove dead data and merge many small files into larger sorted files, which keeps space usage in check and sustains read performance over time. This week’s implementation is single-threaded. You will revisit that assumption later. Flush the memtable as a new immutable JSON SSTable file with keys sorted. The SSTable file is a JSON array of objects, each with two fields, and . Keys are unique within a file. For example, if your memtable contains the following entries: You need to create the following SSTable: Use a counter for the filename prefix, for example , , . After writing the new SSTable, append its filename to the MANIFEST (append only), then clear the memtable: Create an empty file if it doesn’t exist. Derive the next SSTable ID from the MANIFEST so you don't reuse the same filename. Check the memtable: If found, return the corresponding value. If not found, read the MANIFEST to list SSTable filenames: Scan SSTables from newest to oldest (for example , then , then ). Use a simple linear scan inside each file for now. Stop at the first hit and return the corresponding value. If still not found, return . Some databases flush when the memtable reaches a target size in bytes, ensuring predictable memory usage. A flush can also occur after a period of time has passed. This occurs because the database eventually needs to release commit log segments. For tables with very low write activity, this can sometimes lead to data resurrection scenarios. Here’s an old issue from the ScyllaDB codebase that illustrates this behavior. The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary .

Tutorial

JSON

Programming

Database

0 views

The Coder Cafe 3 months ago

Nothing Beats Kindness

☕ Welcome to The Coder Cafe! Today, November 13, is World Kindness Day. For this special occasion, we discuss how kindness matters at work. Get cozy, grab a coffee, and let’s begin! We’re in 2022. It’s Saturday evening, and I’m about to go to bed. I’m on-call that night. I haven’t been paged, but just to make sure everything is OK, I logged in and checked Slack. An incident was going on, and a colleague was already on it. I DMed him: “ Why didn’t you contact me? ” He replied: “ It’s late and I thought you might be sleeping. I was awake, so I looked to see if there’s something I could do. ” My first reaction was: I’m on-call. I’m paid for it. I’ll take care of it. Go to bed. But here’s the thing: on a Saturday evening, he chose to help because he thought I might be sleeping, even though I was the one on-call, the one paid to handle it. That was a pure act of kindness. No points. No credit. Just care. And after that? Honestly, I would have done anything for that person . At work, we work with people long before we work with code. There’s always a little distance between us: roles and power dynamics, deadlines and pressure, different cultures, communication styles, sometimes different time zones. Kindness is the fastest bridge across that distance. Kindness is about being generous, considerate, and having concern for others without expecting praise or reward in return. It’s a voluntary act that creates psychological safety among team members. When people feel safe, they surface risks earlier, ask the “naive” questions, and move faster together. Kind people make work better day in, day out. Kindness boosts trust, speeds decisions, reduces stress, and quietly raises the bar for everyone. Let’s look at a few places where kindness matters in our daily jobs: Code review : When we’re assigned a review, we’re not there to rate someone’s code. We’re there to merge the best possible change together. Be respectful and stay factual. Favor questions over pronouncements: “ What scenarios does this handle? I’m worried about X; would Y cover Z? ” Point out what’s good, suggest concrete fixes, and link to standards or examples. If there’s confusion, offer help. Meetings : Make space so everyone can be heard. Don’t interrupt. Invite quieter people in: “ Ben, anything you would add? ” It’s not because someone is more vocal that they’re more right. Mentoring : People make mistakes. Don’t jump to blame or perform expertise. The goal is to protect in public and correct in private. Give clear, kind feedback, focus on the next step, and share your own past mistakes to lower the temperature. Random thank you : When you receive help or just enjoy working with someone, say thank you. Recognition matters, and doing it publicly multiplies the effect. For example, at Google, there’s a program called gThanks that lets you thank someone publicly so others can see it too. Make time to listen : Being kind also means making time to listen. I remember going through a difficult period of my life, and a former manager just took time to talk, without judging. That mattered more than any advice. Self-compassion : Kindness also applies to yourself. Give yourself the same understanding you would give a teammate. Take breaks, ask for help, forgive your own mistakes, and learn from them. Being kind is a bridge to people, and even in a professional context, as Aesop wrote, no act of kindness, no matter how small, is ever wasted. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Don’t Forget About Your Mental Health Keeping a Mistake Journal The XY Problem Why Kindness at Work Pays Off Random Acts of Kindness Foundation ❤️ If you enjoyed this post, please hit the like button. 💬 What’s one act of kindness that changed your workday? Leave a comment We’re in 2022. It’s Saturday evening, and I’m about to go to bed. I’m on-call that night. I haven’t been paged, but just to make sure everything is OK, I logged in and checked Slack. An incident was going on, and a colleague was already on it. I DMed him: “ Why didn’t you contact me? ” He replied: “ It’s late and I thought you might be sleeping. I was awake, so I looked to see if there’s something I could do. ” My first reaction was: I’m on-call. I’m paid for it. I’ll take care of it. Go to bed. But here’s the thing: on a Saturday evening, he chose to help because he thought I might be sleeping, even though I was the one on-call, the one paid to handle it. That was a pure act of kindness. No points. No credit. Just care. And after that? Honestly, I would have done anything for that person . Why Kindness Wins At work, we work with people long before we work with code. There’s always a little distance between us: roles and power dynamics, deadlines and pressure, different cultures, communication styles, sometimes different time zones. Kindness is the fastest bridge across that distance. Kindness is about being generous, considerate, and having concern for others without expecting praise or reward in return. It’s a voluntary act that creates psychological safety among team members. When people feel safe, they surface risks earlier, ask the “naive” questions, and move faster together. Kind people make work better day in, day out. Kindness boosts trust, speeds decisions, reduces stress, and quietly raises the bar for everyone. Let’s look at a few places where kindness matters in our daily jobs: Code review : When we’re assigned a review, we’re not there to rate someone’s code. We’re there to merge the best possible change together. Be respectful and stay factual. Favor questions over pronouncements: “ What scenarios does this handle? I’m worried about X; would Y cover Z? ” Point out what’s good, suggest concrete fixes, and link to standards or examples. If there’s confusion, offer help. Meetings : Make space so everyone can be heard. Don’t interrupt. Invite quieter people in: “ Ben, anything you would add? ” It’s not because someone is more vocal that they’re more right. Mentoring : People make mistakes. Don’t jump to blame or perform expertise. The goal is to protect in public and correct in private. Give clear, kind feedback, focus on the next step, and share your own past mistakes to lower the temperature. Random thank you : When you receive help or just enjoy working with someone, say thank you. Recognition matters, and doing it publicly multiplies the effect. For example, at Google, there’s a program called gThanks that lets you thank someone publicly so others can see it too. Make time to listen : Being kind also means making time to listen. I remember going through a difficult period of my life, and a former manager just took time to talk, without judging. That mattered more than any advice. Self-compassion : Kindness also applies to yourself. Give yourself the same understanding you would give a teammate. Take breaks, ask for help, forgive your own mistakes, and learn from them. Don’t Forget About Your Mental Health Keeping a Mistake Journal The XY Problem Why Kindness at Work Pays Off Random Acts of Kindness Foundation

Career

0 views

The Coder Cafe 3 months ago

Build Your Own Key-Value Storage Engine—Week 1

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Welcome to week 1 of Build Your Own Key-Value Storage Engine ! Let’s start by making sure what you’re about to build in this series makes complete sense: what’s a storage engine? A storage engine is the part of a database that actually stores, indexes, and retrieves data, whether on disk or in memory. Think of the database as the restaurant, and the storage engine as the kitchen that decides how food is prepared and stored. Some databases let you choose the storage engine. For example, MySQL uses InnoDB by default (based on B+-trees). Through plugins, you can switch to RocksDB, which is based on LSM trees. This week, you will build an in-memory storage engine and the first version of the validation client that you will reuse throughout the series. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Keys are lowercase ASCII strings. Values are ASCII strings. NOTE : Assumptions persist for the rest of the series unless explicitly discarded. The request body contains the value. If the key exists, update its value and return success. If the key doesn’t exist, create it and return success. Keep all data in memory. If the key exists, return 200 OK with the value in the body. If the key does not exist, return . Implement a client to validate your server: Read the testing scenario from this file: put.txt . Run an HTTP request for each line: → Send a to with body . → Send a to . Confirm that is returned. If not, something is wrong with your implementation. → Send a GET to . Confirm that is returned. If not, something is wrong with your implementation. Each request must be executed sequentially, one line at a time; otherwise, out-of-order responses may fail the client’s assertions. If you want to generate an input file with a different number of lines, you can use this Go generator : is the format to generate. is the number of lines. At this stage, you need a -type file, so for example, if you need one million lines: Add basic metrics for latency: Record start and end time for each request. Keep a small histogram of latencies in milliseconds. At the end, print , , and . This work is optional as there is no latency target in this series. However, it can be an interesting point of comparison across weeks to see how your changes affect latency. That’s it for this week! You have built a simple storage engine that keeps everything in memory. In two weeks, we will level up. You will delve into a data structure widely used in key-value databases: LSM trees. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Welcome to week 1 of Build Your Own Key-Value Storage Engine ! Let’s start by making sure what you’re about to build in this series makes complete sense: what’s a storage engine? A storage engine is the part of a database that actually stores, indexes, and retrieves data, whether on disk or in memory. Think of the database as the restaurant, and the storage engine as the kitchen that decides how food is prepared and stored. Some databases let you choose the storage engine. For example, MySQL uses InnoDB by default (based on B+-trees). Through plugins, you can switch to RocksDB, which is based on LSM trees. This week, you will build an in-memory storage engine and the first version of the validation client that you will reuse throughout the series. Your Tasks 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Assumptions Keys are lowercase ASCII strings. Values are ASCII strings. : The request body contains the value. If the key exists, update its value and return success. If the key doesn’t exist, create it and return success. Keep all data in memory. : If the key exists, return 200 OK with the value in the body. If the key does not exist, return . Read the testing scenario from this file: put.txt . Run an HTTP request for each line: → Send a to with body . → Send a to . Confirm that is returned. If not, something is wrong with your implementation. → Send a GET to . Confirm that is returned. If not, something is wrong with your implementation. Each request must be executed sequentially, one line at a time; otherwise, out-of-order responses may fail the client’s assertions. is the format to generate. is the number of lines. Record start and end time for each request. Keep a small histogram of latencies in milliseconds. At the end, print , , and .

Tutorial

Database

Programming

1 views

The Coder Cafe 4 months ago

Horror Coding Stories: Therac-25

📅 Last updated: March 9, 2025 🎃 Welcome to The Coder Cafe! Today, we examine the Therac-25 accidents, where design and software failures resulted in multiple radiation overdoses and deaths. Make sure to check the Explore Further section to see if you’re able to reproduce the deadly issue. Get cozy, grab a pumpkin spice latte, and let’s begin! Therac-25 Treating cancers used to require a mix of machines, depending on tumor depth: shallow or deep. In the early 1980s, a new generation promised both from a single system. That was a big deal for hospitals: one machine instead of several meant lower maintenance and fewer systems to manage. That was the case with the Therac-25. The Therac-25 offered two therapies with selectable modes: Electron beam: Low-energy electrons for shallow tumors (e.g., skin cancer). X-ray photons: High-energy radiation for deep tumors (e.g., lung cancer). Earlier Therac models allowed switching modes with hardware circuits and physical interlocks. The new version was smaller, cheaper, and computer-controlled. Less hardware and fewer parts meant lower costs. However, what no one realized soon enough: it also removed an independent safety net. On a routine day, a radiology technologist sat at the console and began entering a plan: By habit, she selected X-ray (deep mode). Then she immediately corrected it for Electron (shallow mode) and hit start. The machine halted with a message. The operator’s manual didn’t explain the code. Service materials listed the number but gave no useful guidance, so she resumed and triggered the radiation. The patient was receiving his ninth treatment. Immediately, he knew something was different. He reported a buzzing sound, later recognized as the accelerator pouring out radiation at maximum. The pain came fast; paralysis followed. He later died from radiation injury. Weeks later, a second patient endured the same incident on the same model. Initially, the radiology technologist entered ‘ ’ for X-ray (‘▮’ is the cursor and ‘ ’ are other fields): She immediately hit Cursor Up to go back and correct the field to ‘ ’: After a rapid sequence of Return presses, she moved back down to the command area: From her perspective, the screen showed the corrected mode, so she hit return and started the treatment: Behind the scenes, the Therac-25 software ran several concurrent tasks: Data-entry task : Monitored operator inputs and edited a shared treatment-setup structure. Hardware-control task : On a periodic loop, snapshotted that same structure and positioned the turntable and magnets based on user input. Because both tasks read the same memory with no mutual exclusion, there was a short window (on the order of seconds) in which the hardware-control task used a different value than the one displayed on the screen. As a result: The UI showed Electron mode, which looked correct to the operator. The hardware-control task had snapshotted stale data and marked the system as ready even though critical elements (e.g., turntable position, scanning magnets/accessories) were not yet aligned with electron mode. When treatment was started, the machine delivered an effectively unscanned, high-intensity electron beam, causing a massive overdose. This is a race condition example: the outcome depends on the timing of events, here, the input cadence of the technologist. Depending on the timing, the system could enter a fatal state, with one process seeing ‘ ’ while another saw ‘ ’. The manufacturer later confirmed the error could not be reproduced reliably in testing. The timing had to line up just right, which made the bug elusive. They initially misdiagnosed it as a hardware fault and applied only minor fixes. Unfortunately, the speed of operator editing was the key trigger that exposed this software race. The problem could have stopped here, but it didn’t. Months later, another fatal overdose occurred, this time caused by a different software defect. It wasn’t a timing race. This time, the issue was a counter overflow within the control program. The software used an internal counter to track how many times certain setup operations ran. After the counter exceeded its maximum value, it wrapped back to zero. That arithmetic overflow created a window where a critical safety check was bypassed, allowing the beam to turn on without the proper accessories in place. Again, the Therac-25 fired a high-intensity beam without the proper hardware configuration. Both the race condition and the counter overflow stemmed from the same design flaw: the belief that software alone could enforce safety. The Therac-25 showed, in tragic terms, that without independent safeguards, small coding errors can have catastrophic consequences. We should know that whether it’s software, hardware, or a human process, every single safeguard has inherent flaws. Therefore, in complex systems, safety should be layered, as illustrated by the Swiss cheese model: Credits In total, there were six known radiation overdoses involving the Therac-25, and at least three were fatal. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Adaptive LIFO Resilient, Fault-tolerant, Robust, or Reliable? Lurking Variables The Worst Computer Bugs in History: Race conditions in Therac-25 Killed By A Machine: The Therac-25 An Investigation of the Therac-25 Accidents I created a Docker image based on a C implementation from an MIT course simulating the operator console of the Therac-25 interface: You can run the UI using Docker: Simulator commands: Beam Type: ‘ ’ or ‘ ’ Command: ‘ ’ for beam on, or ‘ ’ to quit the simulator. 👉 Try to trigger the error based on the scenario discussed. ❤️ If you enjoyed this post, please hit the like button. 💬 Any other horror coding stories you want to share? Leave a comment Therac-25 Treating cancers used to require a mix of machines, depending on tumor depth: shallow or deep. In the early 1980s, a new generation promised both from a single system. That was a big deal for hospitals: one machine instead of several meant lower maintenance and fewer systems to manage. That was the case with the Therac-25. The Therac-25 offered two therapies with selectable modes: Electron beam: Low-energy electrons for shallow tumors (e.g., skin cancer). X-ray photons: High-energy radiation for deep tumors (e.g., lung cancer). By habit, she selected X-ray (deep mode). Then she immediately corrected it for Electron (shallow mode) and hit start. Data-entry task : Monitored operator inputs and edited a shared treatment-setup structure. Hardware-control task : On a periodic loop, snapshotted that same structure and positioned the turntable and magnets based on user input. The UI showed Electron mode, which looked correct to the operator. The hardware-control task had snapshotted stale data and marked the system as ready even though critical elements (e.g., turntable position, scanning magnets/accessories) were not yet aligned with electron mode. When treatment was started, the machine delivered an effectively unscanned, high-intensity electron beam, causing a massive overdose. Credits In total, there were six known radiation overdoses involving the Therac-25, and at least three were fatal. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Resources More From the Reliability Category Adaptive LIFO Resilient, Fault-tolerant, Robust, or Reliable? Lurking Variables The Worst Computer Bugs in History: Race conditions in Therac-25 Killed By A Machine: The Therac-25 An Investigation of the Therac-25 Accidents You can run the UI using Docker: Simulator commands: Beam Type: ‘ ’ or ‘ ’ Command: ‘ ’ for beam on, or ‘ ’ to quit the simulator.

Science

0 views

The Coder Cafe 4 months ago

Build Your Own Key-Value Storage Engine

Welcome to The Coding Corner ! This is our new section at The Coder Cafe, where we build real-world systems together, one step at a time. Next week, we will launch the first post series: Build Your Own Key-Value Storage Engine . Are you interested in understanding how key-value databases work? Tackling challenges like durability, partitioning, and compaction? Exploring data structures like LSM trees, Bloom filters, and tries? Then this series is for you. Build Your Own Key-Value Storage Engine focuses on the storage engine itself; we will stay single-node. Topics such as replication and consensus are out of scope. Yet, if this format works, we may cover them in a future series. The structure of each post will be as follows: Introduction : The theory for what you are about to build that week. Your tasks : A list of tasks to complete the week’s challenges. Note that you can complete the series in any programming language you want. Further notes : Additional perspective on how things work in real systems. If you’re not going to implement things yourself but are interested in databases, you may still want to read sections 1 and 3 at least. Each week out of two, a new post of the series will be released. Last but not least, I’m delighted to share that this series was written in collaboration with ScyllaDB. They reviewed the content for accuracy and shared practical context from real systems, providing a clearer view of how production databases behave and the problems they solve. Huge thanks to , Felipe Cardeneti Mendes , and ScyllaDB. By the way, they host a free virtual conference called Monster Scale Summit, and the content is always excellent. If you care about scaling challenges, it’s absolutely worth registering! Also, if you’re interested in giving a talk, the CFP closes in two days. Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. On a personal note, this has been the most time-consuming project I have done for The Coder Cafe . I really hope you will enjoy it! See you this Friday for a special post for Halloween and next Wednesday for the first post of the series. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Welcome to The Coding Corner ! This is our new section at The Coder Cafe, where we build real-world systems together, one step at a time. Next week, we will launch the first post series: Build Your Own Key-Value Storage Engine . Are you interested in understanding how key-value databases work? Tackling challenges like durability, partitioning, and compaction? Exploring data structures like LSM trees, Bloom filters, and tries? Then this series is for you. Build Your Own Key-Value Storage Engine focuses on the storage engine itself; we will stay single-node. Topics such as replication and consensus are out of scope. Yet, if this format works, we may cover them in a future series. The structure of each post will be as follows: Introduction : The theory for what you are about to build that week. Your tasks : A list of tasks to complete the week’s challenges. Note that you can complete the series in any programming language you want. Further notes : Additional perspective on how things work in real systems.

Programming

Database

Tutorial

1 views

The Coder Cafe 4 months ago

Speed vs. Velocity

☕ Welcome to The Coder Cafe! Today, we discuss the difference between speed and velocity in team productivity, illustrating that tracking speed alone can be misleading. Get cozy, grab a coffee, and let’s begin! We often celebrate teams for moving fast. But speed alone can be a trap. A rush of fast changes that barely move the product toward the real goal shouldn’t count as a win. When we talk about team productivity, we should understand that speed ≠ velocity : Speed is how quickly a team ship changes. Velocity is speed with direction, the movement toward a defined goal. Let’s look at three teams to illustrate these definitions. Team A has ideal velocity. Each iteration is represented by an arrow: its length shows speed, and its angle shows direction. Every iteration moves the team consistently closer to the goal. Team A shows ideal velocity: steady speed and consistent direction. Team B has a speed problem. Each iteration is well-aligned (correct angle), but the team delivers too slowly (small length), resulting in a lower velocity. They finish only halfway to the goal. Team B has the right direction but low speed, so they only reach halfway to the goal. Team C ships as rapidly as team A (same arrow length). However, various factors such as frequent bug fixes and changing targets make their direction inconsistent. Despite high speed, their velocity is low, and they end only halfway to the goal, just like team B. Team C moves fast but changes direction often, so velocity stays low and progress stops halfway. As we can see, velocity requires both speed and direction. A team moving too slowly or in inconsistent directions will make little progress, even if they’re busy . Only when speed is high and direction is aligned do teams reach their goals efficiently. Measuring team speed isn’t useless, though. We can track speed by considering various metrics such as deployment frequency, average time in code review, or mean time to recovery (MTTR) following a production bug. These metrics are interesting to track and provide a certain perspective to understand a team's productivity. Yet, speed shouldn’t be the sole dimension to track. The danger of tracking speed only is that a team might become organized in a way to optimize short-term delivery . The team might focus on delivering many changes that, together, do not move the product in a meaningful way. Instead, teams should track speed and velocity. As we said, velocity is speed with direction. We already discussed metrics to track speed; what is missing is monitoring direction. Setting clear and factual objectives that align with the business strategy helps us track direction. For example: Payment success rate above 99 percent. Signup to activation above 50 percent within 7 days. Retention after week 4 is above 40 percent. The easier a metric is to measure, the easier it is to track the direction over time. One caveat is how to report progress. In a past team, we used to set OKRs (Objectives and Key Results) every semester. Some objectives were difficult to measure, so we tracked progress differently. Say we created 50 tickets and closed 45. In that case, we reported that we reached 90% of the OKR. That number said nothing about real progress toward the objective. Key results should be outcomes, not ticket counts. Something else to mention, I read here and there that we should track velocity by counting the number of story points delivered within a timeframe (e.g., during a sprint). I strongly disagree with this. Let me give an example: A team ships a first change ← 3 story points The team finds a bug and ships a fix ← 2 story points Later, the team learns the initial approach is not the best, so it ships another change ← 5 story points On paper, the team delivered 10 story points. That is only speed, though, not velocity. If we care about direction as well, the team only delivered a single feature. Story points measure effort; they don’t measure progress toward the objective. Speed is how fast we ship. Velocity is speed with direction toward a goal. The goal of a team shouldn’t be to reach high speed; it should be to reach high velocity, where rapid iterations translate into real system-level improvements. Tracking story points, for example, is a measure of speed, not velocity. Objectives tracking should use outcomes, not ticket counts. Reporting 90 percent of tickets done is not a good measure of progress toward the objective. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Keeping a Mistake Journal Streetlight Effect Survivor Bias Sprint Velocity in Scrum: How to Measure and Improve Performance // The velocity definition I disagree with. ❤️ If you enjoyed this post, please hit the like button. 💬 What’s your take on speed vs. velocity? How do you measure velocity in your team? Leave a comment We often celebrate teams for moving fast. But speed alone can be a trap. A rush of fast changes that barely move the product toward the real goal shouldn’t count as a win. When we talk about team productivity, we should understand that speed ≠ velocity : Speed is how quickly a team ship changes. Velocity is speed with direction, the movement toward a defined goal. Team A has ideal velocity. Each iteration is represented by an arrow: its length shows speed, and its angle shows direction. Every iteration moves the team consistently closer to the goal. Team A shows ideal velocity: steady speed and consistent direction. Team B has a speed problem. Each iteration is well-aligned (correct angle), but the team delivers too slowly (small length), resulting in a lower velocity. They finish only halfway to the goal. Team B has the right direction but low speed, so they only reach halfway to the goal. Team C ships as rapidly as team A (same arrow length). However, various factors such as frequent bug fixes and changing targets make their direction inconsistent. Despite high speed, their velocity is low, and they end only halfway to the goal, just like team B. Team C moves fast but changes direction often, so velocity stays low and progress stops halfway. Payment success rate above 99 percent. Signup to activation above 50 percent within 7 days. Retention after week 4 is above 40 percent. A team ships a first change ← 3 story points The team finds a bug and ships a fix ← 2 story points Later, the team learns the initial approach is not the best, so it ships another change ← 5 story points Speed is how fast we ship. Velocity is speed with direction toward a goal. The goal of a team shouldn’t be to reach high speed; it should be to reach high velocity, where rapid iterations translate into real system-level improvements. Tracking story points, for example, is a measure of speed, not velocity. Objectives tracking should use outcomes, not ticket counts. Reporting 90 percent of tickets done is not a good measure of progress toward the objective. Keeping a Mistake Journal Streetlight Effect Survivor Bias Sprint Velocity in Scrum: How to Measure and Improve Performance // The velocity definition I disagree with.

Business

Agile

0 views

The Coder Cafe 4 months ago

Conflict-Free Replicated Data Types (CRDTs)

☕ Welcome to The Coder Cafe! Today, we will explore CRDTs, why they matter in distributed systems, and how they keep nodes in sync. Get cozy, grab a coffee, and let’s begin! CRDTs, short for Conflict-Free Replicated Data Types, are a family of data structures built for distributed systems. At first sight, CRDTs may look intimidating. Yet at their core, the idea is not that complex. What makes them special is that they allow updates to happen independently on different nodes while still guaranteeing that all replicas eventually converge to the same state. To understand how CRDTs achieve this, we first need to step back. We need to talk about concurrent operations and what coordination means in a distributed system. Let’s take it step by step. What does concurrent operations mean? Our first intuition might be to say they happen at the same time. That’s not quite right. Here’s a counterargument based on a collaborative editing example. While on a plane, Alice connects to a document and makes an offline change to a sentence. An hour later, Bob connects to the same document and edits the very same sentence, but online. Later, when Alice lands, both versions have to sync. The two edits (1. and 2.) were separated by an hour. They didn’t happen at the same time, yet they are concurrent. So what’s a better definition for concurrent operations? Two operations that are not causally related. In the previous example, neither operation was made with knowledge of the other. They are not causally related, which makes them concurrent. Yet, if Bob had first seen Alice’s update and then made his own, his edit would depend on hers. In that case, the two operations wouldn’t be concurrent anymore. We should also understand concurrent ≠ conflict: If Alice fixes a missing letter in a word while Bob removes the whole word, that’s a conflict. If Alice edits one sentence while Bob edits another, that’s not a conflict. Concurrency is about independence in knowledge. Conflict is about whether the effects of operations collide. Now, let’s talk about coordination in distributed systems. Imagine a database with two nodes, node 1 and node 2. A bunch of clients connect to it. Sometimes requests go to node 1, sometimes to node 2. Let’s say two clients send concurrent and conflicting operations: In this case, we can’t have node 1 storing $200 while node 2 stores -$100. That would be a consistency violation with the two nodes disagreeing on Alice’s balance. Instead, both nodes need to agree on a shared value. To do that, they have to communicate and decide on one of the following: Reject both operations Accept client A’s update and set the balance to $200 Accept client B’s update and set the balance to -$100 The very action of nodes communicating and, if needed, waiting to agree on a single outcome is called coordination. Coordination is one way to keep replicas consistent under concurrent operations. But coordination is not the only way. That’s where CRDTs come in. CRDT stands for Conflict-Free Replicated Data Types . In short, CRDTs are data structures built so that nodes can accept local updates independently and concurrently, without the need for coordination. If you read our recent post on availability models, you might notice we’re now in the territory of total availability: a system is totally available if every non-faulty node can execute any operation. Total availability comes with weaker consistency. For CRDTs, the consistency guarantee is called Strong Eventual Consistency (SEC) . For that, CRDTs rely on a deterministic conflict resolution algorithm. Because every node applies the same rules, all replicas are guaranteed to eventually converge to the same state. Let’s make this more concrete with a classic CRDT: the G-Counter (Grow-Only Counter). Imagine a database with two nodes tracking the number of likes on a post. Node 1 receives a new like, increments its counter, and replies success to the client: Then, node 1 communicates with node 2 to send this update: Ultimately, both nodes converge to the same value: 6. How does the conflict resolution work for a G-Counter? Each replica keeps a vector of counters, with one slot per node. In our example, the total number of likes is 5. Let’s say node 1 has seen 2 likes and node 2 has seen 3 likes. So the initial state is the following: When node 1 receives a new like, it only increments its own slot. Node 2 is now temporarily out of sync: During synchronization, both nodes merge their vectors by taking the element-wise maximum: Now both replicas converge to the same state: The beauty of this algorithm is that it’s deterministic and order-independent. No matter when or how often the nodes sync, they always end up with the same state. NOTE : Do you know Gossip Glomers? It’s a series of distributed systems challenges we briefly introduced in an earlier post . Challenge 4 is to build a Grow-Only Counter. It’s worth checking out if you haven’t already. CRDTs can also be combined to make a more complex CRDT. For example, if we want to track both likes and dislikes, we can use two G-Counters together. This data type is called a PN-Counter (Positive-Negative Counter). Imagine two clients act concurrently on the same post: one likes it, another dislikes it. The nodes exchange their updates and converge to the same value: In the case of a PN-Counter, the conflict resolution algorithm is similar to the G-Counter. The difference lies in the fact that it involves not one but two vectors: one for increases and one for decreases. Assume an initial state where node 1 has received 2 likes and 0 dislikes, and node 2 has received 3 likes and 0 dislikes: Now, suppose node 1 receives a new like and node 2 receives a dislike. Before the sync, the state is the following: When the replicas exchange their state, the merge rule is element-wise maximum for each vector: After sync, both nodes converge to: The final counter of likes is: Let’s pause for a second. Based on what we’ve discussed, can you think of some use cases for CRDTs? A data structure where nodes are updated independently, concurrently, without coordination, and still guarantees that they converge to the same state? One main use case is collaborative and offline-first systems. For example, Notion, a collaborative workspace, recently introduced a feature that lets people edit the same content offline. They rely on CRDTs, and more specifically on Peritext, a CRDT for rich-text collaboration co-authored by multiple people, including . Another big use case is totally available systems that put availability ahead of strong consistency. As we’ve seen, nodes don’t need to coordinate before acknowledging a client request, which makes the system more highly available. Take Redis, for example. It can be configured in an active-active architecture with geographically distributed datacenter s. Clients connect to their closest cluster and get local latencies without waiting for coordination across distant regions. And yes, this setup is built on CRDTs. We could also think about other applications for CRDTs, like: Edge & IoT : Devices update offline and merge later without a central server. Peer-to-peer : Peers share changes directly and match up when they reconnect. CDN/edge state : Keep preferences, drafts, or counters near users and sync to the origin later. There are two main types of CRDTs: State-based CRDTs : Convergence happens by propagating the full state. Operation-based CRDTs : Convergence happens by propagating the update operations. In the previous examples, we looked at two state-based CRDTs: the G-Counter (Grow-Only Counter) and the PN-Counter (Positive-Negative Counter). In both cases, what was exchanged between the nodes was the entire state. For example, node 1 could tell node 2 that its total number of likes is 3. With state-based CRDTs, states are merged with a function that must be: Commutative: We can merge in any order and get the same result. Idempotent: Merging something with itself doesn’t change it. Associative: We can merge in any grouping and get the same result. Each synchronization monotonically increases the internal state. In other words, when two replicas sync, the state can only move forward, never backward. This is enforced by a simple “ can’t-go-backwards ” rule (a partial order), where merges use operations like max for numbers (as we’ve seen) or union for sets. In operation-based CRDTs, nodes share the operations rather than the full state. Convergence relies on three properties: Commutativity of concurrent operations Causality: Either carried in the operations’ metadata (for example, vector clocks) or guaranteed by the transport layer through causal delivery Duplicate tolerance: Handled by idempotent operations, unique operation IDs with deduplication, or a transport layer that guarantees no duplicates One example of an operation-based CRDT is the LWW-Register (Last-Writer-Wins Register), which stores a single value. Updates are resolved using a logical timestamp (such as Lamport clocks) along with a tie-breaker like the node ID. When a node writes a value, it broadcasts an operation . On receiving it, a node applies the update if the pair is greater than the one it currently holds. To summarize: State-based CRDTs: Convergence is guaranteed because merging states is associative, commutative, and idempotent. Don’t require assumptions on the delivery layer beyond eventual delivery. Simpler to reason about. Exchanging full states can be more bandwidth-intensive. Operation-based CRDTs: More bandwidth-efficient; we only send the operations, not the whole state. Correctness usually depends on having causal order (or encoding causality in the ops) and tolerating duplicates via idempotence/dedup. More complex to implement (causal broadcast, vector clocks, or equivalent). For completeness, there’s also a third type we should be aware of: delta-based CRDTs . Here, convergence is achieved by sending and merging fragments of state (deltas) rather than the entire state. A quick analogy to picture the differences: State-based CRDT: “ From time to time, send me the whole document. ” Operation-based CRDT: “ When you make a change, tell me exactly what you did. ” → “ Adding word `miles` at position 42. ” Delta-based CRDT: “ When you make a change, send me just the delta that reflects it (for example, the updated sentence) ” → “ And miles to go before I sleep. ” We talked about collaborative document editing. So you might assume a system like Google Docs is based on CRDTs, right? Well, that’s not the case. Google Docs is based on another concept called OT (Operational Transformation) . The goal of OT and CRDT is the same: convergence among all nodes in a collaborative system. The main difference is that OT requires all communication to go through the same server: We haven’t mentioned it until now (on purpose), but with CRDTs, there’s no need for a central server to achieve convergence . Back to our collaborative editing tool: if Alice and Bob are both offline but manage to connect their laptops directly, they could still achieve convergence without talking to a central server: As we saw earlier, CRDTs embed a deterministic conflict resolution algorithm. The data type itself ensures convergence. That’s the key difference: CRDTs don’t need to make any assumptions about the network topology or about a central server. considers CRDT to be the natural successor of OT. NOTE : So, why is Google Docs still based on OT? Historical reasons. Google Docs was launched before CRDTs existed, and it still works really well. There’s no practical reason for Google to migrate from OT to CRDT, despite some discussions about it in the past. Operations are concurrent when they aren’t causally related; concurrency doesn’t automatically mean conflict. Coordination is when replicas communicate and, if needed, wait to agree on a single outcome for concurrent updates before acknowledging clients, so they don’t diverge. CRDTs accept independent updates on each replica and still converge via deterministic merge rules. Three types: state-based (share full state), operation-based (share operations), delta-based (share just the changed parts). CRDTs are a great fit for systems like offline-first collaboration and highly available systems. Unlike OT, CRDTs don’t rely on a central server to reach the same result everywhere. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. Exploring Database Isolation Levels Safety and Liveness Ivan Zhao (Notion’s CEO) tweet on the new Notion offline collaboration feature Diving into Conflict-Free Replicated Data Types (CRDTs) - Redis CRDTs: The Hard Parts by Hacker News discussion Peritext - A CRDT for Rich-Text Collaboration Active-Active geo-distribution (CRDTS-based) - Redis Bartosz Sypytkowski’s 12-part blog series on CRDT ❤️ If you enjoyed this post, please hit the like button. 💬 Have you worked with CRDTs before, or do you see another use case where they shine? Share your thoughts in the comments! Leave a comment CRDTs, short for Conflict-Free Replicated Data Types, are a family of data structures built for distributed systems. At first sight, CRDTs may look intimidating. Yet at their core, the idea is not that complex. What makes them special is that they allow updates to happen independently on different nodes while still guaranteeing that all replicas eventually converge to the same state. To understand how CRDTs achieve this, we first need to step back. We need to talk about concurrent operations and what coordination means in a distributed system. Let’s take it step by step. Concurrent Operations What does concurrent operations mean? Our first intuition might be to say they happen at the same time. That’s not quite right. Here’s a counterargument based on a collaborative editing example. While on a plane, Alice connects to a document and makes an offline change to a sentence. An hour later, Bob connects to the same document and edits the very same sentence, but online. Later, when Alice lands, both versions have to sync. If Alice fixes a missing letter in a word while Bob removes the whole word, that’s a conflict. If Alice edits one sentence while Bob edits another, that’s not a conflict. In this case, we can’t have node 1 storing $200 while node 2 stores -$100. That would be a consistency violation with the two nodes disagreeing on Alice’s balance. Instead, both nodes need to agree on a shared value. To do that, they have to communicate and decide on one of the following: Reject both operations Accept client A’s update and set the balance to $200 Accept client B’s update and set the balance to -$100 Then, node 1 communicates with node 2 to send this update: Ultimately, both nodes converge to the same value: 6. How does the conflict resolution work for a G-Counter? Each replica keeps a vector of counters, with one slot per node. In our example, the total number of likes is 5. Let’s say node 1 has seen 2 likes and node 2 has seen 3 likes. So the initial state is the following: When node 1 receives a new like, it only increments its own slot. Node 2 is now temporarily out of sync: During synchronization, both nodes merge their vectors by taking the element-wise maximum: Now both replicas converge to the same state: The beauty of this algorithm is that it’s deterministic and order-independent. No matter when or how often the nodes sync, they always end up with the same state. NOTE : Do you know Gossip Glomers? It’s a series of distributed systems challenges we briefly introduced in an earlier post . Challenge 4 is to build a Grow-Only Counter. It’s worth checking out if you haven’t already. PN-Counter CRDTs can also be combined to make a more complex CRDT. For example, if we want to track both likes and dislikes, we can use two G-Counters together. This data type is called a PN-Counter (Positive-Negative Counter). Imagine two clients act concurrently on the same post: one likes it, another dislikes it. The nodes exchange their updates and converge to the same value: In the case of a PN-Counter, the conflict resolution algorithm is similar to the G-Counter. The difference lies in the fact that it involves not one but two vectors: one for increases and one for decreases. Assume an initial state where node 1 has received 2 likes and 0 dislikes, and node 2 has received 3 likes and 0 dislikes: Now, suppose node 1 receives a new like and node 2 receives a dislike. Before the sync, the state is the following: When the replicas exchange their state, the merge rule is element-wise maximum for each vector: After sync, both nodes converge to: The final counter of likes is: Use Cases Let’s pause for a second. Based on what we’ve discussed, can you think of some use cases for CRDTs? A data structure where nodes are updated independently, concurrently, without coordination, and still guarantees that they converge to the same state? One main use case is collaborative and offline-first systems. For example, Notion, a collaborative workspace, recently introduced a feature that lets people edit the same content offline. They rely on CRDTs, and more specifically on Peritext, a CRDT for rich-text collaboration co-authored by multiple people, including . Another big use case is totally available systems that put availability ahead of strong consistency. As we’ve seen, nodes don’t need to coordinate before acknowledging a client request, which makes the system more highly available. Take Redis, for example. It can be configured in an active-active architecture with geographically distributed datacenter s. Clients connect to their closest cluster and get local latencies without waiting for coordination across distant regions. And yes, this setup is built on CRDTs. We could also think about other applications for CRDTs, like: Edge & IoT : Devices update offline and merge later without a central server. Peer-to-peer : Peers share changes directly and match up when they reconnect. CDN/edge state : Keep preferences, drafts, or counters near users and sync to the origin later. State-based CRDTs : Convergence happens by propagating the full state. Operation-based CRDTs : Convergence happens by propagating the update operations. Commutative: We can merge in any order and get the same result. Idempotent: Merging something with itself doesn’t change it. Associative: We can merge in any grouping and get the same result. Commutativity of concurrent operations Causality: Either carried in the operations’ metadata (for example, vector clocks) or guaranteed by the transport layer through causal delivery Duplicate tolerance: Handled by idempotent operations, unique operation IDs with deduplication, or a transport layer that guarantees no duplicates State-based CRDTs: Convergence is guaranteed because merging states is associative, commutative, and idempotent. Don’t require assumptions on the delivery layer beyond eventual delivery. Simpler to reason about. Exchanging full states can be more bandwidth-intensive. Operation-based CRDTs: More bandwidth-efficient; we only send the operations, not the whole state. Correctness usually depends on having causal order (or encoding causality in the ops) and tolerating duplicates via idempotence/dedup. More complex to implement (causal broadcast, vector clocks, or equivalent). State-based CRDT: “ From time to time, send me the whole document. ” Operation-based CRDT: “ When you make a change, tell me exactly what you did. ” → “ Adding word `miles` at position 42. ” Delta-based CRDT: “ When you make a change, send me just the delta that reflects it (for example, the updated sentence) ” → “ And miles to go before I sleep. ” We haven’t mentioned it until now (on purpose), but with CRDTs, there’s no need for a central server to achieve convergence . Back to our collaborative editing tool: if Alice and Bob are both offline but manage to connect their laptops directly, they could still achieve convergence without talking to a central server: As we saw earlier, CRDTs embed a deterministic conflict resolution algorithm. The data type itself ensures convergence. That’s the key difference: CRDTs don’t need to make any assumptions about the network topology or about a central server. considers CRDT to be the natural successor of OT. NOTE : So, why is Google Docs still based on OT? Historical reasons. Google Docs was launched before CRDTs existed, and it still works really well. There’s no practical reason for Google to migrate from OT to CRDT, despite some discussions about it in the past. Conclusion Operations are concurrent when they aren’t causally related; concurrency doesn’t automatically mean conflict. Coordination is when replicas communicate and, if needed, wait to agree on a single outcome for concurrent updates before acknowledging clients, so they don’t diverge. CRDTs accept independent updates on each replica and still converge via deterministic merge rules. Three types: state-based (share full state), operation-based (share operations), delta-based (share just the changed parts). CRDTs are a great fit for systems like offline-first collaboration and highly available systems. Unlike OT, CRDTs don’t rely on a central server to reach the same result everywhere. Exploring Database Isolation Levels Safety and Liveness Ivan Zhao (Notion’s CEO) tweet on the new Notion offline collaboration feature Diving into Conflict-Free Replicated Data Types (CRDTs) - Redis CRDTs: The Hard Parts by Hacker News discussion Peritext - A CRDT for Rich-Text Collaboration Active-Active geo-distribution (CRDTS-based) - Redis Bartosz Sypytkowski’s 12-part blog series on CRDT

Go

Programming

0 views

The Coder Cafe 4 months ago

The Story of The Coder Cafe

☕ Welcome to The Coder Cafe! This week marks the first anniversary of the newsletter. To celebrate, I will share its story. Get cozy, grab a coffee, and let’s begin! Origins We were in July 2024. It was a warm weekend, and I was lying in bed thinking about my next “big thing”. A few years earlier, I had finished writing my book. It was an exhausting experience, but I finally felt ready for another challenge. All of a sudden, I got an idea. What if I launched… a podcast? I jumped out of bed, searched for a book on podcasts, bought one, and started reading about all that needs to be known: the different formats, how to find an audience, whether we should invite guests, and so on. I even had the perfect name. As a lover of the cozy atmosphere of coffee shops, I wanted my podcast to reflect that same warm ambience. The name would be The Coder Cafe . But a few days later, the excitement faded. Did I really want to make a podcast after all? I had a few concepts in mind, but none of them truly clicked . Eventually, I decided to drop the idea. Yet the name The Coder Cafe stuck with me. I checked thecoder.cafe domain, and it was available. Every great story starts with a domain name. Let’s buy it! Now I had a domain name, but I still didn’t know what to do about it. Around that time, I started reading a lot of newsletters. One in particular inspired me for the quality and regularity of its content: by . I even launched one called Go Engineer, which I quickly stopped. Thanks to my book, I already had a Go audience. But deep down, I didn’t want to write only about Go anymore. I didn’t want to be tied to a single language when there are so many areas I’m passionate about: code health, testing, distributed systems, reliability, observability, performance, and more. At some point, the two ideas converged. I would create a newsletter, and it would be called The Coder Cafe . It wouldn’t be tied to one language. It would be a place where any software engineer could find something useful. I already had experience with online writing. My Medium blog had more than 4k followers. But since my book, I hadn’t really written there. I needed a fresh start and a chance to relearn how to write online. Writing a book and writing online are two very different activities. So I ditched my book on podcasts and bought another one: The Art and Business of Online Writing . If you don’t know this book, I really recommend it. Two principles in particular stuck with me: Volume matters. The most popular blogs and newsletters publish often. Timeless topics matter. Daily news about which stock to buy has volume but no staying power. Timeless content always wins. These two ideas would shape my newsletter: I wanted to write daily ( volume ) and focus on fundamental concepts ( timeless ). I started drafting the newsletter description: Feeling overwhelmed by the endless stream of tech content? At The Coder Cafe, we serve timeless concepts with your coffee, every day. At first, I planned to publish five posts per week on different topics. But after talking with my girlfriend, she suggested I group posts by theme. For example, one week on caching, another on testing, etc. I loved the idea. She also recommended writing a recap post at the end of the week as a way to reinforce learning. I loved this idea even more. I wasn’t sure people wanted to read my content during the weekend, though, so I refined the plan: Four posts from Monday to Thursday, each on one core concept. A recap on Friday, to reinforce the week’s lessons. By mid-August, the concept was finalized. It was time to write some posts find a good logo! I bought a paid subscription to Canva and spent my whole weekend creating different logo variations: None of them really convinced me, though. When I was about to design yet another version, this time of a logo that repeats inside itself a few times to hint at recursion, I suddenly became aware of my lack of artistic skills . I decided to delegate the work, heading to Fiverr to find a freelancer. That’s how I met a very talented artist, Eli Huynh . She even worked on the Attack on Titan anime, can you believe it? I asked her to craft a coffee shop logo, and I slipped in some personal touches I wanted to see hidden in the design: An illustration of Designing Data-Intensive Applications by , my favorite computer science book. Moby Dock , the Docker mascot (I worked at Docker and loved my time there!) A Docker command on the coffee machine. She produced this beautiful masterpiece: I shared it with different people, and the feedback was unanimous: the drawing was stunning, but… it was not really a logo. Nothing against the artist, obviously, I was the one giving the requirements, but indeed, it wasn’t a logo. So I moved this illustration to the About page and kept searching for another freelancer. After a few failures, I started working with someone who feels promising. His first attempt was the following: After many back-and-forths, we converged on this version: When I received it, I loved it instantly. The colors felt warm and cozy. It captured the atmosphere I had in mind. Sometimes you don’t need outside validation, you just know. This logo would be The Coder Cafe ’s identity. In the end, I spent dozens of hours on this logo quest. You might think it was absurd, especially since I still hadn’t written a single post. But for me, visual identity had to come first. It’s like opening a restaurant: before designing the menu, you work on the atmosphere and decoration. Is it really absurd? I don’t think so. The same applied to the tagline. I tried dozens of variations: One daily concept. Learn daily, grow deeply. A timeless concept with your coffee. Brewed daily. Eventually, I chose this one: One concept with your coffee. It captures exactly what I want: concepts, a cozy coffee-shop vibe, and a short, memorable line that feels right at home in The Coder Cafe 1 . Yes, initially, my idea was to create a paid newsletter. I even shared it with an ex-colleague who gave me blunt feedback: I won’t give you my money; you already work at Google. From the outside, it may sound harsh, but I didn’t take it negatively. It actually made me reflect. Why did I want to create a paid newsletter in the first place? One of the things I enjoy most in life is learning and teaching (whether through writing, speaking, or any other format). If I could one day make a living from these two activities, that would be a dream job. It’s that simple. I didn’t create The Coder Cafe because I needed more money (of course, Google pays well). I made it because, in my dreams, it could eventually become my main activity. So, starting as a paid newsletter felt natural at the time. It was my way of committing fully to the project and testing whether The Coder Cafe could be more than just a side experiment . By the end of August, it was time to start creating content. Publishing one post a day is a lot. I did not want to feel pressured every evening to write for the next day, so I decided to build a buffer of posts. Throughout September, I wrote 25 posts plus their recaps, enough for six weeks of content. That felt safe. On October 7, 2024, The Coder Cafe newsletter was ready for launch, and its first post went live 🎉. To organize free and paid content, my idea was simple. All posts during the first four weeks would be free. The goal was to build an audience and show that my writing was worth paying for. After that period, I would keep one post free per week and place the others behind a paywall. The free post would act as a sample for new readers. The first week showed decent traction. I reached 120 free subscribers and 8 paid subscribers. The number of free subs was modest, but the paid-to-free ratio was good. Yet, to be honest, many of the early paid subscribers were friends or people who wanted to support me. Very few were convinced by the paid value proposition yet. The second week was better. One post reached the front page of Hacker News and passed 25k views. In a single day, subscriptions jumped to 291 free and 11 paid. That is when I saw the real impact of Hacker News: However, barely two weeks after launch, my girlfriend and I learned something that would have a massive impact on my daily newsletter: we were going to have a baby . Balancing my job at Google, my girlfriend, some social life, and a daily newsletter was already challenging. With a newborn on the way, it became absolutely impossible. Just one week after launch, I was the happiest man alive, but my daily newsletter concept was already dead. A quick pause to talk about marketing. I’m not going to overwhelm you with my “massive” marketing campaign (aka one post on LinkedIn and one post on X). But one day, I came up with a fun idea. What if I hid a coupon for lifetime access to the newsletter somewhere on thecoder.cafe website? Developers love puzzles. Maybe this could go viral? I started by hiding a tiny URL in the illustration designed by Eli Huynh: When pasted in a browser, this URL pointed to a Gist containing some JavaScript code. Running the code displayed a Base64 string. Decoding it three times gave another URL. Visiting that link opened a blank page that printed yet another URL in the browser console. That new URL pointed to an SVG file of a blue circle. Opening the SVG’s source revealed a hidden message with a free coupon: So, how many people found it? How many even reached the very first URL hidden in the image? Get ready for my next book on marketing! ( Credits to the original book) Stopping The Coder Cafe ? So, I knew I would eventually have to stop publishing daily because I would run out of time. It broke my heart for one particular reason. With free subscribers, I can explore topics. If they do not like a post, they skip it. With paying subscribers, though, I felt accountable. People spend their own money on my content, and I cannot disappoint them. The daily newsletter was a contract between them and me, and I was about to break it. In December, my buffer was shrinking fast, and I started to think seriously about stopping everything and refunding all paid subscriptions. Then something unexpected happened. It was close to Christmas, and the post of the day was on TDD . I was at my parents’ home when I received an email notification: a new paid subscriber had joined. And not just anyone, himself. Funny enough, I was not even praising TDD. I said that while it makes sense in some contexts, I do not really use it. A few days later, Kent left a comment on one of my posts: Looking back, that comment changed everything for me. It gave me confidence in myself and in my writing. If someone like Kent Beck found value in what I was doing, then maybe there was something worth continuing. Right after that, I took the bull by the horns and published a post titled Stepping Back to Move Forward . I explained that I would stop paid subscriptions, and I also pointed out a deeper issue with my daily format. As I explained, each week had a theme, for example, unit tests. Due to limited writing time, the daily posts were not very long. Instead of writing one in-depth article, I split the topic into four posts. Since only one of them was free, that was the one I shared on platforms like Reddit. The feedback I got was often that it lacked depth. I could have argued online: “But the depth is spread across four posts. If you read the whole series, you’ll see it. You just have to become a paid subscriber. Blah blah blah.” Let’s be honest. Nobody would have cared. Readers judged the content they saw. If it looked shallow, they were right, period. This made it hard to attract new people, and apart from the post that went viral, most of my content stayed relatively anonymous. One week later, I sent a private email to all paid subscribers. I explained I was stopping paid subscriptions and issuing refunds: On January 3rd, I had 747 free subscribers and went from 15 to 0 paid. From that moment on, The Coder Cafe became about enjoying the writing without pressure to deliver and taking the time to delve more into each topic. In March, I decided to stop blogging on Medium and fully switch to Substack. I feel at home here. I also wondered what would happen if I wrote more personal reflections and stories rather than concepts. So, I created a new section called Lattes & Stories (this post is in this section). I really enjoy writing in a more storytelling style. In April, I reached the 1,000 subscriber mark. To celebrate, I organized a coding challenge and gathered around $1,000 worth of prizes from Keychron , , JetBrains , and O’Reilly . Thanks again to the sponsors! It was a lot of fun to run. Maybe I will do another one later. $10,000 of prizes for 10,000 subs? We can still dream. Also in April, the idea of building a community around The Coder Cafe started to emerge. I wanted to create a sense of belonging, where members feel that they matter to one another and to the group. So, I created a Discord server. I’m still exploring options to spark engagement. Join the Discord community In May, something surreal happened: , the very newsletter that inspired me, recommended The Coder Cafe 2 . In September, I converted the content of year one into a 260-page book, available on Leanpub . Also in September, I enabled sponsorships to explore partnerships with companies interested in supporting The Coder Cafe . If you would like to partner on a post, you can learn more here . Some stats after one year: The Coder Cafe reached more than 3,600 subscribers across 119 countries. I wrote 78 posts and collected 309,432 views. Two posts stood out and together represent about one-third of total views: So, I Wrote a Book: The Story Behind 100 Go Mistakes and How to Avoid Them . I started writing this as soon as I finished my book, but it took three years to get the perspective to complete it. This post has a special meaning for me. Working on Complex Systems: What I Learned Working at Google . A deep dive into complex systems, written during a holiday with my brother in a city life retreat in the middle of Sweden. That trip also gave the post a special meaning. Writing from there with that view was unforgettable. Conclusion and Future So, is The Coder Cafe a success story? Financially, no. Because Stripe took fees on subscriptions, I refunded more than I received, so the balance is even negative 😅. But in the end, after one year, money wasn’t the measure of success. Positive feedback, helping readers, and rediscovering the joy of sharing stuff mattered more. It also pushed me to explore many topics across tech and non-tech. I haven’t let go of the dream of making this my living, just not yet. My priority now is reaching more people and building a stronger community. We will see where that journey leads. A glimpse into the future. I will continue to write about concepts and occasionally share stories. I will also launch a new section where we will pick a system and build it from scratch, step by step, week after week. I am currently collaborating with a company to craft the content of the first series. I am looking forward to releasing it. Speaking of collaboration, I would love to explore this aspect, whether it’s partnering with companies, inviting other writers 3 , or featuring people with a worthy story. We will see. 🫶 I love sharing, and you give me an audience. Thank you for that. See you next week for a (long) post on CRDTs. Subscribe now So, I Wrote a Book Why I Switched to Vim Keybindings What I Learned During My Paternity Leave The Art and Business of Online Writing ❤️ If you enjoyed this post, please hit the like button. I will revisit this tagline months later to Learn one concept with your coffee . Substack recommendations are a feature where one newsletter can endorse another, so its readers get suggested to subscribe. I haven’t yet formalized the process, but if you’re interested in contributing as a guest writer, let’s discuss . Origins We were in July 2024. It was a warm weekend, and I was lying in bed thinking about my next “big thing”. A few years earlier, I had finished writing my book. It was an exhausting experience, but I finally felt ready for another challenge. All of a sudden, I got an idea. What if I launched… a podcast? I jumped out of bed, searched for a book on podcasts, bought one, and started reading about all that needs to be known: the different formats, how to find an audience, whether we should invite guests, and so on. I even had the perfect name. As a lover of the cozy atmosphere of coffee shops, I wanted my podcast to reflect that same warm ambience. The name would be The Coder Cafe . But a few days later, the excitement faded. Did I really want to make a podcast after all? I had a few concepts in mind, but none of them truly clicked . Eventually, I decided to drop the idea. Yet the name The Coder Cafe stuck with me. I checked thecoder.cafe domain, and it was available. Every great story starts with a domain name. Let’s buy it! Now I had a domain name, but I still didn’t know what to do about it. Around that time, I started reading a lot of newsletters. One in particular inspired me for the quality and regularity of its content: by . I even launched one called Go Engineer, which I quickly stopped. Thanks to my book, I already had a Go audience. But deep down, I didn’t want to write only about Go anymore. I didn’t want to be tied to a single language when there are so many areas I’m passionate about: code health, testing, distributed systems, reliability, observability, performance, and more. At some point, the two ideas converged. I would create a newsletter, and it would be called The Coder Cafe . It wouldn’t be tied to one language. It would be a place where any software engineer could find something useful. I already had experience with online writing. My Medium blog had more than 4k followers. But since my book, I hadn’t really written there. I needed a fresh start and a chance to relearn how to write online. Writing a book and writing online are two very different activities. So I ditched my book on podcasts and bought another one: The Art and Business of Online Writing . If you don’t know this book, I really recommend it. Two principles in particular stuck with me: Volume matters. The most popular blogs and newsletters publish often. Timeless topics matter. Daily news about which stock to buy has volume but no staying power. Timeless content always wins. Four posts from Monday to Thursday, each on one core concept. A recap on Friday, to reinforce the week’s lessons. None of them really convinced me, though. When I was about to design yet another version, this time of a logo that repeats inside itself a few times to hint at recursion, I suddenly became aware of my lack of artistic skills . I decided to delegate the work, heading to Fiverr to find a freelancer. That’s how I met a very talented artist, Eli Huynh . She even worked on the Attack on Titan anime, can you believe it? I asked her to craft a coffee shop logo, and I slipped in some personal touches I wanted to see hidden in the design: An illustration of Designing Data-Intensive Applications by , my favorite computer science book. Moby Dock , the Docker mascot (I worked at Docker and loved my time there!) A Docker command on the coffee machine. I shared it with different people, and the feedback was unanimous: the drawing was stunning, but… it was not really a logo. Nothing against the artist, obviously, I was the one giving the requirements, but indeed, it wasn’t a logo. So I moved this illustration to the About page and kept searching for another freelancer. After a few failures, I started working with someone who feels promising. His first attempt was the following: After many back-and-forths, we converged on this version: When I received it, I loved it instantly. The colors felt warm and cozy. It captured the atmosphere I had in mind. Sometimes you don’t need outside validation, you just know. This logo would be The Coder Cafe ’s identity. In the end, I spent dozens of hours on this logo quest. You might think it was absurd, especially since I still hadn’t written a single post. But for me, visual identity had to come first. It’s like opening a restaurant: before designing the menu, you work on the atmosphere and decoration. Is it really absurd? I don’t think so. The same applied to the tagline. I tried dozens of variations: One daily concept. Learn daily, grow deeply. A timeless concept with your coffee. Brewed daily. However, barely two weeks after launch, my girlfriend and I learned something that would have a massive impact on my daily newsletter: we were going to have a baby . Balancing my job at Google, my girlfriend, some social life, and a daily newsletter was already challenging. With a newborn on the way, it became absolutely impossible. Just one week after launch, I was the happiest man alive, but my daily newsletter concept was already dead. Marketing A quick pause to talk about marketing. I’m not going to overwhelm you with my “massive” marketing campaign (aka one post on LinkedIn and one post on X). But one day, I came up with a fun idea. What if I hid a coupon for lifetime access to the newsletter somewhere on thecoder.cafe website? Developers love puzzles. Maybe this could go viral? I started by hiding a tiny URL in the illustration designed by Eli Huynh: When pasted in a browser, this URL pointed to a Gist containing some JavaScript code. Running the code displayed a Base64 string. Decoding it three times gave another URL. Visiting that link opened a blank page that printed yet another URL in the browser console. That new URL pointed to an SVG file of a blue circle. Opening the SVG’s source revealed a hidden message with a free coupon: Get ready for my next book on marketing! ( Credits to the original book) Stopping The Coder Cafe ? So, I knew I would eventually have to stop publishing daily because I would run out of time. It broke my heart for one particular reason. With free subscribers, I can explore topics. If they do not like a post, they skip it. With paying subscribers, though, I felt accountable. People spend their own money on my content, and I cannot disappoint them. The daily newsletter was a contract between them and me, and I was about to break it. In December, my buffer was shrinking fast, and I started to think seriously about stopping everything and refunding all paid subscriptions. Then something unexpected happened. It was close to Christmas, and the post of the day was on TDD . I was at my parents’ home when I received an email notification: a new paid subscriber had joined. And not just anyone, himself. Funny enough, I was not even praising TDD. I said that while it makes sense in some contexts, I do not really use it. A few days later, Kent left a comment on one of my posts: Looking back, that comment changed everything for me. It gave me confidence in myself and in my writing. If someone like Kent Beck found value in what I was doing, then maybe there was something worth continuing. Right after that, I took the bull by the horns and published a post titled Stepping Back to Move Forward . I explained that I would stop paid subscriptions, and I also pointed out a deeper issue with my daily format. As I explained, each week had a theme, for example, unit tests. Due to limited writing time, the daily posts were not very long. Instead of writing one in-depth article, I split the topic into four posts. Since only one of them was free, that was the one I shared on platforms like Reddit. The feedback I got was often that it lacked depth. I could have argued online: “But the depth is spread across four posts. If you read the whole series, you’ll see it. You just have to become a paid subscriber. Blah blah blah.” Let’s be honest. Nobody would have cared. Readers judged the content they saw. If it looked shallow, they were right, period. This made it hard to attract new people, and apart from the post that went viral, most of my content stayed relatively anonymous. One week later, I sent a private email to all paid subscribers. I explained I was stopping paid subscriptions and issuing refunds: On January 3rd, I had 747 free subscribers and went from 15 to 0 paid. From that moment on, The Coder Cafe became about enjoying the writing without pressure to deliver and taking the time to delve more into each topic. What Followed In March, I decided to stop blogging on Medium and fully switch to Substack. I feel at home here. I also wondered what would happen if I wrote more personal reflections and stories rather than concepts. So, I created a new section called Lattes & Stories (this post is in this section). I really enjoy writing in a more storytelling style. In April, I reached the 1,000 subscriber mark. To celebrate, I organized a coding challenge and gathered around $1,000 worth of prizes from Keychron , , JetBrains , and O’Reilly . Thanks again to the sponsors! It was a lot of fun to run. Maybe I will do another one later. $10,000 of prizes for 10,000 subs? We can still dream. Also in April, the idea of building a community around The Coder Cafe started to emerge. I wanted to create a sense of belonging, where members feel that they matter to one another and to the group. So, I created a Discord server. I’m still exploring options to spark engagement. Join the Discord community In May, something surreal happened: , the very newsletter that inspired me, recommended The Coder Cafe 2 . In September, I converted the content of year one into a 260-page book, available on Leanpub . Also in September, I enabled sponsorships to explore partnerships with companies interested in supporting The Coder Cafe . If you would like to partner on a post, you can learn more here . Some stats after one year: The Coder Cafe reached more than 3,600 subscribers across 119 countries. I wrote 78 posts and collected 309,432 views. Two posts stood out and together represent about one-third of total views: So, I Wrote a Book: The Story Behind 100 Go Mistakes and How to Avoid Them . I started writing this as soon as I finished my book, but it took three years to get the perspective to complete it. This post has a special meaning for me. Working on Complex Systems: What I Learned Working at Google . A deep dive into complex systems, written during a holiday with my brother in a city life retreat in the middle of Sweden. That trip also gave the post a special meaning. Writing from there with that view was unforgettable. So, I Wrote a Book Why I Switched to Vim Keybindings What I Learned During My Paternity Leave The Art and Business of Online Writing

Programming

Career

Writing

JavaScript

0 views

The Coder Cafe 4 months ago

Announcing The Coder Cafe Season 1 (Book)

🔔 This post is in the Announcements section, where we share news and updates related to The Coder Cafe. Notifications for each section can be configured in your settings . TL;DR We turned year one of The Coder Cafe into a 260-page book. Published on Leanpub: DRM-free EPUB/PDF. Works on Kindle, Kobo, iPad: read it anywhere. Pay what you want, min $4.90. Buying a copy is a way to support The Coder Cafe . Get the book Behind the Book Today marks the first anniversary of The Coder Cafe newsletter 🥳. To celebrate, I gathered the core concepts we explored this year into one book. I just published The Coder Cafe Season 1: Timeless Concepts for Software Engineers on Leanpub . If you’re unfamiliar with Leanpub, it’s a platform for DRM-free EPUB/PDF books with pay-what-you-want pricing and free updates. You can set your price (min $4.90) and read it on Kindle, Kobo, iPad, or e-reader/app. Drawn from the first year of the newsletter, it’s a single, carefully sequenced journey. Read sequentially, or jump to the concept you need. I’ve also included a special bonus with the book: my personal algorithms & data structures Anki deck, the support I mainly used to prepare for the Google SWE interviews. Buying the book helps support The Coder Cafe into year two and some even more ambitious projects. Get the book TL;DR We turned year one of The Coder Cafe into a 260-page book. Published on Leanpub: DRM-free EPUB/PDF. Works on Kindle, Kobo, iPad: read it anywhere. Pay what you want, min $4.90. Buying a copy is a way to support The Coder Cafe .

Books

0 views

The Coder Cafe 5 months ago

Organic Growth vs. Controlled Growth

☕ Welcome to The Coder Cafe! Today, we will discuss the concept of organic growth vs. controlled growth. Get cozy, grab a coffee, and let’s begin! A few months ago, I was in a meeting discussing the state of a codebase and heard myself say: This codebase has grown organically. After reflecting on my own words, I began questioning myself: I don’t even know what organic growth really means and whether it is ultimately positive or negative. Discussing the concept with various colleagues, I noticed we had different interpretations. I also created a poll on X that went viral (# irony ), showing that most people see organic growth as something positive. If Perplexity is correct, the term organic growth was used first in Bushido: The Soul of Japan , published in 1899. The author, Inazo Nitobe, describes Bushido (aka the way of the samurai) as something that developed through organic growth rather than being the invention of a single person or the result of a single event. In other words, Bushido emerged gradually over decades and centuries as a collective experience, not the product of one individual. It’s easy to draw parallels with the many codebases we maintain at work. The older the codebase and the more developers it has had, each with their own vision of how the code should look. Differences in perception arise from factors like developers’ varied mental models based on the parts of the code they know best. Developers come and go, making changes, introducing biases about what the future should be, and after years, the codebase reflects hundreds or even thousands of such incremental changes. This is organic growth in software: incremental delivery driven by agile methodology that shapes codebases, often in a bottom-up manner. So, is organic growth good or bad? Let’s use the garden metaphor: imagine a public garden where anyone can plant whatever they want and wherever they want. Over time, sure, it’s colorful, but there’s little harmony, with a lack of overall guidance or direction: Now imagine we control what can be planted and where, and periodically review and rearrange what’s growing. Our garden will likely look harmonious and well-maintained: In my opinion, organic growth carries a somewhat negative connotation, implying a lack of direction, with systems growing out of ad hoc solutions and quick fixes rather than intentional design. So, how can we move from a messy garden to a harmonious one? By transitioning from organic to controlled growth. Controlled growth is the deliberate and reflective process of evolving a codebase. With controlled growth, we should also embrace progress that is not always perfect but consistently guided by shared patterns and standards, fostering sustainable, manageable, and harmonious development over time. Here are five main rules for controlled growth: Be consistent . Follow agreed conventions and standards to build maintainable and predictable codebases. Many standards aren’t explicit, so strive to follow even the implicit ones. For example, if an entity is consistently named X, don’t be a cowboy and name it Y; stick to X. Plan ahead for large changes. Think strategically and collectively about big transitions such as significant new features. Avoid chaos and technical debt by breaking ambitious changes into smaller phases, communicating plans early, and leveraging documentation assets such as design docs (see Explore Further section). Hold regular retrospectives. After major development milestones, gather the team to review what went well and what could improve. Retrospectives foster a culture of learning and continuous improvement, a strategic pillar of controlled growth. Apply the Boy Scout rule. Frequent changes are inevitable and often rushed under pressure. Remember: “Always leave the campground cleaner than you found it.” When making a pull request, spend a little time refactoring small bits, cleaning legacy code, removing dead code, or renaming variables. This keeps the garden harmonious over time. Build a team where everyone shares responsibility and speaks openly. Controlled growth shouldn’t fall only on tech leads; it should be everyone’s job. When all team members feel heard and share responsibility for the health and quality of the codebase, the team grows on a stronger, more sustainable foundation. Subscribe now Tidy First? // We already discussed organic growth in this post. Focus on Product Ideas, Not Requirements Cognitive Load Bushido: The Soul of Japan Design Docs at Google ❤️ If you enjoyed this post, please hit the like button. 💬 When you say a codebase grew organically, what do you mean? Leave a comment A few months ago, I was in a meeting discussing the state of a codebase and heard myself say: This codebase has grown organically. After reflecting on my own words, I began questioning myself: I don’t even know what organic growth really means and whether it is ultimately positive or negative. Discussing the concept with various colleagues, I noticed we had different interpretations. I also created a poll on X that went viral (# irony ), showing that most people see organic growth as something positive. If Perplexity is correct, the term organic growth was used first in Bushido: The Soul of Japan , published in 1899. The author, Inazo Nitobe, describes Bushido (aka the way of the samurai) as something that developed through organic growth rather than being the invention of a single person or the result of a single event. In other words, Bushido emerged gradually over decades and centuries as a collective experience, not the product of one individual. It’s easy to draw parallels with the many codebases we maintain at work. The older the codebase and the more developers it has had, each with their own vision of how the code should look. Differences in perception arise from factors like developers’ varied mental models based on the parts of the code they know best. Developers come and go, making changes, introducing biases about what the future should be, and after years, the codebase reflects hundreds or even thousands of such incremental changes. This is organic growth in software: incremental delivery driven by agile methodology that shapes codebases, often in a bottom-up manner. So, is organic growth good or bad? Let’s use the garden metaphor: imagine a public garden where anyone can plant whatever they want and wherever they want. Over time, sure, it’s colorful, but there’s little harmony, with a lack of overall guidance or direction: Now imagine we control what can be planted and where, and periodically review and rearrange what’s growing. Our garden will likely look harmonious and well-maintained: In my opinion, organic growth carries a somewhat negative connotation, implying a lack of direction, with systems growing out of ad hoc solutions and quick fixes rather than intentional design. So, how can we move from a messy garden to a harmonious one? By transitioning from organic to controlled growth. Controlled growth is the deliberate and reflective process of evolving a codebase. With controlled growth, we should also embrace progress that is not always perfect but consistently guided by shared patterns and standards, fostering sustainable, manageable, and harmonious development over time. Here are five main rules for controlled growth: Be consistent . Follow agreed conventions and standards to build maintainable and predictable codebases. Many standards aren’t explicit, so strive to follow even the implicit ones. For example, if an entity is consistently named X, don’t be a cowboy and name it Y; stick to X. Plan ahead for large changes. Think strategically and collectively about big transitions such as significant new features. Avoid chaos and technical debt by breaking ambitious changes into smaller phases, communicating plans early, and leveraging documentation assets such as design docs (see Explore Further section). Hold regular retrospectives. After major development milestones, gather the team to review what went well and what could improve. Retrospectives foster a culture of learning and continuous improvement, a strategic pillar of controlled growth. Apply the Boy Scout rule. Frequent changes are inevitable and often rushed under pressure. Remember: “Always leave the campground cleaner than you found it.” When making a pull request, spend a little time refactoring small bits, cleaning legacy code, removing dead code, or renaming variables. This keeps the garden harmonious over time. Build a team where everyone shares responsibility and speaks openly. Controlled growth shouldn’t fall only on tech leads; it should be everyone’s job. When all team members feel heard and share responsibility for the health and quality of the codebase, the team grows on a stronger, more sustainable foundation. Tidy First? // We already discussed organic growth in this post. Focus on Product Ideas, Not Requirements Cognitive Load Bushido: The Soul of Japan Design Docs at Google

Agile

0 views

The Coder Cafe 5 months ago

What I Learned During My Paternity Leave

🔕 This post is part of the Lattes & Stories section, where I share personal reflections and stories (not the regular Concepts section). If you want, you can turn off notifications for this section here : Notifications → Disable “Lattes & Stories“. ☕ Welcome to The Coder Cafe! Today is a recap of my paternity leave, focusing on the things I read and learned. Get cozy, grab a coffee, and let’s begin! At Google, we get 18 weeks of paternity leave (yes, that’s amazing). Since the birth of my baby at the end of May, I’ve been out of the office and will be back at work next week. These months were the best time to spend with my newborn, but also a nice chance to read, learn, and try new things. Here are some of the technical and non-technical things I got into. I started reading Code Health Guardian . Let me jump straight: it’s one of the best books I’ve ever read on software engineering . Period. Don’t be fooled by the use of “AI“ in the subtitle (“ The Old-New Role of a Human Programmer in the AI Era ”). The focus is on code health. It covers various topics such as complexity, causes of complexity, documentation, interfaces, code discoverability, and functional programming. To me, it felt like a more modern (and better) version than Clean Code. A quote that perfectly summarizes my love for this book: Clever in programming is a compliment, clever in software engineering is an accusation. I strongly recommend this book. Learning Systems Thinking is a book about the concept of systems thinking. What’s system thinking? Modern software is no longer just isolated applications; they are becoming systems of software. Systems thinking invites us to shift our perspective from focusing on a single software to looking at the larger system it belongs to, and how the parts interact. One interesting idea in the book is the Iceberg model. It suggests that what we see (the events) is only the tip of the iceberg. Beneath the surface, there are patterns of behavior, deeper systemic structures, and even mental models that shape how the system works: The lesson is that when working with systems, we should move from reacting to events to understanding the patterns and structures that create them, so we can design better long-term solutions . The book was a good introduction. Yet, in retrospect, I would have liked it to give more concrete actions or applied examples. I will need to follow up with another resource to go deeper into practical applications of systems thinking. NOTE : Have you read my post on complex systems? It’s the most-read post of The Coder Cafe. During my leave, I started learning C++. Why? At Google, many systems are developed in C++. One I’ve been involved with is Borg . Because I hadn’t done C/C++ since my studies, every pull request I made was painful. I wanted to improve that. I started with A Tour of C++ by Bjarne Stroustrup. It’s refreshingly short for a C++ book (about 300 pages). That was my entry point. Let’s see what will come next 1 . NOTE : Did you know that Google uses C++ without exceptions? Mostly for performance and maintainability reasons, functions return an 2 , which is similar to Rust’s . At some point, I wanted to delve deeper into distributed systems, but I felt like I had already read most of the well-known books on the topic. What I had completely missed until then was the angle of technical whitepapers. Most of them are more challenging to read than blog posts, but they offer a depth that can’t be matched . I read a few during my leave, including F1: A Distributed SQL Database That Scales and Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service . After my paternity leave, I plan to continue exploring more of them as they have become one of my favorite sources of technical insight. For discovery, I used ‘s papershelf . The Mom Test is a great book that explores patterns and anti-patterns when discussing business or product ideas with customers. The core idea is this: if we pitch an idea to our mom, she will tell us it’s great, even if it’s not . Customers often do the same. They don’t want to hurt our feelings, so they give polite feedback instead of useful feedback. The solution is to avoid asking opinion-based questions like “ Do you think it’s a good idea? ” Instead, we should ask about real experiences and behaviors, questions that even our mom couldn’t fake. For example: “ Would you buy a product which did X? ” “ How much would you pay for X? ” “ How are you dealing with it now? ” “ Why do you bother? ” “ Talk me through the last time that happened. ” I haven’t run customer interviews myself, but I’ve participated in some. Interestingly enough, many of the anti-patterns from the book showed up in those meetings. This wasn’t a book with an immediate outcome for me, but it broadened my perspective, and that’s always valuable. The Art of Explanation is a book written by a BBC presenter and journalist. The author describes how to clearly explain any topics, focusing on ten main attributes: Simplicity: Is this the simplest way we can say this? Essential detail: What detail is essential to this explanation? Complexity: If a topic is complex, we can’t dodge the complexities and hope to explain something well. It reminded me of this illustration: Efficiency: Is this the most succinct way we can explain this? Precision: Are we saying exactly what we want to communicate? Context: Are we provided all the necessary context for people to understand the topic? No distractions: Are there any verbal, written or visual distractions? (It reminded me of rule #9 here .) Engaging: Are there times when it’s easy to lose focus? Lots of good things on how to maintain a good flow, such as making sure we move from one sentence to another logically. Useful: Have we answered the questions that people may have? Clarity of purpose: Above all else, what are we trying to explain? The book gave me a practical checklist I can return to whenever I need to convey something complex . It was also a reminder that being clear isn’t a gift, it’s a skill worth practicing regularly. Made to Stick is a book that focuses on the question of why some ideas have a lasting impact while others don’t. The authors introduce their SUCCESs framework to make ideas stick : Simple: Find the core of an idea. The more we reduce the amount of information in an idea, the stickier it will be. I loved this line: When you say three things, you say nothing. Unexpected: A great way to catch attention is with a surprise. We can’t demand attention; we must attract it, and one of the easiest ways is to convey something unexpected by breaking an existing pattern. Concrete: Our brains are wired to remember concrete data. Credible: The more credible we are, the more an idea will stick. Emotions: A great way to make people care is to convey emotion, to make them feel something. Story: How can we make them feel something? Via stories and great storytelling. Made to Stick helped me learn that ideas don’t just succeed because they are true; they also succeed because they are communicated in a way that people remember . I strongly recommend this book as well. Building a Second Brain was one of my favorite reads during this time. I loved it so much that I even wrote a dedicated post about it: For me, this book was a real game-changer in how I capture and reuse what I learn. Steal Like an Artist is a book on creativity built around a simple idea: nothing is completely original. As creators (writers, musicians, or anyone producing work), the author suggests we should “steal” from anything that inspires us and sparks our imagination. I didn’t find this book as compelling as others, but it does offer a few useful insights. One that stuck with me is the difference between copying and creating: copying a single person is plagiarism, but drawing from many influences is what makes something feel original. The trick isn’t to imitate, it’s to transform and remix ideas until they become your own. Four Thousand Weeks is a productivity book, but with a very different take than most books on the topic. The title comes from the fact that our lifespan is roughly 4,000 weeks. During my paternity leave, I felt overwhelmed at times by everything I wanted to read and do, and this book helped. Its message is simple: instead of trying to produce more and more, we should accept that our time is limited. We will never get everything done. What we can do is focus on the few things that matter most and give them our best attention. It helped me find some peace and reminded me to focus on what matters most to me. Last but not least, I couldn’t spend this paternity leave without some science fiction and fantasy reading. On the sci-fi side, I started exploring the Warhammer 40k universe by picking up a few books about the Night Lords faction. If all you know about Warhammer 40k is the painting and miniatures, you should also know there’s a massive, surprisingly coherent lore built around it. I always assumed it would be some cheap sci-fi, but I was wrong; it’s much richer than I expected. On the fantasy side, I usually enjoy epic fantasy such as The Lord of the Rings or The Realm of the Elderlings . Yet, I wasn’t in the mood for long and heavy stories. I wanted something lighter and more relaxing. That’s how I found the “cozy fantasy” genre. I read with Legends & Lattes and The Spellshop . Nothing really epic happens in these books, but the atmosphere is calm and comforting. Perfect for a quiet morning coffee (after a tough night). Above all else, these four months were a great opportunity to spend a lot of time with my kid (thanks, Google, for that). I’m not going to elaborate too much on how it went, as it’s very personal. The only thing I want to emphasize is how having a baby is one of those rare experiences in life. Let me explain: If we traveled to Japan and loved it, we could still say next year, " I’m going back to Japan again.” If we enjoyed a tlayuda , we could say this weekend, “ I’m going to make some tlayudas again.” If we enjoyed that Coldplay concert, we could say, “ Next album, I’m going to see them again.” With a baby, things feel different. Experiences are truly unique because of how quickly they grow. Especially in the beginning, a baby changes day in, day out: a new way of looking at things, better head control, a new way to grab an object, etc. Because babies change so quickly, every experience becomes one of a kind. Tomorrow is already going to be a different day, a different experience. But if we look beyond, we might even say that most experiences are truly unique: We might return to Japan, but we will notice new things, meet new people, or experience it differently. We might make tlayudas again, but they won’t taste exactly the same. We might see Coldplay again, but maybe we’re with a different person, in a different city, at a different point in our lives. Being a dad brought me this truth: life itself is unrepeatable in its details. It is not inherently good or bad; it is simply what makes so many experiences unique and fatherhood so magical. ❤️ If you enjoyed the post, please consider giving it a like. It’s a helpful signal to decide what to write next. 💬 What have you learned in these past months? Anything you’d like to share? Leave a comment So, I Wrote a Book Why I Switched to Vim Keybindings Effective Modern C++ perhaps? Let me know what you would recommend as a follow-up. https://abseil.io/docs/cpp/guides/status At Google, we get 18 weeks of paternity leave (yes, that’s amazing). Since the birth of my baby at the end of May, I’ve been out of the office and will be back at work next week. These months were the best time to spend with my newborn, but also a nice chance to read, learn, and try new things. Here are some of the technical and non-technical things I got into. Technical Stuff Code Health Guardian I started reading Code Health Guardian . Let me jump straight: it’s one of the best books I’ve ever read on software engineering . Period. Don’t be fooled by the use of “AI“ in the subtitle (“ The Old-New Role of a Human Programmer in the AI Era ”). The focus is on code health. It covers various topics such as complexity, causes of complexity, documentation, interfaces, code discoverability, and functional programming. To me, it felt like a more modern (and better) version than Clean Code. A quote that perfectly summarizes my love for this book: Clever in programming is a compliment, clever in software engineering is an accusation. I strongly recommend this book. Systems Thinking Learning Systems Thinking is a book about the concept of systems thinking. What’s system thinking? Modern software is no longer just isolated applications; they are becoming systems of software. Systems thinking invites us to shift our perspective from focusing on a single software to looking at the larger system it belongs to, and how the parts interact. One interesting idea in the book is the Iceberg model. It suggests that what we see (the events) is only the tip of the iceberg. Beneath the surface, there are patterns of behavior, deeper systemic structures, and even mental models that shape how the system works: The lesson is that when working with systems, we should move from reacting to events to understanding the patterns and structures that create them, so we can design better long-term solutions . The book was a good introduction. Yet, in retrospect, I would have liked it to give more concrete actions or applied examples. I will need to follow up with another resource to go deeper into practical applications of systems thinking. NOTE : Have you read my post on complex systems? It’s the most-read post of The Coder Cafe. Learning C++ During my leave, I started learning C++. Why? At Google, many systems are developed in C++. One I’ve been involved with is Borg . Because I hadn’t done C/C++ since my studies, every pull request I made was painful. I wanted to improve that. I started with A Tour of C++ by Bjarne Stroustrup. It’s refreshingly short for a C++ book (about 300 pages). That was my entry point. Let’s see what will come next 1 . NOTE : Did you know that Google uses C++ without exceptions? Mostly for performance and maintainability reasons, functions return an 2 , which is similar to Rust’s . Whitepapers At some point, I wanted to delve deeper into distributed systems, but I felt like I had already read most of the well-known books on the topic. What I had completely missed until then was the angle of technical whitepapers. Most of them are more challenging to read than blog posts, but they offer a depth that can’t be matched . I read a few during my leave, including F1: A Distributed SQL Database That Scales and Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service . After my paternity leave, I plan to continue exploring more of them as they have become one of my favorite sources of technical insight. For discovery, I used ‘s papershelf . Non-Technical Stuff The Mom Test The Mom Test is a great book that explores patterns and anti-patterns when discussing business or product ideas with customers. The core idea is this: if we pitch an idea to our mom, she will tell us it’s great, even if it’s not . Customers often do the same. They don’t want to hurt our feelings, so they give polite feedback instead of useful feedback. The solution is to avoid asking opinion-based questions like “ Do you think it’s a good idea? ” Instead, we should ask about real experiences and behaviors, questions that even our mom couldn’t fake. For example: Bad: “ Would you buy a product which did X? ” “ How much would you pay for X? ” Good: “ How are you dealing with it now? ” “ Why do you bother? ” “ Talk me through the last time that happened. ” Simplicity: Is this the simplest way we can say this? Essential detail: What detail is essential to this explanation? Complexity: If a topic is complex, we can’t dodge the complexities and hope to explain something well. It reminded me of this illustration: Efficiency: Is this the most succinct way we can explain this? Precision: Are we saying exactly what we want to communicate? Context: Are we provided all the necessary context for people to understand the topic? No distractions: Are there any verbal, written or visual distractions? (It reminded me of rule #9 here .) Engaging: Are there times when it’s easy to lose focus? Lots of good things on how to maintain a good flow, such as making sure we move from one sentence to another logically. Useful: Have we answered the questions that people may have? Clarity of purpose: Above all else, what are we trying to explain? Simple: Find the core of an idea. The more we reduce the amount of information in an idea, the stickier it will be. I loved this line: When you say three things, you say nothing. Unexpected: A great way to catch attention is with a surprise. We can’t demand attention; we must attract it, and one of the easiest ways is to convey something unexpected by breaking an existing pattern. Concrete: Our brains are wired to remember concrete data. Credible: The more credible we are, the more an idea will stick. Emotions: A great way to make people care is to convey emotion, to make them feel something. Story: How can we make them feel something? Via stories and great storytelling. If we traveled to Japan and loved it, we could still say next year, " I’m going back to Japan again.” If we enjoyed a tlayuda , we could say this weekend, “ I’m going to make some tlayudas again.” If we enjoyed that Coldplay concert, we could say, “ Next album, I’m going to see them again.” We might return to Japan, but we will notice new things, meet new people, or experience it differently. We might make tlayudas again, but they won’t taste exactly the same. We might see Coldplay again, but maybe we’re with a different person, in a different city, at a different point in our lives. So, I Wrote a Book Why I Switched to Vim Keybindings

Programming

C++

Sql

Career

0 views

The Coder Cafe 5 months ago

Second Brain

☕ Welcome to The Coder Cafe! Today, I wanted to share a book that had a huge impact on how I organize my personal and professional knowledge: Building a Second Brain by Tiago Forte. Get cozy, grab a coffee, and let’s begin! A Decade of Failures For over a decade, I failed miserably at keeping an effective note-taking system, mostly for two reasons: I never had a single, centralized place. I used to rely on a combination of Notion + Apple Notes + a physical notebook + Kindle highlights + Anki . I didn’t have a system that was generic enough. I had a different format pretty much every time: book summaries, posts, computer science notes, mistake journal , personal growth notes, and so on. The result was a messy system that didn’t scale. I kept losing knowledge, and it made learning inefficient . When we read or watch a great resource but don’t capture what we learned from it, chances are high that within a few months or even weeks, that knowledge fades away. That’s what kept happening to me. In a world full of content—social media, work, books, courses, podcasts—being able to extract and retain working knowledge isn’t optional; it’s necessary if we want to keep growing as an engineer and as a person . Let’s discuss an approach to solve this problem. One of the first sentences in the book hooked me: Your mind is for having ideas, not holding them. Our brain is great for processing ongoing tasks, but it’s not meant to retain everything. Over time, it lets go of unused ideas to make room for more relevant ones. This process, referred to as synaptic pruning, helps us adapt, but it also means we lose what we don’t externalize. That’s the promise of a second brain: a place to offload notes and thoughts, avoid losing knowledge, and build on it over time with new ideas. Before switching to the second brain approach, I also struggled with my notes because I wanted everything to be perfect. That drained a lot of time and energy. But one paragraph from the book really shifted my perspective: We have to remember that we are not building an encyclopaedia of immaculately organized knowledge . We are building a working system . […] For that reason, you should prefer a system that is imperfect, but that continues to be useful in the real conditions of your life. That changed everything for me. I stopped chasing perfection and started building something that simply works, something that supports me every day, both at work and in my personal life. What I found enriching is that the book presents not just a system, but also a mindset . Let’s start with the latter. Knowledge work is about taking information and turning it into results, for example, delivering on a project. All day, we consume and then produce: The problem with this approach is that most of the information that we will gain will eventually be lost. Sure, we might remember the most important parts for a while. But what about the rest? It will fade, eventually. What we miss is a feedback loop, a way to recycle information into knowledge that we can reuse later : That’s what a second brain gives us: a way to turn information into assets we can reinvest in the future. Another interesting idea in the book is to stop seeing notes as a flat list of things we’ve saved. Our notes can become building blocks , meaning pieces that help us create new ideas later on. That’s a powerful shift: notes aren’t just storage, they’re raw material for thinking. They can connect and evolve into new ideas. Being able to track down all your notes in a single place makes those connections easier to spot and build on. One last point I wanted to discuss: creativity. Here’s the definition given by Neuroscientist Nancy C. Andreasen: Creative people are better at recognizing relationships, making associations, and connections. A second brain is not only a memory tool, but also a thinking tool. Keeping an effective system to track down notes and ideas becomes a great tool to improve our creativity. Remember: creativity is not a talent, it’s a way of operating . NOTE: Did you enjoy this punchline? It’s coming from a note I hold on creativity in my second brain 😎. In this section, we’ll delve into what a note really is, how to take notes effectively, and then go over the PARA system presented in the book. So, what’s a note? The author defines it this way: A piece of content, interpreted through your lens, curated according to your taste, translated into your own words, or drawn from life experience, and stored in a secure place. Let’s go over the different parts of that definition. First, a note is something to use, not just to collect. Again, we’re not building an encyclopaedia, we’re building something that works for us. For example, if we’re interested in public speaking, we don’t need a note on every single resource we read or watch. Instead, the author suggests asking a few questions to decide what’s worth capturing: Does it inspire us? Is it useful? Is this personal? Is it surprising? Ultimately, we should capture what resonates. For example, we may have read this amazing book that everyone is talking about and discover that it doesn’t resonate with us at all 1 . Conversely, we may have watched a 30-second video that had a profound effect on us and triggered emotions. In this case, it’s better to spend some time creating a note on this video rather than the book. When something resonates with us, it’s our emotion-based, intuitive mind telling us it’s interesting before our logical mind can explain why. Every time we take a note, we should ask: “ How can I make this as useful as possible for my future self? ” One way to do that is to be mindful of our future limited time. Instead of tracking dozens, if not hundreds, of lines, we should focus on finding the essence , meaning the heart and soul of what a resource is trying to communicate. But in some cases, that’s barely possible. For example, I recently read the F1: A distributed SQL database that scales whitepaper. How can we capture the essence of such a dense technical document containing so much valuable information? The solution, brought again by the author, is to use the progressive summarization technique . In short, it’s about layering our notes: Layer 1 - Captured notes : Either a copy and paste or even better writing, down in our own words what we understood from it. Layer 2 - Bolded passages : Go over the captured notes and mark in bold the most important pieces. Layer 3 - Highlighted passages : Go over the bolded passages and highlight the most important pieces. Layer 4 - Summary : Write down a summary. Here’s an example from my F1 whitepaper notes. I created two main sections: Summary and Highlights . Layer 1: In Highlights, I captured all the raw content, for example, on locking. Layer 2: I bolded the passages that were most interesting to me. Layer 3: I highlighted the most important parts. Layer 4: I wrote a summary section with the ideas that were the most interesting for me. In the end, the note looks like this: This approach allows me to come back later and start with a quick summary to refresh the main ideas. If I need a bit more, I can scan the bolded and highlighted passages. And if I really want to dive deeper, I can go through all the highlights. Of course, every layer is optional. If we consider that a resource only needs layers 1 and 4 of summarization, or just 1 and 2, that’s perfectly fine. Again, our second brain should be something that works for us. The PARA system is at the core of the book. It’s a proposition on how to organize our notes. It is designed for actionability, with layers of action. To make things clear, a note can be assigned to one of the following domains: Project : A short-term effort with a possible due date and a clear outcome that needs to happen in order to mark the project as complete. For example, publish a blog post about second brain. Area : Ongoing responsibilities, what we are committed to, and what requires constant attention. Resource : A catchall for anything that doesn’t belong to a project or an area. Archive : When a note becomes inactive or outdated, we can move it to the archive. NOTE : The distinction between area and resource wasn’t immediately clear to me, so here’s how I think about it. I enjoy both fitness and climbing. But I’m only committed to fitness. I try to eat healthy, work out regularly, and so on. Climbing, on the other hand, is something I enjoy, but I only go from time to time. So in my system, fitness is an area, and climbing is a resource. The PARA system has two main benefits. Clear focus : We’re not mixing short-term efforts with long-term maintenance. It helps us focus on outcomes and next steps rather than just piles of information. Genericity : PARA can handle all kinds of notes. It organizes information based on how actionable it is, not what kind of information it is. As I said, I used to lack a system that was generic enough to track all my notes. Now, whether it’s a book I read, a course I followed, or a post I came across, I capture everything that resonates with me using the PARA system. To implement the second brain, I used Notion , which I think is a fantastic tool with a lot of flexibility and configuration options. If you don’t want to build your own second brain from scratch, you can check out this online tutorial: Or, you can use my personal Notion template . This is the setup I use every day to track notes, growth tasks, my to-do list, areas, resources, and more. In summary, the second brain method presented in the book follows the CODE system: Capture : Keep what resonates, leave the rest aside. Organize : Save for actionability (project, area, or resource). Distill : Find the essence of what a resource communicates. Express : Show your work based on the knowledge you gained. Unless you’re a super memory genius, I strongly recommend looking into building a second brain. As we discussed, our brain should be for having ideas, not holding them. Maybe the ideas in this post won’t fit you exactly, and you’ll come up with your own way of tracking notes. However you approach it, building a second brain has been incredibly important for me, and it might be for you as well. Don’t Forget About Your Mental Health Survivor Bias The XY Problem Building a Second Brain Create Your Own Second Brain Your Resource Guide to Building a Second Brain ❤️ If you enjoyed the post, please consider giving it a like. It’s a helpful signal to decide what to write next. 💬 Are you using a second brain system? If not, how do you keep track of your notes and ideas? Leave a comment The Alchemist , I’m looking at you. A Decade of Failures For over a decade, I failed miserably at keeping an effective note-taking system, mostly for two reasons: I never had a single, centralized place. I used to rely on a combination of Notion + Apple Notes + a physical notebook + Kindle highlights + Anki . I didn’t have a system that was generic enough. I had a different format pretty much every time: book summaries, posts, computer science notes, mistake journal , personal growth notes, and so on. That’s what a second brain gives us: a way to turn information into assets we can reinvest in the future. Another interesting idea in the book is to stop seeing notes as a flat list of things we’ve saved. Our notes can become building blocks , meaning pieces that help us create new ideas later on. That’s a powerful shift: notes aren’t just storage, they’re raw material for thinking. They can connect and evolve into new ideas. Being able to track down all your notes in a single place makes those connections easier to spot and build on. One last point I wanted to discuss: creativity. Here’s the definition given by Neuroscientist Nancy C. Andreasen: Creative people are better at recognizing relationships, making associations, and connections. A second brain is not only a memory tool, but also a thinking tool. Keeping an effective system to track down notes and ideas becomes a great tool to improve our creativity. Remember: creativity is not a talent, it’s a way of operating . NOTE: Did you enjoy this punchline? It’s coming from a note I hold on creativity in my second brain 😎. System In this section, we’ll delve into what a note really is, how to take notes effectively, and then go over the PARA system presented in the book. Notes So, what’s a note? The author defines it this way: A piece of content, interpreted through your lens, curated according to your taste, translated into your own words, or drawn from life experience, and stored in a secure place. Let’s go over the different parts of that definition. First, a note is something to use, not just to collect. Again, we’re not building an encyclopaedia, we’re building something that works for us. For example, if we’re interested in public speaking, we don’t need a note on every single resource we read or watch. Instead, the author suggests asking a few questions to decide what’s worth capturing: Does it inspire us? Is it useful? Is this personal? Is it surprising? Layer 1 - Captured notes : Either a copy and paste or even better writing, down in our own words what we understood from it. Layer 2 - Bolded passages : Go over the captured notes and mark in bold the most important pieces. Layer 3 - Highlighted passages : Go over the bolded passages and highlight the most important pieces. Layer 4 - Summary : Write down a summary. Layer 1: In Highlights, I captured all the raw content, for example, on locking. Layer 2: I bolded the passages that were most interesting to me. Layer 3: I highlighted the most important parts. Layer 4: I wrote a summary section with the ideas that were the most interesting for me. Project : A short-term effort with a possible due date and a clear outcome that needs to happen in order to mark the project as complete. For example, publish a blog post about second brain. Area : Ongoing responsibilities, what we are committed to, and what requires constant attention. Resource : A catchall for anything that doesn’t belong to a project or an area. Archive : When a note becomes inactive or outdated, we can move it to the archive. Clear focus : We’re not mixing short-term efforts with long-term maintenance. It helps us focus on outcomes and next steps rather than just piles of information. Genericity : PARA can handle all kinds of notes. It organizes information based on how actionable it is, not what kind of information it is. Capture : Keep what resonates, leave the rest aside. Organize : Save for actionability (project, area, or resource). Distill : Find the essence of what a resource communicates. Express : Show your work based on the knowledge you gained. Don’t Forget About Your Mental Health Survivor Bias The XY Problem Building a Second Brain Create Your Own Second Brain Your Resource Guide to Building a Second Brain

Sql

Career

0 views

The Coder Cafe 6 months ago

Availability Models

Hello! Last week, we reached 3,000 subscribers, that’s awesome, thank you all! Recently, I was going through the distributed systems reliability glossary brought by Antithesis & Jepsen , and I found their approach on availability models particularly interesting. Let’s dive into that. Introduction Here are two database whitepapers: Megastore , a Google storage system: Dynamo , an Amazon key-value store: Both of these whitepapers use the term highly available . Of course, as a reader, we may expect these two whitepapers to mean the same thing when discussing high availability. After all, an orange is an orange, an apple is an apple, and a highly available system should be a highly available system. Yet, that’s not the case, and each paper means something different. Let’s first look at what availability means, then discuss why high availability is a vague concept, and finally explore the different availability models. In The CAP Theorem , we already discussed availability as: Every request receives a non-error response, even if it may not contain the most up-to-date data. However, this definition missed a crucial dimension: response time . Technically, a system could be available even if it responds after an hour. While such a system technically provides a non-error response, it fails to deliver a usable experience, severely compromising its practical availability from a user perspective. This is where The PACELC Theorem offers a more practical perspective on availability. PACELC highlights that: In the presence of a partition, a system must choose between availability and consistency. In the absence of partition, a system must choose between latency (the upper-bound limit during which a request should receive a non-error response) and consistency. So latency becomes part of availability. And that makes sense, right? If a system is too slow, it’s effectively down for the user. Availability isn’t just about uptime; it's also about whether the system is responsive in a meaningful timeframe. The term high availability is vague. Does it mean 99.9% uptime? 99.999%? ScyllaDB, in their technical glossary says it’s about maintaining levels of uptime that exceed normal SLAs . That’s fine, but it’s also easy to game. Say we define an SLA at 50%, then run at 80%. Technically, we exceeded it. Yet, does that mean we’re offering a highly available database? Probably not. The Antithesis reliability glossary defines high availability as a system that is available more often than a single node . I quite like that one. If we take the availability of our best node and our system does worse than that, it’s not really highly available. Simple and practical. Still, it’s not perfect. Let’s say we have five nodes, each available 50% of the time. With a write quorum of two, our write availability might still reach ~80%. But we’d still be down one request out of five. Hard to sell that as high availability. To bring more clarity to the conversation, Antithesis introduced a set of availability models. An availability model is something to help us define when an operation should succeed. What do we mean by operation? It’s simply a request made to the system. That could be a read, a write, a ping, whatever the system is supposed to handle. Instead of thinking in terms of the whole system being up or down, we look at whether a specific request can succeed, even when parts of the system are failing. Let’s explore three availability models. Definition : A system is majority available if when a majority of non-faulty nodes can communicate with one another, these nodes can execute some operations. Consider a database composed of five nodes: In the nominal case, everything works: a client connects, and the database can process all operations. Now, imagine two nodes go down. Maybe they crash, maybe there’s a network issue causing a partition: We’re left with three nodes that can still talk to each other. That’s a majority. If the database can still perform operations in this situation, we say it’s majority available . That’s how the Megastore whitepaper defined highly available : being majority available. This model is often used when consistency matters. For example, when a write or a leader election requires a majority to agree before responding to the client. So even if some nodes are unreachable, the system can still make safe progress as long as the majority is intact. Definition : A system is totally available if every non-faulty node can execute any operation. Let’s take the same setup: a database with five nodes. This time, three of them are faulty: In a majority available model, we can’t do much as we don’t have a majority. But in a totally available model, the system can still handle operations. Indeed, in this model, each non-faulty node can act on its own. It doesn’t need to coordinate with others or wait for a network round-trip. This model favors latency. Just handle the request and move on. That’s how the Dynamo whitepaper defined highly available : being totally available. The tradeoff is consistency. Totally available systems can’t enforce strong guarantees because the nodes don’t necessarily sync before responding. That’s why this model typically goes hand in hand with weaker consistency models . Definition : A system is sticky available if whenever a client’s transactions are executed against a copy of database state that reflects all of the client’s prior operations, it eventually receives a response, even in the presence of indefinitely long partitions. Let’s look at an example. Two clients, A and B, are connected to a database and make updates over time: Blue updates are made by client A, and green updates are made by client B. Sticky available means: After update 5, client A will eventually get a response that reflects at least updates 1, 4, and 5. After update 3, client B will eventually get a response that reflects at least updates 2 and 3. How can we achieve that? It depends on how replication is handled by the system. In a fully replicated system, all nodes store the full dataset. Sticky availability can be achieved by making sure a client always talks to the same node. Here’s the same example again, but now with client A always connected to node 1, and client B to node 2: Blue updates are made by client A, and green updates are made by client B. Here, node 2 hasn’t yet replicated update 5, and node 1 hasn’t yet replicated update 3. Still, since each client sticks to one node, they eventually see a consistent view of their own operations , despite possible failures such as a partition between the nodes. Now, let’s discuss a partially replicated system where nodes are replicas for subsets of data items. Here’s a (dummy) partitioning system where even-numbered updates go to node 1 and odd-numbered ones to node 2: Blue updates are made by client A, and green updates are made by client B. Here, clients can’t just stick to a single node. Instead, they must maintain stickiness with a single logical copy of the database, which may consist of multiple nodes. Clients can also help implement this model by acting as servers themselves. For example, a client could cache its own reads and writes, allowing it to return responses even during indefinitely long partitions. Highly available is too vague; watch out when you read or hear it. It might mean different things depending on the system or author. Majority available means a majority of nodes can still perform some operations. This model supports stronger consistency. Totally available means each non-faulty node can handle requests independently. It favors latency, but usually comes with weaker consistency. Sticky available means clients can make progress as long as they keep talking to a replica that reflects their own history. Availability models help us reason at the operation level, not just the system level. What matters is which operation can succeed, and under what conditions. ❤️ If you enjoyed the post, please consider giving it a like. It’s a helpful signal to decide what to write next. 💬 When someone says their system is “highly available,” what do you assume they mean? Leave a comment The PACELC Theorem Exploring Database Isolation Levels Latency and User Experience A distributed systems reliability glossary - Antithesis High Availability Database Definition - ScyllaDB Consistency Models - Jepsen Introduction Here are two database whitepapers: Megastore , a Google storage system: Dynamo , an Amazon key-value store: In the presence of a partition, a system must choose between availability and consistency. In the absence of partition, a system must choose between latency (the upper-bound limit during which a request should receive a non-error response) and consistency. In the nominal case, everything works: a client connects, and the database can process all operations. Now, imagine two nodes go down. Maybe they crash, maybe there’s a network issue causing a partition: We’re left with three nodes that can still talk to each other. That’s a majority. If the database can still perform operations in this situation, we say it’s majority available . That’s how the Megastore whitepaper defined highly available : being majority available. This model is often used when consistency matters. For example, when a write or a leader election requires a majority to agree before responding to the client. So even if some nodes are unreachable, the system can still make safe progress as long as the majority is intact. Total availability Definition : A system is totally available if every non-faulty node can execute any operation. Let’s take the same setup: a database with five nodes. This time, three of them are faulty: In a majority available model, we can’t do much as we don’t have a majority. But in a totally available model, the system can still handle operations. Indeed, in this model, each non-faulty node can act on its own. It doesn’t need to coordinate with others or wait for a network round-trip. This model favors latency. Just handle the request and move on. That’s how the Dynamo whitepaper defined highly available : being totally available. The tradeoff is consistency. Totally available systems can’t enforce strong guarantees because the nodes don’t necessarily sync before responding. That’s why this model typically goes hand in hand with weaker consistency models . Sticky Available Definition : A system is sticky available if whenever a client’s transactions are executed against a copy of database state that reflects all of the client’s prior operations, it eventually receives a response, even in the presence of indefinitely long partitions. Let’s look at an example. Two clients, A and B, are connected to a database and make updates over time: Blue updates are made by client A, and green updates are made by client B. Sticky available means: After update 5, client A will eventually get a response that reflects at least updates 1, 4, and 5. After update 3, client B will eventually get a response that reflects at least updates 2 and 3. Blue updates are made by client A, and green updates are made by client B. Here, node 2 hasn’t yet replicated update 5, and node 1 hasn’t yet replicated update 3. Still, since each client sticks to one node, they eventually see a consistent view of their own operations , despite possible failures such as a partition between the nodes. Partially Replicated System Now, let’s discuss a partially replicated system where nodes are replicas for subsets of data items. Here’s a (dummy) partitioning system where even-numbered updates go to node 1 and odd-numbered ones to node 2: Blue updates are made by client A, and green updates are made by client B. Here, clients can’t just stick to a single node. Instead, they must maintain stickiness with a single logical copy of the database, which may consist of multiple nodes. Clients can also help implement this model by acting as servers themselves. For example, a client could cache its own reads and writes, allowing it to return responses even during indefinitely long partitions. Summary Highly available is too vague; watch out when you read or hear it. It might mean different things depending on the system or author. Majority available means a majority of nodes can still perform some operations. This model supports stronger consistency. Totally available means each non-faulty node can handle requests independently. It favors latency, but usually comes with weaker consistency. Sticky available means clients can make progress as long as they keep talking to a replica that reflects their own history. Availability models help us reason at the operation level, not just the system level. What matters is which operation can succeed, and under what conditions. The PACELC Theorem Exploring Database Isolation Levels Latency and User Experience A distributed systems reliability glossary - Antithesis High Availability Database Definition - ScyllaDB Consistency Models - Jepsen

Database

0 views