Posts in Database (20 found)
The Coder Cafe 3 days ago

Build Your Own Key-Value Storage Engine—Week 7

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Over the last few weeks, you refined your LSM tree to introduce leveling. In case of a key miss, the process requires the following steps: Lookup from the memtable. Lookup from all the L0 SSTables. Lookup from one L1 SSTable. Lookup from one L2 SSTable. Last week, you optimized the lookups by introducing block-based SSTables and indexing, but a lookup is still not a “free” operation. Worst case, it requires fetching two pages (one for the index block and one for the data block) to find out that a key is missing in an SSTable. This week, you will optimize searches by introducing a “tiny” level of caching per SSTable. If you’re an avid reader of The Coder Cafe 1 , we already discussed a great candidate for such a cache: One that doesn’t consume too much memory to make sure we don’t increase space amplification drastically. One that is fast enough so that a lookup doesn’t introduce too much overhead, especially if we have to check a cache before making any lookup in an SSTable. You will implement a cache using Bloom filters : a space-efficient, probabilistic data structure to check for set membership. A Bloom filter can return two possible answers: The element is definitely not in the set (no false negatives). The element may be in the set (false positives are possible). In addition to optimizing SSTable lookups, you will also optimize your memtable. In week 2, you implemented a memtable using a hashtable. Let’s get some perspective to understand the problems of using a hashtable: A memtable buffers writes. As it’s the main entry point for writes, a write has to be fast. → OK: a hashtable has average inserts, plus ( : the length of the key) for hashing. For reads, doing a key lookup has to be fast → OK: average lookups, plus to hash. Doing range scanning operations (week 5, optional work), such as: “ Give me the list of keys between bar and foo “ → A hashtable, because it’s not an ordered data structure, is terrible: you end up touching everything so with the number of elements in the hashtable. Flush to L0 → A hashtable isn’t ordered, so it requires sorting all the keys ( ) with n the number of elements) to produce the SSTables. Because of these negative points, could we find a better data structure? Yes! This week, you will switch the memtable to a radix trie (see Further Notes for a discussion on alternative data structures). A trie is a tree-shaped data structure usually used to store strings efficiently. The common example to illustrate a trie is to store a dictionary. For example, suppose you want to store these two words: Despite that starts with the same four letters, you need to store a total of 4 + 5 = 9 letters. Tries optimize the storage required by sharing prefixes. Each node stores one letter. Here’s an example of a trie storing these two words in addition to the word foo ( nodes represent the end of a word): As you can see, we didn’t duplicate the first four letters of to store . In this very example, instead of storing 9 letters for and , we stored only five letters. Yet, you’re not going to implement a “basic” trie for your memtable; instead, you will implement a compressed trie called a radix trie (also known as a patricia 2 trie). Back to the previous example, storing one node (one square) has an overhead. It usually means at least one extra field to store the next element, usually a pointer. In the previous example, we needed 11 nodes in total, but what if we could compress the number of nodes required? The idea is to combine nodes with a single child: This new trie stores the exact same information, except it requires 6 nodes instead of 11. That’s what radix tries are about. To summarize the benefits of switching a memtable from a hashtable to a radix trie: Ordered by design: Tries keep keys in order and make prefix/range lookups natural, which helps for and for streaming a sorted flush. No rebalancing/rehashing pauses: The shape doesn’t depend on insertion order, and operations don’t need rebalancing; you avoid periodic rehash work. Prefix compression: A radix trie can cut duplicated key bytes in the memtable, reducing in-memory space. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Let’s size the Bloom filter. You will target: (false-positive rate) = 1% (max elements per SSTable) = 1,953 (hash functions) = 5 Using the formula from the Bloom Filters post: We get ≈ 19,230 bits, i.e., 2,404 B. We will round up to 2,496 B (39 × 64 B), so the bitset is a whole number of cache lines. NOTE : Using =7 would shave only ~2–3% space for ~40% more hash work, so =5 is a good trade-off. To distribute elements across the bitvector, you will use the following approach. You will use xxHash64 with two different constant seeds to get two base hashes, then derive k indices by double hashing (pseudo-code): The required changes to introduce Bloom filters: For each SSTable in the MANIFEST, cache its related Bloom filter in memory. Since each Bloom filter requires only a small amount of space, this optimization has a minimal memory footprint. For example, caching 1,000 Bloom filters of the type you designed requires less than 2.5 MB of memory. SSTable creation: For each new SSTable you write, initialize an empty bitvector of 2,496 B. Build the Bloom filter in memory as you emit the keys (including tombstones): Compute based on the key. For each , set bit at position . When the SSTable is done, persist a sidecar file next to it (e.g., and ) and the file. Update the cache containing the Bloom filters. Compaction: Delete from memory the Bloom filters corresponding to deleted SSTables. Before reading an SSTable: Compute based on the key. If all the bits of are set: The key may be present, therefore, proceed with your normal lookup in the SSTable. Otherwise: Skip this SSTable. Now, let’s replace your hashtable with a trie. : Compressed edge fragment. : A map keyed by the next character after to a node. : An enum with the different possible values: : The node is just a prefix, no full key ends here. : A full key exists at this node. : This key was explicitly deleted. : If is , the corresponding value. Root is a sentinel node with an empty . Walk from the root, matching the longest common prefix against . If partial match in the middle of an edge, split once: Create a parent with the common part, two children: the old suffix and the new suffix. Descend via the next child (next unmatched character). At the terminal node: set and Walk edges by longest-prefix match. If an edge doesn’t match, return not found. At the terminal node: If : return If or , return not found. Walk as in . If the path doesn’t fully exist, create the missing suffix nodes with so that a terminal node exists. At the terminal node: set (you may have to clear ). Flush process: In-order traversal: : Emit tombstone. : Emit nothing. There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Use per-SSTable random seeds for the Bloom hash functions. Persist them in the Bloom filter files. In Bloom Filters , you introduced blocked Bloom filters, a variant that optimizes spatial locality by: Dividing the bloom filter into contiguous blocks, each the size of a cache line. Restricting each query to a single block to ensure all bit lookups stay within the same cache line. Switch to blocked Bloom filters and see the impacts on latency and throughput. If you implemented the operation from week 5 (optional work), wire it to your memtable radix trie. That’s it for this week! You optimized lookups with per-SSTable Bloom filters and switched the memtable to a radix trie, an ordered data structure. Since the beginning of the series, everything you built has been single-threaded, and flush/compaction remains stop-the-world. In two weeks, you will finally tackle the final boss of LSM trees: concurrency. If you want to dive more into tries, Trie Memtables in Cassandra is a paper that explains why Cassandra moved from a skip list + B-tree memtable to a trie, and what it changed for topics such as GC and CPU locality. A popular variant of radix trie is the Adaptive Radix Tree (ART): it dynamically resizes node types based on the number of children to stay compact and cache-friendly, while supporting fast in-memory lookups, inserts, and deletes. This paper (or this summary ) explores the topic in depth. You should also be aware that tries aren’t the only option for memtables, as other data structures exist. For example, RocksDB relies on a skip list. See this resource for more information. About Bloom filters, some engines keep a Bloom filter not only per SSTable but per data-block range as well. This was the case for RocksDB’s older block-based filter format ( source ). RocksDB later shifted toward partitioned index/filters, which partition the index and full-file filter into smaller blocks with a top-level directory for on-demand loading. The official doc delves into the new approach. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. I’m sure you are. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Over the last few weeks, you refined your LSM tree to introduce leveling. In case of a key miss, the process requires the following steps: Lookup from the memtable. Lookup from all the L0 SSTables. Lookup from one L1 SSTable. Lookup from one L2 SSTable. One that doesn’t consume too much memory to make sure we don’t increase space amplification drastically. One that is fast enough so that a lookup doesn’t introduce too much overhead, especially if we have to check a cache before making any lookup in an SSTable. The element is definitely not in the set (no false negatives). The element may be in the set (false positives are possible). A memtable buffers writes. As it’s the main entry point for writes, a write has to be fast. → OK: a hashtable has average inserts, plus ( : the length of the key) for hashing. For reads, doing a key lookup has to be fast → OK: average lookups, plus to hash. Doing range scanning operations (week 5, optional work), such as: “ Give me the list of keys between bar and foo “ → A hashtable, because it’s not an ordered data structure, is terrible: you end up touching everything so with the number of elements in the hashtable. Flush to L0 → A hashtable isn’t ordered, so it requires sorting all the keys ( ) with n the number of elements) to produce the SSTables. As you can see, we didn’t duplicate the first four letters of to store . In this very example, instead of storing 9 letters for and , we stored only five letters. Yet, you’re not going to implement a “basic” trie for your memtable; instead, you will implement a compressed trie called a radix trie (also known as a patricia 2 trie). Back to the previous example, storing one node (one square) has an overhead. It usually means at least one extra field to store the next element, usually a pointer. In the previous example, we needed 11 nodes in total, but what if we could compress the number of nodes required? The idea is to combine nodes with a single child: This new trie stores the exact same information, except it requires 6 nodes instead of 11. That’s what radix tries are about. To summarize the benefits of switching a memtable from a hashtable to a radix trie: Ordered by design: Tries keep keys in order and make prefix/range lookups natural, which helps for and for streaming a sorted flush. No rebalancing/rehashing pauses: The shape doesn’t depend on insertion order, and operations don’t need rebalancing; you avoid periodic rehash work. Prefix compression: A radix trie can cut duplicated key bytes in the memtable, reducing in-memory space. (false-positive rate) = 1% (max elements per SSTable) = 1,953 (hash functions) = 5 Startup: For each SSTable in the MANIFEST, cache its related Bloom filter in memory. Since each Bloom filter requires only a small amount of space, this optimization has a minimal memory footprint. For example, caching 1,000 Bloom filters of the type you designed requires less than 2.5 MB of memory. SSTable creation: For each new SSTable you write, initialize an empty bitvector of 2,496 B. Build the Bloom filter in memory as you emit the keys (including tombstones): Compute based on the key. For each , set bit at position . When the SSTable is done, persist a sidecar file next to it (e.g., and ) and the file. Update the cache containing the Bloom filters. Compaction: Delete from memory the Bloom filters corresponding to deleted SSTables. Lookup: Before reading an SSTable: Compute based on the key. If all the bits of are set: The key may be present, therefore, proceed with your normal lookup in the SSTable. Otherwise: Skip this SSTable. : Compressed edge fragment. : A map keyed by the next character after to a node. : An enum with the different possible values: : The node is just a prefix, no full key ends here. : A full key exists at this node. : This key was explicitly deleted. : If is , the corresponding value. : Walk from the root, matching the longest common prefix against . If partial match in the middle of an edge, split once: Create a parent with the common part, two children: the old suffix and the new suffix. Descend via the next child (next unmatched character). At the terminal node: set and : Walk edges by longest-prefix match. If an edge doesn’t match, return not found. At the terminal node: If : return If or , return not found. : Walk as in . If the path doesn’t fully exist, create the missing suffix nodes with so that a terminal node exists. At the terminal node: set (you may have to clear ). In-order traversal: : Emit . : Emit tombstone. : Emit nothing. Dividing the bloom filter into contiguous blocks, each the size of a cache line. Restricting each query to a single block to ensure all bit lookups stay within the same cache line.

0 views

Flexible I/O for Database Management Systems with xNVMe

Flexible I/O for Database Management Systems with xNVMe Emil Houlborg, Simon A. F. Lund, Marcel Weisgut, Tilmann Rabl, Javier González, Vivek Shah, Pınar Tözün CIDR’26 This paper describes xNVMe , a storage library (developed by Samsung), and demonstrates how it can be integrated into DuckDB. Section 2 contains the hard sell for . The “x” prefix serves a similar role to the “X” in DirectX. It is fast, while also being portable across operating systems and storage devices. The C API will feel like home for folks who have experience with low-level graphics APIs (no shaders on the disk yet, sorry). There are APIs to open a handle to a device, allocate buffers, and submit NVMe commands (synchronously or asynchronously). Listing 3 has an example, which feels like “Mantle for NVMe”: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf The API works on Linux, FreeBSD, Windows, and macOS. Some operating systems have multiple backends available (e.g., , ). The point of this paper is that it is easy to drop into an existing application. The paper describes , which is an implementation of the DuckDB interface and uses . creates dedicated queues for each DuckDB worker thread to avoid synchronization (similar tricks are used by applications calling graphics APIs in parallel). The paper also describes how supports shiny new NVMe features like Flexible Data Placement (FDP). This allows DuckDB to pass hints to the SSD to colocate buffers with similar lifetimes (which improves garbage collection performance). Most of the results in the paper show comparable performance for vs the baseline DuckDB filesystem. Fig. 5 shows one benchmark where yields a significant improvement: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf Dangling Pointers I think the long-term success of will depend on governance. Potential members of the ecosystem could be scared off by Samsung’s potential conflict of interest (i.e., will Samsung privilege Samsung SSDs in some way?) There is a delicate balancing act between an API driven by a sluggish bureaucratic committee, and an API which is dominated by one vendor. Subscribe now

0 views
Justin Duke 2 weeks ago

One status field per model

Any sufficiently old application starts to succumb to a pernicious form of technical debt known in street parlance as shitty data modeling . Sometimes this manifests as the god object: a single model that represents the user, and the settings, and the DNS configuration, and twelve other things. Sometimes this comes in the form of a table (or multiple tables) where the initial set of data modeling concerns in the early goings of the project don't quite match the reality discovered along the way, and a series of subtle mismatches collide with each other in the same way that subtle mismatches between tectonic plates do. Data models, unlike other areas of tech debt, are correctly scary to refactor. Even in Django — an application framework with really robust, mature migration tooling — reshaping data in production is non-trivial. The weight associated with even relatively simple schema changes can be so overwhelming as to forever dissuade a would-be re-architect from making things right. Therefore, it is that much more important to spend the extra mental energy early on to make sure, whenever possible, your data model is a roughly correct one, and to course correct early when it isn't. There are many ways to do this, and the goal of describing a virtuous data model in its entirety is too large and broad a problem for this measly little essay. Instead, I want to share a heuristic that I have found particularly useful — one which is summed up, as many of my blog posts are, in the title. Every data model must have at most one status field. If you're thinking about making a change such that a model has more than one status field, you have the wrong data model. Let me illustrate via self-flagellation and talk about Buttondown's own problematic model: the object. The object has three status fields within its lush, expansive confines: This is incorrect. We should have created standalone models for the sending domain and hosting domain, each with a simple field of its own, and drawn foreign keys from the onto those. We did not do this, because at the time it felt like overkill. And so. You pay the price — not in any one specific bug, but in weirdness , in the difficulty of reasoning about the code. Is there a meaningful difference between an status and a of for an active newsletter, versus an status and a of ? What queries should return which combinations? The confusion compounds. Again, I know this sounds trivial. But every good data model has syntactic sugar around the state machine, and every good state machine has a unary representation of its state. 1 See also: enums . A field (the normal one)

0 views
Phil Eaton 3 weeks ago

Paths of MySQL, vector search edition

This is an external post of mine. Click here if you are not redirected.

0 views
Justin Duke 4 weeks ago

Brief notes on migrating to Postgres-backed jobs

It seems premature to talk about a migration that is only halfway done, even if it's the hard half that's done — but I think there's something useful in documenting the why and how of a transition while you're still in the thick of it, before the revisionist history of completion sets in. Early last year, we built out a system for running background jobs directly against Postgres within Django. This very quickly got abstracted out into a generic task runner — shout out to Brandur and many other people who have been beating this drum for a while. And as far as I can tell, this concept of shifting away from Redis and other less-durable caches for job infrastructure is regaining steam on the Rails side of the ecosystem, too. The reason we did it was mostly for ergonomics around graceful batch processing. It is significantly easier to write a poller in Django for stuff backed by the ORM than it is to try and extend RQ or any of the other task runner options that are Redis-friendly. Django gives you migrations, querysets, admin visibility, transactional guarantees — all for free, all without another moving part. And as we started using it and it proved stable, we slowly moved more and more things over to it. At the time of this writing, around half of our jobs by quantity — which represent around two-thirds by overall volume — have been migrated over from RQ onto this system. This is slightly ironic given that we also last year released django-rq-cron , a library that, if I have my druthers, we will no longer need. Fewer moving parts is the watchword. We're removing spindles from the system and getting closer and closer to a simple, portable, and legible stack of infrastructure.

1 views

Premium: The Hater's Guide to Oracle

You can’t avoid Oracle. No, really, you can’t. Oracle is everywhere. It sells ERP software – enterprise resource planning, which is a rat king of different services for giant companies for financial services, procurement (IE: sourcing and organizing the goods your company needs to run), compliance, project management, and human resources. It sells database software, and even owns the programming language Java as part of its acquisition of Sun Microsystems back in 2010 .  Its customers are fucking everyone: hospitals ( such as England’s National Health Service ), large corporations (like Microsoft), health insurance companies, Walmart, and multiple different governments. Even if you have never even heard of Oracle before, it’s almost entirely certain that your personal data is sitting in an Oracle-designed system somewhere.  Once you let Oracle into your house, it never leaves. Canceling contracts is difficult, to the point that one Redditor notes that some clients agreed to spend a minimum amount of money on services without realizing, meaning that you can’t remove services you don’t need even during the renewal of a contract . One user from three years ago told the story of adding two users to their contract for Oracle’s Netsuite Starter Edition ( around $1000 a month in today’s pricing ), only for an Oracle account manager to call a day later to demand they upgrade to the more expensive package ($2500 per month) for every user.   In a thread from a year ago , another user asked for help renegotiating their contract for Netsuite, adding that “[their] company is no where near the state needed to begin an implementation” and “would use a third party partner to implement” software that they had been sold by Oracle. One user responded by saying that Oracle would play hardball and “may even use [the] threat of attorneys.”  In fact, there are entire websites about negotiations with Oracle, with Palisade Compliance saying that “Oracle likes a frenetic pace where contracts are reviewed and dialogues happen under the constant pressure of Oracle’s quarter closes,” describing negotiations with them as “often rushed, filled with tension, and littered with threats from aggressive sales and Oracle auditing personnel.” This is something you can only do when you’ve made it so incredibly difficult to change providers. What’re you gonna do? Have your entire database not work? Pay up. Oracle also likes to do “audits” of big customers where it makes sure that every single part of your organization that uses Oracle software is paying for it, or were not using it in a way that was not allowed based on their contract . For example, Oracle sued healthcare IT company Perry Johnson & Associates in 2020 because the company that built PJ&A’s database systems used Oracle’s database software. The case was settled. This is all to say that Oracle is a big company that sells lots of stuff, and increases the pressure around its quarterly earnings as a means of boosting revenues. If you have a company with computers that might be running Java or Oracle’s software — even if somebody else installed it for you! — you’ll be paying Oracle, one way or another. They even tried to sue Google for using the open source version of Java to build its Android operating system (though they lost).  Oracle is a huge, inevitable pain in the ass, and, for the most part, an incredibly profitable one . Every time a new customer signs on at Oracle, they pledge themselves to the Graveyard Smash and permanent fealty to Larry Ellison’s database empire.  As a result, founder Larry Ellison has become one of the richest people in the world — the fifth-largest as of writing this sentence — owning 40% of Oracle’s stock and, per Martin Peers of The Information, will earn about $2.3 billion in dividends in the next year.  Oracle has also done well to stay out of bullshit hype-cycles. While it quickly spun up vague blockchain and metaverse offerings, its capex stayed relatively flat at around $1 billion to $2.1 billion a fiscal year (which runs from June 1 to May 31), until it burst to $4.511 in FY2022 (which began on June 1, 2021, for reference), $8.695 billion in FY2023, $6.86 billion in FY2024, and then increasing a teeny little bit to $21.25 billion in FY2025 as it stocked up on AI GPUs and started selling compute. You may be wondering if that helped at all, and it doesn’t appear to have at all. Oracle’s net income has stayed in the $2 billion to $3 billion range for over a decade , other than a $2.7 billion spike last quarter from its sale of its shares in Ampere . You see, things have gotten weird at Oracle, in part because of the weirdness of the Ellisons themselves, and their cozy relationship with the Trump Administration ( and Trump itself ). Ellison’s massive wealth backed son David Ellison’s acquisition of Paramount , putting conservative Bari Weiss at the helm of CBS in an attempt to placate and empower the right wing, and is currently trying to buy Warner Brothers Discovery ( though it appears Netflix may have won ), all in pursuit of kissing up to a regime steeped in brutality and bigotry that killed two people in Minnesota. Oracle will serve as the trusted security partner, responsible for auditing and ensuring compliance with National Security Terms, according to a memo. The company already provides cloud services for TikTok and manages user data in the U.S. Notably, Oracle previously made a bid for TikTok back in 2020. I know that you’re likely a little scared that an ultra right-wing billionaire has bought another major social network. I know you think that Oracle, a massive and inevitable cloud storage platform owned by a man who looks like H.R. Giger drew Jerry Stiller. I know you’re likely worried about a replay of the Elon Musk Twitter fiasco, where every week it seemed like things would collapse but it never seemed to happen, and then Musk bought an election. What if I told you that things were very different, and far more existentially perilous for Oracle? You see, Oracle is arguably one of the single-most evil and successful companies in the world, and it’s got there by being an aggressive vendor of database and ERP software, one that, like a tick with a law degree, cannot be removed without some degree of bloodshed. Perhaps not the highest-margin business in the world, but you know, it worked. Oracle has stuck to the things it’s known for for years and years and done just fine… …until AI, that is. Let’s see what AI has done for Oracle’s gross margi- OH MY GOD ! The scourge of AI GPUs has taken Oracle’s gross margin from around 79% in 2021 to 68.54% in 2025, with CNBC reporting that FactSet-polled analysts saw it falling to 49% by 2030 , which I think is actually being a little optimistic.   Oracle was very early to high-performance computing, becoming the first cloud in the world to have general availability of NVIDIA’s A100 GPUs back in September 2020 , and in June 2023 (at the beginning of Oracle’s FY2024), Ellison declared that Oracle would spend “billions” on NVIDIA GPUs, naming AI firm Cohere as one of its customers.  In May 2024, Musk and Ellison discussed a massive cloud compute contract — a multi-year, $10 billion deal that fell apart in July 2024 when Musk got impatient , a blow that was softened by Microsoft’s deal to buy compute capacity for OpenAI , for chips to be rented out of a data center in Abilene Texas that, about six months later, OpenAI would claim was part of a “$500 billion Stargate initiative” announcement between Oracle, SoftBank and OpenAI that was so rushed that Ellison had to borrow a coat to stay warm on the White House lawn, per The Information . “Stargate” is commonly misunderstood as a Trump program, or something that has raised $500 billion, when what it actually is is Oracle raising debt to build data centers for OpenAI. Instead of staying in its lane as a dystopian datacenter mobster, Oracle entered into negative-to-extremely-low margin realm of GPU rentals, raising $58 billion in debt and signing $248 billion in data center leases to service a 5-year-long $300 billion contract with OpenAI that it doesn’t have the capacity for and OpenAI doesn’t have the money to pay for . Oh, and TikTok? The billion-user social network that Oracle sort-of-just bought? There’s one little problem with it: per The Information , ByteDance investors estimate TikTok lost several billion dollars last year on revenues of roughly $20 billion, attributed to its high growth costs and, per The Information, “higher operational and labor costs in overseas markets compared to China.” Now, I know what you’re gonna say: Ellison bought TikTok as a propaganda tool, much like Musk bought Twitter. “The plan isn’t for it to be profitable,” you say. “It’s all about control” you say, and I say, in response, that you should know exactly how fucked Oracle is. In its last quarter, Oracle had negative $13 billion in cash flow , and between 2022 and late 2025 quintupled its PP&E (from $12.8 billion to $67.85 billion), primarily through the acquisition of GPUs for AI compute. Its remaining performance obligations are $523 billion , with $300 billion of that coming from OpenAI in a deal that starts, according to the Wall Street Journal, “ in 2027 ,” with data centers that are so behind in construction that the best Oracle could muster is saying that 96,000 B200 GPUs had been “delivered” to the Stargate Abilene data center in December 2025 for a data center of 450,000 GPUs that has to be fully operational by the end of 2026 without fail.  And what’re the margins on those GPUs? Negative 100% .  Oracle, a business borne of soulless capitalist brutality, has tied itself existentially to not just the success of AI , but the specific, incredible, impossible success of OpenAI , which will have to muster up $30 billion in less than a year to start paying for it, and another $270 billion or more to pay for the rest… at a time when Oracle doesn’t have the capacity and has taken on brutal debt to build it. For Oracle to survive , OpenAI must find a way to pay it four times the annual revenue of Microsoft Azure ($75 billion) , and because OpenAI burns billions of dollars, it’s going to have to raise all of that money at a time of historically low liquidity for venture capital .  Did I mention that Oracle took on $56 billion of debt to build data centers specifically for OpenAI? Or that the banks who invested in these deals don’t seem to be able to sell off the debt ? Let me put it really simply: We are setting up for a very funny and chaotic situation where Oracle simply runs out of money, and in the process blows up Larry Ellison’s fortune. However much influence Ellison might have with the administration, Oracle has burdened itself with debt and $248 billion in data center lease obligations — costs that are inevitable, and are already crushing the life out of the company (and the stock).  The only way out is if OpenAI becomes literally the most-successful cash-generating company of all time within the next two years, and that’s being generous. This is not a joke. This is not an understatement. Sam Altman holds Larry Ellison’s future in his clammy little hands, and there isn’t really anything anybody can do about it other than hope for the best, because Oracle already took on all that debt and capex. Forget about politics, forget about the fear in your heart that the darkness always wins, and join me in The Hater’s Guide To Oracle, or My Name’s Larry Ellison, and Welcome To Jackass. Larry Ellison’s wealth is almost entirely tied up in Oracle stock. Oracle’s stock is tied to the company “Oracle,” which is currently destroying its margins and annihilating its available cash to buy GPUs to serve a customer that cannot afford to pay it. Oracle has taken on ruinous debt that can only be paid if this customer, which cannot afford it and needs to raise money from an already-depleted venture capital pool, actually pays it. Oracle’s stock has already been punished for these debts , and that’s before OpenAI fails to pay for its contract. Oracle now owns part of one of its largest cloud customers, TikTok, which loses billions of dollars a year, and the US entity says, per Bloomberg , that it will “retrain, test and update the content recommendation algorithm on US user data,” guaranteeing that it’ll fuck up whatever makes it useful, reducing its efficacy for advertisers. Larry Ellison’s entire financial future is based on whether OpenAI lives or dies. If it dies, there isn’t another entity in the universe that can actually afford (or has interest in) the scale of the compute Oracle is building.

0 views
Dangling Pointers 1 months ago

UPP: Universal Predicate Pushdown to Smart Storage

UPP: Universal Predicate Pushdown to Smart Storage Ipoom Jeong, Jinghan Huang, Chuxuan Hu, Dohyun Park, Jaeyoung Kang, Nam Sung Kim, and Yongjoo Park ISCA'25 Working on hardware acceleration requires a healthy dose of honesty. If you try hard enough, you can find clever ways to accelerate a given application. Once that point is reached, it is helpful to step back and ask yourself: “are these ideas generally applicable to many hardware architectures, or only to the one I am targeting?” This paper describes techniques for high performance filtering of OLAP data, and a FPGA implementation. I wonder if these ideas would also work well on other chips. This paper rests on two assumptions: Inputs are relatively static, which means that the cost of preprocessing can be amortized over many queries Best-effort filtering is OK, because the system has another level of filtering to catch any false positives (rows which should have been removed, but were not) A preprocessing step generates a 256-bit row vector (RV) associated with each row. These bits are partitioned among all columns (e.g., if there are 8 columns in the relation, then each column is represented with 32 bits per row). When a query is run, the relevant filters from the query are converted into a set of 256-bit query vectors (QVs) and simple instructions which perform logical operations between the row vectors and query vectors. The result of those instructions is a single bit per row which determines if the row can be safely removed. Numerical expressions (e.g., ) are supported for monotone functions. During the preprocessing step, the lower and upper bounds of each column are computed. This space is divided into a fixed number of buckets. For each value in a column, the associated bucket index is computed, and the associated bit in the row vector is set to 1. Hashing is used to handle the case where there are more buckets than bits allocated for a given column. When a query is executed, the software can determine the set of buckets which the query references. For example, say the filter expression is: and the buckets for are: The query vector which selects rows which should not be filtered is (LSB first): To determine if a row should be filtered, compute the bitwise AND of the row vector and the query vector. If all bits of the result are zero, then the row can be removed. To convert a string into a row vector, the paper proposes tokenizing the string, and then hashing each token (i.e., word) to determine which bit in the row vector to set. This means that multiple bits in a row vector can be set (one bit per word). Only tokens which appear frequently in the dataset are hashed, the rest are ignored. A query expression like is decomposed into three tokens, and each token is hashed, and the hash values determine which bits in the query vector are set. Rows are accepted if they have all 3 bits set. Note that this is best-effort filtering. For example, if a row contains the string in the column, that row will not be removed. Table 3 shows FPGA resource usage numbers for an FPGA accelerator which executes queries by performing bitwise operations on row and query vectors, and compacting tables based on the generated bitmaps. These numbers seem pretty modest (i.e., good) to me. The authors argue that this implementation is small enough that it could be incorporated into a Smart SSD, allowing queries to be pushed down as far as possible. Source: https://dl.acm.org/doi/10.1145/3695053.3731005 Fig. 6 shows TPC-H performance results. Each pair of bars represents a particular query run on a baseline system, and on a system with filtering pushed down to a Smart SSD. Q21 doesn’t see a speedup because it is bound by join and aggregation, not filtering. Source: https://dl.acm.org/doi/10.1145/3695053.3731005 Dangling Pointers I wonder how much of this is overfitting to TPC-H. If you look at the TPC-H spec , a lot of string columns are generated by randomly sampling from a very small set of possible tokens. It would be great if the industry had a “held out test set” which could be used to evaluate OLAP performance on real-world yet hidden datasets which researchers could not directly see. Subscribe now Inputs are relatively static, which means that the cost of preprocessing can be amortized over many queries Best-effort filtering is OK, because the system has another level of filtering to catch any false positives (rows which should have been removed, but were not)

0 views
Binary Igor 1 months ago

Data Consistency: transactions, delays and long-running processes

Data Consistency is simply a need to keep data consistent within a particular boundary and time. There are two main scopes ... Local data is immediately consistent - there are no delays, it appears either in its final form or not at all. In the global scope however, things look totally different.

0 views
The Coder Cafe 1 months ago

Build Your Own Key-Value Storage Engine—Week 6

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing In week 2, you used JSON as the SSTable format. That works for document databases, but the overhead of this serialization format doesn’t make it the best choice for your storage engine: Best case: You stream the file and linearly scan entries until you find the key, but a miss means scanning the entire file. Worst case: You read the whole file and parse everything, then search for the key. This week, you will switch to block-based SSTables. Data will be chunked into fixed-size blocks designed to fit within a single disk page. The main benefits: Efficient I/O: Each lookup can fetch a complete block with a single page read. Predictable latency: Since every block maps to exactly one page, each read involves a fixed, bounded amount of I/O, improving latency consistency. Smaller on disk: Binary encoding typically compresses better than JSON. Integrity: Per-block checksums detect corruption without requiring a re-read of the file. Caching: Hot SSTable blocks are cached in a memory-based block cache to reduce I/O and decompression overhead. Alongside the data blocks, you will maintain a small index that stores the first key of each block and its corresponding offset, allowing lookups to jump directly to the relevant block without scanning all of them. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Fixed 64-byte keys and values: This alleviates a lot of logic to keep fixed-size blocks, making the implementation easier to write and reason about. Because of the week 1 assumption (keys are lowercase ASCII strings), each character is one byte, which also makes the implementation easier. A block-based SSTable will be composed of: One index block (first 4 KB page) Multiple data blocks (each 4 KB) Each block has a fixed size of 4 KB. Aligning blocks to 4 KB means a disk read can fetch a block in one page. If blocks are not aligned, a read may span two pages. Here’s the file layout at a glance: The layout of an index block (4 KB): : The number of data blocks in the SSTable. A set of key entries (64 B), each being the first key of the corresponding data block. Entries are sorted by key and used to decide which block to fetch during a lookup. To make the index fit into a single 4 KB page, it must contain at most 63 entries. Here’s the layout (note this is a binary layout; newlines are used only for the representation): NOTE : If you’re not familiar with the concept of padding: it’s filling unused bytes (here with 0x00) so fields and blocks have fixed sizes. has a value between 0 and 63. If you encoded 63 as text, you would need two bytes ( = and = ). Instead, you can store it as a binary integer so it fits in one byte: . Same layout, with explicit offsets: An example of an SSTable with three data blocks, hence three entries. Remember: this is binary; newlines are for readability only: This index block indicates: Block 0 starts with the key . Block 1 starts with the key . Block 2 starts with the key . You don’t need to store per-block offsets. Because the index is stored on a 4 KB page and every data block is exactly 4 KB and written contiguously, offsets can be calculated this way ( starts at 0): Block 0 starts at offset 4096. Block 1 starts at offset 8192. Block 2 starts at offset 12288. Now, let’s focus on data blocks. In addition to the key-value entries, reserve 8 bytes in the block at the start to store a CRC computed over + all entries; this lets you verify data integrity on read. The layout of a data block (4 KB per block): Header (128 B): (8 B): A checksum computed over bytes [8..4096). You can choose any standard variant (e.g., CRC-64/ECMA-182). (1 B): the number of entries in this block (0..31). Padding (119 B). Entries area (31 x 128 B = 3968 B), each entry is: (64 B, right-padded). (64 B, right-padded). The last data block may contain fewer than 31 entries ( ), but always pad with zeros to reach exactly 4 KB. This guarantees one-page reads and prevents errors across read modes (e.g., with mmap ). The layout of a data block (again, newlines are used only for the representation): Same layout, with explicit offsets: An example of a block composed of three key-value pairs: Note that because the index block holds at most 63 key entries, an SSTable can have at most 63 data blocks. With 31 entries per block, that caps an SSTable at 63 × 31 = 1,953 entries. A tombstone is represented by a value of 64 bytes all set to 0x00. Due to this sentinel, the all-zero value is reserved and cannot be used as an application value from this week onward. Searching for a value doesn’t change (memtable → L0 → L1, etc.). What changes is how you read one SSTable (remember: from L1, you only need to read one SSTable per level because of non-overlapping key ranges). The process to read from an SSTable: Binary search the index in to find the largest ≤ key and get . If not found (e.g., first index key is and your key is ), return a miss for this SSTable. Compute the block offset: . Fetch the corresponding 4 KB block. Verify CRC before using the block: Compute CRC64 over bytes [8..4096). Compare with the 8-byte CRC stored at offset 0..7. If it doesn’t match, fail the read for this SSTable. Binary search the entries in for the key. Return the corresponding value or a miss. Last week, you split at 2,000 entries during the compaction process. This week, because a single SSTable is limited to 1,953 entries, change the split threshold to 1,953. There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Drop the 64-byte constraint: store a length-prefixed key and value per entry (short header with key length and value length). Keep entries sorted and include the lengths in your checksum. Tombstones are currently represented by a sentinel value (a 64-byte all-zero value), which prevents storing an actual empty value. Instead, avoid reserving any value for deletes: add an explicit entry type per record (value or tombstone). Now that the format is binary, compression becomes more effective and saves more space. As an optional task, compress each data block independently so lookups still touch only one block: Record each block’s offset and compressed size in the index. Read just those bytes, decompress, and search. This packs more logical blocks into each cached page, raising cache hit rates, reducing pages touched during scans, and smoothing read latency. That’s it for this week! You implemented block-based SSTables and indexing, gaining benefits like more efficient I/O and reduced write amplification. In two weeks, you will focus on improving read performance by adding a layer that can tell whether an SSTable is worth parsing, and say goodbye to your hashtable-based memtable, replacing it with a more efficient data structure. For a production-grade implementation of block-based SSTables, see RocksDB’s block-based SSTable format . It details block layout, per-block compression, and how the index stores offsets and sizes. You can also check out ScyllaDB’s SSTables v3 docs . ScyllaDB maintains a small in-memory summary of sampled keys to narrow the search, then uses the on-disk index to locate the exact block. This provides a nice contrast to our single-page index and illustrates how to scale when SSTables grow large. For a deeper look at how things work in practice in terms of directory structure, you can explore the ScyllaDB SSTables directory structure , which shows how metadata and data are organized on disk. Regarding CRC read failures, we mentioned that a checksum mismatch should simply cause the read to fail for that SSTable. In real systems, databases rely on replication to handle corruption. When multiple replicas exist, a system can recover by using data from an intact replica if one becomes corrupted or unavailable. Upon detecting a checksum mismatch, the system discards the corrupt replica and rebuilds it from a healthy one. This approach only works as long as a valid replica exists, which is why frequent checksum verification is critical: it ensures corruption is caught and repaired as early as possible, before it propagates. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing In week 2, you used JSON as the SSTable format. That works for document databases, but the overhead of this serialization format doesn’t make it the best choice for your storage engine: Best case: You stream the file and linearly scan entries until you find the key, but a miss means scanning the entire file. Worst case: You read the whole file and parse everything, then search for the key. Efficient I/O: Each lookup can fetch a complete block with a single page read. Predictable latency: Since every block maps to exactly one page, each read involves a fixed, bounded amount of I/O, improving latency consistency. Smaller on disk: Binary encoding typically compresses better than JSON. Integrity: Per-block checksums detect corruption without requiring a re-read of the file. Caching: Hot SSTable blocks are cached in a memory-based block cache to reduce I/O and decompression overhead. Fixed 64-byte keys and values: This alleviates a lot of logic to keep fixed-size blocks, making the implementation easier to write and reason about. Because of the week 1 assumption (keys are lowercase ASCII strings), each character is one byte, which also makes the implementation easier. One index block (first 4 KB page) Multiple data blocks (each 4 KB) : The number of data blocks in the SSTable. A set of key entries (64 B), each being the first key of the corresponding data block. Entries are sorted by key and used to decide which block to fetch during a lookup. Block 0 starts with the key . Block 1 starts with the key . Block 2 starts with the key . Block 0 starts at offset 4096. Block 1 starts at offset 8192. Block 2 starts at offset 12288. Header (128 B): (8 B): A checksum computed over bytes [8..4096). You can choose any standard variant (e.g., CRC-64/ECMA-182). (1 B): the number of entries in this block (0..31). Padding (119 B). Entries area (31 x 128 B = 3968 B), each entry is: (64 B, right-padded). (64 B, right-padded). Binary search the index in to find the largest ≤ key and get . If not found (e.g., first index key is and your key is ), return a miss for this SSTable. Compute the block offset: . Fetch the corresponding 4 KB block. Verify CRC before using the block: Compute CRC64 over bytes [8..4096). Compare with the 8-byte CRC stored at offset 0..7. If it doesn’t match, fail the read for this SSTable. Binary search the entries in for the key. Return the corresponding value or a miss. Record each block’s offset and compressed size in the index. Read just those bytes, decompress, and search.

0 views
Justin Duke 1 months ago

Migrating to PlanetScale

First off, huge amount of credit to Mati for migrating our database to PlanetScale. I highly recommend reading the blog post . He does a good job talking about the boring stuff, which is to say, talking about the interesting stuff when reflecting on how this project went. Three other things come to mind, from my side of the fence: This is the largest infrastructural project that Buttondown has done which I have not been a part of — a fact that is both surreal and, frankly, very cool. 2. Mati hinted at this in the blog post, but outside of the obvious quantitative improvements brought by PlanetScale, the insights dashboard is worth its weight in gold. Being able to very easily see problematic queries and fix them has drastically changed our database posture, perhaps even more than the change in hardware itself. I find myself staring at the insights dashboard and thinking about what other parts of the infrastructure — CI tests, outbound SMTP, etc. — we need this for as well. 3. PlanetScale folks recommended we start using sqlcommenter to annotate our queries with the API routes that generated them. This was a good suggestion, but that package has the same malnourishment problem that so many Google-born OSS projects do: you end up pulling in a lot of dependencies, including some very old ones, to execute what is essentially 50 lines of code. Rather than do that, I asked Mr. Claude to vend the relevant snippets into a middleware that we can plop into our codebase. It is below:

0 views
The Coder Cafe 1 months ago

Build Your Own Key-Value Storage Engine—Week 5

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. I’ll also give a talk there, so feel free to join! Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Last week, you implemented deletion and compaction, making sure the LSM tree wouldn’t grow indefinitely. Still, there’s a weak spot: in the worst-case scenario (e.g., on a key miss), a single read has to scan all SSTables. To address this, you will implement leveling, a core idea in LSM trees. Instead of a single flat list of SSTables, leveling stores data across multiple levels: , , , etc. gets compacted to and makes space for future memtable flushes. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. This process is called level compaction. Something important to understand: is slightly different from all the other levels. is created during memtable flushes. If a key already exists at and also in the memtable, the next flush can write that key again to a new file. In other words, can have overlapping keys. For all the other levels ( to ), that’s not the case. They are created by compaction, which removes duplicates and produces non-overlapping key ranges. In this week’s simplified design, an to compaction takes all SSTables from and , performs a k-way merge, then rewrites fully. As a result, each key appears at most once per level from downward. What’s the consequence of non-overlapping keys? You can improve lookups using a simple range-to-file mapping, for example: Keys from to are stored in this SSTable. Keys from to are stored in this SSTable. With this setup, a read checks only one SSTable per level from to . is the exception due to overlaps, so a read may still need to scan all SSTables. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Limit the number of levels to two: , which may contain overlapping keys. , no overlapping keys. Create a folder for each level: , and . Keep one global file at the root. You will create a layout for both and : remains a simple list of SSTables. allows key-range partitioning. For example: This indicates: is composed of three SSTables: Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . The main goal of the compaction process is to compact both and . At the end, you should merge all the data from and into . will be left empty. When reaches five full SSTable files (2,000 entries each), run an → compaction: Open iterators on all and SSTables. Apply the k-way merge algorithm: Comparator: Primary: . Tie-break (equal ): Prefer over . At , prefer the newest SSTable. Version order: any record from is newer than records from . Within , newer files win (same as week 4). Keep at most one record per key (newest wins). Tombstones: because is the bottom level, drop a tombstone if no older value for that key remains in the merge result. Create new L1 SSTables with at most 2,000 entries. When naming new L1 files, make sure they are unique. For example, if contains and , the first SSTable file created should be . Publish atomically: each new file the directory. Update the atomically. the file. the root directory (the directory containing the file and and folders). Delete obsolete L1 files, then . Delete all files in , then . The logic is unchanged from previous weeks. The only difference is that flush writes to and updates the file in the section. Check the memtable. If not found, scan all files newest to oldest using section of the . If not found at : Use the section of the to choose the one shard that contains the key’s range, then read only that L1 file. Return the value if found; otherwise, return . There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Introducing leveling has a fundamental impact on deletions. With a single level, compaction sees all versions of every key at once, so a tombstone can be dropped as soon as it has “killed“ every older record for that key. Yet, the rule we mentioned last week holds true: a tombstone can be evicted only after all data it shadows no longer exist on disk. With multiple levels, compaction must propagate tombstones downward. It’s only at the bottommost level that tombstones can be dropped, because only there you can prove they no longer shadow any other records. As an optional task, make the number of levels configurable: , , …, : Define a size ratio so each level has a target size larger than the previous one. Keep one directory per level: , , …, . Keep a single global . When a level reaches its max number of SSTables (derived from the size ratio), compact that level into the next. Only drop tombstones at the bottommost level . At any intermediate level with , propagate the tombstone downward during compaction. Implement : Return all keys between (included) and (excluded). Use put-delete-scan.txt to validate that your changes are correct. It introduces the keyword. For example: This line means: between (included) and (excluded), the keys are , , (the output will always be sorted) NOTE : If this route conflicts with , rename the single-key route to . That’s it for this week! Your LSM tree is taking shape. You implemented leveling, a key LSM design idea, and refined compaction so reads are tighter and storage stays under control. In two weeks, you will revisit the week 2 choice of JSON for SSTables. You will switch to block-based SSTables to reduce parsing and I/O overhead and add indexing within each SSTable. We mentioned that, because of key overlaps, a read may still need to scan all SSTables (e.g., key miss). This is the main reason why is typically kept small. In general, each level is larger than the one above it by a fixed size ratio (e.g., 10×). Some databases even use less static mechanisms. For instance, RocksDB relies on Dynamic Leveled Compaction , where the size of each level is automatically adjusted based on the size of the oldest (last) level, eliminating the need to define each level’s size statically. Regarding compaction, you should know that in real-world databases, it isn’t done in batch mode across all data. Let’s understand why. Suppose you have four levels and a layout like this for one key: The key exists at L3. The key doesn’t exist at L2. The key is updated at L1. A tombstone is placed at L0. You can’t compact L0 with L1/L2/L3 in one shot; that would mean checking every SSTable against every level. What happens in reality is that compaction is a promotion process. In our example, the tombstone at L0 is promoted to L1. Implementations ensure that it either (a) is compacted together with the L1 SSTable it shadows, or (b) waits until that L1 data is promoted to L2. The same rule repeats level by level, until the tombstone reaches L3 and finally removes the shadowed value. Meanwhile, it’s essential to understand that compaction is crucial in LSM trees. Let’s take some perspective to understand the reason. An LSM tree buffers writes in a memtable and flushes to L0. Compaction merges SSTables across levels to control read amplification. If compaction falls behind, L0 files accumulate, flushes slow down (or stall at file-count thresholds), write latency climbs, and in the worst case, you can observe write pauses. Not because the memtable is “locked,” but because the engine can’t safely create more L0 files until compaction catches up. This is one of the reasons why the RUM conjecture we introduced last week is important. If you compact too eagerly, you burn a lot of disk I/O and lose the LSM’s write advantage. If you compact too lazily, you incur a penalty on your read path. If you compact everything all the time, you incur a space-amplification penalty during compaction roughly equal to the working set size. Because compaction is so important, most key-value stores support parallel compactions across levels (except → , which isn’t parallelized due to overlapping key ranges in L0). You should also be aware that ongoing research keeps improving compaction. For example, the SILK: Preventing Latency Spikes in LSM Key-Value Stores paper analyzes why LSM systems can exhibit high tail latency. The main reason is that limited I/O bandwidth causes interference between client writes, flushes, and compactions. The key takeaway is that not all internal operations are equal. The paper explores solutions such as Bandwidth awareness: Monitor client I/O and allocate the leftover to internal work dynamically instead of static configuration. Prioritization: Give priority to operations near the top of the tree (flushes and L0 → L1 compaction). Slowdowns there create backpressure that impacts tail latency more than work at deeper levels. Last but not least, what you implemented this week is called level compaction. Other strategies like tiered compaction exist, which merge SSTables based on their size and count rather than fixed levels. You can explore this great resource from Mark Callaghan, which dives deeper into the design trade-offs and performance characteristics of different compaction strategies in LSM trees. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Last week, you implemented deletion and compaction, making sure the LSM tree wouldn’t grow indefinitely. Still, there’s a weak spot: in the worst-case scenario (e.g., on a key miss), a single read has to scan all SSTables. To address this, you will implement leveling, a core idea in LSM trees. Instead of a single flat list of SSTables, leveling stores data across multiple levels: , , , etc. gets compacted to and makes space for future memtable flushes. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. This process is called level compaction. Something important to understand: is slightly different from all the other levels. is created during memtable flushes. If a key already exists at and also in the memtable, the next flush can write that key again to a new file. In other words, can have overlapping keys. For all the other levels ( to ), that’s not the case. They are created by compaction, which removes duplicates and produces non-overlapping key ranges. In this week’s simplified design, an to compaction takes all SSTables from and , performs a k-way merge, then rewrites fully. As a result, each key appears at most once per level from downward. What’s the consequence of non-overlapping keys? You can improve lookups using a simple range-to-file mapping, for example: Keys from to are stored in this SSTable. Keys from to are stored in this SSTable. Limit the number of levels to two: , which may contain overlapping keys. , no overlapping keys. Create a folder for each level: , and . Keep one global file at the root. remains a simple list of SSTables. allows key-range partitioning. is composed of three SSTables: . : Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Open iterators on all and SSTables. Apply the k-way merge algorithm: Comparator: Primary: . Tie-break (equal ): Prefer over . At , prefer the newest SSTable. Version order: any record from is newer than records from . Within , newer files win (same as week 4). Keep at most one record per key (newest wins). Tombstones: because is the bottom level, drop a tombstone if no older value for that key remains in the merge result. Create new L1 SSTables with at most 2,000 entries. When naming new L1 files, make sure they are unique. For example, if contains and , the first SSTable file created should be . Publish atomically: each new file the directory. Update the atomically. the file. the root directory (the directory containing the file and and folders). Clean up: Delete obsolete L1 files, then . Delete all files in , then . Check the memtable. If not found, scan all files newest to oldest using section of the . If not found at : Use the section of the to choose the one shard that contains the key’s range, then read only that L1 file. Return the value if found; otherwise, return . Define a size ratio so each level has a target size larger than the previous one. Keep one directory per level: , , …, . Keep a single global . When a level reaches its max number of SSTables (derived from the size ratio), compact that level into the next. Only drop tombstones at the bottommost level . At any intermediate level with , propagate the tombstone downward during compaction. Return all keys between (included) and (excluded). Use put-delete-scan.txt to validate that your changes are correct. It introduces the keyword. For example: This line means: between (included) and (excluded), the keys are , , (the output will always be sorted) The key exists at L3. The key doesn’t exist at L2. The key is updated at L1. A tombstone is placed at L0. This is one of the reasons why the RUM conjecture we introduced last week is important. If you compact too eagerly, you burn a lot of disk I/O and lose the LSM’s write advantage. If you compact too lazily, you incur a penalty on your read path. If you compact everything all the time, you incur a space-amplification penalty during compaction roughly equal to the working set size. Bandwidth awareness: Monitor client I/O and allocate the leftover to internal work dynamically instead of static configuration. Prioritization: Give priority to operations near the top of the tree (flushes and L0 → L1 compaction). Slowdowns there create backpressure that impacts tail latency more than work at deeper levels.

0 views
Evan Schwartz 2 months ago

Scour Year End Update 2025

I thought about sending out a personalized "Scour Wrapped"... until I got the 7th Wrapped from some random service. So instead, I'll just say Happy New Year and thanks for your support in 2025! 🥂 These were the new features added since the last update in October. Scour now identifies articles that are paywalled and indicates them with a yellow dollar sign next to the domain. In your settings , you can opt to hide paywalled content. If you do, you can also exempt specific domains where you have a subscription so you will see their content even if it is behind the paywall. Thank you to Johnny and Allen for requesting this feature! For anyone interested in the technical details, I wrote a blog post about a neat SQL trick I came across while building this: Short-Circuiting Correlated Subqueries in SQLite . You can also now block content from specific websites. The option to block a domain can be found by clicking the "..." button below each post. You can see and manage your excluded domains in your settings . Thanks to Vahe for this suggestion! If you subscribe to specific feeds (as opposed to scouring all of them), Scour will now recommend other sources for you to follow right in your personalized feed. These recommendations are based on Scour looking for content that matches your interests that you aren't currently getting. You can find more recommendations on your Feeds page . Each feed also now displays its three most recent posts below its description to make it easier to know what you'll get if you subscribe. You can click on the feed's title to see all of the posts from that feed. Thanks to Tiago for this suggestion! By default, clicking on a link to a post will bring you to the original website where it was published. However, if you prefer to read it on Scour, you can read the Preview, which can be found in the "..." menu under each post. Thanks to Linh for this suggestion! The filter menu for your feed (accessible via the button next to where it says Your Top Finds) should be clearer and more mobile-friendly. You can filter by time range and toggle between seeing posts from feeds you’ve subscribed to or see posts from everyone’s feeds. Thanks Stefan for the feedback on this! A number of people have told me that they are confused about how the love/like/dislike reactions are used on Scour. I'll work on making this clearer in the future but in the meantime, there's now a section in the FAQs about this. The answer is: Loves and likes are saved to your Likes page, so you can use them to bookmark interesting content. Unlike most content aggregators, Scour does not use reactions to change what shows up in your feed. Instead, reactions are used to generate Interest Recommendations for you. Scour only shows content related to topics you've explicitly chosen. You can also subscribe to other users' Likes as feeds. Everyone's reactions contribute to the Popular Posts page. Here were some of my favorite posts I found on Scour in November and December: Thanks to everyone who wrote about Scour on their blog or website in 2025! This included: If you write about Scour in the future, or if you already did and I didn't include you, please let me know! Thank you to everyone who provided feedback on Scour this year! Specifically, thank you to Aaron, Alberto, Alex K, Alex W, Allen, Andrew D, Andrew M, Andy M, Andy P, Cairin, Cole, Daniel, Elyem, Hary, Imperfect, Jadi, Jeppe, Jesse, Johnny, Jon, Karit, Kilpatrj, Linh, Proudmuslim-dev, Ryan, Sarah, Stefan, Tiago, Tomáš, Tyler, and Vahe. And thank you to all of the anonymous feedback givers as well! Because you made it to the end of the post, here's a little preview of an upcoming feature for you. Let's say you want to only see posts from small websites, like individuals' blogs. You can now try filtering your feed by how many posts each website or feed publishes per month. For example, you can use these links to see only posts from quieter domains or quieter feeds . Or, you can try this one to only see articles from larger websites . Let me know what you think! UI for controlling these filters is coming soon! Happy New Year and happy Scouring! - Evan Scour scoured 9,940,460 posts from 15,608 feeds 1,013 new users signed up (welcome!!) 12,620 interests were added, with 6,688 of those from recommendations 26,702 posts were read, 3,023 were liked, and 383 were loved 55 suggestions on the feedback board were completed Paper AI Tigers Build / Buy / Bot More databases should be single-threaded Disks Lie: Building a WAL that actually survives Minsuk Kang: Scour and minifeed are 100X better than Instagram and X (January) Winther: Blog Discovery (June) Daniel Prindii: My Read it later and discoverability systems in 2025 (July) PPC Land: Developer revives RSS with AI while Google targets syndication infrastructure (August) Tomáš Burkert: RSS feeds discovery strategies (October) Alex White: Discovering the Indie Web (November) Matt Maldre: Search engine for blogs (November) Andrew Doran: Tools for discovering the IndieWeb (December)

0 views
Grumpy Gamer 2 months ago

Sqlite Comments

When I started using Hugu for static site generation I lost the ability to have comments and we all know now supportive the Internet can be, so why wouldn’t you have comments? I wrote a few php scripts that I added on to Hugo and I had comments again. I decided to store the comments as flat files so I didn’t complicate things by needing the bloated MySQL. I wanted to keep it as simple and fast as possible. When a comment is added, my PHP script created a directory (if needed) for the post and saves the comment out as a .json file with name as the current time to make sorting easy. When the blog page was displayed, these files (already sorted thanks to the filename) were loaded and displayed. And it all worked well until it didn’t. Flat files are simple. but they can be hard to search or maintain if they need cleaning up or dealt with after a spam attack. I figured I use commandline tools to do all of that, but it’s a lot more cumbersome than I first thought. I missed have them in a sql database. I didn’t want to install MySQL again, but my site doesn’t get a lot of commenting traffic so I could use Sqlite instead. The downside is Sqlite write-locks the database while a write is happening. In my case it’s a fraction of a second and wouldn’t be a issue. The second problem I had was the version of Ubuntu my server was using is 5 years old and some of the packages I wanted wouldn’t available for it. I tried to update Ubuntu and for reasons I don’t fully understand I couldn’t. So I spun up a new server. Since grumpygamer.com is a statics site I only had to install Apache and I was off and running. Fun times. But the comment flat files still bugged me and I thought I’d use this as an opportunity to convert over to Sqlite. PHP/Apache comes with Sqilte already installed, so that’s easy. A long weekend and I rewrote the code to save comments and everything is back and working. Given that a webserver and PHP already needed to be installed, it isn’t a big deal to use Sqlite. If you’re not comfortable with SQL, it might be harder but I like SQL.

0 views
matklad 2 months ago

Static Allocation For Compilers

TigerBeetle famously uses “static allocation” . Infamously, the use of the term is idiosyncratic: what is meant is not arrays, as found in embedded development, but rather a weaker “no allocation after startup” form. The amount of memory TigerBeetle process uses is not hard-coded into the Elf binary. It depends on the runtime command line arguments. However, all allocation happens at startup, and there’s no deallocation. The long-lived event loop goes round and round happily without . I’ve wondered for years if a similar technique is applicable to compilers. It seemed impossible, but today I’ve managed to extract something actionable from this idea? Static allocation depends on the physics of the underlying problem. And distributed databases have surprisingly simple physics, at least in the case of TigerBeetle. The only inputs and outputs of the system are messages. Each message is finite in size (1MiB). The actual data of the system is stored on disk and can be arbitrarily large. But the diff applied by a single message is finite. And, if your input is finite, and your output is finite, it’s actually quite hard to need to allocate extra memory! This is worth emphasizing — it might seem like doing static allocation is tough and requires constant vigilance and manual accounting for resources. In practice, I learned that it is surprisingly compositional. As long as inputs and outputs of a system are finite, non-allocating processing is easy. And you can put two such systems together without much trouble. routing.zig is a good example of such an isolated subsystem. The only issue here is that there isn’t a physical limit on how many messages can arrive at the same time. Obviously, you can’t process arbitrary many messages simultaneously. But in the context of a distributed system over an unreliable network, a safe move is to drop a message on the floor if the required processing resources are not available. Counter-intuitively, not allocating is simpler than allocating, provided that you can pull it off! Alas, it seems impossible to pull it off for compilers. You could say something like “hey, the largest program will have at most one million functions”, but that will lead to both wasted memory and poor user experience. You could also use a single yolo arena of a fixed size, like I did in Hard Mode Rust , but that isn’t at all similar to “static allocation”. With arenas, the size is fixed explicitly, but you can OOM. With static allocation it is the opposite — no OOM, but you don’t know how much memory you’ll need until startup finishes! The “problem size” for a compiler isn’t fixed — both the input (source code) and the output (executable) can be arbitrarily large. But that is also the case for TigerBeetle — the size of the database is not fixed, it’s just that TigerBeetle gets to cheat and store it on disk, rather than in RAM. And TigerBeetle doesn’t do “static allocation” on disk, it can fail with at runtime, and it includes a dynamic block allocator to avoid that as long as possible by re-using no longer relevant sectors. So what we could say is that a compiler consumes arbitrarily large input, and produces arbitrarily large output, but those “do not count” for the purpose of static memory allocation. At the start, we set aside an “output arena” for storing finished, immutable results of compiler’s work. We then say that this output is accumulated after processing a sequence of chunks, where chunk size is strictly finite. While limiting the total size of the code-base is unreasonable, limiting a single file to, say, 4 MiB (runtime-overridable) is fine. Compiling then essentially becomes a “stream processing” problem, where both inputs and outputs are arbitrary large, but the filter program itself must execute in O(1) memory. With this setup, it is natural to use indexes rather than pointers for “output data”, which then makes it easy to persist it to disk between changes. And it’s also natural to think about “chunks of changes” not only spatially (compiler sees a new file), but also temporally (compiler sees a new version of an old file). Is there any practical benefits here? I don’t know! But seems worth playing around with! I feel that a strict separation between O(N) compiler output and O(1) intermediate processing artifacts can clarify compiler’s architecture, and I won’t be too surprised if O(1) processing in compilers would lead to simpler code the same way it does for databases?

0 views
The Coder Cafe 2 months ago

Build Your Own Key-Value Storage Engine—Week 4

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Over the past few weeks, you built an LSM tree and three main components: a memtable, SSTables, and a WAL that records the same operations you keep in the memtable. To prevent on-disk data from growing forever, you will implement compaction, a critical process in LSM trees. Compaction periodically merges SSTables to reclaim space and keep read performance predictable. For example, if key exists in every SSTable on disk: Compaction drops duplicates and keeps only the newest record: . In addition, you will implement a endpoint. Handling deletes in an LSM tree isn’t straightforward at all: SSTables are immutable. To preserve the append-only nature of LSM trees, deletions are written as tombstones: markers indicating a key was logically deleted. You write it to the WAL, keep it in the memtable, and propagate it during flush. How should compaction work in the presence of tombstones? Suppose you have the following SSTables on disk: the key exists in , doesn’t exist in , exists in , and is deleted at : .”","title":null,"type":"image/png","href":null,"belowTheFold":true,"topImage":false,"internalRedirect":"https://read.thecoder.cafe/i/174613473?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png","isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" title="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" srcset="https://substackcdn.com/image/fetch/$s_!rqB7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 424w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 848w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1272w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1456w" sizes="100vw" loading="lazy"> If the key doesn’t exist in the memtable, the current state for is deleted. Now, imagine that during compaction, you merge and . As the key is marked as deleted in the newest SSTable, you may decide to drop the tombstone, as it hides the key in : Now, would do: Key doesn’t exist in the memtable → Continue. Key doesn’t exist in SST-5 → Continue. Key doesn’t exist in SST-2 → Continue. Key exists in SST-1 → Return (instead of ). The fundamental rule is the following: during compaction, a tombstone can be evicted only after all data it shadows no longer exist on disk. Otherwise, dropping a tombstone too early can make an old value reappear. This is known as data resurrection: a key that “comes back to life” after a deletion. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Flush and compaction should be single-threaded and stop-the-world operations: do not serve client requests until the operations complete. Append a tombstone to the WAL file, with : Update the memtable: Do not remove the key directly; mark it deleted with a tombstone. Acknowledge the request. During flush, carry tombstones into the new SSTable using a new field. For example: The keys must remain sorted. The goals of the compaction process for this week are the following: For each key, keep only the newest record. Drop records hidden by newer versions. This is where merging happens: the newest record wins, and older versions are evicted. Drop tombstones when no older value remains. The compaction trigger is: every 10,000 update requests ( and , not ), compact all SSTables. Algorithm ( k-way merge using a min-heap on key): Open an iterator for each SSTable file known by the MANIFEST. Push each iterator’s current record into a min-heap with the following comparator: Primary: . Tie-break (equal ): Newest SSTable first based on MANIFEST order (to make sure an old value doesn’t win). While the heap is not empty: Pop the smallest key (this first pop is the newest version of due to the tie-break). Drain all other heap entries whose key is and discard them (older values). For the record you picked: If it’s a tombstone, emit nothing for . Otherwise, emit the value for . Advance only the iterators you drained for and push their next records into the heap. Stream emitted records (sorted) into new SSTables. Remember: the max entries in an SSTable should remain 2,000. each new SSTable file, then its parent directory. Update the MANIFEST atomically (see week 3). Remove the old SSTable files. Check the memtable: If the key is marked as deleted, return . Else, return the value. Scan SSTables from newest to oldest, given the MANIFEST order (same as before). For the first record with the requested key: If , return . Else, return the value. If the key isn’t found, return . When replaying the WAL, make sure to take into account tombstone values ( ). Update your client to handle lines → Send a request to . Download and run your client against a new file containing requests: put-delete.txt . NOTE : Refer to week 1 if you need to generate your own file with the number of lines you want. That’s it for this week! Your storage engine now supports deletes and a compaction mechanism that prevents unbounded growth. The Coder Cafe will take a break for two weeks. On January 7th, you will continue exploring LSM trees and cover leveling. In your current implementation, a miss still scans all SSTables; therefore, you will also add key range partitioning to limit the number of SSTables that need to be checked during a lookup. See you next year! The compaction trigger you used was simple: every 10,000 PUT or DELETE requests. In real systems, compaction is usually driven by factors such as too many SSTable files, space pressure, or high read amplification. Also, many systems add safeguards to keep compaction controlled and resource-efficient. For example, a common one is bounded fan-in (merging only a small, fixed number of SSTables per batch), so the engine never opens every file at once. Others track each SSTable’s first and last key to select only overlapping candidates, hence avoiding unrelated files. Taking a step back, it’s interesting to note that the core LSM idea—append-only writes with regular compaction—shows up in many systems, even outside pure LSM trees. For example: Lucene : Immutable segments are created and later merged in the background, an LSM-like pattern, even though it isn’t an LSM tree per se. Memcached Extstore : Flushes values to free RAM, but keeps the hashtable, keys, and storage pointers in memory. It later compacts the data. Kafka : Rewrites segments to keep the latest value per key and drop older versions, which is conceptually similar to SSTable compaction. Also, we briefly introduced the concept of key resurrection in the introduction. You should be aware that this is a common challenge with LSM trees. In real-world conditions, crashes, slow WAL truncation, and complex compaction can allow an old value to be replayed during recovery after its tombstone has been removed, leading to key resurrection. Here are two great references that delve more into this kind of problem: Preventing Data Resurrection with Repair Based Tombstone Garbage Collection Repair Time Requirements to Prevent Data Resurrection in Cassandra & Scylla Another excellent reference is Acheron: Persisting Tombstones in LSM Engines . It shows how standard LSM compaction can leave tombstones stuck for long periods, so “deleted" data may still linger in lower levels and complicate compliance requirements such as GDPR/CCPA compliance. The paper introduces delete-aware techniques that prioritize pushing tombstones down the tree to make deletions persist more predictably. Lastly, you can explore the RUM conjecture . Structurally, it’s similar to the CAP theorem : “ three things, pick two” . In short, you can make a database excel at two of: reads, updates (insert/change/delete), and memory/space, but not all three at once. Make any two really good and the third gets worse; that’s an unavoidable trade-off. This helps explain why, for example, LSM trees optimized for fast updates and good space efficiency pay a cost in read performance due to read amplification. That trade-off shows up in the design of the compaction process you implemented this week: you trade space and significant I/O for simplicity by compacting everything in one shot. This is fine for the example, but with 500GB of SSTables, you may need roughly another 500GB of free space during the merge in the worst case. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Over the past few weeks, you built an LSM tree and three main components: a memtable, SSTables, and a WAL that records the same operations you keep in the memtable. To prevent on-disk data from growing forever, you will implement compaction, a critical process in LSM trees. Compaction periodically merges SSTables to reclaim space and keep read performance predictable. For example, if key exists in every SSTable on disk: Compaction drops duplicates and keeps only the newest record: . In addition, you will implement a endpoint. Handling deletes in an LSM tree isn’t straightforward at all: SSTables are immutable. To preserve the append-only nature of LSM trees, deletions are written as tombstones: markers indicating a key was logically deleted. You write it to the WAL, keep it in the memtable, and propagate it during flush. How should compaction work in the presence of tombstones? Suppose you have the following SSTables on disk: the key exists in , doesn’t exist in , exists in , and is deleted at : .”","title":null,"type":"image/png","href":null,"belowTheFold":true,"topImage":false,"internalRedirect":"https://read.thecoder.cafe/i/174613473?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png","isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" title="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" srcset="https://substackcdn.com/image/fetch/$s_!rqB7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 424w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 848w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1272w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1456w" sizes="100vw" loading="lazy"> If the key doesn’t exist in the memtable, the current state for is deleted. Now, imagine that during compaction, you merge and . As the key is marked as deleted in the newest SSTable, you may decide to drop the tombstone, as it hides the key in : Now, would do: Key doesn’t exist in the memtable → Continue. Key doesn’t exist in SST-5 → Continue. Key doesn’t exist in SST-2 → Continue. Key exists in SST-1 → Return (instead of ). Flush and compaction should be single-threaded and stop-the-world operations: do not serve client requests until the operations complete. Append a tombstone to the WAL file, with : Update the memtable: Do not remove the key directly; mark it deleted with a tombstone. Acknowledge the request. For each key, keep only the newest record. Drop records hidden by newer versions. This is where merging happens: the newest record wins, and older versions are evicted. Drop tombstones when no older value remains. Open an iterator for each SSTable file known by the MANIFEST. Push each iterator’s current record into a min-heap with the following comparator: Primary: . Tie-break (equal ): Newest SSTable first based on MANIFEST order (to make sure an old value doesn’t win). While the heap is not empty: Pop the smallest key (this first pop is the newest version of due to the tie-break). Drain all other heap entries whose key is and discard them (older values). For the record you picked: If it’s a tombstone, emit nothing for . Otherwise, emit the value for . Advance only the iterators you drained for and push their next records into the heap. Stream emitted records (sorted) into new SSTables. Remember: the max entries in an SSTable should remain 2,000. each new SSTable file, then its parent directory. Update the MANIFEST atomically (see week 3). Remove the old SSTable files. Check the memtable: If the key is marked as deleted, return . Else, return the value. Scan SSTables from newest to oldest, given the MANIFEST order (same as before). For the first record with the requested key: If , return . Else, return the value. If the key isn’t found, return . When replaying the WAL, make sure to take into account tombstone values ( ). Update your client to handle lines → Send a request to . Download and run your client against a new file containing requests: put-delete.txt . NOTE : Refer to week 1 if you need to generate your own file with the number of lines you want. Lucene : Immutable segments are created and later merged in the background, an LSM-like pattern, even though it isn’t an LSM tree per se. Memcached Extstore : Flushes values to free RAM, but keeps the hashtable, keys, and storage pointers in memory. It later compacts the data. Kafka : Rewrites segments to keep the latest value per key and drop older versions, which is conceptually similar to SSTable compaction. Preventing Data Resurrection with Repair Based Tombstone Garbage Collection Repair Time Requirements to Prevent Data Resurrection in Cassandra & Scylla

0 views
Evan Schwartz 2 months ago

Short-Circuiting Correlated Subqueries in SQLite

I recently added domain exclusion lists and paywalled content filtering to Scour . This blog post describes a small but useful SQL(ite) query optimization I came across between the first and final drafts of these features: using an uncorrelated scalar subquery to skip a correlated subquery (if you don't know what that means, I'll explain it below). Scour searches noisy sources for content related to users' interests. At the time of writing, it ingests between 1 and 3 million pieces of content from over 15,000 sources each month. For better and for worse, Scour does ranking on the fly, so the performance of the ranking database query directly translates to page load time. The main SQL query Scour uses for ranking applies a number of filters and streams the item embeddings through the application code for scoring. Scour uses brute force search rather than a vector database, which works well enough for now because of three factors: A simplified version of the query looks something like: The query plan shows that this makes good use of indexes: To add user-specified domain blocklists, I created the table and added this filter clause to the main ranking query: The domain exclusion table uses as a primary key, so the lookup is efficient. However, this lookup is done for every row returned from the first part of the query. This is a correlated subquery : A problem with the way we just added this feature is that most users don't exclude any domains, but we've added a check that is run for every row anyway. To speed up the queries for users who aren't using the feature, we could first check the user's settings and then dynamically build the query. But we don't have to, because we can accomplish the same effect within one static query. We can change our domain exclusion filter to first check whether the user has any excluded domains: Since the short-circuits, if the first returns (when the user has no excluded domains), SQLite never evaluates the correlated subquery at all. The first clause does not reference any column in , so SQLite can evaluate it once and reuse the boolean result for all of the rows. This "uncorrelated scalar subquery" is extremely cheap to evaluate and, when it returns , lets us short-circuit and skip the more expensive correlated subquery that checks each item's domain against the exclusion list. Here is the query plan for this updated query. Note how the second subquery says , whereas the third one is a . The latter is the per-row check, but it can be skipped by the second subquery. To test the performance of each of these queries, I replaced the with and used a simple bash script to invoke the binary 100 times for each query on my laptop. Starting up the process each time adds overhead, but we're comparing relative differences. At the time of this benchmark, the last week had 235,975 items, 144,229 of which were in English. The two example users I tested this for below only look for English content. This test represents most users, who have not configured any excluded domains: This shows that the short-circuit query adds practically no overhead for users without excluded domains, whereas the correlated subquery alone makes queries 17% slower for these users. This test uses an example user that has excluded content from 2 domains: In this case, we do need to check each row against the domain filter. But this shows that the short-circuit still adds no overhead on top of the query. When using SQL subqueries to filter down result sets, it's worth thinking about whether each subquery is really needed for most users or most queries. If the check is needed most of the time, this approach won't help. However if the per-row check isn't always needed, using an uncorrelated scalar subquery to short-circuit a condition can dramatically speed up the average case with practically zero overhead. This is extra important because the slow-down from each additional subquery compounds. In this blog post, I described and benchmarked a single additional filter. But this is only one of multiple subquery filters. Earlier, I also mentioned that users had asked for a way to filter out paywalled content. This works similarly to filtering out content from excluded domains. Some users opt-in to hiding paywalled content. For those users, we check if each item is paywalled. If so, we check if it comes from a site the user has specifically allowed paywalled content from (because they have a subscription). I used the same uncorrelated subquery approach to first check if the feature is enabled for the user and, only then, does SQLite need to check each row. Concretely, the paywalled content filter subquery looks like: In short, a trivial uncorrelated scalar subquery can help us short-circuit and avoid a more expensive per-row check when we don't need it. There are multiple ways to exclude rows from an SQL query. Here are the results from the same benchmark I ran above, but with two other ways of checking for whether an item comes from an excluded domain. The version of the query uses the subquery: The variation joins with and then checks for : And here are the full benchmarks: For users without excluded domains, we can see that the query using the short-circuit wins and adds no overhead. For users who do have excluded domains, the is faster than the version. However, this version raises the exact problem this whole blog post is designed to address. Since joins happen no matter what, we cannot use the short-circuit to avoid the overhead for users without excluded domains. At least for now, this is why I've gone with the subquery using the short-circuit. Discuss on Hacker News , Lobsters , r/programming , r/sqlite . Scour uses SQLite, so the data is colocated with the application code. It uses binary-quantized vector embeddings with Hamming Distance comparisons, which only take ~5 nanoseconds each . We care most about recent posts so we can significantly narrow the search set by publish date.

0 views
Marc Brooker 2 months ago

What Does a Database for SSDs Look Like?

Maybe not what you think. Over on X, Ben Dicken asked : What does a relational database designed specifically for local SSDs look like? Postgres, MySQL, SQLite and many others were invented in the 90s and 00s, the era of spinning disks. A local NVMe SSD has ~1000x improvement in both throughput and latency. Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random. If we had to throw these databases away and begin from scratch in 2025, what would change and what would remain? How might we tackle this question quantitatively for the modern transaction-orientated database? Approach One: The Five Minute Rule Perhaps my single favorite systems paper, The 5 Minute Rule… by Jim Gray and Franco Putzolu gives us a very simple way to answer one of the most important questions in systems: how big should caches be? The five minute rule is that, back in 1986, if you expected to read a page again within five minutes you should keep in in RAM. If not, you should keep it on disk. The basic logic is that you look at the page that’s least likely to be re-used. If it’s cheaper to keep around until it’s next expected re-use, then you should keep more. If it’s cheaper to reload from storage than keep around, then you should keep less 1 . Let’s update the numbers for 2025, assuming that pages are around 32kB 2 (this becomes important later). The EC2 delivers about 1.8 million read iops of this size, at a price of around $0.004576 per second, or \(10^{-9}\) dollars per transfer (assuming we’re allocating about 40% of the instance price to storage). About one dollar per billion reads. It also has enough RAM for about 50 million pages of this size, costing around \(3 \times 10^{-11}\) dollars to storage a page for one second. So, on this instance type, we should size our RAM cache to store pages for about 30 seconds. Not too different from Gray and Putzolu’s result 40 years ago! That’s answer number one: the database should have a cache sized so that the hot set contains pages expected to be accessed in the next 30 seconds, for optimal cost. For optimal latency, however, the cache may want to be considerably bigger. Approach Two: The Throughput/IOPS Breakeven Point The next question is what size accesses we want to send to our storage devices to take best advantage of their performance. In the days of spinning media, the answer to this was surprisingly big: a 100MB/s disk could generally do around 100 seeks a second, so if your transfers were less than around 1MB you were walking away from throughput. Give or take a factor of 2. What does it look like for modern SSDs? SSDs are much faster on both throughput and iops. They’re less sensitive than spinning drives to workload patterns, but read/write ratios and the fullness of the drives still matter. Absent benchmarking on the actual hardware with the real workload, my rule of thumb is that SSDs are throughput limited for transfers bigger than 32kB, and iops limited for transfers smaller than 32kB. Making transfers bigger than 32kB doesn’t help throughput much, reduces IOPS, and probably makes the cache less effective because of false sharing and related effects. This is especially important for workloads with poor spatial locality . So that’s answer number two: we want our transfers to disk not to be much smaller than 32kB on average, or we’re walking away from throughput. Approach Three: Durability and Replication Building reads on local SSDs is great: tons of throughput, tons of iops. Writes on local SSDs, on the other hand, have the distinct problem of only being durable on the local box, which is unacceptable for most workloads. Modern hardware is very reliable, but thinking through the business risks of losing data on failover isn’t very fun at all, so let’s assume that our modern database is going to replicate off-box, making at least one more synchronous copy. Ideally in a different availability zone (AZ). That we were using for our comparison earlier has 100Gb/s (or around 12GB/s) of network bandwidth. That puts a cap on how much write throughput we can have for a single-leader database. Cross-AZ latency in EC2 varies from a couple hundred microseconds to a millisecond or two, which puts a minimum on our commit latency. That gives us answer number three: we want to incur cross-AZ latency only at commit time, and not during writes. Which is where we run into one of my favorite topics: isolation. The I in ACID . A modern database design will avoid read-time coordination using multiversioning, but to offer isolation stronger than will need to coordinate either on each write or at commit time. It can do that like, say, Aurora Postgres does, having a single leader at a time running in a single AZ. This means great latency for clients in that zone, and higher latency for clients in different AZs. Given that most applications are hosted in multiple AZs, this can add up for latency-sensitive applications which makes a lot of round trips to the database. The alternative approach is the one Aurora DSQL takes, doing the cross-AZ round trip only at time, saving round-trips. Here’s me talking about the shape of that trade-off at re:Invent this year: There’s no clear answer here, because there are real trade-offs between the two approaches. But do make sure to ask your database vendor whether those impressive latency benchmarks are running where you application actually runs. In the spirit of the original question, though, the incredible bandwidth and latency availability in modern datacenter networks is as transformative as SSDs in database designs. Or should be. While we’re incurring the latency cost of synchronous replication, we may as well get strongly consistent scale-out reads for free. In DSQL, we do this using high-quality hardware clocks that you can use too . Another nice win from modern hardware. There are other approaches too. That’s answer number four for me: The modern database uses high-quality clocks and knowledge of actual application architectures to optimize for real-world performance (like latency in multiple availability zones or regions) without compromising on strong consistency. Approach Four: What about that WAL? Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random. WALs, and related low-level logging details, are critical for database systems that care deeply about durability on a single system. But the modern database isn’t like that: it doesn’t depend on commit-to-disk on a single system for its durability story. Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails). That’s answer number five: the modern database commits transactions to a distributed log, which provides multi-machine multi-AZ durability, and might provide other services like atomicity. Recovery is a replay from the distributed log, on any one of a number of peer replicas. What About Data Structures? B-Trees versus LSM-trees vs B-Tree variants versus LSM variants versus other data structures are trade-offs that have a lot to do with access patterns and workload patterns. Picking a winner would be a whole series of blog posts, so I’m going to chicken out and say its complicated . If we had to throw these databases away and begin from scratch in 2025, what would change and what would remain? I’d keep the relational model, atomicity, isolation (but would probably pick as a default), strong consistency, SQL, interactive transactions, and the other core design decisions of relational databases. But I’d move durability, read and write scale, and high availability into being distributed rather than single system concerns. I think that helps with performance and cost, while making these properties easier to achieve. I’d mostly toss out local durability and recovery, and all the huge history of optimizations and data structures around that 3 , in favor of getting better properties in the distributed setting. I’d pay more attention to internal strong isolation (in the security sense) between clients and workloads. I’d size caches for a working set of between 30 seconds and 5 minutes of accesses. I’d optimize for read transfers around that 32kB sweet spot from local SSD, and the around 8kB sweet spot for networks. Probably more stuff too, but this is long enough as-is. Other topics worth covering include avoiding copies on IO, co-design with virtualization (e.g. see our Aurora Serverless paper ), trade-offs of batching, how the relative performance of different isolation levels changes, what promises to give clients, encryption and authorization of data at rest and in motion, dealing with very hot single items, new workloads like vector, verifiable replication journals, handing off changes to analytics systems, access control, multi-tenancy, forking and merging, and even locales. The reasoning is slightly smarter, thinking about the marginal page and marginal cost of memory, but this simplification works for our purposes here. The marginal cost of memory is particularly interesting in a provisioned system, because it varies between zero (you’ve paid for it already) and huge (you need a bigger instance size). One of the really nice things about serverless (like DSQL) and dynamic scaling (like Aurora Serverless) is that it makes the marginal cost constant, greatly simplifying the task of reasoning about cache size. Yes, I know that pages are typically 4kB or 2MB, but bear with me here. Sorry ARIES .

0 views
matduggan.com 2 months ago

SQLite for a REST API Database?

When I wrote the backend for my Firefox time-wasting extension ( here ), I assumed I was going to be setting up Postgres. My setup is boilerplate and pretty boring, with everything running in Docker Compose for personal projects and then persistence happening in volumes. However when I was working with it locally, I obviously used SQLite since that's always the local option that I use. It's very easy to work with, nice to back up and move around and in general is a pleasure to work with. As I was setting up the launch, I realized I really didn't want to set up a database. There's nothing wrong with having a Postgres container running, but I'd like to skip it if its possible. So my limited understanding of SQLite before I started this was "you can have one writer and many readers". I had vaguely heard of SQLite "WAL" but my understanding of WAL is more in the context of shipping WAL between database servers. You have one primary, many readers, you ship WAL to from the primary to the readers and then you can promote a reader to the primary position once it has caught up on WAL. My first attempt at setting up SQLite for a REST API died immediately in exactly this way. So by default SQLite: This seems to be caused by SQLite having a rollback journal and using strict locking. Which makes perfect sense for the use-case that SQLite is typically used for, but I want to abuse that setup for something it is not typically used for. So after doing some Googling I ended up with these as the sort of "best recommended" options. I'm 95% sure I copy/pasted the entire block. What is this configuration doing. However my results from load testing sucked. Now this is under heavy load (simulating 1000 active users making a lot of requests at the same time, which is more than I've seen), but still this is pretty bad. The cause of it was, of course, my fault. My "blacklist" is mostly just sites that publish a ton of dead links. However I had the setup wrong and was making a database query per website to see if it matched the black list. Stupid mistake. Once I fixed that. Great! Or at least "good enough from an unstable home internet connection with some artificial packet loss randomly inserted". So should you use SQLite as the backend database for a FastAPI setup? Well it depends on how many users you are planning on having. Right now I can handle between 1000 and 2000 requests per second if they're mostly reads, which is exponentially more than I will need for years of running the service. If at some point in the future that no longer works, it's thankfully very easy to migrate off of SQLite onto something else. So yeah overall I'm pretty happy with it as a design. Only one writer at a time Writers block readers during transactions Switches SQLite from rollback journal to Write-Ahead Logging (WAL) Default behavior is Write -> Copy original data to journal -> Modify database -> Delete journal. WAL mode is Write -> Append changes to WAL file -> Periodically checkpoint to main DB So here you have 4 options to toggle for how often SQLite syncs to disk. OFF is SQlite lets the OS handle it. NORMAL is the SQLite engine still syncs, but less often than FULL. WAL mode is safe from corruption with NORMAL typically. FULL uses the Xsync method of the VFS (don't feel bad I've never heard of it before either: https://sqlite.org/vfs.html ) to ensure everything is written to disk before moving forward. EXTRA: I'm not 100% sure what this exactly does but it sounds extra. "EXTRA synchronous is like FULL with the addition that the directory containing a rollback journal is synced after that journal is unlinked to commit a transaction in DELETE mode. EXTRA provides additional durability if the commit is followed closely by a power loss. Without EXTRA, depending on the underlying filesystem, it is possible that a single transaction that commits right before a power loss might get rolled back upon reboot. The database will not go corrupt. But the last transaction might go missing, thus violating durability, if EXTRA is not set." = please wait up to 60 seconds. this one threw me for a loop. Why is it a negative number? If you set it to a positive number, you mean pages. SQLite page size is 4kb by default, so 2000 = 8MB. A negative number means KB which is easier to reason about than pages. I don't really know what a "good" cache_size is here. 64MB feels right given the kind of data I'm throwing around and how small it is, but this is guess work. = write to memory, not disk. Makes sense for speed.

0 views
Dangling Pointers 2 months ago

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS Youmin Chen, Jiwu Shu, Yanyan Shen, Linpeng Huang, and Hong Mei SOSP'25 I love this paper, because it grinds one of my axes: efficient pipeline parallelism on general purpose CPUs. In many hardware designs, pipeline parallelism is the dominant form of parallelism, whereas data parallelism takes the cake on CPUs and GPUs. It has always seemed to me that there are applications where pipeline parallelism should be great on multi-core CPUs, and here is an example. Fig. 1 illustrates the design space for key-value stores: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ). It is hard to effectively cache data in a TPR/TPQ model, because each request runs the entire key-value store request code path. For example, a CPU core may have enough cache capacity to hold network buffers or hot tuples, but not both. The key disadvantage to a TPS architecture is load balancing. One stage could become the bottleneck, leaving CPU cores idle. The authors propose dynamic reconfiguration of the pipeline based on workload changes. Another challenge with pipelining is implementing efficient communication between cores, because data associated with each request flows down the pipeline with the request itself. Fig. 3 shows the pipeline proposed in this paper: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 The NIC writes request packets into the network buffer (stored in the LLC). The cache-resident layer reads data from this buffer and handles requests involving commonly used keys by accessing the hot index and hot data caches (also in the LLC). The memory-resident layer handles cold keys and values, which are stored in DRAM. One set of threads (pinned to CPU cores) implement the cache-resident layer, and a different set of threads (pinned to other CPU cores) implement the memory-resident layer. An auto-tuner continually monitors the system and adjusts the number of threads assigned to each layer. Section 3.5 describes the synchronization required to implement this adjustment. The NIC writes request packets into a single queue. The cache-resident threads cooperatively read requests from this queue. If there are threads in the pool, then thread reads all requests with: . Next, threads check to see if the key associated with a request is hot (and thus cached in the LLC). Time is divided into epochs. During a given epoch, the set of cached items does not change. This enables fast lookups without costly synchronization between threads. A background thread gathers statistics to determine the set of items to be cached in the next epoch and has the ability to atomically switch to the next epoch when the time comes. The number of hot keys is kept small enough that it is highly likely that hot keys will be stored in the LLC. Requests that miss in the cache-resident layer are passed on to the memory-resident layer for further processing (via the CR-MR queue ). Typically, the LLC is treated like a global resource (shared by all cores). But this particular use case requires that most of the LLC be dedicated to the cache-resident layer. This is accomplished with the help of the PQOS utility from Intel, which uses “Intel(R) Resource Director Technology” to control which ways of the LLC are assigned to each layer. The memory-resident layer operates on batches of requests. Because the requests are not hot, it is highly likely that each request will require DRAM accesses for index lookups (keys) and data lookups (values). Software prefetching is used to hide DRAM latency during index lookups. When servicing operations, data values are copied directly into the outgoing network buffer. The CR-MR queue is used to communicate between the two layers. Each (CR thread, MR thread) pair has a dedicated lock-free queue. Enqueue operations use a round-robin policy (message from CR thread is sent to MR thread: ). Dequeue operations must potentially scan queues corresponding to all possible senders. Multiple requests can be stored per message, to amortize control overhead. Fig. 7 has throughput results for synthetic workloads (A, B, and C have different ratios of put/get operations), uTPS-T is this work: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 Dangling Pointers The pipelining here is coarse-grained, and the design is only optimized for the LLC. I wonder if a more fine-grained pipeline would allow hot data to be stored in L2 caches. For example, the set of hot keys could be sharded among N cores, with each core holding a different shard in its L2 cache. It seems redundant that this design requires software to determine the set of hot keys, when the hardware cache circuitry already has support to do something like this. Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. Pipelining Advantages A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ).

0 views