Posts in Database (20 found)
Evan Schwartz 2 weeks ago

Scour Year End Update 2025

I thought about sending out a personalized "Scour Wrapped"... until I got the 7th Wrapped from some random service. So instead, I'll just say Happy New Year and thanks for your support in 2025! 🥂 These were the new features added since the last update in October. Scour now identifies articles that are paywalled and indicates them with a yellow dollar sign next to the domain. In your settings , you can opt to hide paywalled content. If you do, you can also exempt specific domains where you have a subscription so you will see their content even if it is behind the paywall. Thank you to Johnny and Allen for requesting this feature! For anyone interested in the technical details, I wrote a blog post about a neat SQL trick I came across while building this: Short-Circuiting Correlated Subqueries in SQLite . You can also now block content from specific websites. The option to block a domain can be found by clicking the "..." button below each post. You can see and manage your excluded domains in your settings . Thanks to Vahe for this suggestion! If you subscribe to specific feeds (as opposed to scouring all of them), Scour will now recommend other sources for you to follow right in your personalized feed. These recommendations are based on Scour looking for content that matches your interests that you aren't currently getting. You can find more recommendations on your Feeds page . Each feed also now displays its three most recent posts below its description to make it easier to know what you'll get if you subscribe. You can click on the feed's title to see all of the posts from that feed. Thanks to Tiago for this suggestion! By default, clicking on a link to a post will bring you to the original website where it was published. However, if you prefer to read it on Scour, you can read the Preview, which can be found in the "..." menu under each post. Thanks to Linh for this suggestion! The filter menu for your feed (accessible via the button next to where it says Your Top Finds) should be clearer and more mobile-friendly. You can filter by time range and toggle between seeing posts from feeds you’ve subscribed to or see posts from everyone’s feeds. Thanks Stefan for the feedback on this! A number of people have told me that they are confused about how the love/like/dislike reactions are used on Scour. I'll work on making this clearer in the future but in the meantime, there's now a section in the FAQs about this. The answer is: Loves and likes are saved to your Likes page, so you can use them to bookmark interesting content. Unlike most content aggregators, Scour does not use reactions to change what shows up in your feed. Instead, reactions are used to generate Interest Recommendations for you. Scour only shows content related to topics you've explicitly chosen. You can also subscribe to other users' Likes as feeds. Everyone's reactions contribute to the Popular Posts page. Here were some of my favorite posts I found on Scour in November and December: Thanks to everyone who wrote about Scour on their blog or website in 2025! This included: If you write about Scour in the future, or if you already did and I didn't include you, please let me know! Thank you to everyone who provided feedback on Scour this year! Specifically, thank you to Aaron, Alberto, Alex K, Alex W, Allen, Andrew D, Andrew M, Andy M, Andy P, Cairin, Cole, Daniel, Elyem, Hary, Imperfect, Jadi, Jeppe, Jesse, Johnny, Jon, Karit, Kilpatrj, Linh, Proudmuslim-dev, Ryan, Sarah, Stefan, Tiago, Tomáš, Tyler, and Vahe. And thank you to all of the anonymous feedback givers as well! Because you made it to the end of the post, here's a little preview of an upcoming feature for you. Let's say you want to only see posts from small websites, like individuals' blogs. You can now try filtering your feed by how many posts each website or feed publishes per month. For example, you can use these links to see only posts from quieter domains or quieter feeds . Or, you can try this one to only see articles from larger websites . Let me know what you think! UI for controlling these filters is coming soon! Happy New Year and happy Scouring! - Evan Scour scoured 9,940,460 posts from 15,608 feeds 1,013 new users signed up (welcome!!) 12,620 interests were added, with 6,688 of those from recommendations 26,702 posts were read, 3,023 were liked, and 383 were loved 55 suggestions on the feedback board were completed Paper AI Tigers Build / Buy / Bot More databases should be single-threaded Disks Lie: Building a WAL that actually survives Minsuk Kang: Scour and minifeed are 100X better than Instagram and X (January) Winther: Blog Discovery (June) Daniel Prindii: My Read it later and discoverability systems in 2025 (July) PPC Land: Developer revives RSS with AI while Google targets syndication infrastructure (August) Tomáš Burkert: RSS feeds discovery strategies (October) Alex White: Discovering the Indie Web (November) Matt Maldre: Search engine for blogs (November) Andrew Doran: Tools for discovering the IndieWeb (December)

0 views
Grumpy Gamer 2 weeks ago

Sqlite Comments

When I started using Hugu for static site generation I lost the ability to have comments and we all know now supportive the Internet can be, so why wouldn’t you have comments? I wrote a few php scripts that I added on to Hugo and I had comments again. I decided to store the comments as flat files so I didn’t complicate things by needing the bloated MySQL. I wanted to keep it as simple and fast as possible. When a comment is added, my PHP script created a directory (if needed) for the post and saves the comment out as a .json file with name as the current time to make sorting easy. When the blog page was displayed, these files (already sorted thanks to the filename) were loaded and displayed. And it all worked well until it didn’t. Flat files are simple. but they can be hard to search or maintain if they need cleaning up or dealt with after a spam attack. I figured I use commandline tools to do all of that, but it’s a lot more cumbersome than I first thought. I missed have them in a sql database. I didn’t want to install MySQL again, but my site doesn’t get a lot of commenting traffic so I could use Sqlite instead. The downside is Sqlite write-locks the database while a write is happening. In my case it’s a fraction of a second and wouldn’t be a issue. The second problem I had was the version of Ubuntu my server was using is 5 years old and some of the packages I wanted wouldn’t available for it. I tried to update Ubuntu and for reasons I don’t fully understand I couldn’t. So I spun up a new server. Since grumpygamer.com is a statics site I only had to install Apache and I was off and running. Fun times. But the comment flat files still bugged me and I thought I’d use this as an opportunity to convert over to Sqlite. PHP/Apache comes with Sqilte already installed, so that’s easy. A long weekend and I rewrote the code to save comments and everything is back and working. Given that a webserver and PHP already needed to be installed, it isn’t a big deal to use Sqlite. If you’re not comfortable with SQL, it might be harder but I like SQL.

0 views
matklad 3 weeks ago

Static Allocation For Compilers

TigerBeetle famously uses “static allocation” . Infamously, the use of the term is idiosyncratic: what is meant is not arrays, as found in embedded development, but rather a weaker “no allocation after startup” form. The amount of memory TigerBeetle process uses is not hard-coded into the Elf binary. It depends on the runtime command line arguments. However, all allocation happens at startup, and there’s no deallocation. The long-lived event loop goes round and round happily without . I’ve wondered for years if a similar technique is applicable to compilers. It seemed impossible, but today I’ve managed to extract something actionable from this idea? Static allocation depends on the physics of the underlying problem. And distributed databases have surprisingly simple physics, at least in the case of TigerBeetle. The only inputs and outputs of the system are messages. Each message is finite in size (1MiB). The actual data of the system is stored on disk and can be arbitrarily large. But the diff applied by a single message is finite. And, if your input is finite, and your output is finite, it’s actually quite hard to need to allocate extra memory! This is worth emphasizing — it might seem like doing static allocation is tough and requires constant vigilance and manual accounting for resources. In practice, I learned that it is surprisingly compositional. As long as inputs and outputs of a system are finite, non-allocating processing is easy. And you can put two such systems together without much trouble. routing.zig is a good example of such an isolated subsystem. The only issue here is that there isn’t a physical limit on how many messages can arrive at the same time. Obviously, you can’t process arbitrary many messages simultaneously. But in the context of a distributed system over an unreliable network, a safe move is to drop a message on the floor if the required processing resources are not available. Counter-intuitively, not allocating is simpler than allocating, provided that you can pull it off! Alas, it seems impossible to pull it off for compilers. You could say something like “hey, the largest program will have at most one million functions”, but that will lead to both wasted memory and poor user experience. You could also use a single yolo arena of a fixed size, like I did in Hard Mode Rust , but that isn’t at all similar to “static allocation”. With arenas, the size is fixed explicitly, but you can OOM. With static allocation it is the opposite — no OOM, but you don’t know how much memory you’ll need until startup finishes! The “problem size” for a compiler isn’t fixed — both the input (source code) and the output (executable) can be arbitrarily large. But that is also the case for TigerBeetle — the size of the database is not fixed, it’s just that TigerBeetle gets to cheat and store it on disk, rather than in RAM. And TigerBeetle doesn’t do “static allocation” on disk, it can fail with at runtime, and it includes a dynamic block allocator to avoid that as long as possible by re-using no longer relevant sectors. So what we could say is that a compiler consumes arbitrarily large input, and produces arbitrarily large output, but those “do not count” for the purpose of static memory allocation. At the start, we set aside an “output arena” for storing finished, immutable results of compiler’s work. We then say that this output is accumulated after processing a sequence of chunks, where chunk size is strictly finite. While limiting the total size of the code-base is unreasonable, limiting a single file to, say, 4 MiB (runtime-overridable) is fine. Compiling then essentially becomes a “stream processing” problem, where both inputs and outputs are arbitrary large, but the filter program itself must execute in O(1) memory. With this setup, it is natural to use indexes rather than pointers for “output data”, which then makes it easy to persist it to disk between changes. And it’s also natural to think about “chunks of changes” not only spatially (compiler sees a new file), but also temporally (compiler sees a new version of an old file). Is there any practical benefits here? I don’t know! But seems worth playing around with! I feel that a strict separation between O(N) compiler output and O(1) intermediate processing artifacts can clarify compiler’s architecture, and I won’t be too surprised if O(1) processing in compilers would lead to simpler code the same way it does for databases?

0 views
The Coder Cafe 4 weeks ago

Build Your Own Key-Value Storage Engine—Week 4

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Over the past few weeks, you built an LSM tree and three main components: a memtable, SSTables, and a WAL that records the same operations you keep in the memtable. To prevent on-disk data from growing forever, you will implement compaction, a critical process in LSM trees. Compaction periodically merges SSTables to reclaim space and keep read performance predictable. For example, if key exists in every SSTable on disk: Compaction drops duplicates and keeps only the newest record: . In addition, you will implement a endpoint. Handling deletes in an LSM tree isn’t straightforward at all: SSTables are immutable. To preserve the append-only nature of LSM trees, deletions are written as tombstones: markers indicating a key was logically deleted. You write it to the WAL, keep it in the memtable, and propagate it during flush. How should compaction work in the presence of tombstones? Suppose you have the following SSTables on disk: the key exists in , doesn’t exist in , exists in , and is deleted at : .”","title":null,"type":"image/png","href":null,"belowTheFold":true,"topImage":false,"internalRedirect":"https://read.thecoder.cafe/i/174613473?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png","isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" title="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" srcset="https://substackcdn.com/image/fetch/$s_!rqB7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 424w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 848w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1272w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1456w" sizes="100vw" loading="lazy"> If the key doesn’t exist in the memtable, the current state for is deleted. Now, imagine that during compaction, you merge and . As the key is marked as deleted in the newest SSTable, you may decide to drop the tombstone, as it hides the key in : Now, would do: Key doesn’t exist in the memtable → Continue. Key doesn’t exist in SST-5 → Continue. Key doesn’t exist in SST-2 → Continue. Key exists in SST-1 → Return (instead of ). The fundamental rule is the following: during compaction, a tombstone can be evicted only after all data it shadows no longer exist on disk. Otherwise, dropping a tombstone too early can make an old value reappear. This is known as data resurrection: a key that “comes back to life” after a deletion. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Flush and compaction should be single-threaded and stop-the-world operations: do not serve client requests until the operations complete. Append a tombstone to the WAL file, with : Update the memtable: Do not remove the key directly; mark it deleted with a tombstone. Acknowledge the request. During flush, carry tombstones into the new SSTable using a new field. For example: The keys must remain sorted. The goals of the compaction process for this week are the following: For each key, keep only the newest record. Drop records hidden by newer versions. This is where merging happens: the newest record wins, and older versions are evicted. Drop tombstones when no older value remains. The compaction trigger is: every 10,000 update requests ( and , not ), compact all SSTables. Algorithm ( k-way merge using a min-heap on key): Open an iterator for each SSTable file known by the MANIFEST. Push each iterator’s current record into a min-heap with the following comparator: Primary: . Tie-break (equal ): Newest SSTable first based on MANIFEST order (to make sure an old value doesn’t win). While the heap is not empty: Pop the smallest key (this first pop is the newest version of due to the tie-break). Drain all other heap entries whose key is and discard them (older values). For the record you picked: If it’s a tombstone, emit nothing for . Otherwise, emit the value for . Advance only the iterators you drained for and push their next records into the heap. Stream emitted records (sorted) into new SSTables. Remember: the max entries in an SSTable should remain 2,000. each new SSTable file, then its parent directory. Update the MANIFEST atomically (see week 3). Remove the old SSTable files. Check the memtable: If the key is marked as deleted, return . Else, return the value. Scan SSTables from newest to oldest, given the MANIFEST order (same as before). For the first record with the requested key: If , return . Else, return the value. If the key isn’t found, return . When replaying the WAL, make sure to take into account tombstone values ( ). Update your client to handle lines → Send a request to . Download and run your client against a new file containing requests: put-delete.txt . NOTE : Refer to week 1 if you need to generate your own file with the number of lines you want. That’s it for this week! Your storage engine now supports deletes and a compaction mechanism that prevents unbounded growth. The Coder Cafe will take a break for two weeks. On January 7th, you will continue exploring LSM trees and cover leveling. In your current implementation, a miss still scans all SSTables; therefore, you will also add key range partitioning to limit the number of SSTables that need to be checked during a lookup. See you next year! The compaction trigger you used was simple: every 10,000 PUT or DELETE requests. In real systems, compaction is usually driven by factors such as too many SSTable files, space pressure, or high read amplification. Also, many systems add safeguards to keep compaction controlled and resource-efficient. For example, a common one is bounded fan-in (merging only a small, fixed number of SSTables per batch), so the engine never opens every file at once. Others track each SSTable’s first and last key to select only overlapping candidates, hence avoiding unrelated files. Taking a step back, it’s interesting to note that the core LSM idea—append-only writes with regular compaction—shows up in many systems, even outside pure LSM trees. For example: Lucene : Immutable segments are created and later merged in the background, an LSM-like pattern, even though it isn’t an LSM tree per se. Memcached Extstore : Flushes values to free RAM, but keeps the hashtable, keys, and storage pointers in memory. It later compacts the data. Kafka : Rewrites segments to keep the latest value per key and drop older versions, which is conceptually similar to SSTable compaction. Also, we briefly introduced the concept of key resurrection in the introduction. You should be aware that this is a common challenge with LSM trees. In real-world conditions, crashes, slow WAL truncation, and complex compaction can allow an old value to be replayed during recovery after its tombstone has been removed, leading to key resurrection. Here are two great references that delve more into this kind of problem: Preventing Data Resurrection with Repair Based Tombstone Garbage Collection Repair Time Requirements to Prevent Data Resurrection in Cassandra & Scylla Another excellent reference is Acheron: Persisting Tombstones in LSM Engines . It shows how standard LSM compaction can leave tombstones stuck for long periods, so “deleted" data may still linger in lower levels and complicate compliance requirements such as GDPR/CCPA compliance. The paper introduces delete-aware techniques that prioritize pushing tombstones down the tree to make deletions persist more predictably. Lastly, you can explore the RUM conjecture . Structurally, it’s similar to the CAP theorem : “ three things, pick two” . In short, you can make a database excel at two of: reads, updates (insert/change/delete), and memory/space, but not all three at once. Make any two really good and the third gets worse; that’s an unavoidable trade-off. This helps explain why, for example, LSM trees optimized for fast updates and good space efficiency pay a cost in read performance due to read amplification. That trade-off shows up in the design of the compaction process you implemented this week: you trade space and significant I/O for simplicity by compacting everything in one shot. This is fine for the example, but with 500GB of SSTables, you may need roughly another 500GB of free space during the merge in the worst case. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Over the past few weeks, you built an LSM tree and three main components: a memtable, SSTables, and a WAL that records the same operations you keep in the memtable. To prevent on-disk data from growing forever, you will implement compaction, a critical process in LSM trees. Compaction periodically merges SSTables to reclaim space and keep read performance predictable. For example, if key exists in every SSTable on disk: Compaction drops duplicates and keeps only the newest record: . In addition, you will implement a endpoint. Handling deletes in an LSM tree isn’t straightforward at all: SSTables are immutable. To preserve the append-only nature of LSM trees, deletions are written as tombstones: markers indicating a key was logically deleted. You write it to the WAL, keep it in the memtable, and propagate it during flush. How should compaction work in the presence of tombstones? Suppose you have the following SSTables on disk: the key exists in , doesn’t exist in , exists in , and is deleted at : .”","title":null,"type":"image/png","href":null,"belowTheFold":true,"topImage":false,"internalRedirect":"https://read.thecoder.cafe/i/174613473?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png","isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" title="Diagram with four vertically stacked boxes labeled “SSTable 1,” “SSTable 2,” “SSTable 3,” and “SSTable 4”; the first box contains the text “1234 = foo,” the second box contains “Key 1234 doesn’t exist,” the third box contains “1234 = bar,” and the fourth box contains “1234 = .”" srcset="https://substackcdn.com/image/fetch/$s_!rqB7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 424w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 848w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1272w, https://substackcdn.com/image/fetch/$s_!rqB7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96eea5b-0fbf-4f4b-8471-05b0235c0f59_640x880.png 1456w" sizes="100vw" loading="lazy"> If the key doesn’t exist in the memtable, the current state for is deleted. Now, imagine that during compaction, you merge and . As the key is marked as deleted in the newest SSTable, you may decide to drop the tombstone, as it hides the key in : Now, would do: Key doesn’t exist in the memtable → Continue. Key doesn’t exist in SST-5 → Continue. Key doesn’t exist in SST-2 → Continue. Key exists in SST-1 → Return (instead of ). Flush and compaction should be single-threaded and stop-the-world operations: do not serve client requests until the operations complete. Append a tombstone to the WAL file, with : Update the memtable: Do not remove the key directly; mark it deleted with a tombstone. Acknowledge the request. For each key, keep only the newest record. Drop records hidden by newer versions. This is where merging happens: the newest record wins, and older versions are evicted. Drop tombstones when no older value remains. Open an iterator for each SSTable file known by the MANIFEST. Push each iterator’s current record into a min-heap with the following comparator: Primary: . Tie-break (equal ): Newest SSTable first based on MANIFEST order (to make sure an old value doesn’t win). While the heap is not empty: Pop the smallest key (this first pop is the newest version of due to the tie-break). Drain all other heap entries whose key is and discard them (older values). For the record you picked: If it’s a tombstone, emit nothing for . Otherwise, emit the value for . Advance only the iterators you drained for and push their next records into the heap. Stream emitted records (sorted) into new SSTables. Remember: the max entries in an SSTable should remain 2,000. each new SSTable file, then its parent directory. Update the MANIFEST atomically (see week 3). Remove the old SSTable files. Check the memtable: If the key is marked as deleted, return . Else, return the value. Scan SSTables from newest to oldest, given the MANIFEST order (same as before). For the first record with the requested key: If , return . Else, return the value. If the key isn’t found, return . When replaying the WAL, make sure to take into account tombstone values ( ). Update your client to handle lines → Send a request to . Download and run your client against a new file containing requests: put-delete.txt . NOTE : Refer to week 1 if you need to generate your own file with the number of lines you want. Lucene : Immutable segments are created and later merged in the background, an LSM-like pattern, even though it isn’t an LSM tree per se. Memcached Extstore : Flushes values to free RAM, but keeps the hashtable, keys, and storage pointers in memory. It later compacts the data. Kafka : Rewrites segments to keep the latest value per key and drop older versions, which is conceptually similar to SSTable compaction. Preventing Data Resurrection with Repair Based Tombstone Garbage Collection Repair Time Requirements to Prevent Data Resurrection in Cassandra & Scylla

0 views
Evan Schwartz 4 weeks ago

Short-Circuiting Correlated Subqueries in SQLite

I recently added domain exclusion lists and paywalled content filtering to Scour . This blog post describes a small but useful SQL(ite) query optimization I came across between the first and final drafts of these features: using an uncorrelated scalar subquery to skip a correlated subquery (if you don't know what that means, I'll explain it below). Scour searches noisy sources for content related to users' interests. At the time of writing, it ingests between 1 and 3 million pieces of content from over 15,000 sources each month. For better and for worse, Scour does ranking on the fly, so the performance of the ranking database query directly translates to page load time. The main SQL query Scour uses for ranking applies a number of filters and streams the item embeddings through the application code for scoring. Scour uses brute force search rather than a vector database, which works well enough for now because of three factors: A simplified version of the query looks something like: The query plan shows that this makes good use of indexes: To add user-specified domain blocklists, I created the table and added this filter clause to the main ranking query: The domain exclusion table uses as a primary key, so the lookup is efficient. However, this lookup is done for every row returned from the first part of the query. This is a correlated subquery : A problem with the way we just added this feature is that most users don't exclude any domains, but we've added a check that is run for every row anyway. To speed up the queries for users who aren't using the feature, we could first check the user's settings and then dynamically build the query. But we don't have to, because we can accomplish the same effect within one static query. We can change our domain exclusion filter to first check whether the user has any excluded domains: Since the short-circuits, if the first returns (when the user has no excluded domains), SQLite never evaluates the correlated subquery at all. The first clause does not reference any column in , so SQLite can evaluate it once and reuse the boolean result for all of the rows. This "uncorrelated scalar subquery" is extremely cheap to evaluate and, when it returns , lets us short-circuit and skip the more expensive correlated subquery that checks each item's domain against the exclusion list. Here is the query plan for this updated query. Note how the second subquery says , whereas the third one is a . The latter is the per-row check, but it can be skipped by the second subquery. To test the performance of each of these queries, I replaced the with and used a simple bash script to invoke the binary 100 times for each query on my laptop. Starting up the process each time adds overhead, but we're comparing relative differences. At the time of this benchmark, the last week had 235,975 items, 144,229 of which were in English. The two example users I tested this for below only look for English content. This test represents most users, who have not configured any excluded domains: This shows that the short-circuit query adds practically no overhead for users without excluded domains, whereas the correlated subquery alone makes queries 17% slower for these users. This test uses an example user that has excluded content from 2 domains: In this case, we do need to check each row against the domain filter. But this shows that the short-circuit still adds no overhead on top of the query. When using SQL subqueries to filter down result sets, it's worth thinking about whether each subquery is really needed for most users or most queries. If the check is needed most of the time, this approach won't help. However if the per-row check isn't always needed, using an uncorrelated scalar subquery to short-circuit a condition can dramatically speed up the average case with practically zero overhead. This is extra important because the slow-down from each additional subquery compounds. In this blog post, I described and benchmarked a single additional filter. But this is only one of multiple subquery filters. Earlier, I also mentioned that users had asked for a way to filter out paywalled content. This works similarly to filtering out content from excluded domains. Some users opt-in to hiding paywalled content. For those users, we check if each item is paywalled. If so, we check if it comes from a site the user has specifically allowed paywalled content from (because they have a subscription). I used the same uncorrelated subquery approach to first check if the feature is enabled for the user and, only then, does SQLite need to check each row. Concretely, the paywalled content filter subquery looks like: In short, a trivial uncorrelated scalar subquery can help us short-circuit and avoid a more expensive per-row check when we don't need it. There are multiple ways to exclude rows from an SQL query. Here are the results from the same benchmark I ran above, but with two other ways of checking for whether an item comes from an excluded domain. The version of the query uses the subquery: The variation joins with and then checks for : And here are the full benchmarks: For users without excluded domains, we can see that the query using the short-circuit wins and adds no overhead. For users who do have excluded domains, the is faster than the version. However, this version raises the exact problem this whole blog post is designed to address. Since joins happen no matter what, we cannot use the short-circuit to avoid the overhead for users without excluded domains. At least for now, this is why I've gone with the subquery using the short-circuit. Discuss on Hacker News , Lobsters , r/programming , r/sqlite . Scour uses SQLite, so the data is colocated with the application code. It uses binary-quantized vector embeddings with Hamming Distance comparisons, which only take ~5 nanoseconds each . We care most about recent posts so we can significantly narrow the search set by publish date.

0 views
Marc Brooker 1 months ago

What Does a Database for SSDs Look Like?

Maybe not what you think. Over on X, Ben Dicken asked : What does a relational database designed specifically for local SSDs look like? Postgres, MySQL, SQLite and many others were invented in the 90s and 00s, the era of spinning disks. A local NVMe SSD has ~1000x improvement in both throughput and latency. Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random. If we had to throw these databases away and begin from scratch in 2025, what would change and what would remain? How might we tackle this question quantitatively for the modern transaction-orientated database? Approach One: The Five Minute Rule Perhaps my single favorite systems paper, The 5 Minute Rule… by Jim Gray and Franco Putzolu gives us a very simple way to answer one of the most important questions in systems: how big should caches be? The five minute rule is that, back in 1986, if you expected to read a page again within five minutes you should keep in in RAM. If not, you should keep it on disk. The basic logic is that you look at the page that’s least likely to be re-used. If it’s cheaper to keep around until it’s next expected re-use, then you should keep more. If it’s cheaper to reload from storage than keep around, then you should keep less 1 . Let’s update the numbers for 2025, assuming that pages are around 32kB 2 (this becomes important later). The EC2 delivers about 1.8 million read iops of this size, at a price of around $0.004576 per second, or \(10^{-9}\) dollars per transfer (assuming we’re allocating about 40% of the instance price to storage). About one dollar per billion reads. It also has enough RAM for about 50 million pages of this size, costing around \(3 \times 10^{-11}\) dollars to storage a page for one second. So, on this instance type, we should size our RAM cache to store pages for about 30 seconds. Not too different from Gray and Putzolu’s result 40 years ago! That’s answer number one: the database should have a cache sized so that the hot set contains pages expected to be accessed in the next 30 seconds, for optimal cost. For optimal latency, however, the cache may want to be considerably bigger. Approach Two: The Throughput/IOPS Breakeven Point The next question is what size accesses we want to send to our storage devices to take best advantage of their performance. In the days of spinning media, the answer to this was surprisingly big: a 100MB/s disk could generally do around 100 seeks a second, so if your transfers were less than around 1MB you were walking away from throughput. Give or take a factor of 2. What does it look like for modern SSDs? SSDs are much faster on both throughput and iops. They’re less sensitive than spinning drives to workload patterns, but read/write ratios and the fullness of the drives still matter. Absent benchmarking on the actual hardware with the real workload, my rule of thumb is that SSDs are throughput limited for transfers bigger than 32kB, and iops limited for transfers smaller than 32kB. Making transfers bigger than 32kB doesn’t help throughput much, reduces IOPS, and probably makes the cache less effective because of false sharing and related effects. This is especially important for workloads with poor spatial locality . So that’s answer number two: we want our transfers to disk not to be much smaller than 32kB on average, or we’re walking away from throughput. Approach Three: Durability and Replication Building reads on local SSDs is great: tons of throughput, tons of iops. Writes on local SSDs, on the other hand, have the distinct problem of only being durable on the local box, which is unacceptable for most workloads. Modern hardware is very reliable, but thinking through the business risks of losing data on failover isn’t very fun at all, so let’s assume that our modern database is going to replicate off-box, making at least one more synchronous copy. Ideally in a different availability zone (AZ). That we were using for our comparison earlier has 100Gb/s (or around 12GB/s) of network bandwidth. That puts a cap on how much write throughput we can have for a single-leader database. Cross-AZ latency in EC2 varies from a couple hundred microseconds to a millisecond or two, which puts a minimum on our commit latency. That gives us answer number three: we want to incur cross-AZ latency only at commit time, and not during writes. Which is where we run into one of my favorite topics: isolation. The I in ACID . A modern database design will avoid read-time coordination using multiversioning, but to offer isolation stronger than will need to coordinate either on each write or at commit time. It can do that like, say, Aurora Postgres does, having a single leader at a time running in a single AZ. This means great latency for clients in that zone, and higher latency for clients in different AZs. Given that most applications are hosted in multiple AZs, this can add up for latency-sensitive applications which makes a lot of round trips to the database. The alternative approach is the one Aurora DSQL takes, doing the cross-AZ round trip only at time, saving round-trips. Here’s me talking about the shape of that trade-off at re:Invent this year: There’s no clear answer here, because there are real trade-offs between the two approaches. But do make sure to ask your database vendor whether those impressive latency benchmarks are running where you application actually runs. In the spirit of the original question, though, the incredible bandwidth and latency availability in modern datacenter networks is as transformative as SSDs in database designs. Or should be. While we’re incurring the latency cost of synchronous replication, we may as well get strongly consistent scale-out reads for free. In DSQL, we do this using high-quality hardware clocks that you can use too . Another nice win from modern hardware. There are other approaches too. That’s answer number four for me: The modern database uses high-quality clocks and knowledge of actual application architectures to optimize for real-world performance (like latency in multiple availability zones or regions) without compromising on strong consistency. Approach Four: What about that WAL? Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random. WALs, and related low-level logging details, are critical for database systems that care deeply about durability on a single system. But the modern database isn’t like that: it doesn’t depend on commit-to-disk on a single system for its durability story. Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails). That’s answer number five: the modern database commits transactions to a distributed log, which provides multi-machine multi-AZ durability, and might provide other services like atomicity. Recovery is a replay from the distributed log, on any one of a number of peer replicas. What About Data Structures? B-Trees versus LSM-trees vs B-Tree variants versus LSM variants versus other data structures are trade-offs that have a lot to do with access patterns and workload patterns. Picking a winner would be a whole series of blog posts, so I’m going to chicken out and say its complicated . If we had to throw these databases away and begin from scratch in 2025, what would change and what would remain? I’d keep the relational model, atomicity, isolation (but would probably pick as a default), strong consistency, SQL, interactive transactions, and the other core design decisions of relational databases. But I’d move durability, read and write scale, and high availability into being distributed rather than single system concerns. I think that helps with performance and cost, while making these properties easier to achieve. I’d mostly toss out local durability and recovery, and all the huge history of optimizations and data structures around that 3 , in favor of getting better properties in the distributed setting. I’d pay more attention to internal strong isolation (in the security sense) between clients and workloads. I’d size caches for a working set of between 30 seconds and 5 minutes of accesses. I’d optimize for read transfers around that 32kB sweet spot from local SSD, and the around 8kB sweet spot for networks. Probably more stuff too, but this is long enough as-is. Other topics worth covering include avoiding copies on IO, co-design with virtualization (e.g. see our Aurora Serverless paper ), trade-offs of batching, how the relative performance of different isolation levels changes, what promises to give clients, encryption and authorization of data at rest and in motion, dealing with very hot single items, new workloads like vector, verifiable replication journals, handing off changes to analytics systems, access control, multi-tenancy, forking and merging, and even locales. The reasoning is slightly smarter, thinking about the marginal page and marginal cost of memory, but this simplification works for our purposes here. The marginal cost of memory is particularly interesting in a provisioned system, because it varies between zero (you’ve paid for it already) and huge (you need a bigger instance size). One of the really nice things about serverless (like DSQL) and dynamic scaling (like Aurora Serverless) is that it makes the marginal cost constant, greatly simplifying the task of reasoning about cache size. Yes, I know that pages are typically 4kB or 2MB, but bear with me here. Sorry ARIES .

0 views
matduggan.com 1 months ago

SQLite for a REST API Database?

When I wrote the backend for my Firefox time-wasting extension ( here ), I assumed I was going to be setting up Postgres. My setup is boilerplate and pretty boring, with everything running in Docker Compose for personal projects and then persistence happening in volumes. However when I was working with it locally, I obviously used SQLite since that's always the local option that I use. It's very easy to work with, nice to back up and move around and in general is a pleasure to work with. As I was setting up the launch, I realized I really didn't want to set up a database. There's nothing wrong with having a Postgres container running, but I'd like to skip it if its possible. So my limited understanding of SQLite before I started this was "you can have one writer and many readers". I had vaguely heard of SQLite "WAL" but my understanding of WAL is more in the context of shipping WAL between database servers. You have one primary, many readers, you ship WAL to from the primary to the readers and then you can promote a reader to the primary position once it has caught up on WAL. My first attempt at setting up SQLite for a REST API died immediately in exactly this way. So by default SQLite: This seems to be caused by SQLite having a rollback journal and using strict locking. Which makes perfect sense for the use-case that SQLite is typically used for, but I want to abuse that setup for something it is not typically used for. So after doing some Googling I ended up with these as the sort of "best recommended" options. I'm 95% sure I copy/pasted the entire block. What is this configuration doing. However my results from load testing sucked. Now this is under heavy load (simulating 1000 active users making a lot of requests at the same time, which is more than I've seen), but still this is pretty bad. The cause of it was, of course, my fault. My "blacklist" is mostly just sites that publish a ton of dead links. However I had the setup wrong and was making a database query per website to see if it matched the black list. Stupid mistake. Once I fixed that. Great! Or at least "good enough from an unstable home internet connection with some artificial packet loss randomly inserted". So should you use SQLite as the backend database for a FastAPI setup? Well it depends on how many users you are planning on having. Right now I can handle between 1000 and 2000 requests per second if they're mostly reads, which is exponentially more than I will need for years of running the service. If at some point in the future that no longer works, it's thankfully very easy to migrate off of SQLite onto something else. So yeah overall I'm pretty happy with it as a design. Only one writer at a time Writers block readers during transactions Switches SQLite from rollback journal to Write-Ahead Logging (WAL) Default behavior is Write -> Copy original data to journal -> Modify database -> Delete journal. WAL mode is Write -> Append changes to WAL file -> Periodically checkpoint to main DB So here you have 4 options to toggle for how often SQLite syncs to disk. OFF is SQlite lets the OS handle it. NORMAL is the SQLite engine still syncs, but less often than FULL. WAL mode is safe from corruption with NORMAL typically. FULL uses the Xsync method of the VFS (don't feel bad I've never heard of it before either: https://sqlite.org/vfs.html ) to ensure everything is written to disk before moving forward. EXTRA: I'm not 100% sure what this exactly does but it sounds extra. "EXTRA synchronous is like FULL with the addition that the directory containing a rollback journal is synced after that journal is unlinked to commit a transaction in DELETE mode. EXTRA provides additional durability if the commit is followed closely by a power loss. Without EXTRA, depending on the underlying filesystem, it is possible that a single transaction that commits right before a power loss might get rolled back upon reboot. The database will not go corrupt. But the last transaction might go missing, thus violating durability, if EXTRA is not set." = please wait up to 60 seconds. this one threw me for a loop. Why is it a negative number? If you set it to a positive number, you mean pages. SQLite page size is 4kb by default, so 2000 = 8MB. A negative number means KB which is easier to reason about than pages. I don't really know what a "good" cache_size is here. 64MB feels right given the kind of data I'm throwing around and how small it is, but this is guess work. = write to memory, not disk. Makes sense for speed.

0 views
Dangling Pointers 1 months ago

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS Youmin Chen, Jiwu Shu, Yanyan Shen, Linpeng Huang, and Hong Mei SOSP'25 I love this paper, because it grinds one of my axes: efficient pipeline parallelism on general purpose CPUs. In many hardware designs, pipeline parallelism is the dominant form of parallelism, whereas data parallelism takes the cake on CPUs and GPUs. It has always seemed to me that there are applications where pipeline parallelism should be great on multi-core CPUs, and here is an example. Fig. 1 illustrates the design space for key-value stores: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ). It is hard to effectively cache data in a TPR/TPQ model, because each request runs the entire key-value store request code path. For example, a CPU core may have enough cache capacity to hold network buffers or hot tuples, but not both. The key disadvantage to a TPS architecture is load balancing. One stage could become the bottleneck, leaving CPU cores idle. The authors propose dynamic reconfiguration of the pipeline based on workload changes. Another challenge with pipelining is implementing efficient communication between cores, because data associated with each request flows down the pipeline with the request itself. Fig. 3 shows the pipeline proposed in this paper: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 The NIC writes request packets into the network buffer (stored in the LLC). The cache-resident layer reads data from this buffer and handles requests involving commonly used keys by accessing the hot index and hot data caches (also in the LLC). The memory-resident layer handles cold keys and values, which are stored in DRAM. One set of threads (pinned to CPU cores) implement the cache-resident layer, and a different set of threads (pinned to other CPU cores) implement the memory-resident layer. An auto-tuner continually monitors the system and adjusts the number of threads assigned to each layer. Section 3.5 describes the synchronization required to implement this adjustment. The NIC writes request packets into a single queue. The cache-resident threads cooperatively read requests from this queue. If there are threads in the pool, then thread reads all requests with: . Next, threads check to see if the key associated with a request is hot (and thus cached in the LLC). Time is divided into epochs. During a given epoch, the set of cached items does not change. This enables fast lookups without costly synchronization between threads. A background thread gathers statistics to determine the set of items to be cached in the next epoch and has the ability to atomically switch to the next epoch when the time comes. The number of hot keys is kept small enough that it is highly likely that hot keys will be stored in the LLC. Requests that miss in the cache-resident layer are passed on to the memory-resident layer for further processing (via the CR-MR queue ). Typically, the LLC is treated like a global resource (shared by all cores). But this particular use case requires that most of the LLC be dedicated to the cache-resident layer. This is accomplished with the help of the PQOS utility from Intel, which uses “Intel(R) Resource Director Technology” to control which ways of the LLC are assigned to each layer. The memory-resident layer operates on batches of requests. Because the requests are not hot, it is highly likely that each request will require DRAM accesses for index lookups (keys) and data lookups (values). Software prefetching is used to hide DRAM latency during index lookups. When servicing operations, data values are copied directly into the outgoing network buffer. The CR-MR queue is used to communicate between the two layers. Each (CR thread, MR thread) pair has a dedicated lock-free queue. Enqueue operations use a round-robin policy (message from CR thread is sent to MR thread: ). Dequeue operations must potentially scan queues corresponding to all possible senders. Multiple requests can be stored per message, to amortize control overhead. Fig. 7 has throughput results for synthetic workloads (A, B, and C have different ratios of put/get operations), uTPS-T is this work: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 Dangling Pointers The pipelining here is coarse-grained, and the design is only optimized for the LLC. I wonder if a more fine-grained pipeline would allow hot data to be stored in L2 caches. For example, the set of hot keys could be sharded among N cores, with each core holding a different shard in its L2 cache. It seems redundant that this design requires software to determine the set of hot keys, when the hardware cache circuitry already has support to do something like this. Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. Pipelining Advantages A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ).

0 views
Pinaraf's website 1 months ago

JIT, episode III: warp speed ahead

In our first JIT episode , we discussed how we could, using copy-patch, easily create a JIT compiler for PostgreSQL, with a slight improvement in performance compared to the PostgreSQL interpreter. In our second episode , I talked about the performance wall and how hard it was to have a real leap in performance compared to the interpreter. But it ended with a positive outlook, a nice performance jump that I was preparing at that moment… The interpreter will run each opcode for every record it has to process. Everything it has to do for each record that could be done only once is better done once, obviously. And this is where a JIT can beat it. The JIT compiler can choose optimizations that would require checks at each opcode for the interpreter, and thus self-defeating for the interpreter. For instance, I mentioned creating inlined opcodes for common function calls like int4eq : replacing the indirect call to int4eq with a comparison of the function pointer and then an inlined version would indeed be silly, since the comparison is going to waste a lot of time already. So, what can’t the interpreter do? It sure can’t easily remove indirect calls, but this is a 1% performance gain, 2% at most. You won’t get to the headlines with that, right? Well, when in doubt, look at the past… A decade ago, I worked at a small company where I heard the weirdest thing ever regarding system performance: “our application is slower when built in 64 bits mode because the bigger pointer size makes it slower”. I didn’t buy this, spent two days digging into the code, and found that it was the opposite: 64 bits brought such a performance improvement that the entire system collapsed on a mutex that held a core structure in the application… Removing the mutex made the application fly in both 32 and 64 bits, with 64 bits beating 32 bits obviously. But why is 64 bits faster? We are talking database here, so let’s have a look at a table, shall we? http://users.atw.hu/instlatx64/AuthenticAMD/AuthenticAMD0870F10_K17_Matisse7_InstLatX64.txt (I know uBlock doesn’t like this domain, but this text document there is good, I promise) On my CPU, loading a 64 bits value in a register requires twice the time it takes to load a 32 bits value. So sure 64 bits must be slower than 32 bits! Except the switch from 32 to 64 bits also fixed one of the biggest issue with x86: the lack of registers. x86 never improved from its 16 bits roots and had 8 general purpose registers, little compared to PowerPC (32), Sparc (31), ARM (15). When AMD introduced 64 bits in the x86 world, they doubled the number of registers, from a ridiculous 8 to an acceptable 16. And from this came a huge performance boost. Memory = slow Registers = fast Ok, more seriously, I will not start writing about this. Even if it is getting old and outdated, the “What Every Programmer Should Know About Memory” paper of Ulrich Drepper is still a great read if you’re interested into that topic. The only thing that matters for us is that, even with a lot of cache, writing to memory is slower than writing to a register. If I look at some measurements for my Zen2 CPU, a comparison between two registers takes less than a cycle (0.33c it seems), but if data has to be loaded from L1 cache you can add 4 cycles, 12 cycles from L2 and 38 from L3. Way, way slower. 12 to 115 times slower. Registers are used automatically by your compiler. When you write a function, it will automatically figure out what variable to move to a register, when, and if you don’t have enough registers for your entire function, spill registers on the stack as needed. If you are interested into this, there are many fun register allocation algorithms and many wikipedia pages covering this topic. Let’s see one of the most basic opcode, EEOP_SCAN_VAR, taking a value from a scan slot in order to use it later. This is indeed a memory write. Could the interpreter get rid of this? Well, I think it could, but it would be a major undertaking. If we had a variable, stored in a register by the compiler, we could store there, sure, and a next step could fetch from that place, but what if the next step needs another value instead… Then we would have to spill the value back in memory, but checking for this at each step is going to kill performance. It may be possible to rewrite the entire interpreter to match a register based VM, but I can not be sure it would be worth it. And this is the path to beating the interpreter. We can check many things before running the opcodes, trace memory accesses and use registers as much as possible. The great benefit of copy-patch is that you (almost) don’t write assembly code. Porting it to arm64 required me to learn about ARM64 specific relocations, how to encode immediate values in some ARM opcodes, but nothing more. But a big downside is that you don’t write assembly code And, well, if you don’t write the assembly code, you don’t control register allocation. But there is a simple way around this, let’s speak a bit about calling conventions. When function A is called, how do you pass parameters to it? If you learned some x86 assembly at school, you will answer “on the stack” and won a free ticket for an assembly refreshment course When AMD64 was introduced, the SysV Call Convention took over and completely changed the way functions are called. The first six integer or pointer parameters are passed through general purpose registers, and six floating point parameters are passed through FP registers. Each opcode is defined as a function with three parameters (matching the function signature expected by PostgreSQL). While respecting the SysV calling convention, it leaves us three registers that the compiler will keep across the opcode calls, and will spill automatically if needed. An alternative would have been to use the preserve_none calling convention, but for the first version I did not need it (and I still have many calls to PostgreSQL functions that will use the SysV calling convention) But three registers means… two values only. Sadly we transitioned from 32 to 64 bits, not to 65 bits… 65 bits would have given one bit to represent NULL/NOT NULL values, 0 would not have been NULL, 1 + NULL would be NULL! But we will not rewrite history here, and we are going to use one register as a set of null flags, one bit per value register (so we are wasting 62 bits here). Our opcode functions are thus going to have three new parameters, char nullFlags, intptr_t reg0, intptr_t reg1. Jumping to the next opcode will require passing these values around. Great, we keep registers around, now what about using these? As a reminder, here are the opcodes we are dealing with for our previous “SELECT * FROM demo WHERE a = 42”. This code doesn’t use our registers. I rewrote every opcode implementation to use the registers instead of writing in memory. In this version, all memory accesses have been replaced with register accesses instead, hurray! But this will work only for a simple query like this one. Once we start having more variables to store, we will need a spilling mechanism, a way to swap registers… Another issue appears when you call, for instance, a non-inlined function. The EEOP_FUNCEXPR is defined as: Parameters are fed through the fcinfo_data structure. The other opcodes are writing directly into this structure in the usual interpreter execution. It means that we must check all memory accesses from the opcodes and make sure that any expected memory access from an opcode implementation will not end up in a memory location we didn’t write to. I started with a small experiment, a “variabilizer”, that would look at each opcode and figure out through each memory access (read/write) all the variables used in a run, their lifetimes… It can even detect constants stored in memory (a memory that is only read from). I then refactored a lot of the compiler code in the past weeks. I started by moving the specialized opcodes definition and dispatch to the stencil library only, removing any special case I had in the compiler part. This required #defining a way for the C code in stencils.c to generate more C code in the built-stencils.h file through the stencil-builder.py script. Fun, but complicated and hairy stuff. After that, I started rewriting the stencil signature and several opcodes to use registers instead, and wrote a “contract” for each opcode, defining what was expected in each register, what will be written in each register, and what is going to be read/written in memory. With all these changes, here is what the FUNCEXPR_STRICT opcode optimized for int4eq looks like. More metadata than actual code… But if that’s what it takes to get a good performance boost, then here we go. After ingesting that, the compiler can fill in the registers with the proper values when needed. Another big issue that I’m not covering here is that doing this requires some minimal control flow analysis. For my simple benchmark, this is not a problem, and the code is getting ready for a wider range of queries, but I did not want to cover this and preferred focusing on the registers work… Well… This is the optimization I mentioned in the previous article. So, on our stupid benchmark, doing 10 times a simple SELECT * FROM demo WHERE a = 42 on a 10 million rows table… As you can see, this is exactly what we expected: less instructions, sure, but this is not what gave us the performance boost here. What changed is the number of cycles, because the same instruction now uses a register instead of using a memory access, thus several cycles saved for the same instruction. The LLVM JIT can achieve about the same run time here, but it takes some time to generate the bitcode (less than 1ms), then several ms to analyze it, optimize it, and finally translate it to machine code. And this makes LLVM JIT slower here than copyjit, while copyjit still has some room for improvements (I’ve yet to look at tuple deforming) See you in the next one, I think we already know what the topic will be… Well, after I finish porting every opcode to these new metadata, test more stuff, and likely figure out some more optimizations on the way… PS: as said previously, help welcome, code FOSS as usual, on github , and I would gladly accept any sponsoring, mission, anything that could give me more time to work on this…

0 views
Dangling Pointers 1 months ago

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann SIGMOD'14 The giant upon whose shoulders this paper rests is Volcano . Parallelism in Volcano is achieved through a proper separation of concerns. Volcano contains many database operators, most of which are blissfully unaware of parallelism. A handful of operators in a query plan exist only to enable parallelism (for example, an operator could implement pipeline parallelism, or partition data between threads). Generally speaking, an elegant separation of concerns is good for performance. However, the thesis of morsel-driven parallelism is that this is not true. Deeply integrating the notion of parallel execution into each operator is the way to go for OLAP. The system described in this paper is named HyPer. Fig. 2 below illustrates how HyPer decomposes a single query into three pipelines: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 The leftmost pipeline scans the input relation and applies a filter to it. The middle pipeline scans and filters the input relation . The rightmost pipeline scans and filters , joins the result with , and finally joins the result of the first join with . A system like Volcano would be tempted to scan and in parallel. Not so with HyPer: the pipelines which make up a query plan are executed serially. Each relation (both inputs, and temporary data) are divided into morsels. A morsel is a group of ~100,000 tuples. Each morsel resides on a single NUMA node (indicated by colors in the figures). Fig. 3 illustrates how HyPer uses morsel-level parallelism to implement the left pipeline (scan and filter T): Source: https://dl.acm.org/doi/10.1145/2588555.2610507 The pipeline that processes T operates in two phases. In the first phase, a pool of threads (each pinned to a core) collectively process all morsels in T. When a thread comes up for air, it grabs another morsel of input. Threads are assigned to a NUMA node, and threads prefer to process morsels assigned to the same NUMA node. If no morsels of matching color are available, then a thread will reluctantly process a morsel from another NUMA node. During the first phase, each thread writes filtered tuples into a thread-local storage area. Once all input tuples have been processed, a hash table is created (conveniently, the hash table can be sized well because the number of tuples that must be stored is known). This hash table is global (i.e., not NUMA aware). In the second phase, tuples are inserted into the hash table. The HyPer hash table is designed to allow lock-free insertions, along with a fast path for the common case where a probe operation yields no hits. The hash table uses chaining, with very special pointers used to point to the next element in the chain. The lower 48 bits of each pointer in the hash table contains the memory address that is being pointed at, the upper 16 bits are a tag . Think of the tag as a 16-bit Bloom filter describing the set of elements in the sub-list that the pointer points to. When the hash table is probed, a hash of the join key from the probe tuple is used both to determine which chain to search in, and to stop the search early if no possible unexamined list element could contain the join key. Because both the pointer and the tag are packed into 64 bits, a compare-and-swap instruction can be used to insert elements into the hash table without using an OS lock. If the hash table is large enough, then most executions of the compare-and-swap instruction should succeed. Fig. 7 illustrates the hash table data structure and the insertion algorithm: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 It is a bit odd that the design in this paper goes to such great lengths to avoid cross-socket (NUMA) reads from main memory, and yet the hash table is not NUMA aware. I think that the 16-bit tags are key here. If the set of head pointers for all buckets in the hash table is small enough to fit into an L2 cache, then this data can be efficiently replicated into all L2 caches. As long as the hash table hit rate is low enough, the number of cross-socket memory accesses during probe operations will be low. Fig. 11 shows throughput for all 22 TPC-H queries, for 4 different configurations: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 It is interesting how NUMA-awareness matters a lot for some queries, but not all. Fig. 10 shows the author’s NUMA model: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 What is interesting here is that a similar setup exists within each socket. Say a socket contains 32 cores, and 4 memory controllers. Those cores and memory controllers will be laid out in a grid, with a network connecting them. I wonder if there is performance to be gained by paying attention to the intra-core layout (e.g., cores on the left side of the chip should only access memory controllers on the left side of the chip). Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
The Coder Cafe 1 months ago

Build Your Own Key-Value Storage Engine—Week 3

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Last week, you built the first version of an LSM: an in-memory memtable for recent writes, immutable SSTables on disk, and a MANIFEST file listing the SSTable files. However, if the database crashes, data in the memtable would be lost. This week, you will focus on durability by introducing Write-Ahead Logging (WAL). A WAL is an append-only file on disk that records the same operations you keep in memory. How it works: On write, record it in the WAL and the memtable. On restart, you read the WAL from start to end and apply each record to the memtable. Introducing a WAL is not free, though. Writes are slower because each write also goes to the WAL. It also increases write amplification, the ratio of data written to data requested by a client. Another important aspect of durability is when to synchronize a file’s state with the storage device. When you write to a file, it may appear as saved, but the bytes may sit in memory caches rather than on the physical disk. These caches are managed by the OS’s filesystem, an abstraction over the disk. If the machine crashes before the data is flushed, you can lose data. To force the data to stable storage, you need to call a sync primitive. The simple, portable choice is to call fsync , a system call that flushes a file’s buffered data and required metadata to disk. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord For the WAL data format, you won’t use JSON like the SSTables, but NDJSON (Newline-Delimited JSON). It is a true append-only format with one JSON object per line. Append a record to the WAL file , opened with . Set the field to , and the and fields to the provided key and value. For example, writing : Update the memtable with the same logic as before: If the key exists, update the value. Otherwise, create a new entry. Acknowledge the HTTP request. Create an empty file if it doesn’t exist. Replay the WAL from start to end. For each valid line, apply it to the memtable. Keep the same flush trigger (2,000 entries) and the same logic (stop-the-world operation) as last week: Write the new SSTable: Flush the memtable as a new immutable JSON SSTable file with keys sorted (same as before). fsync the SSTable file. the parent directory of the SSTable to make the new filename persistent. Update the MANIFEST atomically: Read the current MANIFEST lines into memory and append the new SSTable filename. Open with . Write the entire list to from the start. Rename → . the parent directory of the MANIFEST. Reset the WAL: Truncate the WAL to zero length. the WAL file. If the server is unavailable, do not fail. Retry indefinitely with a short delay (or exponential backoff). To assess durability: Run the client against the same input file ( put.txt ). Stop and restart your database randomly during the run. Your client should confirm that no acknowledged writes were lost after recovery. Add a per-record checksum to each WAL record. On startup, verify records and stop at the first invalid/truncated one, discarding the tail. For reference, ScyllaDB checksums segments using CRC32; see its commitlog segment file format for inspiration. Regarding the flush process, if the database crashes after step 1 (write the new SSTable) and before step 2 (update the MANIFEST atomically), you may end up with a dangling SSTable file on disk. Add a startup routine to delete any file that exists on disk but is not listed in the MANIFEST. This keeps the data directory aligned with the MANIFEST after a crash. That’s it for this week! Your storage engine is now durable. On restart, data that was in the memtable is recovered from the WAL. This is made possible by and the atomic update of the MANIFEST. Deletion is not handled yet. In the worst case, a miss can read all SSTables, which quickly becomes highly inefficient. In two weeks, you will add a endpoint and learn how SSTables are compacted so the engine can reclaim space and keep reads efficient. In your implementation, you used as a simple “make it durable now“ button. In practice, offers finer control both over what you sync and when you sync. What: (or opening the file with ) persists the data without pushing unrelated metadata, which is usually what you want for WAL appends. You can go further with to bypass the page cache and sync only the data you wrote, but that comes with extra complexity. When: While calling a sync primitive after every request is offered by systems that promise durability, it is often not the default. Many databases use group commit, which batches several writes into one call to amortize the cost while still providing strong guarantees. For additional information, see A write-ahead log is not a universal part of durability by . For example, RocksDB provides options for tuning WAL behavior to meet the needs of different applications: Synchronous WAL writes (what you implemented this week) Group commit. No WAL writes at all. If you want, you can also explore group commit in your implementation and its impact on durability and latency/throughput, since this series will not cover it later. Also, you should know that since a WAL adds I/O to the write path, storage engines use a few practical tricks to keep it fast and predictable. A common one is to preallocate fixed-size WAL segments at startup to: Avoid the penalty of dynamic allocation. Prevent write fragmentation. Align buffers for (an open (2) flag for direct I/O that bypasses the OS page cache). Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Last week, you built the first version of an LSM: an in-memory memtable for recent writes, immutable SSTables on disk, and a MANIFEST file listing the SSTable files. However, if the database crashes, data in the memtable would be lost. This week, you will focus on durability by introducing Write-Ahead Logging (WAL). A WAL is an append-only file on disk that records the same operations you keep in memory. How it works: On write, record it in the WAL and the memtable. On restart, you read the WAL from start to end and apply each record to the memtable. Append a record to the WAL file , opened with . Set the field to , and the and fields to the provided key and value. For example, writing : Update the memtable with the same logic as before: If the key exists, update the value. Otherwise, create a new entry. Acknowledge the HTTP request. Create an empty file if it doesn’t exist. Replay the WAL from start to end. For each valid line, apply it to the memtable. Write the new SSTable: Flush the memtable as a new immutable JSON SSTable file with keys sorted (same as before). fsync the SSTable file. the parent directory of the SSTable to make the new filename persistent. Update the MANIFEST atomically: Read the current MANIFEST lines into memory and append the new SSTable filename. Open with . Write the entire list to from the start. Rename → . the parent directory of the MANIFEST. Reset the WAL: Truncate the WAL to zero length. the WAL file. Run the client against the same input file ( put.txt ). Stop and restart your database randomly during the run. Your client should confirm that no acknowledged writes were lost after recovery. What: (or opening the file with ) persists the data without pushing unrelated metadata, which is usually what you want for WAL appends. You can go further with to bypass the page cache and sync only the data you wrote, but that comes with extra complexity. When: While calling a sync primitive after every request is offered by systems that promise durability, it is often not the default. Many databases use group commit, which batches several writes into one call to amortize the cost while still providing strong guarantees. For additional information, see A write-ahead log is not a universal part of durability by . For example, RocksDB provides options for tuning WAL behavior to meet the needs of different applications: Synchronous WAL writes (what you implemented this week) Group commit. No WAL writes at all. Avoid the penalty of dynamic allocation. Prevent write fragmentation. Align buffers for (an open (2) flag for direct I/O that bypasses the OS page cache).

0 views
Robin Moffatt 1 months ago

Using Graph Analysis with Neo4j to Spot Astroturfing on Reddit

Reddit is one of the longer-standing platforms on the internet, bringing together folk to discuss, rant, grumble, and troll others on all sorts of topics, from Kafka to data engineering to nerding out over really bright torches to grumbling about the state of the country —and a whole lot more. As a social network it’s a prime candidate for using graph analysis to examine how people interact—and in today’s post, hunt down some sneaky shills ;-) I’ve loaded data for several subs into Neo4j, a graph database. Whilst RDBMS is great for digging into specific users or posts, aggregate queries, and so on, graph excels at complex pattern matching and recursive relationships. It’s a case of best tool for the job; you can do recursive SQL instead of graph, it’s just a lot more complicated. Plus the graphical tools I’ll show below are designed to be used with Neo4j or other property graph databases. In Neo4j the nodes (or vertices ) are user, subreddit, comment, and post. The edges (or relationships ) are how these interact. For example: a user [node] authored [edge] a post [node] a user [node] posted in [edge] a subreddit [node] These relationships can be analysed independently, or combined: Let’s familiarise ourselves with graph visualisations and queries. In RDBMS we use SQL to describe the data that we want to return in a query. Neo4j uses Cypher , which looks a bit like SQL but describes graph relationships. Here’s a query to show the user nodes : Neo4j includes a visualisation tool, which shows the returned nodes: We can add predicates, such as matching on a particular node property ( , in this example): You can also look at the raw data: If we zoom in a bit to the previous query results we’ll see that it’s also showing the edges that have been defined indicating a relationship ( ) between some of the nodes: Let’s build on the above predicate query to find my username ( ) and any users that I’ve interacted with: I’m going to head over to a different tool for visualising the data since the built-in capabilities in the free version of Neo4j are too limited for where we’re going with it. Data Explorer for Neo4j is a really nice tool from yWorks . It connects directly to Neo4j and can either use Cypher queries to pull in data, or directly search nodes. The first reason I like using it is the flexibility it gives for laying out the data. Here is the same set of data as above, but shown in different ways: One of the cool things that graph analysis does for us is visualise patterns that are not obvious through regular relational analysis. One of these is a form of astroturfing. Since the LLMs (GPT, Claude, etc) are trained on data that includes Reddit, it’s not uncommon now to see companies trying to play the game (just like they did with keyword-stuffing with white text on white background for Google in the old days) and 'seed' Reddit with positive content about their product. For example, genuine user A asks " what’s the best tool for embedding this nail into a piece of wood ". Genuine user B suggests " well, a hammer, DUUUHHH " (this is Reddit, after all). The Astroturfer comes along and says " What a great question! I’ve been really happy with ACME Corp’s Screwdriver! If you hold it by the blade you’ll find the handle makes a perfect tool for hitting nails. " Astroturfing also includes "asked and answered" (although not usually from the same account; that would be too obvious): Astroturfer A: "Hey guys! I’m building a house and looking for recommendations for the best value toolkit out there. Thanks!" Astroturfer B: "Gosh, well I really love my ACME Corp’s Toolbelt 2000, it is really good, and I’ve been very happy with it. Such good value too!" One of the cornerstones of Reddit is the account handle—whilst you can choose to identify yourself (as I do - ), you can also stay anonymous and be known to the world as something like . This means that what one might do on LinkedIn (click on the person’s name, figure out their company affiliation) often isn’t an option. This is where graph analysis comes in, because it’s great at both identifying and visualising patterns in behaviour that are not so easy to spot otherwise. Poking around one of the subreddits using betweenness analysis I spotted this set of three users highlighted: The accounts picked up here are key to the particular activity on the sub; but that in itself isn’t suprising. You often get key members of a community who post the bulk of the content. But, digging into these particular accounts I saw this significant pattern. The three users are shown as orange boxes; posts are blue and comments are green: It’s a nice little network of one user posting with another commenting—how helpful! To share the work they each take turns writing new posts and replying to others. Each post generally has one and only one comment, usually from one of the others in the group. You can compare this to a sub in which there is much more organic interaction. is a good example of this: Most users tend to just post replies, some only contribute new posts, and so on. Definitely not the nicely-balanced to-and-fro on the unnamed sub above ;) a user [node] authored [edge] a post [node] a user [node] posted in [edge] a subreddit [node] For example, genuine user A asks " what’s the best tool for embedding this nail into a piece of wood ". Genuine user B suggests " well, a hammer, DUUUHHH " (this is Reddit, after all). The Astroturfer comes along and says " What a great question! I’ve been really happy with ACME Corp’s Screwdriver! If you hold it by the blade you’ll find the handle makes a perfect tool for hitting nails. " Astroturfer A: "Hey guys! I’m building a house and looking for recommendations for the best value toolkit out there. Thanks!" Astroturfer B: "Gosh, well I really love my ACME Corp’s Toolbelt 2000, it is really good, and I’ve been very happy with it. Such good value too!"

0 views
Simon Willison 1 months ago

sqlite-utils 4.0a1 has several (minor) backwards incompatible changes

I released a new alpha version of sqlite-utils last night - the 128th release of that package since I started building it back in 2018. is two things in one package: a Python library for conveniently creating and manipulating SQLite databases and a CLI tool for working with them in the terminal. Almost every feature provided by the package is available via both of those surfaces. This is hopefully the last alpha before a 4.0 stable release. I use semantic versioning for this library, so the 4.0 version number indicates that there are backward incompatible changes that may affect code written against the 3.x line. These changes are mostly very minor: I don't want to break any existing code if I can avoid it. I made it all the way to version 3.38 before I had to ship a major release and I'm sad I couldn't push that even further! Here are the annotated release notes for 4.0a1. This change is for type hint enthusiasts. The Python library used to encourage accessing both SQL tables and SQL views through the syntactic sugar - but tables and view have different interfaces since there's no way to handle a on a SQLite view. If you want clean type hints for your code you can now use the and methods instead. A new feature, not a breaking change. I realized that supporting a stream of lists or tuples as an option for populating large tables would be a neat optimization over always dealing with dictionaries each of which duplicated the column names. I had the idea for this one while walking the dog and built the first prototype by prompting Claude Code for web on my phone. Here's the prompt I used and the prototype report it created , which included a benchmark estimating how much of a performance boost could be had for different sizes of tables. I was horrified to discover a while ago that I'd been creating SQLite columns called FLOAT but the correct type to use was REAL! This change fixes that. Previously the fix was to ask for tables to be created in strict mode. As part of this I also figured out recipes for using as a development environment for the package, which are now baked into the Justfile . This one is best explained in the issue . Another change which I would have made earlier but, since it introduces a minor behavior change to an existing feature, I reserved it for the 4.0 release. Back in 2018 when I started this project I was new to working in-depth with SQLite and incorrectly concluded that the correct way to create tables and columns named after reserved words was like this: That turned out to be a non-standard SQL syntax which the SQLite documentation describes like this : A keyword enclosed in square brackets is an identifier. This is not standard SQL. This quoting mechanism is used by MS Access and SQL Server and is included in SQLite for compatibility. Unfortunately I baked it into the library early on and it's been polluting the world with weirdly escaped table and column names ever since! I've finally fixed that, with the help of Claude Code which took on the mind-numbing task of updating hundreds of existing tests that asserted against the generated schemas. The above example table schema now looks like this: This may seem like a pretty small change but I expect it to cause a fair amount of downstream pain purely in terms of updating tests that work against tables created by ! I made this change first in LLM and decided to bring it to for consistency between the two tools. One last minor ugliness that I waited for a major version bump to fix. Update : Now that the embargo has lifted I can reveal that a substantial amount of the work on this release was performed using a preview version of Anthropic's new Claude Opus 4.5 model . Here's the Claude Code transcript for the work to implement the ability to use an iterator over lists instead of dictionaries for bulk insert and upsert operations. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Breaking change : The method now only works with tables. To access a SQL view use instead. ( #657 ) The and methods can now accept an iterator of lists or tuples as an alternative to dictionaries. The first item should be a list/tuple of column names. See Inserting data from a list or tuple iterator for details. ( #672 ) Breaking change : The default floating point column type has been changed from to , which is the correct SQLite type for floating point values. This affects auto-detected columns when inserting data. ( #645 ) Now uses in place of for packaging. ( #675 ) Tables in the Python API now do a much better job of remembering the primary key and other schema details from when they were first created. ( #655 ) Breaking change : The and mechanisms no longer skip values that evaluate to . Previously the option was needed, this has been removed. ( #542 ) Breaking change : Tables created by this library now wrap table and column names in in the schema. Previously they would use . ( #677 ) The CLI argument now accepts a path to a Python file in addition to accepting a string full of Python code. It can also now be specified multiple times. ( #659 ) Breaking change: Type detection is now the default behavior for the and CLI commands when importing CSV or TSV data. Previously all columns were treated as unless the flag was passed. Use the new flag to restore the old behavior. The environment variable has been removed. ( #679 )

0 views
matklad 1 months ago

TigerBeetle Blog

Continuing the tradition , I’ve been also blogging somewhat regularly on TigerBeetle’s blog, so you might want to check those articles out or even subscribe (my favorite RSS reader is RSSSSR ): Today’s post is a video version of Notes on Paxos ! https://tigerbeetle.com/blog/ https://tigerbeetle.com/blog/atom.xml

0 views
Jack Vanlightly 1 months ago

Have your Iceberg Cubed, Not Sorted: Meet Qbeast, the OTree Spatial Index

In today’s post I want to walk through a fascinating indexing technique for data lakehouses which flips the role of the index in open table formats like Apache Iceberg and Delta Lake. We are going to turn the tables on two key points: Indexes are primarily for reads . Indexes are usually framed as read optimizations paid for by write overhead: they make read queries fast, but inserts and updates slower. That isn’t the full story as indexes also support writes such as with faster updates and deletes in primary key tables but the dominant mental model is that indexing serves reads while writes pay the bill. OTFs don’t use tree-based indexes . Open-table format indexes are data-skipping indexes scoped to data files or even blocks within data files. They are a loose collection of column statistics and Bloom filters. Qbeast , a start-up with a presence here in Barcelona where I live, is reimagining indexes for open table formats, showing that neither assumption has to be true. Let’s dive in. A few weeks ago I wrote Beyond Indexes: How Open Table Formats Optimize Query Performance which describes how the open table formats don’t use monolithic tree-based indexes as RDBMS’s do, instead they optimize performance via effective pruning which in turn is boosted by data layout that matches the most important queries. The open-table formats give us two logical levers for optimizing layout: Partitioning Together, these form what is often called clustering : the way a table physically organizes data for efficient scanning by clustering similar data together.  Partitioning is the first major clustering lever in Iceberg and Delta tables. It divides a table into logical groups based on one or more columns so that rows with the same partition key values are stored together. This creates data locality, allowing the engine to quickly identify which partitions match a query filter (e.g., WHERE EventDate = '2025-10-01') and skip the rest. That process, called partition pruning, avoids scanning irrelevant data and greatly speeds up queries.  Within partitions, we can sort the data using a sort order . We can use one or more columns (including transforms of columns) as the sort order, which determines the order of rows in data files, and even across data files after compaction work (within a given partition). The Iceberg spec allows you to specify multiple columns as a lexicographical sort order and Delta goes further by supporting Z-order. However, Spark can also compact Iceberg using Z-order (it’s just not in the spec). Let’s take an example of rows with the following x and y indexed columns: where x has the domain a-d and y has the domain 1-4 , producing 16 (x,y) pairs, such as (a, 1), (a, 2)...(d, 4). When you sort a dataset lexicographically by multiple columns, the data system arranges the rows first by x, and then by y within each x group. That works fine if most queries filter heavily on the first column, but it doesn’t take into account how the data relates across both dimensions. Two records that are close together in (x, y) space might end up far apart on file if their x values differ slightly. Fig 1. Lexicographical order of two dimensions, which follows the “sort by x then by y” order. Z-ordering improves multidimensional sorting by weaving the bits of all indexed columns together into a single scalar value. Sorting by this scalar value produces a Z-shaped curve which fills the dimensional space (hence Z-order being what is known as a space-filling curve). The result is an ordering where items that are close in N-dimensional space remain close in the 1-D key space as well. As a result, it reduces I/O for multi-column range filters and is ideal when queries commonly span multiple dimensions rather than a single dominant one. If you always query on a leading column, then lexicographical sort order is likely better. Fig 2. Z-order uses bit mixing to produce a single scalar sort key, which determines the order, which resembles a z-shaped space-filling curve. But there are some problems with this clustering strategy based on partitioning + sorting strategy: Partition granularity . The partition key must be chosen carefully: too many partitions lead to many small files, which can hurt performance instead of helping it. Imbalanced partitions . Your data may be skewed, leading to imbalanced partition sizes. Some might be very small, while others might be very large, which is inefficient and can lead to uneven performance. Changing distributions . The shape of your data may change over time, making your chosen partitioning strategy less effective over time. Drift . Your tables are constantly drifting away from the optimum clustering layout as new data arrives. Compaction is constantly working to cluster recent data. Global clustering is expensive, so clustering is usually performed on subsets of the data. What if we could use a data layout strategy that was flexible and adaptive (solving pain points 1, 2, 3) and didn’t constantly drift as new data arrived (solving pain point 4)? Enter the Qbeast and the OTree multidimensional indexing approach which came out of research of the Barcelona Supercomputing Center . Qbeast has been on my radar because one of the founders is Flavio Junqueira, a distributed systems researcher behind both Apache ZooKeeper and Apache BookKeeper (both of which have played large roles in my career). The OTree brings to open table formats a global tree index that defines the table’s structure and layout. In some ways, the OTree could be thought of as a distant relative of the clustered index in the RDBMS world as they both define the table layout. However, the OTree is a lightweight structure that does not try to organize individual rows. The OTree index approaches table layout as an adaptive spatial structure. Instead of dividing data according to fixed partition keys or grouped according sort orders, it organizes the dataset into hypercubes that subdivide automatically as the data distribution demands. Each (hyper)cube represents a region in multi-dimensional space defined by the indexed columns. Fig 3. A table indexed on three columns leads to a 3-dimensional (normalized) space (more on that later). In this figure, the original cube has subdivided into 8 subcubes. A cube divides along all indexed dimensions simultaneously, creating 2ᵈ  smaller cubes, where 𝑑 is the number of dimensions (i.e., the number of indexed columns). So for example: With 2 indexed columns, each division produces 4 subcubes (2×2 grid) With 3 indexed columns, each division produces 8 subcubes (2×2×2) With 4 indexed columns, each division produces 16 subcubes (2×2×2×2) Fig 4. A cube subdivides into 8 subcubes (in a 3-dimensional space) corresponding to three indexes columns. Using 3-dimensional space is taxing on the mind and diagrams, so I’ll use examples based on two indexed columns which leads to an easier to visualize 2-dimensional space. The number of dimensions corresponds to the number of indexed columns. If we index our products table by price and rating , then we have a two-dimensional space. Qbeast maps each row to a point in a multidimensional space by normalizing the values of the indexed columns into the 0,1 range, preserving their relative order so that nearby data in each dimension remains close together in space. Fig 5. Two dimensional space with normalized domains For example, if we index columns price and rating, a row with (price=100, rating=4.2) might map to coordinates (0.10, 0.84) in the 0,1 space (of each dimension), while another with (price=120, rating=4.3) becomes (0.12, 0.86). Because both rows are close in their normalized coordinates, they occupy nearby positions in the multidimensional space, thereby preserving the natural proximity of their original values. This is really important because the spatial locality should reflect the value locality within the data domain, else range scans won’t be very useful. This is precisely what the Z-order mapping function tries to do as well (by bit mixing). The difference is that a space-filling curve (like Z-order or Hilbert) takes multi-dimensional coordinates (x, y, z) and projects them onto a one-dimensional ordering, whereas Qbeast preserves the ordering per dimension. A cube is one subdivision of the multidimensional space. At first, all data falls into a single cube representing the full range of values (0-1 of each dimension). As new data arrives and the cube reaches a predetermined size, it generates subcubes, each covering a more specific region of the domain. This cube division continues, producing finer and finer cubes. The result is a layout that mirrors the actual distribution of the data. Skewed data that clusters around a tight set of values is located in dense regions of space, located in finer and finer cubes, while sparse regions remain coarse. Fig 6. Cubes adaptively subdivide recursively based on multidimensional spatial density. In figure 6 above, we get the following set of splits: Root cube is created The root cube divides in half by both dimensions, creating four subcubes (0, 1, 2, 3). Subcube 3 fills up and divides into subcubes (30, 31, 32, 33) Subcube 30 fills up and divides into subcubes (300, 301, 302, 303) Now it’s time to map this spatial representation to the tree. Because of how the cubes subdivide into two halves along each dimension, the cube id (such as 301) encodes its position and normalized domain bounds (along each dimension). This multidimensional space, divided up adaptively into multiple levels of subcubes, is represented by a tree. Fig 7. The OTree representation of the cubes. We can visualize the progress of a single root cube to the final set of cubes as follows. Fig 8. The OTree over time. Next let’s look at how this translates to Apache Iceberg and its data files. Up to this point we’ve been talking about cubes, multidimensional space, and trees in abstract terms. But let’s ground ourselves and see how all this maps onto an Iceberg table or Delta table. The OTree governs layout, but Iceberg/Delta remains the source of truth about the canonical set of data files and their metadata. Writers (such as a Spark job for ingest) consult the OTree but readers (such as a Spark analytics job) only read Iceberg/Delta metadata. This separation allows the index to be invisible to all engines (Spark, Flink, Trino etc), requiring no special integration. Each node of the OTree corresponds to a cube, which in turn contains one or more blocks , where each block points to a data file (such as a Parquet file). Fig 9. Each node of the OTree contains one or more blocks, where each block is a data file (such as Parquet). In this example, the root cube reached capacity with three files and split along 2 dimensions. Notice that the data does not exist only in leaf nodes, but all nodes of the tree. The deeper into the tree you go, the narrower value range across the dimensions each node represents. Any given data point may exist in any node from the root, down to the lowest leaf that covers the data point. Fig 10. The data point maps onto the nodes: root, 3, 30 and 302. A query whose filter predicates cover this point may end up reading each of these files (it depends on the column stats). As I said in my previous blog post on OTF performance, the Iceberg column statistics reflect the layout of the data. We want narrow column stats for effective pruning, which means producing a data layout with data locality. The OTree provides that method of obtaining data locality according to one or more indexed columns (the dimensions of our multidimensional space). But readers carry on using the standard column statistics and bloom filters as usual. So, the OTree index governs the table’s layout but it doesn’t replace Iceberg or Delta’s metadata or data files. The two systems coexist: The OTree index describes how the data should be organized: which regions exist, their spatial boundaries, and which data points fall into each. Iceberg/Delta’s metadata remains the authoritative catalog of what files exist and their stats. In Iceberg, the OTree index is stored as a Puffin file which is referenced in the Iceberg metadata (so the OTree is committed as part of the Iceberg commit). Each commit may result in a new version of the OTree. Fig 11. A very simplified representation of four Iceberg commits which add one Parquet file per commit. The root cube splits in the 3rd snapshot, writing to one subcube, and another subcube in snapshot 4. In DeltaLake, the OTree metadata is included within tag metadata of each operation in the Delta Log (as depicted below). Fig 12. A very simplified representation of the Delta log with four add_files operations. Each added file is mapped to a cube id (where the tree structure is encoded into the cube ids). So although the OTree introduces a tree-shaped, spatial index, the underlying Iceberg/Delta table remains standard (additional fields are added to metadata which does not break existing engines). Query engines simply ignore the OTree when they perform reads. Writers (optionally) and table maintenance jobs (obligatory) do need to know about the OTree, as we want the layout to be governed by an adaptive index rather than static partitioning logic. Ideally writers will use the OTree index so that the index covers the whole dataset (ensuring locality is maintained from the very first moment data is written to the table). However, that requires that the writer, such as Apache Spark, to use the Qbeast module when performing writes. Table maintenance jobs must use the module, in order to apply the spatial layout to the Iceberg data files. Although the OTree governs the layout of the entire table, the OTree itself is just lightweight metadata that describes the parent-child relationships (encoded in the cube ids), and for each cube: the element count and the min/max weights of each cube. I won’t go into the detail of weights, but it is an additional feature designed to enhance data distribution across the nodes. The normalized dimension bounds of each cube are established by the position of the cube in the tree, so there is no need to store that. Because of this, even a table with billions of rows can be represented by an OTree containing just a few thousand small metadata entries, typically amounting to a few megabytes in total. The tree is therefore cheap to store, fast to read, and easy to keep in memory, while still providing a view of the data’s spatial layout. It’s helpful to see all of this on a spectrum. On the left , the classic B-tree clustered index: a strict, key-ordered global tree index that dictates exactly where every row lives. While great for selective OLTP workloads, it is far too rigid and expensive when the dataset grows and the queries become broad (reading millions of rows). On the right , we have Iceberg/Delta’s approach: lightweight metadata describing the canonical set of files (without ordering), with a declared clustering strategy (partitioning and optional sort order) which the table is constantly drifting from, requiring maintenance bound that drift. In the middle sits the OTree , it is a global tree index, but without the fine-grained rigidity of the B-tree. Instead of ordering individual rows, it divides the data space into coarse, adaptive regions that subdivide and merge as the distribution demands. This keeps it incredibly light while still determining where data should live. Dense data is located in narrow cubes and sparse data in wide cubes. The layout is self-correcting as data distribution changes, avoiding imbalanced partitions. It’s fun to see the inversion of the role of the index. Using it to shape the table as it is written, so that the layout remains close to optimal, making the existing read-time optimizations of Iceberg and Delta more effective. The OTree is there behind the scenes and query engines that read from the tables have no idea that it exists. There is a lot more to Qbeast than what I’ve covered here, there are additional mechanisms for ensuring even data distribution and making sampling efficient via random weights, but that’s too detailed for this post. The takeaway for me I suppose is that there are always more innovative ways of doing things, and we’re still early in the open table format / lakehouse game. There are plenty more innovations to come at all levels, from file formats, data organization, to query engines. Indexes are primarily for reads . Indexes are usually framed as read optimizations paid for by write overhead: they make read queries fast, but inserts and updates slower. That isn’t the full story as indexes also support writes such as with faster updates and deletes in primary key tables but the dominant mental model is that indexing serves reads while writes pay the bill. OTFs don’t use tree-based indexes . Open-table format indexes are data-skipping indexes scoped to data files or even blocks within data files. They are a loose collection of column statistics and Bloom filters. Partitioning Partition granularity . The partition key must be chosen carefully: too many partitions lead to many small files, which can hurt performance instead of helping it. Imbalanced partitions . Your data may be skewed, leading to imbalanced partition sizes. Some might be very small, while others might be very large, which is inefficient and can lead to uneven performance. Changing distributions . The shape of your data may change over time, making your chosen partitioning strategy less effective over time. Drift . Your tables are constantly drifting away from the optimum clustering layout as new data arrives. Compaction is constantly working to cluster recent data. Global clustering is expensive, so clustering is usually performed on subsets of the data. With 2 indexed columns, each division produces 4 subcubes (2×2 grid) With 3 indexed columns, each division produces 8 subcubes (2×2×2) With 4 indexed columns, each division produces 16 subcubes (2×2×2×2) Root cube is created The root cube divides in half by both dimensions, creating four subcubes (0, 1, 2, 3). Subcube 3 fills up and divides into subcubes (30, 31, 32, 33) Subcube 30 fills up and divides into subcubes (300, 301, 302, 303) The OTree index describes how the data should be organized: which regions exist, their spatial boundaries, and which data points fall into each. Iceberg/Delta’s metadata remains the authoritative catalog of what files exist and their stats.

0 views
The Coder Cafe 1 months ago

Build Your Own Key-Value Storage Engine—Week 2

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Before delving into this week’s tasks, it’s important to understand what you will implement. This week, you will implement a basic log-structured merge-tree (LSM tree). At its core, an LSM tree is a data structure that prioritizes write efficiency by trading off some read complexity. It buffers writes in memory and uses append-only files on disk, then rewrites data during compaction. It consists of two main components: A mutable in-memory data structure called a memtable, used to store recent writes. A set of immutable SSTables (Sorted String Table) stored on disk. Regularly, the current memtable is snapshotted, its entries are sorted by key, and a new immutable SSTable file is written. In addition, a MANIFEST file is an append-only list of SSTable filenames. It tells the engine which SSTable files exist and in which order to read them, newest to oldest. Why LSM trees shine for write-heavy workloads: Fast writes with sequential I/O: New updates are buffered in memory (memtable) and later written sequentially to disk during a flush (SSTable), which is faster than the random I/O patterns common with B-trees, for example. Decouples writes from read optimization: Writes complete against the memtable, while compaction work runs later (you will tackle that in a future week). Space and long-term efficiency: Compaction processes remove dead data and merge many small files into larger sorted files, which keeps space usage in check and sustains read performance over time. For the memtable, you will start with a hashtable. In a future week, you will learn why a hashtable is not the most efficient data structure for an LSM tree, but it is a simple starting point. For the SSTables, you will use JSON as the data format. Get comfortable with a JSON parser if you are not already. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord This week’s implementation is single-threaded. You will revisit that assumption later. Implement a hashtable to store requests (create or update). You can probably reuse a lot of code from Week 1. When your memtable contains 2,000 entries: Flush the memtable as a new immutable JSON SSTable file with keys sorted. The SSTable file is a JSON array of objects, each with two fields, and . Keys are unique within a file. For example, if your memtable contains the following entries: You need to create the following SSTable: Use a counter for the filename prefix, for example , , . After writing the new SSTable, append its filename to the MANIFEST (append only), then clear the memtable: For now, the flush is a stop-the-world operation. While the file is being written, do not serve reads or writes. You will revisit that later. Create an empty file if it doesn’t exist. Derive the next SSTable ID from the MANIFEST so you don't reuse the same filename. Check the memtable: If found, return the corresponding value. If not found, read the MANIFEST to list SSTable filenames: Scan SSTables from newest to oldest (for example , then , then ). Use a simple linear scan inside each file for now. Stop at the first hit and return the corresponding value. If still not found, return . There are no changes to the client you built in week 1. Run it against the same file ( put.txt ) to validate that your changes are correct. Keep a small LRU cache of known-absent keys (negative cache) between the memtable and SSTables. This avoids repeated disk scans for hot misses: after the first miss, subsequent lookups are O(1). Implementation details are up to you. Instead of parsing the MANIFEST file for each request, you can cache the content in-memory. That’s it for this week! You have built the first version of an LSM tree: a memtable in memory, SSTable files written by regular flushes, and a MANIFEST that lists those SSTables. For now, durability isn’t guaranteed. Data already flushed to SSTables will be read after a restart, but anything still in the memtable during a crash is lost. In two weeks, you will make sure that any request acknowledged to a client remains in your storage engine, even after a restart. The flush trigger you used was pretty simple: once the memtable contains 2,000 entries. In real systems, flushes can be triggered by various factors, for example: Some databases flush when the memtable reaches a target size in bytes, ensuring predictable memory usage. A flush can also occur after a period of time has passed. This occurs because the database eventually needs to release commit log segments. For tables with very low write activity, this can sometimes lead to data resurrection scenarios. Here’s an old issue from the ScyllaDB codebase that illustrates this behavior. Regarding the model, this series assumes a simple key–value one: every PUT stores the whole value, so a GET just finds the newest entry and returns it. If you need a richer model (e.g., rows with many fields or collections), writes are often partial (patches) rather than full replacements. Therefore, reads must reconstruct the result by scanning newest to oldest and merging changes until all required fields are found or a full-write record is encountered. Last but not least, in this series, you implicitly rely on client-side ordering: the validation client issues requests sequentially. Production KV databases typically attach a sequence number or a logical timestamp to each write to handle out-of-order arrivals, merging, and reconciling results. Pure wall-clock timestamps are convenient but brittle; see Kyle Kingsbury’s notes on clock pitfalls for a deeper dive. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary . ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations A mutable in-memory data structure called a memtable, used to store recent writes. A set of immutable SSTables (Sorted String Table) stored on disk. Fast writes with sequential I/O: New updates are buffered in memory (memtable) and later written sequentially to disk during a flush (SSTable), which is faster than the random I/O patterns common with B-trees, for example. Decouples writes from read optimization: Writes complete against the memtable, while compaction work runs later (you will tackle that in a future week). Space and long-term efficiency: Compaction processes remove dead data and merge many small files into larger sorted files, which keeps space usage in check and sustains read performance over time. This week’s implementation is single-threaded. You will revisit that assumption later. Flush the memtable as a new immutable JSON SSTable file with keys sorted. The SSTable file is a JSON array of objects, each with two fields, and . Keys are unique within a file. For example, if your memtable contains the following entries: You need to create the following SSTable: Use a counter for the filename prefix, for example , , . After writing the new SSTable, append its filename to the MANIFEST (append only), then clear the memtable: Create an empty file if it doesn’t exist. Derive the next SSTable ID from the MANIFEST so you don't reuse the same filename. Check the memtable: If found, return the corresponding value. If not found, read the MANIFEST to list SSTable filenames: Scan SSTables from newest to oldest (for example , then , then ). Use a simple linear scan inside each file for now. Stop at the first hit and return the corresponding value. If still not found, return . Some databases flush when the memtable reaches a target size in bytes, ensuring predictable memory usage. A flush can also occur after a period of time has passed. This occurs because the database eventually needs to release commit log segments. For tables with very low write activity, this can sometimes lead to data resurrection scenarios. Here’s an old issue from the ScyllaDB codebase that illustrates this behavior. The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary .

0 views
Dangling Pointers 2 months ago

DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge

DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge Arjun Kashyap, Yuke Li, and Xiaoyi Lu HPDC'25 This paper ends with a counterintuitive result: DPUs aren’t amazingly better than traditional CPUs at implementing key-value stores. You would think they would have a shot, given that they have specialized networking accelerators, and key-value stores don’t require a lot of general-purpose computation. It all comes down to access to off-chip memory. Table 1 and Fig. 2 contain profiling results which motivate the designs presented by this paper: Source: https://dl.acm.org/doi/10.1145/3731545.3731571 When a key-value store is run on a traditional CPU, most of the time is spent in packet processing rather than CRUD operation on (key, value) tuples. DPUs are chips that are specialized for packet processing, so one would think a DPU would be especially helpful in this situation. The designs in this paper are evaluated on NVIDIA BlueField 2 and BlueField 3 DPUs. These DPUs contain a mix of fixed-function hardware and ARM cores. The fundamental idea proposed in this paper is to make the CPU’s job easier by having the DPU perform some of the awkward packet processing work and passing CRUD requests to the CPU in a convenient format. The DPU parses incoming packets, extracts the relevant fields, and writes the minimal amount of information (operation to perform, key, value) to queues in host memory. Fig. 8(d) shows the amount of cruft the DPU is able to remove from each CRUD packet. Source: https://dl.acm.org/doi/10.1145/3731545.3731571 The design described in this paper is optimized in all of the ways you would expect , requests are batched, and appropriate pipelining is used to avoid idle time. Cross-core synchronization (both DPU and host CPU cores) synchronization is minimized with the presence of many queues. Each ARM core owns a unique set of queues, and each host CPU core is assigned to one or more queues. As Fig. 11 shows, the design described so far ( DPU-KV-lat and DPU-KV-sav ) doesn’t offer significant speedups. There are latency improvements, but throughput suffers. These designs are bound by the DPU. Evidence for this comes from the performance of DPU-KV-dual and DPU-KV-shrd . DPU-KV-dual allows idle host CPU cores to process raw packets. DPU-KV-shrd uses sharding. The set of keys is partitioned, with some keys handled by the CPU and some keys handled by the DPU. Source: https://dl.acm.org/doi/10.1145/3731545.3731571 Dangling Pointers The moral of the story is that there is no special advantage conferred on an ARM core inside of a SmartNIC over an Intel core inside of the host CPU. It is interesting to compare this work to a key-value store implemented directly in an RMT pipeline . It would be interesting to drill down into the profiling numbers which motivated this paper and understand how much memory-level parallelism a traditional CPU core can utilize. At the microarchitectural level, this problem has to be memory bound, it would be interesting to see if it is bound by memory latency or bandwidth. Subscribe now

0 views
<antirez> 2 months ago

Scaling HNSWs

I’m taking a few weeks of pause on my HNSWs developments (now working on some other data structure, news soon). At this point, the new type I added to Redis is stable and complete enough, it’s the perfect moment to reason about what I learned about HNSWs, and turn it into a blog post. That kind of brain dump that was so common pre-AI era, and now has become, maybe, a bit more rare. Well, after almost one year of thinking and implementing HNSWs and vector similarity stuff, it is time for some writing. However this is not going to be an intro on HNSWs: too many are present already. This is the “extra mile” instead. If you know HNSWs, I want to share with you my more “advanced” findings, especially in the context of making them fast enough to allow for a “Redis” experience: you know, Redis is designed for low latency and high performance, and HNSWs are kinda resistant to that, so there were challenges to expose HNSWs as an abstract data structure. This blog post will be split into several sections. Think of them as pages of the same book, different chapters of the same experience. Oh and, by the way, I already wrote and subsequently lost this blog post :D [long, sad story about MacOS and bad habits – I hadn’t lost something like that since the 90s, during blackouts], so here most of the problem will be to recall what I wrote a few days ago and, while I’m at it, to better rephrase what I didn’t like very much. ## A few words about the state of HNSW Before digging into the HNSWs internals and optimizations, I want to say a few things about HNSWs. The original paper introducing HNSWs is a great piece of computer science literature, and HNSWs are amazing data structures, but: I don’t believe they are the last word for searching, in a greedy way, for nearby vectors according to a distance function. The paper gives the feeling it lacks some “pieces”, almost like if the researchers, given six months more, had a lot more to explore and say. For instance, I modified the paper myself, extending it in order to support removal of entries, actual removals, not just tombstone deletions where the element is marked as gone and collected later: deleting items is totally missing from the paper. Similarly, there are, right now, efforts in order to really check if the “H” in the HNSWs is really needed, and if instead a flat data structure with just one layer would perform more or less the same (I hope I’ll cover more about this in the future: my feeling is that the truth is in the middle, and that it makes sense to modify the level selection function to just have levels greater than a given threshold). All this to say that, if you are into data structures research, I believe that a great area is to imagine evolutions and refinements of HNSWs, without getting trapped within the idea that the evolutions are only in the sense of: let’s do it, but for disk (see Microsoft efforts), or the like. Ok, enough with the premise, let’s go to the actual low level stuff :) ## Scaling memory Redis is an in-memory system, and both HNSWs and vectors have the unfortunate quality of being very space-hungry. There are three reasons for this: 1. HNSWs have a lot of pointers, like 16, 32 or more pointers (this is a tunable parameter of HNSWs) to neighbor nodes. 2. HNSWs have many levels, being a skiplist-alike data structure. This exacerbates the first problem. 3. HNSW’s satellite data is a vector of floating point numbers, so, in the vanilla case, 4 bytes per component, and normally you can have 300-3000 components, this is the usual range. So, what are the lessons learned here? There are folks that compress pointers, since it is very likely that many pointers (8 bytes in 64 bit systems) will have the highest four bytes all the same. This is smart, I didn’t implement it yet, because in Redis I need to go fast, and this is a tradeoff between space and time: but maybe it is worth it, maybe not. I’ll dig more. However, if you do the math, the fact that there are many layers is not *so* terrible as it looks. On average, the multiple layers per node make the situation worse by just ~1.3x (if the probability of level increase is 0.25 in the level selection function), since many nodes will be just at layer 0. But still 1.3 is more than 1, and if that “H” in HNSWs really is not *so* useful… [Spoiler, what I found is that the seek time if you have everything at layer 0 is greater, the main loop for the greedy search will start from less optimal places and it will eventually reach the right cluster, but will take more computation time. However this is just early results.] So here the *real* low hanging fruit is: vector quantization. What I found is that if you use 8 bit quantization what you get is an almost 4x speedup, a 4x reduction of your vectors (but not a 4x reduction of the whole node: the pointers are still there, and they take a lot of space), and a recall that is virtually the same in real world use cases. This is the reason why Redis Vector Sets use 8 bit quantization by default. You can specify, via VADD options, that you want full precision vectors or binary quantized vectors, where we just take the sign, but I’m skeptical about using both full size vectors and binary quantized vectors. Before talking about them, let’s see what kind of quantization I used for 8 bit. What I do is to compute the maximum absolute value of the component of each vector (so quantization is per-vector), then I use signed 8 bit values to represent the quant from -127 to 127. This is not as good as storing both min and max value, but it is faster when computing cosine similarity, since I can do this: /* Each vector is quantized from [-max_abs, +max_abs] to [-127, 127] * where range = 2*max_abs. */ const float scale_product = (range_a/127) * (range_b/127); Then I multiply things together in the integer domain with (actually in the code the main loop is unrolled and uses multiple accumulators, to make modern CPUs more busy) for (; i And finally we can return back to the floating point distance with: float dotf = dot0 * scale_product; Check the vectors_distance_q8() for more information, but I believe you got the idea: it is very simple to go from the integer quants domain to the unquantized dotproduct with trivial operations. So, 8 bit quantization is a great deal, and full precision was a *needed* feature, because there will be people doing things with vectors generated in a way where each small amount makes a difference (no, with learned vectors this is not the case…) but, why binary quantization? Because I wanted users to have a simple way to not waste space when their *original* information is already binary. Imagine you have a set of users and they have yes/no properties, and you want to find similar users, items, whatever. Well: this is where binary quantization should be used, it’s just, again, an option of the VADD command. ## Scaling speed: threading and locality Oh, you know, I have to tell you something about myself: I’m not a fan of threaded systems when it is possible to do a lot with a single core, and then use multiple cores in a shared-nothing architecture. But HNSWs are different. They are *slow*, and they are accessed almost always in read-only ways, at least in most use cases. For this reason, my Vector Sets implementation is fully threaded. Not just reads, even writes are partially threaded, and you may wonder how this is possible without it resulting in a mess, especially in a system like Redis, where keys can be accessed in different ways by the background saving process, the clients, and so forth. Well, to start, let’s focus on reads. What happens is that as long as nobody is writing in the data structure, we can spawn threads that do the greedy collection of near vectors and return back the results to the blocked client. However, my implementation of HNSWs was written from scratch, I mean, from the empty C file opened with vim, it has 0% of shared code with the two implementations most other systems use, so there are a few “novelties”. One of such different things is that in order to avoid re-visiting already visited nodes, I use an integer stored in each node that is called “epoch”, instead of using another data structure to mark (like, in a hash table) nodes already visited. This is quite slow, I believe. The epoch instead is local to the node, and the global data structure increments the epoch for each search. So in the context of each search, we are sure that we can find epochs that are just But with threads, there are multiple searches occurring at the same time! And, yep, what I needed was an array of epochs: typedef struct hnswNode { uint32_t level; /* Node's maximum level */ … many other stuff … uint64_t visited_epoch[HNSW_MAX_THREADS]; } That’s what you can read in hnsw.h. This is, again, a space-time tradeoff, and again time won against space. So, how was it possible to have threaded writes? The trick is that in HNSW inserts, a lot of time is spent looking for neighbors candidates. So writes are split into a reading-half and commit-half, only the second needs a write lock, and there are a few tricks to make sure that the candidates we accumulated during the first part are discarded if the HNSW changed in the meantime, and some nodes may no longer be valid. There is, however, another problem. What about the user deleting the key, while background threads are working on the value? For this scenario, we have a function that waits for background operations to return before actually reclaiming the object. With these tricks, it is easy to get 50k ops/sec on real world vector workloads, and these are numbers I got from redis-benchmark itself, with all the overhead involved. The raw numbers of the flat HNSW library itself are much higher. ## Scaling memory: reclaiming it properly Before talking about how to scale HNSWs into big use cases with multiple instances involved, and why Redis Vector Sets expose the actual data structure in the face of the user (I believe programmers are smart and don’t need babysitting, but it’s not *just* that), I want to go back and talk again about memory, because there is an interesting story to tell about this specific aspect. Most HNSWs implementations are not able to reclaim memory directly when you delete a node from the graph. I believe there are two main reasons for that: 1. People misunderstand the original HNSW paper in a specific way: they believe links can be NOT reciprocal among neighbors. And there is a specific reason why they think so. 2. The paper does not say anything about deletion of nodes and how to fix the graph after nodes go away and we get missing links in the “web” of connections. The first problem is a combination (I believe) of lack of clarity in the paper and the fact that, while implementing HNSWs, people face a specific problem: when inserting a new node, and good neighbors are searched among existing nodes, often the candidates already have the maximum number of outgoing links. What to do, in this case? The issue is often resolved by linking unidirectionally from the new node we are inserting to the candidates that are already “full” of outgoing links. However, when you need to delete a node, you can no longer resolve all its incoming links, so you can’t really reclaim memory. You mark it as deleted with a flag, and later sometimes there is some rebuilding of the graph to “garbage collect” stale nodes, sometimes memory is just leaked. So, to start, my implementation in Redis does things differently by forcing links to be bidirectional. If A links to B, B links to A. But, how to do so, given that A may be busy? Well, this gets into complicated territory but what happens is that heuristics are used in order to drop links from existing nodes, with other neighbors that are well connected, and if our node is a better candidate even for the target node, and if this is not true there are other ways to force a new node to have at least a minimal number of links, always trying to satisfy the small world property of the graph. This way, when Redis deletes a node from a Vector Set, it always has a way to remove all the pointers to it. However, what to do with the remaining nodes that now are missing a link? What I do is to create a distance matrix among them, in order to try to link the old node neighbors among them, trying to minimize the average distance. Basically for each pair of i,j nodes in our matrix, we calculate how good is their connection (how similar their vectors are) and how badly linking them affects the *remaining* possible pairs (since there could be elements left without good pairs, if we link two specific nodes). After we build this matrix of scores, we then proceed with a greedy pairing step. This works so well that you can build a large HNSW with millions of elements, later delete 95% of all your elements, and the remaining graph still has good recall and no isolated nodes and so forth. That is what I mean when I say that there is space in HNSWs for new papers to continue the work. ## Scaling HNSWs to multiple processes When I started to work at Redis Vector Sets, there was already a vector similarity implementation in Redis-land, specifically as an index type of RediSearch, and this is how most people think at HNSWs: a form of indexing of existing data. Yet I wanted to provide Redis with a new HNSW implementation exposed in a completely different way. Guess how? As a data structure, of course. And this tells a story about how Redis-shaped is my head after so many years, or maybe it was Redis-shaped since the start, and it is Redis that is shaped after my head, since I immediately envisioned how to design a Redis data structure that exposed HNSWs to the users, directly, and I was puzzled that the work with vectors in Redis was not performed exactly like that. At the same time, when I handed my design document to my colleagues at Redis, I can’t say that they immediately “saw” it as an obvious thing. My reasoning was: vectors are like scores in Redis Sorted Sets, except they are not scalar scores where you have a total order. Yet you can VADD, VREM, elements, and then you can call VSIM instead of ZRANGE in order to have *similar* elements. This made sense not just as an API, but I thought of HNSWs as strongly composable, and not linked to a specific use case (not specific to text embeddings, or image embeddings, or even *learned* embeddings necessarily). You do: VADD my_vector_set VALUES [… components …] my_element_string So whatever is in your components, Redis doesn't care, when you call VSIM it will report similar elements. But this also means that, if you have different vectors about the same use case split in different instances / keys, you can ask VSIM for the same query vector into all the instances, and add the WITHSCORES option (that returns the cosine distance) and merge the results client-side, and you have magically scaled your hundred of millions of vectors into multiple instances, splitting your dataset N times [One interesting thing about such a use case is that you can query the N instances in parallel using multiplexing, if your client library is smart enough]. Another very notable thing about HNSWs exposed in this raw way, is that you can finally scale writes very easily. Just hash your element modulo N, and target the resulting Redis key/instance. Multiple instances can absorb the (slow, but still fast for HNSW standards) writes at the same time, parallelizing an otherwise very slow process. This way of exposing HNSWs also scales down in a very significant way: sometimes you want an HNSW for each user / item / product / whatever you are working with. This is very hard to model if you have an index on top of something, but it is trivial if your HNSWs are data structures. You just can have a Vector Set key for each of your items, with just a handful of elements. And of course, like with any other Redis key, you can set an expiration time on the key, so that it will be removed automatically later. All this can be condensed into a rule that I believe should be more present in our industry: many programmers are smart, and if instead of creating a magic system they have no access to, you show them the data structure, the tradeoffs, they can build more things, and model their use cases in specific ways. And your system will be simpler, too. ## Scaling loading times If I don’t use threading, my HNSW library can add word2vec (300 components for each vector) into an HNSW at 5000 elements/second if I use a single thread, and can query the resulting HNSW at 90k queries per second. As you can see there is a large gap. This means that loading back an HNSW with many millions of elements from a Redis dump file into memory would take a lot of time. And this time would impact replication as well. Not great. But, this is true only if we add elements from the disk to the memory in the most trivial way, that is storing “element,vector” on disk and then trying to rebuild the HNSW in memory. There is another lesson to learn here. When you use HNSWs, you need to serialize the nodes and the neighbors as they are, so you can rebuild everything in memory just allocating stuff and turning neighbors IDs into pointers. This resulted in a 100x speedup. But do you really believe the story ends here? Hehe. Recently Redis has stronger security features and avoids doing bad things even when the RDB file is corrupted by an attacker. So what I needed to do was to make sure the HNSW is valid after loading, regardless of the errors and corruption in the serialized data structure. This involved many tricks, but I want to take the freedom to just dump one comment I wrote here, as I believe the reciprocal check is particularly cool: /* Second pass: fix pointers of all the neighbors links. * As we scan and fix the links, we also compute the accumulator * register "reciprocal", that is used in order to guarantee that all * the links are reciprocal. * * This is how it works, we hash (using a strong hash function) the * following key for each link that we see from A to B (or vice versa): * * hash(salt || A || B || link-level) * * We always sort A and B, so the same link from A to B and from B to A * will hash the same. Then we xor the result into the 128 bit accumulator. * If each link has its own backlink, the accumulator is guaranteed to * be zero at the end. * * Collisions are extremely unlikely to happen, and an external attacker * can't easily control the hash function output, since the salt is * unknown, and also there would be to control the pointers. * * This algorithm is O(1) for each node so it is basically free for * us, as we scan the list of nodes, and runs on constant and very * small memory. */ ## Scaling use cases: JSON filters I remember the day when the first working implementation of Vector Sets felt complete. Everything worked as expected and it was the starting point to start with the refinements and the extra features. However in the past weeks and months I internally received the feedback that most use cases need some form of mixed search: you want near vectors to a given query vector (like most similar movies to something) but also with some kind of filtering (only released between 2000 and 2010). My feeling is that you need to query for different parameters less often than product people believe, and that most of the time you can obtain this more efficiently by adding, in this specific case, each year to a different vector set key (this is another instance of the composability of HNSWs expressed as data structures versus a kind of index). However I was thinking about the main loop of the HNSW greedy search, that is something like this: // Simplified HNSW greedy search algorithm. Don’t trust it too much. while(candidates.len() > 0) { c = candidates.pop_nearest(query); worst_distance = results.get_worst_dist(query); if (distance(query,c) > worst_distance) break; foreach (neighbor from c) { if (neighbor.already_visited()) continue; neighbor.mark_as_visited(); if (results.has_space() OR neighbor.distance(query) candidates.add(neighbor); results.add(neighbor); } } } return results; So I started to play with the idea of adding a JSON set of metadata for each node. What if, once I have things like {“year”: 1999}, this was enough to filter while I perform the greedy search? Sure, the search needed to be bound, but there is a key insight here: I want, to start, elements that are *near* to the query vector, so I don’t really need to explore the whole graph if the condition on the JSON attributes is not satisfied by many nodes. I’ll let the user specify the effort, and anyway very far away results that match the filter are useless. So that’s yet another way how my HNSW differs: it supports filtering by expressions similar to the ones you could write inside an “if” statement of a programming language. And your elements in the Vector Set can be associated with JSON blobs, expressing their properties. Then you can do things like: VSIM movies VALUES … your vector components here… FILTER '.year >= 1980 and .year ## A few words on memory usage HNSW’s fatal issue is — in theory — that they are normally served from memory. Actually, you can implement HNSWs on disk, even if there are better data structures from the point of view of disk access latencies. However, in the specific case of Redis and Vector Sets the idea is to provide something that is very fast, easy to work with: the flexibility of in-memory data structures help with that. So the question boils down to: is the memory usage really so bad? Loading the 3 million Word2Vec entries into Redis with the default int8 quantization takes 3GB of RAM, 1kb for each entry. Many use cases have just a few tens of million of entries, or a lot less. And what you get back from HNSWs, if well implemented, and in memory, is very good performance, which is crucial in a data structure and in a workload that is in itself slow by definition. In my MacBook I get 48k ops per second with redis-benchmark and VSIM against this key (holding the word2vec dataset). My feeling is that the memory usage of in-memory HNSWs is very acceptable for many use cases. And even in the use cases where you want the bulk of your vectors on disk, even if there is to pay for slower performance, your hot set should likely be served from RAM. This is one of the reasons why I believe that, to be active in HNSW research is a good idea: I don’t think they will be replaced anytime soon for most use cases. It seems more likely that we will continue to have different data structures that are ideal for RAM and for disk depending on the use cases and data size. Moreover, what I saw recently, even just scanning the Hacker News front page, is people with a few millions of items fighting with systems that are slower or more complicated than needed. HNSWs and carefully exposing them in the right way can avoid all that. ## Conclusions I like HNSWs, and working and implementing them was a real pleasure. I believe vectors are a great fit for Redis, even in an AI-less world (for instance, a few months ago I used them in order to fingerprint Hacker News users, replicating an old work published on HN in the past). HNSWs are simply too cool and powerful for a number of use cases, and with AI, and learned embeddings, all this escalates to a myriad of potential use cases. However, like most features in Redis, I expect that a lot of time will pass before people realize they are useful and powerful and how to use them (no, it’s not just a matter of RAG). This happened also with Streams: finally there is mass adoption, after so many years. If instead you are more interested in HNSW and the implementation I wrote, I believe the code is quite accessible, and heavily commented: https://github.com/redis/redis/blob/unstable/modules/vector-sets/hnsw.c If you want to learn more about Redis Vector Sets, please feel free to read the README file I wrote myself. There is also the official Redis documentation, but I suggest you start from here: https://github.com/redis/redis/tree/unstable/modules/vector-sets Thanks for reading such a long blog post! And have a nice day. References. This is the paper about the "H" in HNSW and how useful it is -> https://arxiv.org/abs/2412.01940 Comments

0 views