Posts in Database (20 found)

Introduction to SQLAlchemy 2 In Practice

In 2023 I wrote " SQLAlchemy 2 In Practice ", a book in which I offer an in-depth look at SQLAlchemy version 2 , still the current version today. SQLAlchemy is, for those who don't know, the most popular database library and Object-Request Mapper (ORM) for Python. I have a tradition of publishing my books on this blog to read for free, but this is one that I never managed to bring here, and starting today I'm going to work on correcting that. This article includes the Preface of the book. If you are interested, keep an eye out on this blog over the next few weeks, as I will be publishing the eight chapters of the book in order. If you can't wait for the installments, you can buy the book in electronic or paper format today, and I will be eternally thankful, as you will be directly supporting my work.

0 views
Robin Moffatt 3 days ago

Claude Code isn't going to replace data engineers (yet)

Ten years late (but hopefully not a dollar short ) I recently figured out what all the fuss about dbt is about . No it’s not (at least, not yet). In fact, used incorrectly, it’ll do a worse job than you. But used right, it’s a kick-ass tool that any data engineer should be adding to their toolbox today * . In this article I’ll show you why.

0 views
The Coder Cafe 3 days ago

Build Your Own Key-Value Storage Engine—Week 8

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. The conference starts today and lasts two days. Tomorrow, I’m giving a talk called Working on Complex Systems . I’d be glad to see you there 🙂 . Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Week 8: Concurrency Over this series, you built a working LSM tree: you flush to persist the memtable to disk and compact to reclaim space. Yet, you’ve been single-threaded so far. This week, we lift that constraint: flush and compaction will run in the background while you keep serving requests. There are many ways to add concurrency. The approach here is to introduce a versioned, ref-counted catalog that lets readers take a stable snapshot while background flush/compaction proceeds. A catalog holds references to: The current memtable. The current WAL. The current MANIFEST. Each request pins one catalog version for the duration of the operation. When a flush or compaction completes, the system creates a new catalog version. Old resources (e.g., obsolete SSTables) are not deleted immediately. Instead, each catalog tracks a refcount of in-flight requests. Once an old catalog’s refcount drops to zero and a newer catalog exists, you can safely garbage-collect the resources that appear in the old version but not in the new one. For example, with two catalog versions (red = older version’s elements, blue = newer element”): Once we can guarantee that catalog v1 is no longer referenced, we can delete the old MANIFEST, SST-2 and SST-3. Another example: a flush produced a new memtable and WAL file: In this case, once vatalog v1 has no remaining references, we can free the old memtable and delete the old WAL file. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Add a data structure that tracks: Memtable reference. MANIFEST path. Version (monotonic). Refcount of active readers. Implement a manager that keeps catalog versions in memory: Pick the latest catalog. Increment its refcount. Decrement the refcount of the catalog. If refcount is zero and there’s a new catalog version: Remove the current catalog. Remove elements present in the current catalog but not in the latest version (files, WAL, etc.) Create a new catalog based on the provided data. Assign a unique, monotonic version. At startup: Read from the authoritative MANIFEST (latest MANIFEST file). Treat any files not listed in MANIFEST as orphans and delete them. Read all WAL files you still have on disk, in order, to rebuild the in-memory state. Create the current catalog version from the reconstructed state. Start the background worker. In a nutshell, flush and compaction will move to the background. You’ll use internal queues plus worker pools to ensure no overlapping work on the same resources: at most one flush running at a time, and at most one compaction running at a time. Compaction: Keep the same trigger: Every 10,000 update requests. Do not run compaction in the request path. On compaction trigger: Post a notification to an internal queue and return. A single background thread listens on the queue and runs the actual work. Similar compaction process, except: Do not overwrite the existing WAL file. Instead, create a new file. Create a new catalog that references the new WAL. Keep the same trigger: When the memtable contains 2,000 entries. Do not run flush in the request path. On flush trigger: Allocate a new memtable and create a new WAL file for subsequent writes. Post a notification to an internal queue. Return immediately to the caller. A single background thread listens on the queue and runs the actual work. Similar flush process, except: Do not overwrite the existing MANIFEST file. Instead, create a new file. Create a new catalog referencing the new MANIFEST. Acquire a catalog from the manager. Do the operation using paths/refs from that catalog. Release the catalog. Concurrent requests make deterministic assertions harder. For example, suppose the validation file contains the following requests that can run in parallel: What should you assert for : , , or ? To make validation deterministic, you will handle barriers: all requests before a barrier must finish before starting the next block. You will also relax checks: a is valid if it returns any value written for a key before the last barrier. A similar example with barriers: The first two requests run in parallel. The first barrier waits for both to complete. The first GET should accept either or . The second request should accept only . The new validation file is a sequence of blocks separated by instructions: All the lines between two barriers form a block. On instruction, wait for all in-flight requests in the current block to finish before starting the next block. / lines are issued in parallel within their block. lines are also issued in parallel within their block. means the response must be any one of the list values. Download and run your client against a new file: concurrency.txt . When the memtable reaches 80% of capacity: Pre-allocate the next memtable in memory. Pre-create/rotate to the next WAL on disk. That's it for the whole series. You implemented a fully functional LSM tree: Started with a memtable (hashtable) and a flush that writes immutable SSTables to disk. Added a WAL for durability. Handled deletes and compaction to reclaim space. Introduced leveling and key-range partitioning to speed up reads. Switched to block-based SSTables with indexing. Added Bloom filters and replaced the memtable with a radix trie for faster lookups. Finally, introduced concurrency: a simple, single-threaded foreground path with flush and compaction running in the background. I hope you had fun building it. Thank you for following the series, and special thanks to our partner, ScyllaDB ! To get more information on how things work in production databases, you can read how RocksDB keeps track of live SST files . The structure is inspired by RocksDB’s . Conflict resolution is one aspect we’re missing in the series (maybe as a follow-up?) A versioned catalog is enough for reads, but what about conflicting writes? Suppose two clients, Alice and Bob, update the same key around the same time. A simple policy to resolve conflicts is latest wins. The database can serialize operations for the same key to ensure the latest request wins: In this example, the database ends up with as the latest state. This approach works with one node. But what about databases composed of multiple nodes? Say the two requests go to two different nodes at roughly the same time: With multiple nodes, the database must resolve conflicts consistently. There are two common ways: Coordination via a leader (consensus): Route both writes to the same leader node, which solves the conflict and determines the end state. Reconcile with comparable timestamps: Attach a timestamp to each write and store it with the key. By timestamp, we don’t mean relying on wall-clock time but a logical clock, so that “later“ is well-defined across nodes. If we go with the second approach and start storing data, we also unlock something production systems use: consistent snapshots. A read can include a timestamp, and the database returns the last version at or before that time; hence, providing a consistent view of the data, even while flush/compaction runs in the background. This pattern has a name: Multi-Version Concurrency Control (MVCC). It involves keeping multiple versions per key instead of only the last one, reading using a chosen point in time, and deleting old versions once they are no longer needed. See how ScyllaDB handles timestamp conflict resolution for more information. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Week 8: Concurrency Over this series, you built a working LSM tree: you flush to persist the memtable to disk and compact to reclaim space. Yet, you’ve been single-threaded so far. This week, we lift that constraint: flush and compaction will run in the background while you keep serving requests. There are many ways to add concurrency. The approach here is to introduce a versioned, ref-counted catalog that lets readers take a stable snapshot while background flush/compaction proceeds. A catalog holds references to: The current memtable. The current WAL. The current MANIFEST. Once we can guarantee that catalog v1 is no longer referenced, we can delete the old MANIFEST, SST-2 and SST-3. Another example: a flush produced a new memtable and WAL file: In this case, once vatalog v1 has no remaining references, we can free the old memtable and delete the old WAL file. Your Tasks 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Catalog Add a data structure that tracks: Memtable reference. MANIFEST path. Version (monotonic). Refcount of active readers. Implement a manager that keeps catalog versions in memory: : Pick the latest catalog. Increment its refcount. : Decrement the refcount of the catalog. If refcount is zero and there’s a new catalog version: Remove the current catalog. Remove elements present in the current catalog but not in the latest version (files, WAL, etc.) : Create a new catalog based on the provided data. Assign a unique, monotonic version. At startup: Read from the authoritative MANIFEST (latest MANIFEST file). Treat any files not listed in MANIFEST as orphans and delete them. Read all WAL files you still have on disk, in order, to rebuild the in-memory state. Create the current catalog version from the reconstructed state. Start the background worker. Compaction: Keep the same trigger: Every 10,000 update requests. Behavior: Do not run compaction in the request path. On compaction trigger: Post a notification to an internal queue and return. Worker: A single background thread listens on the queue and runs the actual work. Similar compaction process, except: Do not overwrite the existing WAL file. Instead, create a new file. Create a new catalog that references the new WAL. Keep the same trigger: When the memtable contains 2,000 entries. Behavior: Do not run flush in the request path. On flush trigger: Allocate a new memtable and create a new WAL file for subsequent writes. Post a notification to an internal queue. Return immediately to the caller. Worker: A single background thread listens on the queue and runs the actual work. Similar flush process, except: Do not overwrite the existing MANIFEST file. Instead, create a new file. Create a new catalog referencing the new MANIFEST. Acquire a catalog from the manager. Do the operation using paths/refs from that catalog. Release the catalog. What should you assert for : , , or ? To make validation deterministic, you will handle barriers: all requests before a barrier must finish before starting the next block. You will also relax checks: a is valid if it returns any value written for a key before the last barrier. A similar example with barriers: The first two requests run in parallel. The first barrier waits for both to complete. The first GET should accept either or . The second request should accept only . All the lines between two barriers form a block. On instruction, wait for all in-flight requests in the current block to finish before starting the next block. / lines are issued in parallel within their block. lines are also issued in parallel within their block. means the response must be any one of the list values. Pre-allocate the next memtable in memory. Pre-create/rotate to the next WAL on disk. Started with a memtable (hashtable) and a flush that writes immutable SSTables to disk. Added a WAL for durability. Handled deletes and compaction to reclaim space. Introduced leveling and key-range partitioning to speed up reads. Switched to block-based SSTables with indexing. Added Bloom filters and replaced the memtable with a radix trie for faster lookups. Finally, introduced concurrency: a simple, single-threaded foreground path with flush and compaction running in the background. In this example, the database ends up with as the latest state. This approach works with one node. But what about databases composed of multiple nodes? Say the two requests go to two different nodes at roughly the same time: With multiple nodes, the database must resolve conflicts consistently. There are two common ways: Coordination via a leader (consensus): Route both writes to the same leader node, which solves the conflict and determines the end state. Reconcile with comparable timestamps: Attach a timestamp to each write and store it with the key. By timestamp, we don’t mean relying on wall-clock time but a logical clock, so that “later“ is well-defined across nodes.

0 views
<antirez> 1 weeks ago

Redis patterns for coding

Here LLM and coding agents can find: 1. Exhaustive documentation about Redis commands and data types. 2. Patterns commonly used. 3. Configuration hints. 4. Algorithms that can be mounted using Redis commands. https://redis.antirez.com/ Some humans claim this documentation is actually useful for actual people, as well :) I'm posting this to make sure search engines will index it. Comments

0 views
Binary Igor 2 weeks ago

JSON Documents Performance, Storage and Search: MongoDB vs PostgreSQL

Does MongoDB still have an edge as a document-oriented database for JSON in particular? Or is Postgres better? Or at least good-enough to stick with it, since it is a more universal database, offering a richer feature set and wider applicability?

0 views
The Coder Cafe 2 weeks ago

Build Your Own Key-Value Storage Engine—Week 7

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Over the last few weeks, you refined your LSM tree to introduce leveling. In case of a key miss, the process requires the following steps: Lookup from the memtable. Lookup from all the L0 SSTables. Lookup from one L1 SSTable. Lookup from one L2 SSTable. Last week, you optimized the lookups by introducing block-based SSTables and indexing, but a lookup is still not a “free” operation. Worst case, it requires fetching two pages (one for the index block and one for the data block) to find out that a key is missing in an SSTable. This week, you will optimize searches by introducing a “tiny” level of caching per SSTable. If you’re an avid reader of The Coder Cafe 1 , we already discussed a great candidate for such a cache: One that doesn’t consume too much memory to make sure we don’t increase space amplification drastically. One that is fast enough so that a lookup doesn’t introduce too much overhead, especially if we have to check a cache before making any lookup in an SSTable. You will implement a cache using Bloom filters : a space-efficient, probabilistic data structure to check for set membership. A Bloom filter can return two possible answers: The element is definitely not in the set (no false negatives). The element may be in the set (false positives are possible). In addition to optimizing SSTable lookups, you will also optimize your memtable. In week 2, you implemented a memtable using a hashtable. Let’s get some perspective to understand the problems of using a hashtable: A memtable buffers writes. As it’s the main entry point for writes, a write has to be fast. → OK: a hashtable has average inserts, plus ( : the length of the key) for hashing. For reads, doing a key lookup has to be fast → OK: average lookups, plus to hash. Doing range scanning operations (week 5, optional work), such as: “ Give me the list of keys between bar and foo “ → A hashtable, because it’s not an ordered data structure, is terrible: you end up touching everything so with the number of elements in the hashtable. Flush to L0 → A hashtable isn’t ordered, so it requires sorting all the keys ( ) with n the number of elements) to produce the SSTables. Because of these negative points, could we find a better data structure? Yes! This week, you will switch the memtable to a radix trie (see Further Notes for a discussion on alternative data structures). A trie is a tree-shaped data structure usually used to store strings efficiently. The common example to illustrate a trie is to store a dictionary. For example, suppose you want to store these two words: Despite that starts with the same four letters, you need to store a total of 4 + 5 = 9 letters. Tries optimize the storage required by sharing prefixes. Each node stores one letter. Here’s an example of a trie storing these two words in addition to the word foo ( nodes represent the end of a word): As you can see, we didn’t duplicate the first four letters of to store . In this very example, instead of storing 9 letters for and , we stored only five letters. Yet, you’re not going to implement a “basic” trie for your memtable; instead, you will implement a compressed trie called a radix trie (also known as a patricia 2 trie). Back to the previous example, storing one node (one square) has an overhead. It usually means at least one extra field to store the next element, usually a pointer. In the previous example, we needed 11 nodes in total, but what if we could compress the number of nodes required? The idea is to combine nodes with a single child: This new trie stores the exact same information, except it requires 6 nodes instead of 11. That’s what radix tries are about. To summarize the benefits of switching a memtable from a hashtable to a radix trie: Ordered by design: Tries keep keys in order and make prefix/range lookups natural, which helps for and for streaming a sorted flush. No rebalancing/rehashing pauses: The shape doesn’t depend on insertion order, and operations don’t need rebalancing; you avoid periodic rehash work. Prefix compression: A radix trie can cut duplicated key bytes in the memtable, reducing in-memory space. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Let’s size the Bloom filter. You will target: (false-positive rate) = 1% (max elements per SSTable) = 1,953 (hash functions) = 5 Using the formula from the Bloom Filters post: We get ≈ 19,230 bits, i.e., 2,404 B. We will round up to 2,496 B (39 × 64 B), so the bitset is a whole number of cache lines. NOTE : Using =7 would shave only ~2–3% space for ~40% more hash work, so =5 is a good trade-off. To distribute elements across the bitvector, you will use the following approach. You will use xxHash64 with two different constant seeds to get two base hashes, then derive k indices by double hashing (pseudo-code): The required changes to introduce Bloom filters: For each SSTable in the MANIFEST, cache its related Bloom filter in memory. Since each Bloom filter requires only a small amount of space, this optimization has a minimal memory footprint. For example, caching 1,000 Bloom filters of the type you designed requires less than 2.5 MB of memory. SSTable creation: For each new SSTable you write, initialize an empty bitvector of 2,496 B. Build the Bloom filter in memory as you emit the keys (including tombstones): Compute based on the key. For each , set bit at position . When the SSTable is done, persist a sidecar file next to it (e.g., and ) and the file. Update the cache containing the Bloom filters. Compaction: Delete from memory the Bloom filters corresponding to deleted SSTables. Before reading an SSTable: Compute based on the key. If all the bits of are set: The key may be present, therefore, proceed with your normal lookup in the SSTable. Otherwise: Skip this SSTable. Now, let’s replace your hashtable with a trie. : Compressed edge fragment. : A map keyed by the next character after to a node. : An enum with the different possible values: : The node is just a prefix, no full key ends here. : A full key exists at this node. : This key was explicitly deleted. : If is , the corresponding value. Root is a sentinel node with an empty . Walk from the root, matching the longest common prefix against . If partial match in the middle of an edge, split once: Create a parent with the common part, two children: the old suffix and the new suffix. Descend via the next child (next unmatched character). At the terminal node: set and Walk edges by longest-prefix match. If an edge doesn’t match, return not found. At the terminal node: If : return If or , return not found. Walk as in . If the path doesn’t fully exist, create the missing suffix nodes with so that a terminal node exists. At the terminal node: set (you may have to clear ). Flush process: In-order traversal: : Emit tombstone. : Emit nothing. There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Use per-SSTable random seeds for the Bloom hash functions. Persist them in the Bloom filter files. In Bloom Filters , you introduced blocked Bloom filters, a variant that optimizes spatial locality by: Dividing the bloom filter into contiguous blocks, each the size of a cache line. Restricting each query to a single block to ensure all bit lookups stay within the same cache line. Switch to blocked Bloom filters and see the impacts on latency and throughput. If you implemented the operation from week 5 (optional work), wire it to your memtable radix trie. That’s it for this week! You optimized lookups with per-SSTable Bloom filters and switched the memtable to a radix trie, an ordered data structure. Since the beginning of the series, everything you built has been single-threaded, and flush/compaction remains stop-the-world. In two weeks, you will finally tackle the final boss of LSM trees: concurrency. If you want to dive more into tries, Trie Memtables in Cassandra is a paper that explains why Cassandra moved from a skip list + B-tree memtable to a trie, and what it changed for topics such as GC and CPU locality. A popular variant of radix trie is the Adaptive Radix Tree (ART): it dynamically resizes node types based on the number of children to stay compact and cache-friendly, while supporting fast in-memory lookups, inserts, and deletes. This paper (or this summary ) explores the topic in depth. You should also be aware that tries aren’t the only option for memtables, as other data structures exist. For example, RocksDB relies on a skip list. See this resource for more information. About Bloom filters, some engines keep a Bloom filter not only per SSTable but per data-block range as well. This was the case for RocksDB’s older block-based filter format ( source ). RocksDB later shifted toward partitioned index/filters, which partition the index and full-file filter into smaller blocks with a top-level directory for on-demand loading. The official doc delves into the new approach. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. I’m sure you are. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing Week 7: Bloom Filters and Trie Memtable Over the last few weeks, you refined your LSM tree to introduce leveling. In case of a key miss, the process requires the following steps: Lookup from the memtable. Lookup from all the L0 SSTables. Lookup from one L1 SSTable. Lookup from one L2 SSTable. One that doesn’t consume too much memory to make sure we don’t increase space amplification drastically. One that is fast enough so that a lookup doesn’t introduce too much overhead, especially if we have to check a cache before making any lookup in an SSTable. The element is definitely not in the set (no false negatives). The element may be in the set (false positives are possible). A memtable buffers writes. As it’s the main entry point for writes, a write has to be fast. → OK: a hashtable has average inserts, plus ( : the length of the key) for hashing. For reads, doing a key lookup has to be fast → OK: average lookups, plus to hash. Doing range scanning operations (week 5, optional work), such as: “ Give me the list of keys between bar and foo “ → A hashtable, because it’s not an ordered data structure, is terrible: you end up touching everything so with the number of elements in the hashtable. Flush to L0 → A hashtable isn’t ordered, so it requires sorting all the keys ( ) with n the number of elements) to produce the SSTables. As you can see, we didn’t duplicate the first four letters of to store . In this very example, instead of storing 9 letters for and , we stored only five letters. Yet, you’re not going to implement a “basic” trie for your memtable; instead, you will implement a compressed trie called a radix trie (also known as a patricia 2 trie). Back to the previous example, storing one node (one square) has an overhead. It usually means at least one extra field to store the next element, usually a pointer. In the previous example, we needed 11 nodes in total, but what if we could compress the number of nodes required? The idea is to combine nodes with a single child: This new trie stores the exact same information, except it requires 6 nodes instead of 11. That’s what radix tries are about. To summarize the benefits of switching a memtable from a hashtable to a radix trie: Ordered by design: Tries keep keys in order and make prefix/range lookups natural, which helps for and for streaming a sorted flush. No rebalancing/rehashing pauses: The shape doesn’t depend on insertion order, and operations don’t need rebalancing; you avoid periodic rehash work. Prefix compression: A radix trie can cut duplicated key bytes in the memtable, reducing in-memory space. (false-positive rate) = 1% (max elements per SSTable) = 1,953 (hash functions) = 5 Startup: For each SSTable in the MANIFEST, cache its related Bloom filter in memory. Since each Bloom filter requires only a small amount of space, this optimization has a minimal memory footprint. For example, caching 1,000 Bloom filters of the type you designed requires less than 2.5 MB of memory. SSTable creation: For each new SSTable you write, initialize an empty bitvector of 2,496 B. Build the Bloom filter in memory as you emit the keys (including tombstones): Compute based on the key. For each , set bit at position . When the SSTable is done, persist a sidecar file next to it (e.g., and ) and the file. Update the cache containing the Bloom filters. Compaction: Delete from memory the Bloom filters corresponding to deleted SSTables. Lookup: Before reading an SSTable: Compute based on the key. If all the bits of are set: The key may be present, therefore, proceed with your normal lookup in the SSTable. Otherwise: Skip this SSTable. : Compressed edge fragment. : A map keyed by the next character after to a node. : An enum with the different possible values: : The node is just a prefix, no full key ends here. : A full key exists at this node. : This key was explicitly deleted. : If is , the corresponding value. : Walk from the root, matching the longest common prefix against . If partial match in the middle of an edge, split once: Create a parent with the common part, two children: the old suffix and the new suffix. Descend via the next child (next unmatched character). At the terminal node: set and : Walk edges by longest-prefix match. If an edge doesn’t match, return not found. At the terminal node: If : return If or , return not found. : Walk as in . If the path doesn’t fully exist, create the missing suffix nodes with so that a terminal node exists. At the terminal node: set (you may have to clear ). In-order traversal: : Emit . : Emit tombstone. : Emit nothing. Dividing the bloom filter into contiguous blocks, each the size of a cache line. Restricting each query to a single block to ensure all bit lookups stay within the same cache line.

0 views

Flexible I/O for Database Management Systems with xNVMe

Flexible I/O for Database Management Systems with xNVMe Emil Houlborg, Simon A. F. Lund, Marcel Weisgut, Tilmann Rabl, Javier González, Vivek Shah, Pınar Tözün CIDR’26 This paper describes xNVMe , a storage library (developed by Samsung), and demonstrates how it can be integrated into DuckDB. Section 2 contains the hard sell for . The “x” prefix serves a similar role to the “X” in DirectX. It is fast, while also being portable across operating systems and storage devices. The C API will feel like home for folks who have experience with low-level graphics APIs (no shaders on the disk yet, sorry). There are APIs to open a handle to a device, allocate buffers, and submit NVMe commands (synchronously or asynchronously). Listing 3 has an example, which feels like “Mantle for NVMe”: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf The API works on Linux, FreeBSD, Windows, and macOS. Some operating systems have multiple backends available (e.g., , ). The point of this paper is that it is easy to drop into an existing application. The paper describes , which is an implementation of the DuckDB interface and uses . creates dedicated queues for each DuckDB worker thread to avoid synchronization (similar tricks are used by applications calling graphics APIs in parallel). The paper also describes how supports shiny new NVMe features like Flexible Data Placement (FDP). This allows DuckDB to pass hints to the SSD to colocate buffers with similar lifetimes (which improves garbage collection performance). Most of the results in the paper show comparable performance for vs the baseline DuckDB filesystem. Fig. 5 shows one benchmark where yields a significant improvement: Source: https://www.cidrdb.org/cidr2026/papers/p6-houlborg.pdf Dangling Pointers I think the long-term success of will depend on governance. Potential members of the ecosystem could be scared off by Samsung’s potential conflict of interest (i.e., will Samsung privilege Samsung SSDs in some way?) There is a delicate balancing act between an API driven by a sluggish bureaucratic committee, and an API which is dominated by one vendor. Subscribe now

0 views
Justin Duke 1 months ago

One status field per model

Any sufficiently old application starts to succumb to a pernicious form of technical debt known in street parlance as shitty data modeling . Sometimes this manifests as the god object: a single model that represents the user, and the settings, and the DNS configuration, and twelve other things. Sometimes this comes in the form of a table (or multiple tables) where the initial set of data modeling concerns in the early goings of the project don't quite match the reality discovered along the way, and a series of subtle mismatches collide with each other in the same way that subtle mismatches between tectonic plates do. Data models, unlike other areas of tech debt, are correctly scary to refactor. Even in Django — an application framework with really robust, mature migration tooling — reshaping data in production is non-trivial. The weight associated with even relatively simple schema changes can be so overwhelming as to forever dissuade a would-be re-architect from making things right. Therefore, it is that much more important to spend the extra mental energy early on to make sure, whenever possible, your data model is a roughly correct one, and to course correct early when it isn't. There are many ways to do this, and the goal of describing a virtuous data model in its entirety is too large and broad a problem for this measly little essay. Instead, I want to share a heuristic that I have found particularly useful — one which is summed up, as many of my blog posts are, in the title. Every data model must have at most one status field. If you're thinking about making a change such that a model has more than one status field, you have the wrong data model. Let me illustrate via self-flagellation and talk about Buttondown's own problematic model: the object. The object has three status fields within its lush, expansive confines: This is incorrect. We should have created standalone models for the sending domain and hosting domain, each with a simple field of its own, and drawn foreign keys from the onto those. We did not do this, because at the time it felt like overkill. And so. You pay the price — not in any one specific bug, but in weirdness , in the difficulty of reasoning about the code. Is there a meaningful difference between an status and a of for an active newsletter, versus an status and a of ? What queries should return which combinations? The confusion compounds. Again, I know this sounds trivial. But every good data model has syntactic sugar around the state machine, and every good state machine has a unary representation of its state. 1 See also: enums . A field (the normal one)

0 views
Phil Eaton 1 months ago

Paths of MySQL, vector search edition

This is an external post of mine. Click here if you are not redirected.

0 views
Justin Duke 1 months ago

Brief notes on migrating to Postgres-backed jobs

It seems premature to talk about a migration that is only halfway done, even if it's the hard half that's done — but I think there's something useful in documenting the why and how of a transition while you're still in the thick of it, before the revisionist history of completion sets in. Early last year, we built out a system for running background jobs directly against Postgres within Django. This very quickly got abstracted out into a generic task runner — shout out to Brandur and many other people who have been beating this drum for a while. And as far as I can tell, this concept of shifting away from Redis and other less-durable caches for job infrastructure is regaining steam on the Rails side of the ecosystem, too. The reason we did it was mostly for ergonomics around graceful batch processing. It is significantly easier to write a poller in Django for stuff backed by the ORM than it is to try and extend RQ or any of the other task runner options that are Redis-friendly. Django gives you migrations, querysets, admin visibility, transactional guarantees — all for free, all without another moving part. And as we started using it and it proved stable, we slowly moved more and more things over to it. At the time of this writing, around half of our jobs by quantity — which represent around two-thirds by overall volume — have been migrated over from RQ onto this system. This is slightly ironic given that we also last year released django-rq-cron , a library that, if I have my druthers, we will no longer need. Fewer moving parts is the watchword. We're removing spindles from the system and getting closer and closer to a simple, portable, and legible stack of infrastructure.

1 views

Premium: The Hater's Guide to Oracle

You can’t avoid Oracle. No, really, you can’t. Oracle is everywhere. It sells ERP software – enterprise resource planning, which is a rat king of different services for giant companies for financial services, procurement (IE: sourcing and organizing the goods your company needs to run), compliance, project management, and human resources. It sells database software, and even owns the programming language Java as part of its acquisition of Sun Microsystems back in 2010 .  Its customers are fucking everyone: hospitals ( such as England’s National Health Service ), large corporations (like Microsoft), health insurance companies, Walmart, and multiple different governments. Even if you have never even heard of Oracle before, it’s almost entirely certain that your personal data is sitting in an Oracle-designed system somewhere.  Once you let Oracle into your house, it never leaves. Canceling contracts is difficult, to the point that one Redditor notes that some clients agreed to spend a minimum amount of money on services without realizing, meaning that you can’t remove services you don’t need even during the renewal of a contract . One user from three years ago told the story of adding two users to their contract for Oracle’s Netsuite Starter Edition ( around $1000 a month in today’s pricing ), only for an Oracle account manager to call a day later to demand they upgrade to the more expensive package ($2500 per month) for every user.   In a thread from a year ago , another user asked for help renegotiating their contract for Netsuite, adding that “[their] company is no where near the state needed to begin an implementation” and “would use a third party partner to implement” software that they had been sold by Oracle. One user responded by saying that Oracle would play hardball and “may even use [the] threat of attorneys.”  In fact, there are entire websites about negotiations with Oracle, with Palisade Compliance saying that “Oracle likes a frenetic pace where contracts are reviewed and dialogues happen under the constant pressure of Oracle’s quarter closes,” describing negotiations with them as “often rushed, filled with tension, and littered with threats from aggressive sales and Oracle auditing personnel.” This is something you can only do when you’ve made it so incredibly difficult to change providers. What’re you gonna do? Have your entire database not work? Pay up. Oracle also likes to do “audits” of big customers where it makes sure that every single part of your organization that uses Oracle software is paying for it, or were not using it in a way that was not allowed based on their contract . For example, Oracle sued healthcare IT company Perry Johnson & Associates in 2020 because the company that built PJ&A’s database systems used Oracle’s database software. The case was settled. This is all to say that Oracle is a big company that sells lots of stuff, and increases the pressure around its quarterly earnings as a means of boosting revenues. If you have a company with computers that might be running Java or Oracle’s software — even if somebody else installed it for you! — you’ll be paying Oracle, one way or another. They even tried to sue Google for using the open source version of Java to build its Android operating system (though they lost).  Oracle is a huge, inevitable pain in the ass, and, for the most part, an incredibly profitable one . Every time a new customer signs on at Oracle, they pledge themselves to the Graveyard Smash and permanent fealty to Larry Ellison’s database empire.  As a result, founder Larry Ellison has become one of the richest people in the world — the fifth-largest as of writing this sentence — owning 40% of Oracle’s stock and, per Martin Peers of The Information, will earn about $2.3 billion in dividends in the next year.  Oracle has also done well to stay out of bullshit hype-cycles. While it quickly spun up vague blockchain and metaverse offerings, its capex stayed relatively flat at around $1 billion to $2.1 billion a fiscal year (which runs from June 1 to May 31), until it burst to $4.511 in FY2022 (which began on June 1, 2021, for reference), $8.695 billion in FY2023, $6.86 billion in FY2024, and then increasing a teeny little bit to $21.25 billion in FY2025 as it stocked up on AI GPUs and started selling compute. You may be wondering if that helped at all, and it doesn’t appear to have at all. Oracle’s net income has stayed in the $2 billion to $3 billion range for over a decade , other than a $2.7 billion spike last quarter from its sale of its shares in Ampere . You see, things have gotten weird at Oracle, in part because of the weirdness of the Ellisons themselves, and their cozy relationship with the Trump Administration ( and Trump itself ). Ellison’s massive wealth backed son David Ellison’s acquisition of Paramount , putting conservative Bari Weiss at the helm of CBS in an attempt to placate and empower the right wing, and is currently trying to buy Warner Brothers Discovery ( though it appears Netflix may have won ), all in pursuit of kissing up to a regime steeped in brutality and bigotry that killed two people in Minnesota. Oracle will serve as the trusted security partner, responsible for auditing and ensuring compliance with National Security Terms, according to a memo. The company already provides cloud services for TikTok and manages user data in the U.S. Notably, Oracle previously made a bid for TikTok back in 2020. I know that you’re likely a little scared that an ultra right-wing billionaire has bought another major social network. I know you think that Oracle, a massive and inevitable cloud storage platform owned by a man who looks like H.R. Giger drew Jerry Stiller. I know you’re likely worried about a replay of the Elon Musk Twitter fiasco, where every week it seemed like things would collapse but it never seemed to happen, and then Musk bought an election. What if I told you that things were very different, and far more existentially perilous for Oracle? You see, Oracle is arguably one of the single-most evil and successful companies in the world, and it’s got there by being an aggressive vendor of database and ERP software, one that, like a tick with a law degree, cannot be removed without some degree of bloodshed. Perhaps not the highest-margin business in the world, but you know, it worked. Oracle has stuck to the things it’s known for for years and years and done just fine… …until AI, that is. Let’s see what AI has done for Oracle’s gross margi- OH MY GOD ! The scourge of AI GPUs has taken Oracle’s gross margin from around 79% in 2021 to 68.54% in 2025, with CNBC reporting that FactSet-polled analysts saw it falling to 49% by 2030 , which I think is actually being a little optimistic.   Oracle was very early to high-performance computing, becoming the first cloud in the world to have general availability of NVIDIA’s A100 GPUs back in September 2020 , and in June 2023 (at the beginning of Oracle’s FY2024), Ellison declared that Oracle would spend “billions” on NVIDIA GPUs, naming AI firm Cohere as one of its customers.  In May 2024, Musk and Ellison discussed a massive cloud compute contract — a multi-year, $10 billion deal that fell apart in July 2024 when Musk got impatient , a blow that was softened by Microsoft’s deal to buy compute capacity for OpenAI , for chips to be rented out of a data center in Abilene Texas that, about six months later, OpenAI would claim was part of a “$500 billion Stargate initiative” announcement between Oracle, SoftBank and OpenAI that was so rushed that Ellison had to borrow a coat to stay warm on the White House lawn, per The Information . “Stargate” is commonly misunderstood as a Trump program, or something that has raised $500 billion, when what it actually is is Oracle raising debt to build data centers for OpenAI. Instead of staying in its lane as a dystopian datacenter mobster, Oracle entered into negative-to-extremely-low margin realm of GPU rentals, raising $58 billion in debt and signing $248 billion in data center leases to service a 5-year-long $300 billion contract with OpenAI that it doesn’t have the capacity for and OpenAI doesn’t have the money to pay for . Oh, and TikTok? The billion-user social network that Oracle sort-of-just bought? There’s one little problem with it: per The Information , ByteDance investors estimate TikTok lost several billion dollars last year on revenues of roughly $20 billion, attributed to its high growth costs and, per The Information, “higher operational and labor costs in overseas markets compared to China.” Now, I know what you’re gonna say: Ellison bought TikTok as a propaganda tool, much like Musk bought Twitter. “The plan isn’t for it to be profitable,” you say. “It’s all about control” you say, and I say, in response, that you should know exactly how fucked Oracle is. In its last quarter, Oracle had negative $13 billion in cash flow , and between 2022 and late 2025 quintupled its PP&E (from $12.8 billion to $67.85 billion), primarily through the acquisition of GPUs for AI compute. Its remaining performance obligations are $523 billion , with $300 billion of that coming from OpenAI in a deal that starts, according to the Wall Street Journal, “ in 2027 ,” with data centers that are so behind in construction that the best Oracle could muster is saying that 96,000 B200 GPUs had been “delivered” to the Stargate Abilene data center in December 2025 for a data center of 450,000 GPUs that has to be fully operational by the end of 2026 without fail.  And what’re the margins on those GPUs? Negative 100% .  Oracle, a business borne of soulless capitalist brutality, has tied itself existentially to not just the success of AI , but the specific, incredible, impossible success of OpenAI , which will have to muster up $30 billion in less than a year to start paying for it, and another $270 billion or more to pay for the rest… at a time when Oracle doesn’t have the capacity and has taken on brutal debt to build it. For Oracle to survive , OpenAI must find a way to pay it four times the annual revenue of Microsoft Azure ($75 billion) , and because OpenAI burns billions of dollars, it’s going to have to raise all of that money at a time of historically low liquidity for venture capital .  Did I mention that Oracle took on $56 billion of debt to build data centers specifically for OpenAI? Or that the banks who invested in these deals don’t seem to be able to sell off the debt ? Let me put it really simply: We are setting up for a very funny and chaotic situation where Oracle simply runs out of money, and in the process blows up Larry Ellison’s fortune. However much influence Ellison might have with the administration, Oracle has burdened itself with debt and $248 billion in data center lease obligations — costs that are inevitable, and are already crushing the life out of the company (and the stock).  The only way out is if OpenAI becomes literally the most-successful cash-generating company of all time within the next two years, and that’s being generous. This is not a joke. This is not an understatement. Sam Altman holds Larry Ellison’s future in his clammy little hands, and there isn’t really anything anybody can do about it other than hope for the best, because Oracle already took on all that debt and capex. Forget about politics, forget about the fear in your heart that the darkness always wins, and join me in The Hater’s Guide To Oracle, or My Name’s Larry Ellison, and Welcome To Jackass. Larry Ellison’s wealth is almost entirely tied up in Oracle stock. Oracle’s stock is tied to the company “Oracle,” which is currently destroying its margins and annihilating its available cash to buy GPUs to serve a customer that cannot afford to pay it. Oracle has taken on ruinous debt that can only be paid if this customer, which cannot afford it and needs to raise money from an already-depleted venture capital pool, actually pays it. Oracle’s stock has already been punished for these debts , and that’s before OpenAI fails to pay for its contract. Oracle now owns part of one of its largest cloud customers, TikTok, which loses billions of dollars a year, and the US entity says, per Bloomberg , that it will “retrain, test and update the content recommendation algorithm on US user data,” guaranteeing that it’ll fuck up whatever makes it useful, reducing its efficacy for advertisers. Larry Ellison’s entire financial future is based on whether OpenAI lives or dies. If it dies, there isn’t another entity in the universe that can actually afford (or has interest in) the scale of the compute Oracle is building.

0 views
Dangling Pointers 1 months ago

UPP: Universal Predicate Pushdown to Smart Storage

UPP: Universal Predicate Pushdown to Smart Storage Ipoom Jeong, Jinghan Huang, Chuxuan Hu, Dohyun Park, Jaeyoung Kang, Nam Sung Kim, and Yongjoo Park ISCA'25 Working on hardware acceleration requires a healthy dose of honesty. If you try hard enough, you can find clever ways to accelerate a given application. Once that point is reached, it is helpful to step back and ask yourself: “are these ideas generally applicable to many hardware architectures, or only to the one I am targeting?” This paper describes techniques for high performance filtering of OLAP data, and a FPGA implementation. I wonder if these ideas would also work well on other chips. This paper rests on two assumptions: Inputs are relatively static, which means that the cost of preprocessing can be amortized over many queries Best-effort filtering is OK, because the system has another level of filtering to catch any false positives (rows which should have been removed, but were not) A preprocessing step generates a 256-bit row vector (RV) associated with each row. These bits are partitioned among all columns (e.g., if there are 8 columns in the relation, then each column is represented with 32 bits per row). When a query is run, the relevant filters from the query are converted into a set of 256-bit query vectors (QVs) and simple instructions which perform logical operations between the row vectors and query vectors. The result of those instructions is a single bit per row which determines if the row can be safely removed. Numerical expressions (e.g., ) are supported for monotone functions. During the preprocessing step, the lower and upper bounds of each column are computed. This space is divided into a fixed number of buckets. For each value in a column, the associated bucket index is computed, and the associated bit in the row vector is set to 1. Hashing is used to handle the case where there are more buckets than bits allocated for a given column. When a query is executed, the software can determine the set of buckets which the query references. For example, say the filter expression is: and the buckets for are: The query vector which selects rows which should not be filtered is (LSB first): To determine if a row should be filtered, compute the bitwise AND of the row vector and the query vector. If all bits of the result are zero, then the row can be removed. To convert a string into a row vector, the paper proposes tokenizing the string, and then hashing each token (i.e., word) to determine which bit in the row vector to set. This means that multiple bits in a row vector can be set (one bit per word). Only tokens which appear frequently in the dataset are hashed, the rest are ignored. A query expression like is decomposed into three tokens, and each token is hashed, and the hash values determine which bits in the query vector are set. Rows are accepted if they have all 3 bits set. Note that this is best-effort filtering. For example, if a row contains the string in the column, that row will not be removed. Table 3 shows FPGA resource usage numbers for an FPGA accelerator which executes queries by performing bitwise operations on row and query vectors, and compacting tables based on the generated bitmaps. These numbers seem pretty modest (i.e., good) to me. The authors argue that this implementation is small enough that it could be incorporated into a Smart SSD, allowing queries to be pushed down as far as possible. Source: https://dl.acm.org/doi/10.1145/3695053.3731005 Fig. 6 shows TPC-H performance results. Each pair of bars represents a particular query run on a baseline system, and on a system with filtering pushed down to a Smart SSD. Q21 doesn’t see a speedup because it is bound by join and aggregation, not filtering. Source: https://dl.acm.org/doi/10.1145/3695053.3731005 Dangling Pointers I wonder how much of this is overfitting to TPC-H. If you look at the TPC-H spec , a lot of string columns are generated by randomly sampling from a very small set of possible tokens. It would be great if the industry had a “held out test set” which could be used to evaluate OLAP performance on real-world yet hidden datasets which researchers could not directly see. Subscribe now Inputs are relatively static, which means that the cost of preprocessing can be amortized over many queries Best-effort filtering is OK, because the system has another level of filtering to catch any false positives (rows which should have been removed, but were not)

0 views
Binary Igor 1 months ago

Data Consistency: transactions, delays and long-running processes

Data Consistency is simply a need to keep data consistent within a particular boundary and time. There are two main scopes ... Local data is immediately consistent - there are no delays, it appears either in its final form or not at all. In the global scope however, things look totally different.

0 views
The Coder Cafe 1 months ago

Build Your Own Key-Value Storage Engine—Week 6

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing In week 2, you used JSON as the SSTable format. That works for document databases, but the overhead of this serialization format doesn’t make it the best choice for your storage engine: Best case: You stream the file and linearly scan entries until you find the key, but a miss means scanning the entire file. Worst case: You read the whole file and parse everything, then search for the key. This week, you will switch to block-based SSTables. Data will be chunked into fixed-size blocks designed to fit within a single disk page. The main benefits: Efficient I/O: Each lookup can fetch a complete block with a single page read. Predictable latency: Since every block maps to exactly one page, each read involves a fixed, bounded amount of I/O, improving latency consistency. Smaller on disk: Binary encoding typically compresses better than JSON. Integrity: Per-block checksums detect corruption without requiring a re-read of the file. Caching: Hot SSTable blocks are cached in a memory-based block cache to reduce I/O and decompression overhead. Alongside the data blocks, you will maintain a small index that stores the first key of each block and its corresponding offset, allowing lookups to jump directly to the relevant block without scanning all of them. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Fixed 64-byte keys and values: This alleviates a lot of logic to keep fixed-size blocks, making the implementation easier to write and reason about. Because of the week 1 assumption (keys are lowercase ASCII strings), each character is one byte, which also makes the implementation easier. A block-based SSTable will be composed of: One index block (first 4 KB page) Multiple data blocks (each 4 KB) Each block has a fixed size of 4 KB. Aligning blocks to 4 KB means a disk read can fetch a block in one page. If blocks are not aligned, a read may span two pages. Here’s the file layout at a glance: The layout of an index block (4 KB): : The number of data blocks in the SSTable. A set of key entries (64 B), each being the first key of the corresponding data block. Entries are sorted by key and used to decide which block to fetch during a lookup. To make the index fit into a single 4 KB page, it must contain at most 63 entries. Here’s the layout (note this is a binary layout; newlines are used only for the representation): NOTE : If you’re not familiar with the concept of padding: it’s filling unused bytes (here with 0x00) so fields and blocks have fixed sizes. has a value between 0 and 63. If you encoded 63 as text, you would need two bytes ( = and = ). Instead, you can store it as a binary integer so it fits in one byte: . Same layout, with explicit offsets: An example of an SSTable with three data blocks, hence three entries. Remember: this is binary; newlines are for readability only: This index block indicates: Block 0 starts with the key . Block 1 starts with the key . Block 2 starts with the key . You don’t need to store per-block offsets. Because the index is stored on a 4 KB page and every data block is exactly 4 KB and written contiguously, offsets can be calculated this way ( starts at 0): Block 0 starts at offset 4096. Block 1 starts at offset 8192. Block 2 starts at offset 12288. Now, let’s focus on data blocks. In addition to the key-value entries, reserve 8 bytes in the block at the start to store a CRC computed over + all entries; this lets you verify data integrity on read. The layout of a data block (4 KB per block): Header (128 B): (8 B): A checksum computed over bytes [8..4096). You can choose any standard variant (e.g., CRC-64/ECMA-182). (1 B): the number of entries in this block (0..31). Padding (119 B). Entries area (31 x 128 B = 3968 B), each entry is: (64 B, right-padded). (64 B, right-padded). The last data block may contain fewer than 31 entries ( ), but always pad with zeros to reach exactly 4 KB. This guarantees one-page reads and prevents errors across read modes (e.g., with mmap ). The layout of a data block (again, newlines are used only for the representation): Same layout, with explicit offsets: An example of a block composed of three key-value pairs: Note that because the index block holds at most 63 key entries, an SSTable can have at most 63 data blocks. With 31 entries per block, that caps an SSTable at 63 × 31 = 1,953 entries. A tombstone is represented by a value of 64 bytes all set to 0x00. Due to this sentinel, the all-zero value is reserved and cannot be used as an application value from this week onward. Searching for a value doesn’t change (memtable → L0 → L1, etc.). What changes is how you read one SSTable (remember: from L1, you only need to read one SSTable per level because of non-overlapping key ranges). The process to read from an SSTable: Binary search the index in to find the largest ≤ key and get . If not found (e.g., first index key is and your key is ), return a miss for this SSTable. Compute the block offset: . Fetch the corresponding 4 KB block. Verify CRC before using the block: Compute CRC64 over bytes [8..4096). Compare with the 8-byte CRC stored at offset 0..7. If it doesn’t match, fail the read for this SSTable. Binary search the entries in for the key. Return the corresponding value or a miss. Last week, you split at 2,000 entries during the compaction process. This week, because a single SSTable is limited to 1,953 entries, change the split threshold to 1,953. There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Drop the 64-byte constraint: store a length-prefixed key and value per entry (short header with key length and value length). Keep entries sorted and include the lengths in your checksum. Tombstones are currently represented by a sentinel value (a 64-byte all-zero value), which prevents storing an actual empty value. Instead, avoid reserving any value for deletes: add an explicit entry type per record (value or tombstone). Now that the format is binary, compression becomes more effective and saves more space. As an optional task, compress each data block independently so lookups still touch only one block: Record each block’s offset and compressed size in the index. Read just those bytes, decompress, and search. This packs more logical blocks into each cached page, raising cache hit rates, reducing pages touched during scans, and smoothing read latency. That’s it for this week! You implemented block-based SSTables and indexing, gaining benefits like more efficient I/O and reduced write amplification. In two weeks, you will focus on improving read performance by adding a layer that can tell whether an SSTable is worth parsing, and say goodbye to your hashtable-based memtable, replacing it with a more efficient data structure. For a production-grade implementation of block-based SSTables, see RocksDB’s block-based SSTable format . It details block layout, per-block compression, and how the index stores offsets and sizes. You can also check out ScyllaDB’s SSTables v3 docs . ScyllaDB maintains a small in-memory summary of sampled keys to narrow the search, then uses the on-disk index to locate the exact block. This provides a nice contrast to our single-page index and illustrates how to scale when SSTables grow large. For a deeper look at how things work in practice in terms of directory structure, you can explore the ScyllaDB SSTables directory structure , which shows how metadata and data are organized on disk. Regarding CRC read failures, we mentioned that a checksum mismatch should simply cause the read to fail for that SSTable. In real systems, databases rely on replication to handle corruption. When multiple replicas exist, a system can recover by using data from an intact replica if one becomes corrupted or unavailable. Upon detecting a checksum mismatch, the system discards the corrupt replica and rebuilds it from a healthy one. This approach only works as long as a valid replica exists, which is why frequent checksum verification is critical: it ensures corruption is caught and repaired as early as possible, before it propagates. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Week 6: Block-Based SSTables and Indexing In week 2, you used JSON as the SSTable format. That works for document databases, but the overhead of this serialization format doesn’t make it the best choice for your storage engine: Best case: You stream the file and linearly scan entries until you find the key, but a miss means scanning the entire file. Worst case: You read the whole file and parse everything, then search for the key. Efficient I/O: Each lookup can fetch a complete block with a single page read. Predictable latency: Since every block maps to exactly one page, each read involves a fixed, bounded amount of I/O, improving latency consistency. Smaller on disk: Binary encoding typically compresses better than JSON. Integrity: Per-block checksums detect corruption without requiring a re-read of the file. Caching: Hot SSTable blocks are cached in a memory-based block cache to reduce I/O and decompression overhead. Fixed 64-byte keys and values: This alleviates a lot of logic to keep fixed-size blocks, making the implementation easier to write and reason about. Because of the week 1 assumption (keys are lowercase ASCII strings), each character is one byte, which also makes the implementation easier. One index block (first 4 KB page) Multiple data blocks (each 4 KB) : The number of data blocks in the SSTable. A set of key entries (64 B), each being the first key of the corresponding data block. Entries are sorted by key and used to decide which block to fetch during a lookup. Block 0 starts with the key . Block 1 starts with the key . Block 2 starts with the key . Block 0 starts at offset 4096. Block 1 starts at offset 8192. Block 2 starts at offset 12288. Header (128 B): (8 B): A checksum computed over bytes [8..4096). You can choose any standard variant (e.g., CRC-64/ECMA-182). (1 B): the number of entries in this block (0..31). Padding (119 B). Entries area (31 x 128 B = 3968 B), each entry is: (64 B, right-padded). (64 B, right-padded). Binary search the index in to find the largest ≤ key and get . If not found (e.g., first index key is and your key is ), return a miss for this SSTable. Compute the block offset: . Fetch the corresponding 4 KB block. Verify CRC before using the block: Compute CRC64 over bytes [8..4096). Compare with the 8-byte CRC stored at offset 0..7. If it doesn’t match, fail the read for this SSTable. Binary search the entries in for the key. Return the corresponding value or a miss. Record each block’s offset and compressed size in the index. Read just those bytes, decompress, and search.

0 views
Justin Duke 1 months ago

Migrating to PlanetScale

First off, huge amount of credit to Mati for migrating our database to PlanetScale. I highly recommend reading the blog post . He does a good job talking about the boring stuff, which is to say, talking about the interesting stuff when reflecting on how this project went. Three other things come to mind, from my side of the fence: This is the largest infrastructural project that Buttondown has done which I have not been a part of — a fact that is both surreal and, frankly, very cool. 2. Mati hinted at this in the blog post, but outside of the obvious quantitative improvements brought by PlanetScale, the insights dashboard is worth its weight in gold. Being able to very easily see problematic queries and fix them has drastically changed our database posture, perhaps even more than the change in hardware itself. I find myself staring at the insights dashboard and thinking about what other parts of the infrastructure — CI tests, outbound SMTP, etc. — we need this for as well. 3. PlanetScale folks recommended we start using sqlcommenter to annotate our queries with the API routes that generated them. This was a good suggestion, but that package has the same malnourishment problem that so many Google-born OSS projects do: you end up pulling in a lot of dependencies, including some very old ones, to execute what is essentially 50 lines of code. Rather than do that, I asked Mr. Claude to vend the relevant snippets into a middleware that we can plop into our codebase. It is below:

0 views
The Coder Cafe 1 months ago

Build Your Own Key-Value Storage Engine—Week 5

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. I’ll also give a talk there, so feel free to join! Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Last week, you implemented deletion and compaction, making sure the LSM tree wouldn’t grow indefinitely. Still, there’s a weak spot: in the worst-case scenario (e.g., on a key miss), a single read has to scan all SSTables. To address this, you will implement leveling, a core idea in LSM trees. Instead of a single flat list of SSTables, leveling stores data across multiple levels: , , , etc. gets compacted to and makes space for future memtable flushes. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. This process is called level compaction. Something important to understand: is slightly different from all the other levels. is created during memtable flushes. If a key already exists at and also in the memtable, the next flush can write that key again to a new file. In other words, can have overlapping keys. For all the other levels ( to ), that’s not the case. They are created by compaction, which removes duplicates and produces non-overlapping key ranges. In this week’s simplified design, an to compaction takes all SSTables from and , performs a k-way merge, then rewrites fully. As a result, each key appears at most once per level from downward. What’s the consequence of non-overlapping keys? You can improve lookups using a simple range-to-file mapping, for example: Keys from to are stored in this SSTable. Keys from to are stored in this SSTable. With this setup, a read checks only one SSTable per level from to . is the exception due to overlaps, so a read may still need to scan all SSTables. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Limit the number of levels to two: , which may contain overlapping keys. , no overlapping keys. Create a folder for each level: , and . Keep one global file at the root. You will create a layout for both and : remains a simple list of SSTables. allows key-range partitioning. For example: This indicates: is composed of three SSTables: Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . The main goal of the compaction process is to compact both and . At the end, you should merge all the data from and into . will be left empty. When reaches five full SSTable files (2,000 entries each), run an → compaction: Open iterators on all and SSTables. Apply the k-way merge algorithm: Comparator: Primary: . Tie-break (equal ): Prefer over . At , prefer the newest SSTable. Version order: any record from is newer than records from . Within , newer files win (same as week 4). Keep at most one record per key (newest wins). Tombstones: because is the bottom level, drop a tombstone if no older value for that key remains in the merge result. Create new L1 SSTables with at most 2,000 entries. When naming new L1 files, make sure they are unique. For example, if contains and , the first SSTable file created should be . Publish atomically: each new file the directory. Update the atomically. the file. the root directory (the directory containing the file and and folders). Delete obsolete L1 files, then . Delete all files in , then . The logic is unchanged from previous weeks. The only difference is that flush writes to and updates the file in the section. Check the memtable. If not found, scan all files newest to oldest using section of the . If not found at : Use the section of the to choose the one shard that contains the key’s range, then read only that L1 file. Return the value if found; otherwise, return . There are no changes to the client. Run it against the same file ( put-delete.txt ) to validate that your changes are correct. Introducing leveling has a fundamental impact on deletions. With a single level, compaction sees all versions of every key at once, so a tombstone can be dropped as soon as it has “killed“ every older record for that key. Yet, the rule we mentioned last week holds true: a tombstone can be evicted only after all data it shadows no longer exist on disk. With multiple levels, compaction must propagate tombstones downward. It’s only at the bottommost level that tombstones can be dropped, because only there you can prove they no longer shadow any other records. As an optional task, make the number of levels configurable: , , …, : Define a size ratio so each level has a target size larger than the previous one. Keep one directory per level: , , …, . Keep a single global . When a level reaches its max number of SSTables (derived from the size ratio), compact that level into the next. Only drop tombstones at the bottommost level . At any intermediate level with , propagate the tombstone downward during compaction. Implement : Return all keys between (included) and (excluded). Use put-delete-scan.txt to validate that your changes are correct. It introduces the keyword. For example: This line means: between (included) and (excluded), the keys are , , (the output will always be sorted) NOTE : If this route conflicts with , rename the single-key route to . That’s it for this week! Your LSM tree is taking shape. You implemented leveling, a key LSM design idea, and refined compaction so reads are tighter and storage stays under control. In two weeks, you will revisit the week 2 choice of JSON for SSTables. You will switch to block-based SSTables to reduce parsing and I/O overhead and add indexing within each SSTable. We mentioned that, because of key overlaps, a read may still need to scan all SSTables (e.g., key miss). This is the main reason why is typically kept small. In general, each level is larger than the one above it by a fixed size ratio (e.g., 10×). Some databases even use less static mechanisms. For instance, RocksDB relies on Dynamic Leveled Compaction , where the size of each level is automatically adjusted based on the size of the oldest (last) level, eliminating the need to define each level’s size statically. Regarding compaction, you should know that in real-world databases, it isn’t done in batch mode across all data. Let’s understand why. Suppose you have four levels and a layout like this for one key: The key exists at L3. The key doesn’t exist at L2. The key is updated at L1. A tombstone is placed at L0. You can’t compact L0 with L1/L2/L3 in one shot; that would mean checking every SSTable against every level. What happens in reality is that compaction is a promotion process. In our example, the tombstone at L0 is promoted to L1. Implementations ensure that it either (a) is compacted together with the L1 SSTable it shadows, or (b) waits until that L1 data is promoted to L2. The same rule repeats level by level, until the tombstone reaches L3 and finally removes the shadowed value. Meanwhile, it’s essential to understand that compaction is crucial in LSM trees. Let’s take some perspective to understand the reason. An LSM tree buffers writes in a memtable and flushes to L0. Compaction merges SSTables across levels to control read amplification. If compaction falls behind, L0 files accumulate, flushes slow down (or stall at file-count thresholds), write latency climbs, and in the worst case, you can observe write pauses. Not because the memtable is “locked,” but because the engine can’t safely create more L0 files until compaction catches up. This is one of the reasons why the RUM conjecture we introduced last week is important. If you compact too eagerly, you burn a lot of disk I/O and lose the LSM’s write advantage. If you compact too lazily, you incur a penalty on your read path. If you compact everything all the time, you incur a space-amplification penalty during compaction roughly equal to the working set size. Because compaction is so important, most key-value stores support parallel compactions across levels (except → , which isn’t parallelized due to overlapping key ranges in L0). You should also be aware that ongoing research keeps improving compaction. For example, the SILK: Preventing Latency Spikes in LSM Key-Value Stores paper analyzes why LSM systems can exhibit high tail latency. The main reason is that limited I/O bandwidth causes interference between client writes, flushes, and compactions. The key takeaway is that not all internal operations are equal. The paper explores solutions such as Bandwidth awareness: Monitor client I/O and allocate the leftover to internal work dynamically instead of static configuration. Prioritization: Give priority to operations near the top of the tree (flushes and L0 → L1 compaction). Slowdowns there create backpressure that impacts tail latency more than work at deeper levels. Last but not least, what you implemented this week is called level compaction. Other strategies like tiered compaction exist, which merge SSTables based on their size and count rather than fixed levels. You can explore this great resource from Mark Callaghan, which dives deeper into the design trade-offs and performance characteristics of different compaction strategies in LSM trees. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Week 4: Deletes, Tombstones, and Compaction Week 5: Leveling and Key-Range Partitioning Last week, you implemented deletion and compaction, making sure the LSM tree wouldn’t grow indefinitely. Still, there’s a weak spot: in the worst-case scenario (e.g., on a key miss), a single read has to scan all SSTables. To address this, you will implement leveling, a core idea in LSM trees. Instead of a single flat list of SSTables, leveling stores data across multiple levels: , , , etc. gets compacted to and makes space for future memtable flushes. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. gets compacted to and makes space for compaction. This process is called level compaction. Something important to understand: is slightly different from all the other levels. is created during memtable flushes. If a key already exists at and also in the memtable, the next flush can write that key again to a new file. In other words, can have overlapping keys. For all the other levels ( to ), that’s not the case. They are created by compaction, which removes duplicates and produces non-overlapping key ranges. In this week’s simplified design, an to compaction takes all SSTables from and , performs a k-way merge, then rewrites fully. As a result, each key appears at most once per level from downward. What’s the consequence of non-overlapping keys? You can improve lookups using a simple range-to-file mapping, for example: Keys from to are stored in this SSTable. Keys from to are stored in this SSTable. Limit the number of levels to two: , which may contain overlapping keys. , no overlapping keys. Create a folder for each level: , and . Keep one global file at the root. remains a simple list of SSTables. allows key-range partitioning. is composed of three SSTables: . : Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Keys between (included) and (excluded) live in . Open iterators on all and SSTables. Apply the k-way merge algorithm: Comparator: Primary: . Tie-break (equal ): Prefer over . At , prefer the newest SSTable. Version order: any record from is newer than records from . Within , newer files win (same as week 4). Keep at most one record per key (newest wins). Tombstones: because is the bottom level, drop a tombstone if no older value for that key remains in the merge result. Create new L1 SSTables with at most 2,000 entries. When naming new L1 files, make sure they are unique. For example, if contains and , the first SSTable file created should be . Publish atomically: each new file the directory. Update the atomically. the file. the root directory (the directory containing the file and and folders). Clean up: Delete obsolete L1 files, then . Delete all files in , then . Check the memtable. If not found, scan all files newest to oldest using section of the . If not found at : Use the section of the to choose the one shard that contains the key’s range, then read only that L1 file. Return the value if found; otherwise, return . Define a size ratio so each level has a target size larger than the previous one. Keep one directory per level: , , …, . Keep a single global . When a level reaches its max number of SSTables (derived from the size ratio), compact that level into the next. Only drop tombstones at the bottommost level . At any intermediate level with , propagate the tombstone downward during compaction. Return all keys between (included) and (excluded). Use put-delete-scan.txt to validate that your changes are correct. It introduces the keyword. For example: This line means: between (included) and (excluded), the keys are , , (the output will always be sorted) The key exists at L3. The key doesn’t exist at L2. The key is updated at L1. A tombstone is placed at L0. This is one of the reasons why the RUM conjecture we introduced last week is important. If you compact too eagerly, you burn a lot of disk I/O and lose the LSM’s write advantage. If you compact too lazily, you incur a penalty on your read path. If you compact everything all the time, you incur a space-amplification penalty during compaction roughly equal to the working set size. Bandwidth awareness: Monitor client I/O and allocate the leftover to internal work dynamically instead of static configuration. Prioritization: Give priority to operations near the top of the tree (flushes and L0 → L1 compaction). Slowdowns there create backpressure that impacts tail latency more than work at deeper levels.

0 views
Evan Schwartz 2 months ago

Scour Year End Update 2025

I thought about sending out a personalized "Scour Wrapped"... until I got the 7th Wrapped from some random service. So instead, I'll just say Happy New Year and thanks for your support in 2025! 🥂 These were the new features added since the last update in October. Scour now identifies articles that are paywalled and indicates them with a yellow dollar sign next to the domain. In your settings , you can opt to hide paywalled content. If you do, you can also exempt specific domains where you have a subscription so you will see their content even if it is behind the paywall. Thank you to Johnny and Allen for requesting this feature! For anyone interested in the technical details, I wrote a blog post about a neat SQL trick I came across while building this: Short-Circuiting Correlated Subqueries in SQLite . You can also now block content from specific websites. The option to block a domain can be found by clicking the "..." button below each post. You can see and manage your excluded domains in your settings . Thanks to Vahe for this suggestion! If you subscribe to specific feeds (as opposed to scouring all of them), Scour will now recommend other sources for you to follow right in your personalized feed. These recommendations are based on Scour looking for content that matches your interests that you aren't currently getting. You can find more recommendations on your Feeds page . Each feed also now displays its three most recent posts below its description to make it easier to know what you'll get if you subscribe. You can click on the feed's title to see all of the posts from that feed. Thanks to Tiago for this suggestion! By default, clicking on a link to a post will bring you to the original website where it was published. However, if you prefer to read it on Scour, you can read the Preview, which can be found in the "..." menu under each post. Thanks to Linh for this suggestion! The filter menu for your feed (accessible via the button next to where it says Your Top Finds) should be clearer and more mobile-friendly. You can filter by time range and toggle between seeing posts from feeds you’ve subscribed to or see posts from everyone’s feeds. Thanks Stefan for the feedback on this! A number of people have told me that they are confused about how the love/like/dislike reactions are used on Scour. I'll work on making this clearer in the future but in the meantime, there's now a section in the FAQs about this. The answer is: Loves and likes are saved to your Likes page, so you can use them to bookmark interesting content. Unlike most content aggregators, Scour does not use reactions to change what shows up in your feed. Instead, reactions are used to generate Interest Recommendations for you. Scour only shows content related to topics you've explicitly chosen. You can also subscribe to other users' Likes as feeds. Everyone's reactions contribute to the Popular Posts page. Here were some of my favorite posts I found on Scour in November and December: Thanks to everyone who wrote about Scour on their blog or website in 2025! This included: If you write about Scour in the future, or if you already did and I didn't include you, please let me know! Thank you to everyone who provided feedback on Scour this year! Specifically, thank you to Aaron, Alberto, Alex K, Alex W, Allen, Andrew D, Andrew M, Andy M, Andy P, Cairin, Cole, Daniel, Elyem, Hary, Imperfect, Jadi, Jeppe, Jesse, Johnny, Jon, Karit, Kilpatrj, Linh, Proudmuslim-dev, Ryan, Sarah, Stefan, Tiago, Tomáš, Tyler, and Vahe. And thank you to all of the anonymous feedback givers as well! Because you made it to the end of the post, here's a little preview of an upcoming feature for you. Let's say you want to only see posts from small websites, like individuals' blogs. You can now try filtering your feed by how many posts each website or feed publishes per month. For example, you can use these links to see only posts from quieter domains or quieter feeds . Or, you can try this one to only see articles from larger websites . Let me know what you think! UI for controlling these filters is coming soon! Happy New Year and happy Scouring! - Evan Scour scoured 9,940,460 posts from 15,608 feeds 1,013 new users signed up (welcome!!) 12,620 interests were added, with 6,688 of those from recommendations 26,702 posts were read, 3,023 were liked, and 383 were loved 55 suggestions on the feedback board were completed Paper AI Tigers Build / Buy / Bot More databases should be single-threaded Disks Lie: Building a WAL that actually survives Minsuk Kang: Scour and minifeed are 100X better than Instagram and X (January) Winther: Blog Discovery (June) Daniel Prindii: My Read it later and discoverability systems in 2025 (July) PPC Land: Developer revives RSS with AI while Google targets syndication infrastructure (August) Tomáš Burkert: RSS feeds discovery strategies (October) Alex White: Discovering the Indie Web (November) Matt Maldre: Search engine for blogs (November) Andrew Doran: Tools for discovering the IndieWeb (December)

0 views
Grumpy Gamer 2 months ago

Sqlite Comments

When I started using Hugu for static site generation I lost the ability to have comments and we all know now supportive the Internet can be, so why wouldn’t you have comments? I wrote a few php scripts that I added on to Hugo and I had comments again. I decided to store the comments as flat files so I didn’t complicate things by needing the bloated MySQL. I wanted to keep it as simple and fast as possible. When a comment is added, my PHP script created a directory (if needed) for the post and saves the comment out as a .json file with name as the current time to make sorting easy. When the blog page was displayed, these files (already sorted thanks to the filename) were loaded and displayed. And it all worked well until it didn’t. Flat files are simple. but they can be hard to search or maintain if they need cleaning up or dealt with after a spam attack. I figured I use commandline tools to do all of that, but it’s a lot more cumbersome than I first thought. I missed have them in a sql database. I didn’t want to install MySQL again, but my site doesn’t get a lot of commenting traffic so I could use Sqlite instead. The downside is Sqlite write-locks the database while a write is happening. In my case it’s a fraction of a second and wouldn’t be a issue. The second problem I had was the version of Ubuntu my server was using is 5 years old and some of the packages I wanted wouldn’t available for it. I tried to update Ubuntu and for reasons I don’t fully understand I couldn’t. So I spun up a new server. Since grumpygamer.com is a statics site I only had to install Apache and I was off and running. Fun times. But the comment flat files still bugged me and I thought I’d use this as an opportunity to convert over to Sqlite. PHP/Apache comes with Sqilte already installed, so that’s easy. A long weekend and I rewrote the code to save comments and everything is back and working. Given that a webserver and PHP already needed to be installed, it isn’t a big deal to use Sqlite. If you’re not comfortable with SQL, it might be harder but I like SQL.

0 views
matklad 2 months ago

Static Allocation For Compilers

TigerBeetle famously uses “static allocation” . Infamously, the use of the term is idiosyncratic: what is meant is not arrays, as found in embedded development, but rather a weaker “no allocation after startup” form. The amount of memory TigerBeetle process uses is not hard-coded into the Elf binary. It depends on the runtime command line arguments. However, all allocation happens at startup, and there’s no deallocation. The long-lived event loop goes round and round happily without . I’ve wondered for years if a similar technique is applicable to compilers. It seemed impossible, but today I’ve managed to extract something actionable from this idea? Static allocation depends on the physics of the underlying problem. And distributed databases have surprisingly simple physics, at least in the case of TigerBeetle. The only inputs and outputs of the system are messages. Each message is finite in size (1MiB). The actual data of the system is stored on disk and can be arbitrarily large. But the diff applied by a single message is finite. And, if your input is finite, and your output is finite, it’s actually quite hard to need to allocate extra memory! This is worth emphasizing — it might seem like doing static allocation is tough and requires constant vigilance and manual accounting for resources. In practice, I learned that it is surprisingly compositional. As long as inputs and outputs of a system are finite, non-allocating processing is easy. And you can put two such systems together without much trouble. routing.zig is a good example of such an isolated subsystem. The only issue here is that there isn’t a physical limit on how many messages can arrive at the same time. Obviously, you can’t process arbitrary many messages simultaneously. But in the context of a distributed system over an unreliable network, a safe move is to drop a message on the floor if the required processing resources are not available. Counter-intuitively, not allocating is simpler than allocating, provided that you can pull it off! Alas, it seems impossible to pull it off for compilers. You could say something like “hey, the largest program will have at most one million functions”, but that will lead to both wasted memory and poor user experience. You could also use a single yolo arena of a fixed size, like I did in Hard Mode Rust , but that isn’t at all similar to “static allocation”. With arenas, the size is fixed explicitly, but you can OOM. With static allocation it is the opposite — no OOM, but you don’t know how much memory you’ll need until startup finishes! The “problem size” for a compiler isn’t fixed — both the input (source code) and the output (executable) can be arbitrarily large. But that is also the case for TigerBeetle — the size of the database is not fixed, it’s just that TigerBeetle gets to cheat and store it on disk, rather than in RAM. And TigerBeetle doesn’t do “static allocation” on disk, it can fail with at runtime, and it includes a dynamic block allocator to avoid that as long as possible by re-using no longer relevant sectors. So what we could say is that a compiler consumes arbitrarily large input, and produces arbitrarily large output, but those “do not count” for the purpose of static memory allocation. At the start, we set aside an “output arena” for storing finished, immutable results of compiler’s work. We then say that this output is accumulated after processing a sequence of chunks, where chunk size is strictly finite. While limiting the total size of the code-base is unreasonable, limiting a single file to, say, 4 MiB (runtime-overridable) is fine. Compiling then essentially becomes a “stream processing” problem, where both inputs and outputs are arbitrary large, but the filter program itself must execute in O(1) memory. With this setup, it is natural to use indexes rather than pointers for “output data”, which then makes it easy to persist it to disk between changes. And it’s also natural to think about “chunks of changes” not only spatially (compiler sees a new file), but also temporally (compiler sees a new version of an old file). Is there any practical benefits here? I don’t know! But seems worth playing around with! I feel that a strict separation between O(N) compiler output and O(1) intermediate processing artifacts can clarify compiler’s architecture, and I won’t be too surprised if O(1) processing in compilers would lead to simpler code the same way it does for databases?

0 views