Posts in Database (20 found)

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS

Rearchitecting the Thread Model of In-Memory Key-Value Stores with μTPS Youmin Chen, Jiwu Shu, Yanyan Shen, Linpeng Huang, and Hong Mei SOSP'25 I love this paper, because it grinds one of my axes: efficient pipeline parallelism on general purpose CPUs. In many hardware designs, pipeline parallelism is the dominant form of parallelism, whereas data parallelism takes the cake on CPUs and GPUs. It has always seemed to me that there are applications where pipeline parallelism should be great on multi-core CPUs, and here is an example. Fig. 1 illustrates the design space for key-value stores: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ). It is hard to effectively cache data in a TPR/TPQ model, because each request runs the entire key-value store request code path. For example, a CPU core may have enough cache capacity to hold network buffers or hot tuples, but not both. The key disadvantage to a TPS architecture is load balancing. One stage could become the bottleneck, leaving CPU cores idle. The authors propose dynamic reconfiguration of the pipeline based on workload changes. Another challenge with pipelining is implementing efficient communication between cores, because data associated with each request flows down the pipeline with the request itself. Fig. 3 shows the pipeline proposed in this paper: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 The NIC writes request packets into the network buffer (stored in the LLC). The cache-resident layer reads data from this buffer and handles requests involving commonly used keys by accessing the hot index and hot data caches (also in the LLC). The memory-resident layer handles cold keys and values, which are stored in DRAM. One set of threads (pinned to CPU cores) implement the cache-resident layer, and a different set of threads (pinned to other CPU cores) implement the memory-resident layer. An auto-tuner continually monitors the system and adjusts the number of threads assigned to each layer. Section 3.5 describes the synchronization required to implement this adjustment. The NIC writes request packets into a single queue. The cache-resident threads cooperatively read requests from this queue. If there are threads in the pool, then thread reads all requests with: . Next, threads check to see if the key associated with a request is hot (and thus cached in the LLC). Time is divided into epochs. During a given epoch, the set of cached items does not change. This enables fast lookups without costly synchronization between threads. A background thread gathers statistics to determine the set of items to be cached in the next epoch and has the ability to atomically switch to the next epoch when the time comes. The number of hot keys is kept small enough that it is highly likely that hot keys will be stored in the LLC. Requests that miss in the cache-resident layer are passed on to the memory-resident layer for further processing (via the CR-MR queue ). Typically, the LLC is treated like a global resource (shared by all cores). But this particular use case requires that most of the LLC be dedicated to the cache-resident layer. This is accomplished with the help of the PQOS utility from Intel, which uses “Intel(R) Resource Director Technology” to control which ways of the LLC are assigned to each layer. The memory-resident layer operates on batches of requests. Because the requests are not hot, it is highly likely that each request will require DRAM accesses for index lookups (keys) and data lookups (values). Software prefetching is used to hide DRAM latency during index lookups. When servicing operations, data values are copied directly into the outgoing network buffer. The CR-MR queue is used to communicate between the two layers. Each (CR thread, MR thread) pair has a dedicated lock-free queue. Enqueue operations use a round-robin policy (message from CR thread is sent to MR thread: ). Dequeue operations must potentially scan queues corresponding to all possible senders. Multiple requests can be stored per message, to amortize control overhead. Fig. 7 has throughput results for synthetic workloads (A, B, and C have different ratios of put/get operations), uTPS-T is this work: Source: https://dl.acm.org/doi/10.1145/3731569.3764794 Dangling Pointers The pipelining here is coarse-grained, and the design is only optimized for the LLC. I wonder if a more fine-grained pipeline would allow hot data to be stored in L2 caches. For example, the set of hot keys could be sharded among N cores, with each core holding a different shard in its L2 cache. It seems redundant that this design requires software to determine the set of hot keys, when the hardware cache circuitry already has support to do something like this. Source: https://dl.acm.org/doi/10.1145/3731569.3764794 One axis is preemptive vs non-preemptive (cooperative) multi-threading. Preemptive multithreading involves context switches, which are cheap relative to disk reads but expensive relative to DRAM reads. The other axis is how to assign work to threads. Thread per request (TPR) creates a new thread for each request. This approach has been subsumed by thread per queue (TPQ), which uses a static number of threads, each of which dequeues requests from a dedicated queue and executes all of the work for a single request to completion. Finally, there is thread per stage (TPS), which divides the steps necessary to complete a request into multiple pipeline stages, and then divides the pipeline stages among a set of threads. The work discussed here uses a non-preemptive, thread per stage architecture. Pipelining Advantages A pipelined implementation seems more complicated than an imperative run-to-completion design, so why do it? The key reason is to take advantage of the CPU cache. Here are two examples: As we’ve seen in other networking papers , a well-designed system can leverage DDIO to allow the NIC to write network packets into the LLC where they are then consumed by software. Key-value stores frequently have hot tuples, and there are advantages to caching these (example here ).

0 views

JIT, episode III: warp speed ahead

In our first JIT episode , we discussed how we could, using copy-patch, easily create a JIT compiler for PostgreSQL, with a slight improvement in performance compared to the PostgreSQL interpreter. In our second episode , I talked about the performance wall and how hard it was to have a real leap in performance compared to the interpreter. But it ended with a positive outlook, a nice performance jump that I was preparing at that moment… The interpreter will run each opcode for every record it has to process. Everything it has to do for each record that could be done only once is better done once, obviously. And this is where a JIT can beat it. The JIT compiler can choose optimizations that would require checks at each opcode for the interpreter, and thus self-defeating for the interpreter. For instance, I mentioned creating inlined opcodes for common function calls like int4eq : replacing the indirect call to int4eq with a comparison of the function pointer and then an inlined version would indeed be silly, since the comparison is going to waste a lot of time already. So, what can’t the interpreter do? It sure can’t easily remove indirect calls, but this is a 1% performance gain, 2% at most. You won’t get to the headlines with that, right? Well, when in doubt, look at the past… A decade ago, I worked at a small company where I heard the weirdest thing ever regarding system performance: “our application is slower when built in 64 bits mode because the bigger pointer size makes it slower”. I didn’t buy this, spent two days digging into the code, and found that it was the opposite: 64 bits brought such a performance improvement that the entire system collapsed on a mutex that held a core structure in the application… Removing the mutex made the application fly in both 32 and 64 bits, with 64 bits beating 32 bits obviously. But why is 64 bits faster? We are talking database here, so let’s have a look at a table, shall we? http://users.atw.hu/instlatx64/AuthenticAMD/AuthenticAMD0870F10_K17_Matisse7_InstLatX64.txt (I know uBlock doesn’t like this domain, but this text document there is good, I promise) On my CPU, loading a 64 bits value in a register requires twice the time it takes to load a 32 bits value. So sure 64 bits must be slower than 32 bits! Except the switch from 32 to 64 bits also fixed one of the biggest issue with x86: the lack of registers. x86 never improved from its 16 bits roots and had 8 general purpose registers, little compared to PowerPC (32), Sparc (31), ARM (15). When AMD introduced 64 bits in the x86 world, they doubled the number of registers, from a ridiculous 8 to an acceptable 16. And from this came a huge performance boost. Memory = slow Registers = fast Ok, more seriously, I will not start writing about this. Even if it is getting old and outdated, the “What Every Programmer Should Know About Memory” paper of Ulrich Drepper is still a great read if you’re interested into that topic. The only thing that matters for us is that, even with a lot of cache, writing to memory is slower than writing to a register. If I look at some measurements for my Zen2 CPU, a comparison between two registers takes less than a cycle (0.33c it seems), but if data has to be loaded from L1 cache you can add 4 cycles, 12 cycles from L2 and 38 from L3. Way, way slower. 12 to 115 times slower. Registers are used automatically by your compiler. When you write a function, it will automatically figure out what variable to move to a register, when, and if you don’t have enough registers for your entire function, spill registers on the stack as needed. If you are interested into this, there are many fun register allocation algorithms and many wikipedia pages covering this topic. Let’s see one of the most basic opcode, EEOP_SCAN_VAR, taking a value from a scan slot in order to use it later. This is indeed a memory write. Could the interpreter get rid of this? Well, I think it could, but it would be a major undertaking. If we had a variable, stored in a register by the compiler, we could store there, sure, and a next step could fetch from that place, but what if the next step needs another value instead… Then we would have to spill the value back in memory, but checking for this at each step is going to kill performance. It may be possible to rewrite the entire interpreter to match a register based VM, but I can not be sure it would be worth it. And this is the path to beating the interpreter. We can check many things before running the opcodes, trace memory accesses and use registers as much as possible. The great benefit of copy-patch is that you (almost) don’t write assembly code. Porting it to arm64 required me to learn about ARM64 specific relocations, how to encode immediate values in some ARM opcodes, but nothing more. But a big downside is that you don’t write assembly code And, well, if you don’t write the assembly code, you don’t control register allocation. But there is a simple way around this, let’s speak a bit about calling conventions. When function A is called, how do you pass parameters to it? If you learned some x86 assembly at school, you will answer “on the stack” and won a free ticket for an assembly refreshment course When AMD64 was introduced, the SysV Call Convention took over and completely changed the way functions are called. The first six integer or pointer parameters are passed through general purpose registers, and six floating point parameters are passed through FP registers. Each opcode is defined as a function with three parameters (matching the function signature expected by PostgreSQL). While respecting the SysV calling convention, it leaves us three registers that the compiler will keep across the opcode calls, and will spill automatically if needed. An alternative would have been to use the preserve_none calling convention, but for the first version I did not need it (and I still have many calls to PostgreSQL functions that will use the SysV calling convention) But three registers means… two values only. Sadly we transitioned from 32 to 64 bits, not to 65 bits… 65 bits would have given one bit to represent NULL/NOT NULL values, 0 would not have been NULL, 1 + NULL would be NULL! But we will not rewrite history here, and we are going to use one register as a set of null flags, one bit per value register (so we are wasting 62 bits here). Our opcode functions are thus going to have three new parameters, char nullFlags, intptr_t reg0, intptr_t reg1. Jumping to the next opcode will require passing these values around. Great, we keep registers around, now what about using these? As a reminder, here are the opcodes we are dealing with for our previous “SELECT * FROM demo WHERE a = 42”. This code doesn’t use our registers. I rewrote every opcode implementation to use the registers instead of writing in memory. In this version, all memory accesses have been replaced with register accesses instead, hurray! But this will work only for a simple query like this one. Once we start having more variables to store, we will need a spilling mechanism, a way to swap registers… Another issue appears when you call, for instance, a non-inlined function. The EEOP_FUNCEXPR is defined as: Parameters are fed through the fcinfo_data structure. The other opcodes are writing directly into this structure in the usual interpreter execution. It means that we must check all memory accesses from the opcodes and make sure that any expected memory access from an opcode implementation will not end up in a memory location we didn’t write to. I started with a small experiment, a “variabilizer”, that would look at each opcode and figure out through each memory access (read/write) all the variables used in a run, their lifetimes… It can even detect constants stored in memory (a memory that is only read from). I then refactored a lot of the compiler code in the past weeks. I started by moving the specialized opcodes definition and dispatch to the stencil library only, removing any special case I had in the compiler part. This required #defining a way for the C code in stencils.c to generate more C code in the built-stencils.h file through the stencil-builder.py script. Fun, but complicated and hairy stuff. After that, I started rewriting the stencil signature and several opcodes to use registers instead, and wrote a “contract” for each opcode, defining what was expected in each register, what will be written in each register, and what is going to be read/written in memory. With all these changes, here is what the FUNCEXPR_STRICT opcode optimized for int4eq looks like. More metadata than actual code… But if that’s what it takes to get a good performance boost, then here we go. After ingesting that, the compiler can fill in the registers with the proper values when needed. Another big issue that I’m not covering here is that doing this requires some minimal control flow analysis. For my simple benchmark, this is not a problem, and the code is getting ready for a wider range of queries, but I did not want to cover this and preferred focusing on the registers work… Well… This is the optimization I mentioned in the previous article. So, on our stupid benchmark, doing 10 times a simple SELECT * FROM demo WHERE a = 42 on a 10 million rows table… As you can see, this is exactly what we expected: less instructions, sure, but this is not what gave us the performance boost here. What changed is the number of cycles, because the same instruction now uses a register instead of using a memory access, thus several cycles saved for the same instruction. The LLVM JIT can achieve about the same run time here, but it takes some time to generate the bitcode (less than 1ms), then several ms to analyze it, optimize it, and finally translate it to machine code. And this makes LLVM JIT slower here than copyjit, while copyjit still has some room for improvements (I’ve yet to look at tuple deforming) See you in the next one, I think we already know what the topic will be… Well, after I finish porting every opcode to these new metadata, test more stuff, and likely figure out some more optimizations on the way… PS: as said previously, help welcome, code FOSS as usual, on github , and I would gladly accept any sponsoring, mission, anything that could give me more time to work on this…

0 views

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann SIGMOD'14 The giant upon whose shoulders this paper rests is Volcano . Parallelism in Volcano is achieved through a proper separation of concerns. Volcano contains many database operators, most of which are blissfully unaware of parallelism. A handful of operators in a query plan exist only to enable parallelism (for example, an operator could implement pipeline parallelism, or partition data between threads). Generally speaking, an elegant separation of concerns is good for performance. However, the thesis of morsel-driven parallelism is that this is not true. Deeply integrating the notion of parallel execution into each operator is the way to go for OLAP. The system described in this paper is named HyPer. Fig. 2 below illustrates how HyPer decomposes a single query into three pipelines: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 The leftmost pipeline scans the input relation and applies a filter to it. The middle pipeline scans and filters the input relation . The rightmost pipeline scans and filters , joins the result with , and finally joins the result of the first join with . A system like Volcano would be tempted to scan and in parallel. Not so with HyPer: the pipelines which make up a query plan are executed serially. Each relation (both inputs, and temporary data) are divided into morsels. A morsel is a group of ~100,000 tuples. Each morsel resides on a single NUMA node (indicated by colors in the figures). Fig. 3 illustrates how HyPer uses morsel-level parallelism to implement the left pipeline (scan and filter T): Source: https://dl.acm.org/doi/10.1145/2588555.2610507 The pipeline that processes T operates in two phases. In the first phase, a pool of threads (each pinned to a core) collectively process all morsels in T. When a thread comes up for air, it grabs another morsel of input. Threads are assigned to a NUMA node, and threads prefer to process morsels assigned to the same NUMA node. If no morsels of matching color are available, then a thread will reluctantly process a morsel from another NUMA node. During the first phase, each thread writes filtered tuples into a thread-local storage area. Once all input tuples have been processed, a hash table is created (conveniently, the hash table can be sized well because the number of tuples that must be stored is known). This hash table is global (i.e., not NUMA aware). In the second phase, tuples are inserted into the hash table. The HyPer hash table is designed to allow lock-free insertions, along with a fast path for the common case where a probe operation yields no hits. The hash table uses chaining, with very special pointers used to point to the next element in the chain. The lower 48 bits of each pointer in the hash table contains the memory address that is being pointed at, the upper 16 bits are a tag . Think of the tag as a 16-bit Bloom filter describing the set of elements in the sub-list that the pointer points to. When the hash table is probed, a hash of the join key from the probe tuple is used both to determine which chain to search in, and to stop the search early if no possible unexamined list element could contain the join key. Because both the pointer and the tag are packed into 64 bits, a compare-and-swap instruction can be used to insert elements into the hash table without using an OS lock. If the hash table is large enough, then most executions of the compare-and-swap instruction should succeed. Fig. 7 illustrates the hash table data structure and the insertion algorithm: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 It is a bit odd that the design in this paper goes to such great lengths to avoid cross-socket (NUMA) reads from main memory, and yet the hash table is not NUMA aware. I think that the 16-bit tags are key here. If the set of head pointers for all buckets in the hash table is small enough to fit into an L2 cache, then this data can be efficiently replicated into all L2 caches. As long as the hash table hit rate is low enough, the number of cross-socket memory accesses during probe operations will be low. Fig. 11 shows throughput for all 22 TPC-H queries, for 4 different configurations: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 It is interesting how NUMA-awareness matters a lot for some queries, but not all. Fig. 10 shows the author’s NUMA model: Source: https://dl.acm.org/doi/10.1145/2588555.2610507 What is interesting here is that a similar setup exists within each socket. Say a socket contains 32 cores, and 4 memory controllers. Those cores and memory controllers will be laid out in a grid, with a network connecting them. I wonder if there is performance to be gained by paying attention to the intra-core layout (e.g., cores on the left side of the chip should only access memory controllers on the left side of the chip). Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
The Coder Cafe 1 weeks ago

Build Your Own Key-Value Storage Engine—Week 3

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Last week, you built the first version of an LSM: an in-memory memtable for recent writes, immutable SSTables on disk, and a MANIFEST file listing the SSTable files. However, if the database crashes, data in the memtable would be lost. This week, you will focus on durability by introducing Write-Ahead Logging (WAL). A WAL is an append-only file on disk that records the same operations you keep in memory. How it works: On write, record it in the WAL and the memtable. On restart, you read the WAL from start to end and apply each record to the memtable. Introducing a WAL is not free, though. Writes are slower because each write also goes to the WAL. It also increases write amplification, the ratio of data written to data requested by a client. Another important aspect of durability is when to synchronize a file’s state with the storage device. When you write to a file, it may appear as saved, but the bytes may sit in memory caches rather than on the physical disk. These caches are managed by the OS’s filesystem, an abstraction over the disk. If the machine crashes before the data is flushed, you can lose data. To force the data to stable storage, you need to call a sync primitive. The simple, portable choice is to call fsync , a system call that flushes a file’s buffered data and required metadata to disk. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord For the WAL data format, you won’t use JSON like the SSTables, but NDJSON (Newline-Delimited JSON). It is a true append-only format with one JSON object per line. Append a record to the WAL file , opened with . Set the field to , and the and fields to the provided key and value. For example, writing : Update the memtable with the same logic as before: If the key exists, update the value. Otherwise, create a new entry. Acknowledge the HTTP request. Create an empty file if it doesn’t exist. Replay the WAL from start to end. For each valid line, apply it to the memtable. Keep the same flush trigger (2,000 entries) and the same logic (stop-the-world operation) as last week: Write the new SSTable: Flush the memtable as a new immutable JSON SSTable file with keys sorted (same as before). fsync the SSTable file. the parent directory of the SSTable to make the new filename persistent. Update the MANIFEST atomically: Read the current MANIFEST lines into memory and append the new SSTable filename. Open with . Write the entire list to from the start. Rename → . the parent directory of the MANIFEST. Reset the WAL: Truncate the WAL to zero length. the WAL file. If the server is unavailable, do not fail. Retry indefinitely with a short delay (or exponential backoff). To assess durability: Run the client against the same input file ( put.txt ). Stop and restart your database randomly during the run. Your client should confirm that no acknowledged writes were lost after recovery. Add a per-record checksum to each WAL record. On startup, verify records and stop at the first invalid/truncated one, discarding the tail. For reference, ScyllaDB checksums segments using CRC32; see its commitlog segment file format for inspiration. Regarding the flush process, if the database crashes after step 1 (write the new SSTable) and before step 2 (update the MANIFEST atomically), you may end up with a dangling SSTable file on disk. Add a startup routine to delete any file that exists on disk but is not listed in the MANIFEST. This keeps the data directory aligned with the MANIFEST after a crash. That’s it for this week! Your storage engine is now durable. On restart, data that was in the memtable is recovered from the WAL. This is made possible by and the atomic update of the MANIFEST. Deletion is not handled yet. In the worst case, a miss can read all SSTables, which quickly becomes highly inefficient. In two weeks, you will add a endpoint and learn how SSTables are compacted so the engine can reclaim space and keep reads efficient. In your implementation, you used as a simple “make it durable now“ button. In practice, offers finer control both over what you sync and when you sync. What: (or opening the file with ) persists the data without pushing unrelated metadata, which is usually what you want for WAL appends. You can go further with to bypass the page cache and sync only the data you wrote, but that comes with extra complexity. When: While calling a sync primitive after every request is offered by systems that promise durability, it is often not the default. Many databases use group commit, which batches several writes into one call to amortize the cost while still providing strong guarantees. For additional information, see A write-ahead log is not a universal part of durability by . For example, RocksDB provides options for tuning WAL behavior to meet the needs of different applications: Synchronous WAL writes (what you implemented this week) Group commit. No WAL writes at all. If you want, you can also explore group commit in your implementation and its impact on durability and latency/throughput, since this series will not cover it later. Also, you should know that since a WAL adds I/O to the write path, storage engines use a few practical tricks to keep it fast and predictable. A common one is to preallocate fixed-size WAL segments at startup to: Avoid the penalty of dynamic allocation. Prevent write fragmentation. Align buffers for (an open (2) flag for direct I/O that bypasses the OS page cache). Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Week 3: Durability with Write-Ahead Logging Last week, you built the first version of an LSM: an in-memory memtable for recent writes, immutable SSTables on disk, and a MANIFEST file listing the SSTable files. However, if the database crashes, data in the memtable would be lost. This week, you will focus on durability by introducing Write-Ahead Logging (WAL). A WAL is an append-only file on disk that records the same operations you keep in memory. How it works: On write, record it in the WAL and the memtable. On restart, you read the WAL from start to end and apply each record to the memtable. Append a record to the WAL file , opened with . Set the field to , and the and fields to the provided key and value. For example, writing : Update the memtable with the same logic as before: If the key exists, update the value. Otherwise, create a new entry. Acknowledge the HTTP request. Create an empty file if it doesn’t exist. Replay the WAL from start to end. For each valid line, apply it to the memtable. Write the new SSTable: Flush the memtable as a new immutable JSON SSTable file with keys sorted (same as before). fsync the SSTable file. the parent directory of the SSTable to make the new filename persistent. Update the MANIFEST atomically: Read the current MANIFEST lines into memory and append the new SSTable filename. Open with . Write the entire list to from the start. Rename → . the parent directory of the MANIFEST. Reset the WAL: Truncate the WAL to zero length. the WAL file. Run the client against the same input file ( put.txt ). Stop and restart your database randomly during the run. Your client should confirm that no acknowledged writes were lost after recovery. What: (or opening the file with ) persists the data without pushing unrelated metadata, which is usually what you want for WAL appends. You can go further with to bypass the page cache and sync only the data you wrote, but that comes with extra complexity. When: While calling a sync primitive after every request is offered by systems that promise durability, it is often not the default. Many databases use group commit, which batches several writes into one call to amortize the cost while still providing strong guarantees. For additional information, see A write-ahead log is not a universal part of durability by . For example, RocksDB provides options for tuning WAL behavior to meet the needs of different applications: Synchronous WAL writes (what you implemented this week) Group commit. No WAL writes at all. Avoid the penalty of dynamic allocation. Prevent write fragmentation. Align buffers for (an open (2) flag for direct I/O that bypasses the OS page cache).

0 views
Robin Moffatt 1 weeks ago

Using Graph Analysis with Neo4j to Spot Astroturfing on Reddit

Reddit is one of the longer-standing platforms on the internet, bringing together folk to discuss, rant, grumble, and troll others on all sorts of topics, from Kafka to data engineering to nerding out over really bright torches to grumbling about the state of the country —and a whole lot more. As a social network it’s a prime candidate for using graph analysis to examine how people interact—and in today’s post, hunt down some sneaky shills ;-) I’ve loaded data for several subs into Neo4j, a graph database. Whilst RDBMS is great for digging into specific users or posts, aggregate queries, and so on, graph excels at complex pattern matching and recursive relationships. It’s a case of best tool for the job; you can do recursive SQL instead of graph, it’s just a lot more complicated. Plus the graphical tools I’ll show below are designed to be used with Neo4j or other property graph databases. In Neo4j the nodes (or vertices ) are user, subreddit, comment, and post. The edges (or relationships ) are how these interact. For example: a user [node] authored [edge] a post [node] a user [node] posted in [edge] a subreddit [node] These relationships can be analysed independently, or combined: Let’s familiarise ourselves with graph visualisations and queries. In RDBMS we use SQL to describe the data that we want to return in a query. Neo4j uses Cypher , which looks a bit like SQL but describes graph relationships. Here’s a query to show the user nodes : Neo4j includes a visualisation tool, which shows the returned nodes: We can add predicates, such as matching on a particular node property ( , in this example): You can also look at the raw data: If we zoom in a bit to the previous query results we’ll see that it’s also showing the edges that have been defined indicating a relationship ( ) between some of the nodes: Let’s build on the above predicate query to find my username ( ) and any users that I’ve interacted with: I’m going to head over to a different tool for visualising the data since the built-in capabilities in the free version of Neo4j are too limited for where we’re going with it. Data Explorer for Neo4j is a really nice tool from yWorks . It connects directly to Neo4j and can either use Cypher queries to pull in data, or directly search nodes. The first reason I like using it is the flexibility it gives for laying out the data. Here is the same set of data as above, but shown in different ways: One of the cool things that graph analysis does for us is visualise patterns that are not obvious through regular relational analysis. One of these is a form of astroturfing. Since the LLMs (GPT, Claude, etc) are trained on data that includes Reddit, it’s not uncommon now to see companies trying to play the game (just like they did with keyword-stuffing with white text on white background for Google in the old days) and 'seed' Reddit with positive content about their product. For example, genuine user A asks " what’s the best tool for embedding this nail into a piece of wood ". Genuine user B suggests " well, a hammer, DUUUHHH " (this is Reddit, after all). The Astroturfer comes along and says " What a great question! I’ve been really happy with ACME Corp’s Screwdriver! If you hold it by the blade you’ll find the handle makes a perfect tool for hitting nails. " Astroturfing also includes "asked and answered" (although not usually from the same account; that would be too obvious): Astroturfer A: "Hey guys! I’m building a house and looking for recommendations for the best value toolkit out there. Thanks!" Astroturfer B: "Gosh, well I really love my ACME Corp’s Toolbelt 2000, it is really good, and I’ve been very happy with it. Such good value too!" One of the cornerstones of Reddit is the account handle—whilst you can choose to identify yourself (as I do - ), you can also stay anonymous and be known to the world as something like . This means that what one might do on LinkedIn (click on the person’s name, figure out their company affiliation) often isn’t an option. This is where graph analysis comes in, because it’s great at both identifying and visualising patterns in behaviour that are not so easy to spot otherwise. Poking around one of the subreddits using betweenness analysis I spotted this set of three users highlighted: The accounts picked up here are key to the particular activity on the sub; but that in itself isn’t suprising. You often get key members of a community who post the bulk of the content. But, digging into these particular accounts I saw this significant pattern. The three users are shown as orange boxes; posts are blue and comments are green: It’s a nice little network of one user posting with another commenting—how helpful! To share the work they each take turns writing new posts and replying to others. Each post generally has one and only one comment, usually from one of the others in the group. You can compare this to a sub in which there is much more organic interaction. is a good example of this: Most users tend to just post replies, some only contribute new posts, and so on. Definitely not the nicely-balanced to-and-fro on the unnamed sub above ;) a user [node] authored [edge] a post [node] a user [node] posted in [edge] a subreddit [node] For example, genuine user A asks " what’s the best tool for embedding this nail into a piece of wood ". Genuine user B suggests " well, a hammer, DUUUHHH " (this is Reddit, after all). The Astroturfer comes along and says " What a great question! I’ve been really happy with ACME Corp’s Screwdriver! If you hold it by the blade you’ll find the handle makes a perfect tool for hitting nails. " Astroturfer A: "Hey guys! I’m building a house and looking for recommendations for the best value toolkit out there. Thanks!" Astroturfer B: "Gosh, well I really love my ACME Corp’s Toolbelt 2000, it is really good, and I’ve been very happy with it. Such good value too!"

0 views
Simon Willison 2 weeks ago

sqlite-utils 4.0a1 has several (minor) backwards incompatible changes

I released a new alpha version of sqlite-utils last night - the 128th release of that package since I started building it back in 2018. is two things in one package: a Python library for conveniently creating and manipulating SQLite databases and a CLI tool for working with them in the terminal. Almost every feature provided by the package is available via both of those surfaces. This is hopefully the last alpha before a 4.0 stable release. I use semantic versioning for this library, so the 4.0 version number indicates that there are backward incompatible changes that may affect code written against the 3.x line. These changes are mostly very minor: I don't want to break any existing code if I can avoid it. I made it all the way to version 3.38 before I had to ship a major release and I'm sad I couldn't push that even further! Here are the annotated release notes for 4.0a1. This change is for type hint enthusiasts. The Python library used to encourage accessing both SQL tables and SQL views through the syntactic sugar - but tables and view have different interfaces since there's no way to handle a on a SQLite view. If you want clean type hints for your code you can now use the and methods instead. A new feature, not a breaking change. I realized that supporting a stream of lists or tuples as an option for populating large tables would be a neat optimization over always dealing with dictionaries each of which duplicated the column names. I had the idea for this one while walking the dog and built the first prototype by prompting Claude Code for web on my phone. Here's the prompt I used and the prototype report it created , which included a benchmark estimating how much of a performance boost could be had for different sizes of tables. I was horrified to discover a while ago that I'd been creating SQLite columns called FLOAT but the correct type to use was REAL! This change fixes that. Previously the fix was to ask for tables to be created in strict mode. As part of this I also figured out recipes for using as a development environment for the package, which are now baked into the Justfile . This one is best explained in the issue . Another change which I would have made earlier but, since it introduces a minor behavior change to an existing feature, I reserved it for the 4.0 release. Back in 2018 when I started this project I was new to working in-depth with SQLite and incorrectly concluded that the correct way to create tables and columns named after reserved words was like this: That turned out to be a non-standard SQL syntax which the SQLite documentation describes like this : A keyword enclosed in square brackets is an identifier. This is not standard SQL. This quoting mechanism is used by MS Access and SQL Server and is included in SQLite for compatibility. Unfortunately I baked it into the library early on and it's been polluting the world with weirdly escaped table and column names ever since! I've finally fixed that, with the help of Claude Code which took on the mind-numbing task of updating hundreds of existing tests that asserted against the generated schemas. The above example table schema now looks like this: This may seem like a pretty small change but I expect it to cause a fair amount of downstream pain purely in terms of updating tests that work against tables created by ! I made this change first in LLM and decided to bring it to for consistency between the two tools. One last minor ugliness that I waited for a major version bump to fix. Update : Now that the embargo has lifted I can reveal that a substantial amount of the work on this release was performed using a preview version of Anthropic's new Claude Opus 4.5 model . Here's the Claude Code transcript for the work to implement the ability to use an iterator over lists instead of dictionaries for bulk insert and upsert operations. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Breaking change : The method now only works with tables. To access a SQL view use instead. ( #657 ) The and methods can now accept an iterator of lists or tuples as an alternative to dictionaries. The first item should be a list/tuple of column names. See Inserting data from a list or tuple iterator for details. ( #672 ) Breaking change : The default floating point column type has been changed from to , which is the correct SQLite type for floating point values. This affects auto-detected columns when inserting data. ( #645 ) Now uses in place of for packaging. ( #675 ) Tables in the Python API now do a much better job of remembering the primary key and other schema details from when they were first created. ( #655 ) Breaking change : The and mechanisms no longer skip values that evaluate to . Previously the option was needed, this has been removed. ( #542 ) Breaking change : Tables created by this library now wrap table and column names in in the schema. Previously they would use . ( #677 ) The CLI argument now accepts a path to a Python file in addition to accepting a string full of Python code. It can also now be specified multiple times. ( #659 ) Breaking change: Type detection is now the default behavior for the and CLI commands when importing CSV or TSV data. Previously all columns were treated as unless the flag was passed. Use the new flag to restore the old behavior. The environment variable has been removed. ( #679 )

0 views
matklad 2 weeks ago

TigerBeetle Blog

Continuing the tradition , I’ve been also blogging somewhat regularly on TigerBeetle’s blog, so you might want to check those articles out or even subscribe (my favorite RSS reader is RSSSSR ): Today’s post is a video version of Notes on Paxos ! https://tigerbeetle.com/blog/ https://tigerbeetle.com/blog/atom.xml

0 views
Jack Vanlightly 3 weeks ago

Have your Iceberg Cubed, Not Sorted: Meet Qbeast, the OTree Spatial Index

In today’s post I want to walk through a fascinating indexing technique for data lakehouses which flips the role of the index in open table formats like Apache Iceberg and Delta Lake. We are going to turn the tables on two key points: Indexes are primarily for reads . Indexes are usually framed as read optimizations paid for by write overhead: they make read queries fast, but inserts and updates slower. That isn’t the full story as indexes also support writes such as with faster updates and deletes in primary key tables but the dominant mental model is that indexing serves reads while writes pay the bill. OTFs don’t use tree-based indexes . Open-table format indexes are data-skipping indexes scoped to data files or even blocks within data files. They are a loose collection of column statistics and Bloom filters. Qbeast , a start-up with a presence here in Barcelona where I live, is reimagining indexes for open table formats, showing that neither assumption has to be true. Let’s dive in. A few weeks ago I wrote Beyond Indexes: How Open Table Formats Optimize Query Performance which describes how the open table formats don’t use monolithic tree-based indexes as RDBMS’s do, instead they optimize performance via effective pruning which in turn is boosted by data layout that matches the most important queries. The open-table formats give us two logical levers for optimizing layout: Partitioning Together, these form what is often called clustering : the way a table physically organizes data for efficient scanning by clustering similar data together.  Partitioning is the first major clustering lever in Iceberg and Delta tables. It divides a table into logical groups based on one or more columns so that rows with the same partition key values are stored together. This creates data locality, allowing the engine to quickly identify which partitions match a query filter (e.g., WHERE EventDate = '2025-10-01') and skip the rest. That process, called partition pruning, avoids scanning irrelevant data and greatly speeds up queries.  Within partitions, we can sort the data using a sort order . We can use one or more columns (including transforms of columns) as the sort order, which determines the order of rows in data files, and even across data files after compaction work (within a given partition). The Iceberg spec allows you to specify multiple columns as a lexicographical sort order and Delta goes further by supporting Z-order. However, Spark can also compact Iceberg using Z-order (it’s just not in the spec). Let’s take an example of rows with the following x and y indexed columns: where x has the domain a-d and y has the domain 1-4 , producing 16 (x,y) pairs, such as (a, 1), (a, 2)...(d, 4). When you sort a dataset lexicographically by multiple columns, the data system arranges the rows first by x, and then by y within each x group. That works fine if most queries filter heavily on the first column, but it doesn’t take into account how the data relates across both dimensions. Two records that are close together in (x, y) space might end up far apart on file if their x values differ slightly. Fig 1. Lexicographical order of two dimensions, which follows the “sort by x then by y” order. Z-ordering improves multidimensional sorting by weaving the bits of all indexed columns together into a single scalar value. Sorting by this scalar value produces a Z-shaped curve which fills the dimensional space (hence Z-order being what is known as a space-filling curve). The result is an ordering where items that are close in N-dimensional space remain close in the 1-D key space as well. As a result, it reduces I/O for multi-column range filters and is ideal when queries commonly span multiple dimensions rather than a single dominant one. If you always query on a leading column, then lexicographical sort order is likely better. Fig 2. Z-order uses bit mixing to produce a single scalar sort key, which determines the order, which resembles a z-shaped space-filling curve. But there are some problems with this clustering strategy based on partitioning + sorting strategy: Partition granularity . The partition key must be chosen carefully: too many partitions lead to many small files, which can hurt performance instead of helping it. Imbalanced partitions . Your data may be skewed, leading to imbalanced partition sizes. Some might be very small, while others might be very large, which is inefficient and can lead to uneven performance. Changing distributions . The shape of your data may change over time, making your chosen partitioning strategy less effective over time. Drift . Your tables are constantly drifting away from the optimum clustering layout as new data arrives. Compaction is constantly working to cluster recent data. Global clustering is expensive, so clustering is usually performed on subsets of the data. What if we could use a data layout strategy that was flexible and adaptive (solving pain points 1, 2, 3) and didn’t constantly drift as new data arrived (solving pain point 4)? Enter the Qbeast and the OTree multidimensional indexing approach which came out of research of the Barcelona Supercomputing Center . Qbeast has been on my radar because one of the founders is Flavio Junqueira, a distributed systems researcher behind both Apache ZooKeeper and Apache BookKeeper (both of which have played large roles in my career). The OTree brings to open table formats a global tree index that defines the table’s structure and layout. In some ways, the OTree could be thought of as a distant relative of the clustered index in the RDBMS world as they both define the table layout. However, the OTree is a lightweight structure that does not try to organize individual rows. The OTree index approaches table layout as an adaptive spatial structure. Instead of dividing data according to fixed partition keys or grouped according sort orders, it organizes the dataset into hypercubes that subdivide automatically as the data distribution demands. Each (hyper)cube represents a region in multi-dimensional space defined by the indexed columns. Fig 3. A table indexed on three columns leads to a 3-dimensional (normalized) space (more on that later). In this figure, the original cube has subdivided into 8 subcubes. A cube divides along all indexed dimensions simultaneously, creating 2ᵈ  smaller cubes, where 𝑑 is the number of dimensions (i.e., the number of indexed columns). So for example: With 2 indexed columns, each division produces 4 subcubes (2×2 grid) With 3 indexed columns, each division produces 8 subcubes (2×2×2) With 4 indexed columns, each division produces 16 subcubes (2×2×2×2) Fig 4. A cube subdivides into 8 subcubes (in a 3-dimensional space) corresponding to three indexes columns. Using 3-dimensional space is taxing on the mind and diagrams, so I’ll use examples based on two indexed columns which leads to an easier to visualize 2-dimensional space. The number of dimensions corresponds to the number of indexed columns. If we index our products table by price and rating , then we have a two-dimensional space. Qbeast maps each row to a point in a multidimensional space by normalizing the values of the indexed columns into the 0,1 range, preserving their relative order so that nearby data in each dimension remains close together in space. Fig 5. Two dimensional space with normalized domains For example, if we index columns price and rating, a row with (price=100, rating=4.2) might map to coordinates (0.10, 0.84) in the 0,1 space (of each dimension), while another with (price=120, rating=4.3) becomes (0.12, 0.86). Because both rows are close in their normalized coordinates, they occupy nearby positions in the multidimensional space, thereby preserving the natural proximity of their original values. This is really important because the spatial locality should reflect the value locality within the data domain, else range scans won’t be very useful. This is precisely what the Z-order mapping function tries to do as well (by bit mixing). The difference is that a space-filling curve (like Z-order or Hilbert) takes multi-dimensional coordinates (x, y, z) and projects them onto a one-dimensional ordering, whereas Qbeast preserves the ordering per dimension. A cube is one subdivision of the multidimensional space. At first, all data falls into a single cube representing the full range of values (0-1 of each dimension). As new data arrives and the cube reaches a predetermined size, it generates subcubes, each covering a more specific region of the domain. This cube division continues, producing finer and finer cubes. The result is a layout that mirrors the actual distribution of the data. Skewed data that clusters around a tight set of values is located in dense regions of space, located in finer and finer cubes, while sparse regions remain coarse. Fig 6. Cubes adaptively subdivide recursively based on multidimensional spatial density. In figure 6 above, we get the following set of splits: Root cube is created The root cube divides in half by both dimensions, creating four subcubes (0, 1, 2, 3). Subcube 3 fills up and divides into subcubes (30, 31, 32, 33) Subcube 30 fills up and divides into subcubes (300, 301, 302, 303) Now it’s time to map this spatial representation to the tree. Because of how the cubes subdivide into two halves along each dimension, the cube id (such as 301) encodes its position and normalized domain bounds (along each dimension). This multidimensional space, divided up adaptively into multiple levels of subcubes, is represented by a tree. Fig 7. The OTree representation of the cubes. We can visualize the progress of a single root cube to the final set of cubes as follows. Fig 8. The OTree over time. Next let’s look at how this translates to Apache Iceberg and its data files. Up to this point we’ve been talking about cubes, multidimensional space, and trees in abstract terms. But let’s ground ourselves and see how all this maps onto an Iceberg table or Delta table. The OTree governs layout, but Iceberg/Delta remains the source of truth about the canonical set of data files and their metadata. Writers (such as a Spark job for ingest) consult the OTree but readers (such as a Spark analytics job) only read Iceberg/Delta metadata. This separation allows the index to be invisible to all engines (Spark, Flink, Trino etc), requiring no special integration. Each node of the OTree corresponds to a cube, which in turn contains one or more blocks , where each block points to a data file (such as a Parquet file). Fig 9. Each node of the OTree contains one or more blocks, where each block is a data file (such as Parquet). In this example, the root cube reached capacity with three files and split along 2 dimensions. Notice that the data does not exist only in leaf nodes, but all nodes of the tree. The deeper into the tree you go, the narrower value range across the dimensions each node represents. Any given data point may exist in any node from the root, down to the lowest leaf that covers the data point. Fig 10. The data point maps onto the nodes: root, 3, 30 and 302. A query whose filter predicates cover this point may end up reading each of these files (it depends on the column stats). As I said in my previous blog post on OTF performance, the Iceberg column statistics reflect the layout of the data. We want narrow column stats for effective pruning, which means producing a data layout with data locality. The OTree provides that method of obtaining data locality according to one or more indexed columns (the dimensions of our multidimensional space). But readers carry on using the standard column statistics and bloom filters as usual. So, the OTree index governs the table’s layout but it doesn’t replace Iceberg or Delta’s metadata or data files. The two systems coexist: The OTree index describes how the data should be organized: which regions exist, their spatial boundaries, and which data points fall into each. Iceberg/Delta’s metadata remains the authoritative catalog of what files exist and their stats. In Iceberg, the OTree index is stored as a Puffin file which is referenced in the Iceberg metadata (so the OTree is committed as part of the Iceberg commit). Each commit may result in a new version of the OTree. Fig 11. A very simplified representation of four Iceberg commits which add one Parquet file per commit. The root cube splits in the 3rd snapshot, writing to one subcube, and another subcube in snapshot 4. In DeltaLake, the OTree metadata is included within tag metadata of each operation in the Delta Log (as depicted below). Fig 12. A very simplified representation of the Delta log with four add_files operations. Each added file is mapped to a cube id (where the tree structure is encoded into the cube ids). So although the OTree introduces a tree-shaped, spatial index, the underlying Iceberg/Delta table remains standard (additional fields are added to metadata which does not break existing engines). Query engines simply ignore the OTree when they perform reads. Writers (optionally) and table maintenance jobs (obligatory) do need to know about the OTree, as we want the layout to be governed by an adaptive index rather than static partitioning logic. Ideally writers will use the OTree index so that the index covers the whole dataset (ensuring locality is maintained from the very first moment data is written to the table). However, that requires that the writer, such as Apache Spark, to use the Qbeast module when performing writes. Table maintenance jobs must use the module, in order to apply the spatial layout to the Iceberg data files. Although the OTree governs the layout of the entire table, the OTree itself is just lightweight metadata that describes the parent-child relationships (encoded in the cube ids), and for each cube: the element count and the min/max weights of each cube. I won’t go into the detail of weights, but it is an additional feature designed to enhance data distribution across the nodes. The normalized dimension bounds of each cube are established by the position of the cube in the tree, so there is no need to store that. Because of this, even a table with billions of rows can be represented by an OTree containing just a few thousand small metadata entries, typically amounting to a few megabytes in total. The tree is therefore cheap to store, fast to read, and easy to keep in memory, while still providing a view of the data’s spatial layout. It’s helpful to see all of this on a spectrum. On the left , the classic B-tree clustered index: a strict, key-ordered global tree index that dictates exactly where every row lives. While great for selective OLTP workloads, it is far too rigid and expensive when the dataset grows and the queries become broad (reading millions of rows). On the right , we have Iceberg/Delta’s approach: lightweight metadata describing the canonical set of files (without ordering), with a declared clustering strategy (partitioning and optional sort order) which the table is constantly drifting from, requiring maintenance bound that drift. In the middle sits the OTree , it is a global tree index, but without the fine-grained rigidity of the B-tree. Instead of ordering individual rows, it divides the data space into coarse, adaptive regions that subdivide and merge as the distribution demands. This keeps it incredibly light while still determining where data should live. Dense data is located in narrow cubes and sparse data in wide cubes. The layout is self-correcting as data distribution changes, avoiding imbalanced partitions. It’s fun to see the inversion of the role of the index. Using it to shape the table as it is written, so that the layout remains close to optimal, making the existing read-time optimizations of Iceberg and Delta more effective. The OTree is there behind the scenes and query engines that read from the tables have no idea that it exists. There is a lot more to Qbeast than what I’ve covered here, there are additional mechanisms for ensuring even data distribution and making sampling efficient via random weights, but that’s too detailed for this post. The takeaway for me I suppose is that there are always more innovative ways of doing things, and we’re still early in the open table format / lakehouse game. There are plenty more innovations to come at all levels, from file formats, data organization, to query engines. Indexes are primarily for reads . Indexes are usually framed as read optimizations paid for by write overhead: they make read queries fast, but inserts and updates slower. That isn’t the full story as indexes also support writes such as with faster updates and deletes in primary key tables but the dominant mental model is that indexing serves reads while writes pay the bill. OTFs don’t use tree-based indexes . Open-table format indexes are data-skipping indexes scoped to data files or even blocks within data files. They are a loose collection of column statistics and Bloom filters. Partitioning Partition granularity . The partition key must be chosen carefully: too many partitions lead to many small files, which can hurt performance instead of helping it. Imbalanced partitions . Your data may be skewed, leading to imbalanced partition sizes. Some might be very small, while others might be very large, which is inefficient and can lead to uneven performance. Changing distributions . The shape of your data may change over time, making your chosen partitioning strategy less effective over time. Drift . Your tables are constantly drifting away from the optimum clustering layout as new data arrives. Compaction is constantly working to cluster recent data. Global clustering is expensive, so clustering is usually performed on subsets of the data. With 2 indexed columns, each division produces 4 subcubes (2×2 grid) With 3 indexed columns, each division produces 8 subcubes (2×2×2) With 4 indexed columns, each division produces 16 subcubes (2×2×2×2) Root cube is created The root cube divides in half by both dimensions, creating four subcubes (0, 1, 2, 3). Subcube 3 fills up and divides into subcubes (30, 31, 32, 33) Subcube 30 fills up and divides into subcubes (300, 301, 302, 303) The OTree index describes how the data should be organized: which regions exist, their spatial boundaries, and which data points fall into each. Iceberg/Delta’s metadata remains the authoritative catalog of what files exist and their stats.

0 views
The Coder Cafe 3 weeks ago

Build Your Own Key-Value Storage Engine—Week 2

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations Before delving into this week’s tasks, it’s important to understand what you will implement. This week, you will implement a basic log-structured merge-tree (LSM tree). At its core, an LSM tree is a data structure that prioritizes write efficiency by trading off some read complexity. It buffers writes in memory and uses append-only files on disk, then rewrites data during compaction. It consists of two main components: A mutable in-memory data structure called a memtable, used to store recent writes. A set of immutable SSTables (Sorted String Table) stored on disk. Regularly, the current memtable is snapshotted, its entries are sorted by key, and a new immutable SSTable file is written. In addition, a MANIFEST file is an append-only list of SSTable filenames. It tells the engine which SSTable files exist and in which order to read them, newest to oldest. Why LSM trees shine for write-heavy workloads: Fast writes with sequential I/O: New updates are buffered in memory (memtable) and later written sequentially to disk during a flush (SSTable), which is faster than the random I/O patterns common with B-trees, for example. Decouples writes from read optimization: Writes complete against the memtable, while compaction work runs later (you will tackle that in a future week). Space and long-term efficiency: Compaction processes remove dead data and merge many small files into larger sorted files, which keeps space usage in check and sustains read performance over time. For the memtable, you will start with a hashtable. In a future week, you will learn why a hashtable is not the most efficient data structure for an LSM tree, but it is a simple starting point. For the SSTables, you will use JSON as the data format. Get comfortable with a JSON parser if you are not already. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord This week’s implementation is single-threaded. You will revisit that assumption later. Implement a hashtable to store requests (create or update). You can probably reuse a lot of code from Week 1. When your memtable contains 2,000 entries: Flush the memtable as a new immutable JSON SSTable file with keys sorted. The SSTable file is a JSON array of objects, each with two fields, and . Keys are unique within a file. For example, if your memtable contains the following entries: You need to create the following SSTable: Use a counter for the filename prefix, for example , , . After writing the new SSTable, append its filename to the MANIFEST (append only), then clear the memtable: For now, the flush is a stop-the-world operation. While the file is being written, do not serve reads or writes. You will revisit that later. Create an empty file if it doesn’t exist. Derive the next SSTable ID from the MANIFEST so you don't reuse the same filename. Check the memtable: If found, return the corresponding value. If not found, read the MANIFEST to list SSTable filenames: Scan SSTables from newest to oldest (for example , then , then ). Use a simple linear scan inside each file for now. Stop at the first hit and return the corresponding value. If still not found, return . There are no changes to the client you built in week 1. Run it against the same file ( put.txt ) to validate that your changes are correct. Keep a small LRU cache of known-absent keys (negative cache) between the memtable and SSTables. This avoids repeated disk scans for hot misses: after the first miss, subsequent lookups are O(1). Implementation details are up to you. Instead of parsing the MANIFEST file for each request, you can cache the content in-memory. That’s it for this week! You have built the first version of an LSM tree: a memtable in memory, SSTable files written by regular flushes, and a MANIFEST that lists those SSTables. For now, durability isn’t guaranteed. Data already flushed to SSTables will be read after a restart, but anything still in the memtable during a crash is lost. In two weeks, you will make sure that any request acknowledged to a client remains in your storage engine, even after a restart. The flush trigger you used was pretty simple: once the memtable contains 2,000 entries. In real systems, flushes can be triggered by various factors, for example: Some databases flush when the memtable reaches a target size in bytes, ensuring predictable memory usage. A flush can also occur after a period of time has passed. This occurs because the database eventually needs to release commit log segments. For tables with very low write activity, this can sometimes lead to data resurrection scenarios. Here’s an old issue from the ScyllaDB codebase that illustrates this behavior. Regarding the model, this series assumes a simple key–value one: every PUT stores the whole value, so a GET just finds the newest entry and returns it. If you need a richer model (e.g., rows with many fields or collections), writes are often partial (patches) rather than full replacements. Therefore, reads must reconstruct the result by scanning newest to oldest and merging changes until all required fields are found or a full-write record is encountered. Last but not least, in this series, you implicitly rely on client-side ordering: the validation client issues requests sequentially. Production KV databases typically attach a sequence number or a logical timestamp to each write to handle out-of-order arrivals, merging, and reconciling results. Pure wall-clock timestamps are convenient but brittle; see Kyle Kingsbury’s notes on clock pitfalls for a deeper dive. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary . ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Week 2: LSM Tree Foundations A mutable in-memory data structure called a memtable, used to store recent writes. A set of immutable SSTables (Sorted String Table) stored on disk. Fast writes with sequential I/O: New updates are buffered in memory (memtable) and later written sequentially to disk during a flush (SSTable), which is faster than the random I/O patterns common with B-trees, for example. Decouples writes from read optimization: Writes complete against the memtable, while compaction work runs later (you will tackle that in a future week). Space and long-term efficiency: Compaction processes remove dead data and merge many small files into larger sorted files, which keeps space usage in check and sustains read performance over time. This week’s implementation is single-threaded. You will revisit that assumption later. Flush the memtable as a new immutable JSON SSTable file with keys sorted. The SSTable file is a JSON array of objects, each with two fields, and . Keys are unique within a file. For example, if your memtable contains the following entries: You need to create the following SSTable: Use a counter for the filename prefix, for example , , . After writing the new SSTable, append its filename to the MANIFEST (append only), then clear the memtable: Create an empty file if it doesn’t exist. Derive the next SSTable ID from the MANIFEST so you don't reuse the same filename. Check the memtable: If found, return the corresponding value. If not found, read the MANIFEST to list SSTable filenames: Scan SSTables from newest to oldest (for example , then , then ). Use a simple linear scan inside each file for now. Stop at the first hit and return the corresponding value. If still not found, return . Some databases flush when the memtable reaches a target size in bytes, ensuring predictable memory usage. A flush can also occur after a period of time has passed. This occurs because the database eventually needs to release commit log segments. For tables with very low write activity, this can sometimes lead to data resurrection scenarios. Here’s an old issue from the ScyllaDB codebase that illustrates this behavior. The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary .

0 views

DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge

DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge Arjun Kashyap, Yuke Li, and Xiaoyi Lu HPDC'25 This paper ends with a counterintuitive result: DPUs aren’t amazingly better than traditional CPUs at implementing key-value stores. You would think they would have a shot, given that they have specialized networking accelerators, and key-value stores don’t require a lot of general-purpose computation. It all comes down to access to off-chip memory. Table 1 and Fig. 2 contain profiling results which motivate the designs presented by this paper: Source: https://dl.acm.org/doi/10.1145/3731545.3731571 When a key-value store is run on a traditional CPU, most of the time is spent in packet processing rather than CRUD operation on (key, value) tuples. DPUs are chips that are specialized for packet processing, so one would think a DPU would be especially helpful in this situation. The designs in this paper are evaluated on NVIDIA BlueField 2 and BlueField 3 DPUs. These DPUs contain a mix of fixed-function hardware and ARM cores. The fundamental idea proposed in this paper is to make the CPU’s job easier by having the DPU perform some of the awkward packet processing work and passing CRUD requests to the CPU in a convenient format. The DPU parses incoming packets, extracts the relevant fields, and writes the minimal amount of information (operation to perform, key, value) to queues in host memory. Fig. 8(d) shows the amount of cruft the DPU is able to remove from each CRUD packet. Source: https://dl.acm.org/doi/10.1145/3731545.3731571 The design described in this paper is optimized in all of the ways you would expect , requests are batched, and appropriate pipelining is used to avoid idle time. Cross-core synchronization (both DPU and host CPU cores) synchronization is minimized with the presence of many queues. Each ARM core owns a unique set of queues, and each host CPU core is assigned to one or more queues. As Fig. 11 shows, the design described so far ( DPU-KV-lat and DPU-KV-sav ) doesn’t offer significant speedups. There are latency improvements, but throughput suffers. These designs are bound by the DPU. Evidence for this comes from the performance of DPU-KV-dual and DPU-KV-shrd . DPU-KV-dual allows idle host CPU cores to process raw packets. DPU-KV-shrd uses sharding. The set of keys is partitioned, with some keys handled by the CPU and some keys handled by the DPU. Source: https://dl.acm.org/doi/10.1145/3731545.3731571 Dangling Pointers The moral of the story is that there is no special advantage conferred on an ARM core inside of a SmartNIC over an Intel core inside of the host CPU. It is interesting to compare this work to a key-value store implemented directly in an RMT pipeline . It would be interesting to drill down into the profiling numbers which motivated this paper and understand how much memory-level parallelism a traditional CPU core can utilize. At the microarchitectural level, this problem has to be memory bound, it would be interesting to see if it is bound by memory latency or bandwidth. Subscribe now

0 views
<antirez> 4 weeks ago

Scaling HNSWs

I’m taking a few weeks of pause on my HNSWs developments (now working on some other data structure, news soon). At this point, the new type I added to Redis is stable and complete enough, it’s the perfect moment to reason about what I learned about HNSWs, and turn it into a blog post. That kind of brain dump that was so common pre-AI era, and now has become, maybe, a bit more rare. Well, after almost one year of thinking and implementing HNSWs and vector similarity stuff, it is time for some writing. However this is not going to be an intro on HNSWs: too many are present already. This is the “extra mile” instead. If you know HNSWs, I want to share with you my more “advanced” findings, especially in the context of making them fast enough to allow for a “Redis” experience: you know, Redis is designed for low latency and high performance, and HNSWs are kinda resistant to that, so there were challenges to expose HNSWs as an abstract data structure. This blog post will be split into several sections. Think of them as pages of the same book, different chapters of the same experience. Oh and, by the way, I already wrote and subsequently lost this blog post :D [long, sad story about MacOS and bad habits – I hadn’t lost something like that since the 90s, during blackouts], so here most of the problem will be to recall what I wrote a few days ago and, while I’m at it, to better rephrase what I didn’t like very much. ## A few words about the state of HNSW Before digging into the HNSWs internals and optimizations, I want to say a few things about HNSWs. The original paper introducing HNSWs is a great piece of computer science literature, and HNSWs are amazing data structures, but: I don’t believe they are the last word for searching, in a greedy way, for nearby vectors according to a distance function. The paper gives the feeling it lacks some “pieces”, almost like if the researchers, given six months more, had a lot more to explore and say. For instance, I modified the paper myself, extending it in order to support removal of entries, actual removals, not just tombstone deletions where the element is marked as gone and collected later: deleting items is totally missing from the paper. Similarly, there are, right now, efforts in order to really check if the “H” in the HNSWs is really needed, and if instead a flat data structure with just one layer would perform more or less the same (I hope I’ll cover more about this in the future: my feeling is that the truth is in the middle, and that it makes sense to modify the level selection function to just have levels greater than a given threshold). All this to say that, if you are into data structures research, I believe that a great area is to imagine evolutions and refinements of HNSWs, without getting trapped within the idea that the evolutions are only in the sense of: let’s do it, but for disk (see Microsoft efforts), or the like. Ok, enough with the premise, let’s go to the actual low level stuff :) ## Scaling memory Redis is an in-memory system, and both HNSWs and vectors have the unfortunate quality of being very space-hungry. There are three reasons for this: 1. HNSWs have a lot of pointers, like 16, 32 or more pointers (this is a tunable parameter of HNSWs) to neighbor nodes. 2. HNSWs have many levels, being a skiplist-alike data structure. This exacerbates the first problem. 3. HNSW’s satellite data is a vector of floating point numbers, so, in the vanilla case, 4 bytes per component, and normally you can have 300-3000 components, this is the usual range. So, what are the lessons learned here? There are folks that compress pointers, since it is very likely that many pointers (8 bytes in 64 bit systems) will have the highest four bytes all the same. This is smart, I didn’t implement it yet, because in Redis I need to go fast, and this is a tradeoff between space and time: but maybe it is worth it, maybe not. I’ll dig more. However, if you do the math, the fact that there are many layers is not *so* terrible as it looks. On average, the multiple layers per node make the situation worse by just ~1.3x (if the probability of level increase is 0.25 in the level selection function), since many nodes will be just at layer 0. But still 1.3 is more than 1, and if that “H” in HNSWs really is not *so* useful… [Spoiler, what I found is that the seek time if you have everything at layer 0 is greater, the main loop for the greedy search will start from less optimal places and it will eventually reach the right cluster, but will take more computation time. However this is just early results.] So here the *real* low hanging fruit is: vector quantization. What I found is that if you use 8 bit quantization what you get is an almost 4x speedup, a 4x reduction of your vectors (but not a 4x reduction of the whole node: the pointers are still there, and they take a lot of space), and a recall that is virtually the same in real world use cases. This is the reason why Redis Vector Sets use 8 bit quantization by default. You can specify, via VADD options, that you want full precision vectors or binary quantized vectors, where we just take the sign, but I’m skeptical about using both full size vectors and binary quantized vectors. Before talking about them, let’s see what kind of quantization I used for 8 bit. What I do is to compute the maximum absolute value of the component of each vector (so quantization is per-vector), then I use signed 8 bit values to represent the quant from -127 to 127. This is not as good as storing both min and max value, but it is faster when computing cosine similarity, since I can do this: /* Each vector is quantized from [-max_abs, +max_abs] to [-127, 127] * where range = 2*max_abs. */ const float scale_product = (range_a/127) * (range_b/127); Then I multiply things together in the integer domain with (actually in the code the main loop is unrolled and uses multiple accumulators, to make modern CPUs more busy) for (; i And finally we can return back to the floating point distance with: float dotf = dot0 * scale_product; Check the vectors_distance_q8() for more information, but I believe you got the idea: it is very simple to go from the integer quants domain to the unquantized dotproduct with trivial operations. So, 8 bit quantization is a great deal, and full precision was a *needed* feature, because there will be people doing things with vectors generated in a way where each small amount makes a difference (no, with learned vectors this is not the case…) but, why binary quantization? Because I wanted users to have a simple way to not waste space when their *original* information is already binary. Imagine you have a set of users and they have yes/no properties, and you want to find similar users, items, whatever. Well: this is where binary quantization should be used, it’s just, again, an option of the VADD command. ## Scaling speed: threading and locality Oh, you know, I have to tell you something about myself: I’m not a fan of threaded systems when it is possible to do a lot with a single core, and then use multiple cores in a shared-nothing architecture. But HNSWs are different. They are *slow*, and they are accessed almost always in read-only ways, at least in most use cases. For this reason, my Vector Sets implementation is fully threaded. Not just reads, even writes are partially threaded, and you may wonder how this is possible without it resulting in a mess, especially in a system like Redis, where keys can be accessed in different ways by the background saving process, the clients, and so forth. Well, to start, let’s focus on reads. What happens is that as long as nobody is writing in the data structure, we can spawn threads that do the greedy collection of near vectors and return back the results to the blocked client. However, my implementation of HNSWs was written from scratch, I mean, from the empty C file opened with vim, it has 0% of shared code with the two implementations most other systems use, so there are a few “novelties”. One of such different things is that in order to avoid re-visiting already visited nodes, I use an integer stored in each node that is called “epoch”, instead of using another data structure to mark (like, in a hash table) nodes already visited. This is quite slow, I believe. The epoch instead is local to the node, and the global data structure increments the epoch for each search. So in the context of each search, we are sure that we can find epochs that are just But with threads, there are multiple searches occurring at the same time! And, yep, what I needed was an array of epochs: typedef struct hnswNode { uint32_t level; /* Node's maximum level */ … many other stuff … uint64_t visited_epoch[HNSW_MAX_THREADS]; } That’s what you can read in hnsw.h. This is, again, a space-time tradeoff, and again time won against space. So, how was it possible to have threaded writes? The trick is that in HNSW inserts, a lot of time is spent looking for neighbors candidates. So writes are split into a reading-half and commit-half, only the second needs a write lock, and there are a few tricks to make sure that the candidates we accumulated during the first part are discarded if the HNSW changed in the meantime, and some nodes may no longer be valid. There is, however, another problem. What about the user deleting the key, while background threads are working on the value? For this scenario, we have a function that waits for background operations to return before actually reclaiming the object. With these tricks, it is easy to get 50k ops/sec on real world vector workloads, and these are numbers I got from redis-benchmark itself, with all the overhead involved. The raw numbers of the flat HNSW library itself are much higher. ## Scaling memory: reclaiming it properly Before talking about how to scale HNSWs into big use cases with multiple instances involved, and why Redis Vector Sets expose the actual data structure in the face of the user (I believe programmers are smart and don’t need babysitting, but it’s not *just* that), I want to go back and talk again about memory, because there is an interesting story to tell about this specific aspect. Most HNSWs implementations are not able to reclaim memory directly when you delete a node from the graph. I believe there are two main reasons for that: 1. People misunderstand the original HNSW paper in a specific way: they believe links can be NOT reciprocal among neighbors. And there is a specific reason why they think so. 2. The paper does not say anything about deletion of nodes and how to fix the graph after nodes go away and we get missing links in the “web” of connections. The first problem is a combination (I believe) of lack of clarity in the paper and the fact that, while implementing HNSWs, people face a specific problem: when inserting a new node, and good neighbors are searched among existing nodes, often the candidates already have the maximum number of outgoing links. What to do, in this case? The issue is often resolved by linking unidirectionally from the new node we are inserting to the candidates that are already “full” of outgoing links. However, when you need to delete a node, you can no longer resolve all its incoming links, so you can’t really reclaim memory. You mark it as deleted with a flag, and later sometimes there is some rebuilding of the graph to “garbage collect” stale nodes, sometimes memory is just leaked. So, to start, my implementation in Redis does things differently by forcing links to be bidirectional. If A links to B, B links to A. But, how to do so, given that A may be busy? Well, this gets into complicated territory but what happens is that heuristics are used in order to drop links from existing nodes, with other neighbors that are well connected, and if our node is a better candidate even for the target node, and if this is not true there are other ways to force a new node to have at least a minimal number of links, always trying to satisfy the small world property of the graph. This way, when Redis deletes a node from a Vector Set, it always has a way to remove all the pointers to it. However, what to do with the remaining nodes that now are missing a link? What I do is to create a distance matrix among them, in order to try to link the old node neighbors among them, trying to minimize the average distance. Basically for each pair of i,j nodes in our matrix, we calculate how good is their connection (how similar their vectors are) and how badly linking them affects the *remaining* possible pairs (since there could be elements left without good pairs, if we link two specific nodes). After we build this matrix of scores, we then proceed with a greedy pairing step. This works so well that you can build a large HNSW with millions of elements, later delete 95% of all your elements, and the remaining graph still has good recall and no isolated nodes and so forth. That is what I mean when I say that there is space in HNSWs for new papers to continue the work. ## Scaling HNSWs to multiple processes When I started to work at Redis Vector Sets, there was already a vector similarity implementation in Redis-land, specifically as an index type of RediSearch, and this is how most people think at HNSWs: a form of indexing of existing data. Yet I wanted to provide Redis with a new HNSW implementation exposed in a completely different way. Guess how? As a data structure, of course. And this tells a story about how Redis-shaped is my head after so many years, or maybe it was Redis-shaped since the start, and it is Redis that is shaped after my head, since I immediately envisioned how to design a Redis data structure that exposed HNSWs to the users, directly, and I was puzzled that the work with vectors in Redis was not performed exactly like that. At the same time, when I handed my design document to my colleagues at Redis, I can’t say that they immediately “saw” it as an obvious thing. My reasoning was: vectors are like scores in Redis Sorted Sets, except they are not scalar scores where you have a total order. Yet you can VADD, VREM, elements, and then you can call VSIM instead of ZRANGE in order to have *similar* elements. This made sense not just as an API, but I thought of HNSWs as strongly composable, and not linked to a specific use case (not specific to text embeddings, or image embeddings, or even *learned* embeddings necessarily). You do: VADD my_vector_set VALUES [… components …] my_element_string So whatever is in your components, Redis doesn't care, when you call VSIM it will report similar elements. But this also means that, if you have different vectors about the same use case split in different instances / keys, you can ask VSIM for the same query vector into all the instances, and add the WITHSCORES option (that returns the cosine distance) and merge the results client-side, and you have magically scaled your hundred of millions of vectors into multiple instances, splitting your dataset N times [One interesting thing about such a use case is that you can query the N instances in parallel using multiplexing, if your client library is smart enough]. Another very notable thing about HNSWs exposed in this raw way, is that you can finally scale writes very easily. Just hash your element modulo N, and target the resulting Redis key/instance. Multiple instances can absorb the (slow, but still fast for HNSW standards) writes at the same time, parallelizing an otherwise very slow process. This way of exposing HNSWs also scales down in a very significant way: sometimes you want an HNSW for each user / item / product / whatever you are working with. This is very hard to model if you have an index on top of something, but it is trivial if your HNSWs are data structures. You just can have a Vector Set key for each of your items, with just a handful of elements. And of course, like with any other Redis key, you can set an expiration time on the key, so that it will be removed automatically later. All this can be condensed into a rule that I believe should be more present in our industry: many programmers are smart, and if instead of creating a magic system they have no access to, you show them the data structure, the tradeoffs, they can build more things, and model their use cases in specific ways. And your system will be simpler, too. ## Scaling loading times If I don’t use threading, my HNSW library can add word2vec (300 components for each vector) into an HNSW at 5000 elements/second if I use a single thread, and can query the resulting HNSW at 90k queries per second. As you can see there is a large gap. This means that loading back an HNSW with many millions of elements from a Redis dump file into memory would take a lot of time. And this time would impact replication as well. Not great. But, this is true only if we add elements from the disk to the memory in the most trivial way, that is storing “element,vector” on disk and then trying to rebuild the HNSW in memory. There is another lesson to learn here. When you use HNSWs, you need to serialize the nodes and the neighbors as they are, so you can rebuild everything in memory just allocating stuff and turning neighbors IDs into pointers. This resulted in a 100x speedup. But do you really believe the story ends here? Hehe. Recently Redis has stronger security features and avoids doing bad things even when the RDB file is corrupted by an attacker. So what I needed to do was to make sure the HNSW is valid after loading, regardless of the errors and corruption in the serialized data structure. This involved many tricks, but I want to take the freedom to just dump one comment I wrote here, as I believe the reciprocal check is particularly cool: /* Second pass: fix pointers of all the neighbors links. * As we scan and fix the links, we also compute the accumulator * register "reciprocal", that is used in order to guarantee that all * the links are reciprocal. * * This is how it works, we hash (using a strong hash function) the * following key for each link that we see from A to B (or vice versa): * * hash(salt || A || B || link-level) * * We always sort A and B, so the same link from A to B and from B to A * will hash the same. Then we xor the result into the 128 bit accumulator. * If each link has its own backlink, the accumulator is guaranteed to * be zero at the end. * * Collisions are extremely unlikely to happen, and an external attacker * can't easily control the hash function output, since the salt is * unknown, and also there would be to control the pointers. * * This algorithm is O(1) for each node so it is basically free for * us, as we scan the list of nodes, and runs on constant and very * small memory. */ ## Scaling use cases: JSON filters I remember the day when the first working implementation of Vector Sets felt complete. Everything worked as expected and it was the starting point to start with the refinements and the extra features. However in the past weeks and months I internally received the feedback that most use cases need some form of mixed search: you want near vectors to a given query vector (like most similar movies to something) but also with some kind of filtering (only released between 2000 and 2010). My feeling is that you need to query for different parameters less often than product people believe, and that most of the time you can obtain this more efficiently by adding, in this specific case, each year to a different vector set key (this is another instance of the composability of HNSWs expressed as data structures versus a kind of index). However I was thinking about the main loop of the HNSW greedy search, that is something like this: // Simplified HNSW greedy search algorithm. Don’t trust it too much. while(candidates.len() > 0) { c = candidates.pop_nearest(query); worst_distance = results.get_worst_dist(query); if (distance(query,c) > worst_distance) break; foreach (neighbor from c) { if (neighbor.already_visited()) continue; neighbor.mark_as_visited(); if (results.has_space() OR neighbor.distance(query) candidates.add(neighbor); results.add(neighbor); } } } return results; So I started to play with the idea of adding a JSON set of metadata for each node. What if, once I have things like {“year”: 1999}, this was enough to filter while I perform the greedy search? Sure, the search needed to be bound, but there is a key insight here: I want, to start, elements that are *near* to the query vector, so I don’t really need to explore the whole graph if the condition on the JSON attributes is not satisfied by many nodes. I’ll let the user specify the effort, and anyway very far away results that match the filter are useless. So that’s yet another way how my HNSW differs: it supports filtering by expressions similar to the ones you could write inside an “if” statement of a programming language. And your elements in the Vector Set can be associated with JSON blobs, expressing their properties. Then you can do things like: VSIM movies VALUES … your vector components here… FILTER '.year >= 1980 and .year ## A few words on memory usage HNSW’s fatal issue is — in theory — that they are normally served from memory. Actually, you can implement HNSWs on disk, even if there are better data structures from the point of view of disk access latencies. However, in the specific case of Redis and Vector Sets the idea is to provide something that is very fast, easy to work with: the flexibility of in-memory data structures help with that. So the question boils down to: is the memory usage really so bad? Loading the 3 million Word2Vec entries into Redis with the default int8 quantization takes 3GB of RAM, 1kb for each entry. Many use cases have just a few tens of million of entries, or a lot less. And what you get back from HNSWs, if well implemented, and in memory, is very good performance, which is crucial in a data structure and in a workload that is in itself slow by definition. In my MacBook I get 48k ops per second with redis-benchmark and VSIM against this key (holding the word2vec dataset). My feeling is that the memory usage of in-memory HNSWs is very acceptable for many use cases. And even in the use cases where you want the bulk of your vectors on disk, even if there is to pay for slower performance, your hot set should likely be served from RAM. This is one of the reasons why I believe that, to be active in HNSW research is a good idea: I don’t think they will be replaced anytime soon for most use cases. It seems more likely that we will continue to have different data structures that are ideal for RAM and for disk depending on the use cases and data size. Moreover, what I saw recently, even just scanning the Hacker News front page, is people with a few millions of items fighting with systems that are slower or more complicated than needed. HNSWs and carefully exposing them in the right way can avoid all that. ## Conclusions I like HNSWs, and working and implementing them was a real pleasure. I believe vectors are a great fit for Redis, even in an AI-less world (for instance, a few months ago I used them in order to fingerprint Hacker News users, replicating an old work published on HN in the past). HNSWs are simply too cool and powerful for a number of use cases, and with AI, and learned embeddings, all this escalates to a myriad of potential use cases. However, like most features in Redis, I expect that a lot of time will pass before people realize they are useful and powerful and how to use them (no, it’s not just a matter of RAG). This happened also with Streams: finally there is mass adoption, after so many years. If instead you are more interested in HNSW and the implementation I wrote, I believe the code is quite accessible, and heavily commented: https://github.com/redis/redis/blob/unstable/modules/vector-sets/hnsw.c If you want to learn more about Redis Vector Sets, please feel free to read the README file I wrote myself. There is also the official Redis documentation, but I suggest you start from here: https://github.com/redis/redis/tree/unstable/modules/vector-sets Thanks for reading such a long blog post! And have a nice day. References. This is the paper about the "H" in HNSW and how useful it is -> https://arxiv.org/abs/2412.01940 Comments

0 views
Shayon Mukherjee 1 months ago

A hypothetical search engine on S3 with Tantivy and warm cache on NVMe

I’ve been curious about how far you can push object storage as a foundation for database-like systems. In previous posts, I explored moving JSON data from PostgreSQL to Parquet on S3 and building MVCC-style tables with constant-time deletes using S3’s conditional writes. These experiments showed that decoupling storage from compute unlocks interesting trade-offs while lowering costs and simpler operations in exchange for higher cold query latency. Search engines traditionally don’t fit this model.

0 views
Karboosx 1 months ago

Building a Simple Search Engine That Actually Works

You don't need Elasticsearch for most projects. I built a simple search engine from scratch that tokenizes everything, stores it in your existing database, and scores results by relevance. Dead simple to understand and maintain.

25 views
Simon Willison 1 months ago

Video + notes on upgrading a Datasette plugin for the latest 1.0 alpha, with help from uv and OpenAI Codex CLI

I'm upgrading various plugins for compatibility with the new Datasette 1.0a20 alpha release and I decided to record a video of the process. This post accompanies that video with detailed additional notes. I picked a very simple plugin to illustrate the upgrade process (possibly too simple). datasette-checkbox adds just one feature to Datasette: if you are viewing a table with boolean columns (detected as integer columns with names like or or ) and your current user has permission to update rows in that table it adds an inline checkbox UI that looks like this: I built the first version with the help of Claude back in August 2024 - details in this issue comment . Most of the implementation is JavaScript that makes calls to Datasette 1.0's JSON write API . The Python code just checks that the user has the necessary permissions before including the extra JavaScript. The first step in upgrading any plugin is to run its tests against the latest Datasette version. Thankfully makes it easy to run code in scratch virtual environments that include the different code versions you want to test against. I have a test utility called (for "test against development Datasette") which I use for that purpose. I can run it in any plugin directory like this: And it will run the existing plugin tests against whatever version of Datasette I have checked out in my directory. You can see the full implementation of (and its friend described below) in this TIL - the basic version looks like this: I started by running in the directory, and got my first failure... but it wasn't due to permissions, it was because the for the plugin was pinned to a specific mismatched version of Datasette: I fixed this problem by swapping to and ran the tests again... and they passed! Which was a problem because I was expecting permission-related failures. It turns out when I first wrote the plugin I was lazy with the tests - they weren't actually confirming that the table page loaded without errors. I needed to actually run the code myself to see the expected bug. First I created myself a demo database using sqlite-utils create-table : Then I ran it with Datasette against the plugin's code like so: Sure enough, visiting produced a 500 error about the missing method. The next step was to update the test to also trigger this error: And now fails as expected. It this point I could have manually fixed the plugin itself - which would likely have been faster given the small size of the fix - but instead I demonstrated a bash one-liner I've been using to apply these kinds of changes automatically: runs OpenAI Codex in non-interactive mode - it will loop until it has finished the prompt you give it. I tell it to consult the subset of the Datasette upgrade documentation that talks about Datasette permissions and then get the command to pass its tests. This is an example of what I call designing agentic loops - I gave Codex the tools it needed ( ) and a clear goal and let it get to work on my behalf. The remainder of the video covers finishing up the work - testing the fix manually, commiting my work using: Then shipping a 0.1a4 release to PyPI using the pattern described in this TIL . Finally, I demonstrated that the shipped plugin worked in a fresh environment using like this: Executing this command installs and runs a fresh Datasette instance with a fresh copy of the new alpha plugin ( ). It's a neat way of confirming that freshly released software works as expected. This video was shot in a single take using Descript , with no rehearsal and perilously little preparation in advance. I recorded through my AirPods and applied the "Studio Sound" filter to clean up the audio. I pasted in a closing slide from my previous video and exported it locally at 1080p, then uploaded it to YouTube. Something I learned from the Software Carpentry instructor training course is that making mistakes in front of an audience is actively helpful - it helps them see a realistic version of how software development works and they can learn from watching you recover. I see this as a great excuse for not editing out all of my mistakes! I'm trying to build new habits around video content that let me produce useful videos while minimizing the amount of time I spend on production. I plan to iterate more on the format as I get more comfortable with the process. I'm hoping I can find the right balance between production time and value to viewers. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options .

0 views
Jack Vanlightly 1 months ago

How Would You Like Your Iceberg Sir? Stream or Batch Ordered?

Today I want to talk about stream analytics, batch analytics and Apache Iceberg. Stream and batch analytics work differently but both can be built on top of Iceberg, but due to their differences there can be a tug-of-war over the Iceberg table itself. In this post I am going to use two real-world systems, Apache Fluss (streaming tabular storage) and Confluent Tableflow (Kafka-to-Iceberg), as a case study for these tensions between stream and batch analytics. Apache Fluss uses zero-copy tiering to Iceberg . Recent data is stored on Fluss servers (using Kafka replication protocol for high availability and durability) but is then moved to Iceberg for long-term storage. This results in one copy of the data. Confluent Kora and Tableflow uses internal topic tiering and Iceberg materialization , copying Kafka topic data to Iceberg, such that we have two copies (one in Kora, one in Iceberg). This post will explain why both have chosen different approaches and why both are totally sane, defensible decisions. First we should understand the concepts of stream-order and batch-order . A streaming Flink job typically assumes its sources come with stream-order . For example, a simple SELECT * Flink query assumes the source is (loosely) temporally ordered, as if it were a live stream. It might be historical data, such as starting at the earliest offset of a Kafka topic, but it is still loaded in a temporal order. Windows and temporal joins also depend on the source being stream-ordered to some degree, to avoid needing large/infinite window sizes which blow up the state. A Spark batch job typically hopes that the data layout of the Iceberg table is batch-ordered , say, partitioned and sorted by business values like region, customer etc), thus allowing it to efficiently prune data files that are not relevant, and to minimize costly shuffles. If Flink is just reading a Kafka topic from start to end, it’s nothing special. But we can also get fancy by reading from two data sources: one historical and one real-time. The idea is that we can unify historical data from Iceberg (or another table format) and real-time data from some kind of event stream. We call the reading from the historical source, bootstrapping . Streaming bootstrap refers to running a continuous query that reads historical data first and then seamlessly switches to live streaming input. In order to do the switch from historical to real-time source, we need to do that switch on a given offset. The notion of a “last tiered offset” is a correctness boundary that ensures that the bootstrap and the live stream blend seamlessly without duplication or gaps. This offset can be mapped to an Iceberg snapshot. Fig 1. Bootstrap a streaming Flink job from historical then switch to real-time. However, if the historical Iceberg data is laid out with a batch-order (partitioned and sorted by business values like region, customer etc) then the bootstrap portion of a SELECT * will appear completely out-of-order relative to stream-order. This breaks the expectations of the user, who wants to see data in the order it arrived (i.e., stream-order), not a seemingly random one.  We could sort the data first from batch-order back to stream-order in the Flink source before it reaches the Flink operator level, but this can get really inefficient. Fig 2. Sort batch-ordered historical data in the Flink source task. If the table has been partitioned by region and sorted by customer, but we want to sort it by the time it arrived (such as by timestamp or Kafka offset), this will require a huge amount of work and data shuffling (in a large table). The result is not only a very expensive bootstrap, but also a very slow one (afterall, we expect fast results with a streaming query). So we hit a wall: Flink wants data ordered temporally for efficient streaming bootstrap. Batch workloads want data ordered by value (e.g., columns) for effective pruning and scan efficiency. These two data layouts are orthogonal. Temporal order preserves ingest locality; value order preserves query locality. You can’t have both in a single physical layout. Fluss is a streaming tabular storage layer built for real-time analytics which can serve as the real-time data layer for lakehouse architectures. I did a comprehensive deep dive into Apache Fluss recently, diving right into the internals if you are interested. Apache Fluss takes a clear stance. It’s designed as a streaming storage layer for data lakehouses, so it optimizes Iceberg for streaming bootstrap efficiency. It does this by maintaining stream-order in the Iceberg table. Fig 3. Fluss stores real-time and historical data in stream-order. Internally, Fluss uses its own offset (akin to the Kafka offset) as the Iceberg sort order. This ensures that when Flink reads from Iceberg, it sees a temporally ordered sequence. The Flink source can literally stream data from Iceberg without a costly data shuffle.  Let’s take look at a Fluss log table. A log table can define: Optional partitioning keys (based on one or more columns). Without them, a table is one large partition. The number of buckets per partition . The bucket is the smallest logical subdivision of a Fluss partition. Optional bucketing key for hash-bucketing. Else rows are added to random buckets, or round-robin. The partitioning and buckets are both converted to an Iceberg partition spec. Fig 4. An example of the Iceberg partition spec and sort order Within each of these Iceberg partitions, the sort order is the Fluss offset. For example, we could partition by a date field, then spread the data randomly across the buckets within each partition. Fig 5. The partitions of an Iceberg table visualized. Inside Flink, the source will generate one “split” per table bucket, routing them by bucket id to split readers. Due to the offset sort order, each Parquet file should contain contiguous blocks of offsets after compaction. Therefore each split reader naturally reads Iceberg data in offset order until it switches to the Fluss servers for real-time data (also in offset order). Fig 6. Flink source bootstraps from Iceberg visualized Once the lake splits have been read, the readers start reading from the Fluss servers for real-time data. This is great for Flink streaming bootstrap (it is just scanning the data files as a cheap sequential scan). Primary key tables are similar but have additional limitations on the partitioning and bucketing keys (as they must be subsets of the primary key). A primary key, such as device_id , is not a good partition column as it’s too fine grained, leading us to use an unpartitioned table. Fig 7. Unpartitioned primary key table with 6 buckets. If we want Iceberg partitioning, we’ll need to add another column (such as a date) to the primary key and then use the date column for the partitioning key (and device_id as a bucket key for hash-bucketing) . This makes the device_id non-unique though. In short, Fluss is a streaming storage abstraction for tabular data in lakehouses and stores both real-time and historical data in stream-order. This layout is designed for streaming Flink jobs. But if you have a Spark job trying to query that same Iceberg table, pruning is almost useless as it does not use a batch-optimized layout. Fluss may well decide to support Iceberg custom partitioning and sorting (batch-order) in the future, but it will then face the same challenges of supporting streaming bootstrap from batch-ordered Iceberg. Confluent’s Tableflow (the Kafka-to-Iceberg materialization layer) took the opposite approach. It stores two copies of the data: one stream-ordered and one optionally batch-ordered. Kafka/Kora internally tiers log segments to object storage, which is a historical data source in stream-order (good for streaming bootstrap). Iceberg is a copy, which allows for stream-order or batch-order, it’s up to the customer. Custom partitioning and sort order is not yet available at the time of writing, but it’s coming. Fig 8. Tableflow continuously materializes a copy of a Kafka topic as an Iceberg table. I already wrote why I think zero-copy Iceberg tiering is a bad fit for Kafka specifically. Much also applies to Kora, which is why Tableflow is a separate distributed component from Kora brokers. So if we’re going to materialize a copy of the data for analytics, we have the freedom to allow customers to optimize their tables for their use case, which is often batch-based analytics. Fig 9. Copy 1 (original): Kora maintains stream-ordered live and historical Kafka data. Copy 2 (derived): Tableflow continuously materializes Kafka topics as Iceberg tables. If the Iceberg table is also stored in stream-order then Flink could do an Iceberg streaming bootstrap and then switch to Kafka. This is not available right now in Confluent, but it could be built. There are also improvements that could be made to historical data stored by Kora/Kafka, such as using a columnar format for log segments (something that Fluss does today). Either way, the materialization design provides the flexibility to execute a streaming bootstrap using a stream-order historical data source, allowing the customer to optimize the Iceberg table according to their needs. Batch jobs want value locality (data clustered by common predicates), aka batch-order. Streaming jobs want temporal locality (data ordered by ingestion), aka stream-order. With a single Iceberg table, once you commit to one, the other becomes inefficient. Given this constraint, we can understand the two different approaches: Fluss chose stream-order in its Iceberg tables to support stream analytics constraints and avoid a second copy of the data. That’s a valid design decision as after all, Fluss is a streaming tabular storage layer for real-time analytics that fronts the lakehouse. But it does mean giving up the ability to use Iceberg’s layout levers of partitioning and sorting to tune batch query performance. Confluent chose a stream-order in Kora and one optionally batch-ordered Iceberg copy (via Tableflow materialization), letting the customer decide the optimum Iceberg layout. That’s also a valid design decision as Confluent wants to connect systems of all kinds, be they real-time or not. Flexibility to handle diverse systems and diverse customer requirements wins out. But it does require a second copy of the data (causing higher storage costs). As the saying goes, the opposite of a good idea can be a good idea. It all depends on what you are building and what you want to prioritize. The only losing move is pretending you can have both (stream-optimized and batch-optimized workloads) in one Iceberg table without a cost. Once you factor in the compute cost of using one format for both workloads, the storage savings disappear. If you really need both, build two physical views and keep them in sync. Some related blog posts that are relevant this one: Beyond Indexes: How Open Table Formats Optimize Query Performance Why I’m not a fan of zero-copy Apache Kafka-Apache Iceberg Understanding Apache Fluss Apache Fluss uses zero-copy tiering to Iceberg . Recent data is stored on Fluss servers (using Kafka replication protocol for high availability and durability) but is then moved to Iceberg for long-term storage. This results in one copy of the data. Confluent Kora and Tableflow uses internal topic tiering and Iceberg materialization , copying Kafka topic data to Iceberg, such that we have two copies (one in Kora, one in Iceberg). Flink wants data ordered temporally for efficient streaming bootstrap. Batch workloads want data ordered by value (e.g., columns) for effective pruning and scan efficiency. Optional partitioning keys (based on one or more columns). Without them, a table is one large partition. The number of buckets per partition . The bucket is the smallest logical subdivision of a Fluss partition. Optional bucketing key for hash-bucketing. Else rows are added to random buckets, or round-robin. Fluss chose stream-order in its Iceberg tables to support stream analytics constraints and avoid a second copy of the data. That’s a valid design decision as after all, Fluss is a streaming tabular storage layer for real-time analytics that fronts the lakehouse. But it does mean giving up the ability to use Iceberg’s layout levers of partitioning and sorting to tune batch query performance. Confluent chose a stream-order in Kora and one optionally batch-ordered Iceberg copy (via Tableflow materialization), letting the customer decide the optimum Iceberg layout. That’s also a valid design decision as Confluent wants to connect systems of all kinds, be they real-time or not. Flexibility to handle diverse systems and diverse customer requirements wins out. But it does require a second copy of the data (causing higher storage costs). Beyond Indexes: How Open Table Formats Optimize Query Performance Why I’m not a fan of zero-copy Apache Kafka-Apache Iceberg Understanding Apache Fluss

0 views
The Coder Cafe 1 months ago

Build Your Own Key-Value Storage Engine—Week 1

Curious how leading engineers tackle extreme scale challenges with data-intensive applications? Join Monster Scale Summit (free + virtual). It’s hosted by ScyllaDB, the monstrously fast and scalable database. Agenda Week 0: Introduction Week 1: In-Memory Store Welcome to week 1 of Build Your Own Key-Value Storage Engine ! Let’s start by making sure what you’re about to build in this series makes complete sense: what’s a storage engine? A storage engine is the part of a database that actually stores, indexes, and retrieves data, whether on disk or in memory. Think of the database as the restaurant, and the storage engine as the kitchen that decides how food is prepared and stored. Some databases let you choose the storage engine. For example, MySQL uses InnoDB by default (based on B+-trees). Through plugins, you can switch to RocksDB, which is based on LSM trees. This week, you will build an in-memory storage engine and the first version of the validation client that you will reuse throughout the series. 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Keys are lowercase ASCII strings. Values are ASCII strings. NOTE : Assumptions persist for the rest of the series unless explicitly discarded. The request body contains the value. If the key exists, update its value and return success. If the key doesn’t exist, create it and return success. Keep all data in memory. If the key exists, return 200 OK with the value in the body. If the key does not exist, return . Implement a client to validate your server: Read the testing scenario from this file: put.txt . Run an HTTP request for each line: → Send a to with body . → Send a to . Confirm that is returned. If not, something is wrong with your implementation. → Send a GET to . Confirm that is returned. If not, something is wrong with your implementation. Each request must be executed sequentially, one line at a time; otherwise, out-of-order responses may fail the client’s assertions. If you want to generate an input file with a different number of lines, you can use this Go generator : is the format to generate. is the number of lines. At this stage, you need a -type file, so for example, if you need one million lines: Add basic metrics for latency: Record start and end time for each request. Keep a small histogram of latencies in milliseconds. At the end, print , , and . This work is optional as there is no latency target in this series. However, it can be an interesting point of comparison across weeks to see how your changes affect latency. That’s it for this week! You have built a simple storage engine that keeps everything in memory. In two weeks, we will level up. You will delve into a data structure widely used in key-value databases: LSM trees. Missing direction in your tech career? At The Coder Cafe, we serve timeless concepts with your coffee to help you master the fundamentals. Written by a Google SWE and trusted by thousands of readers, we support your growth as an engineer, one coffee at a time. ❤️ If you enjoyed this post, please hit the like button. Week 0: Introduction Week 1: In-Memory Store Welcome to week 1 of Build Your Own Key-Value Storage Engine ! Let’s start by making sure what you’re about to build in this series makes complete sense: what’s a storage engine? A storage engine is the part of a database that actually stores, indexes, and retrieves data, whether on disk or in memory. Think of the database as the restaurant, and the storage engine as the kitchen that decides how food is prepared and stored. Some databases let you choose the storage engine. For example, MySQL uses InnoDB by default (based on B+-trees). Through plugins, you can switch to RocksDB, which is based on LSM trees. This week, you will build an in-memory storage engine and the first version of the validation client that you will reuse throughout the series. Your Tasks 💬 If you want to share your progress, discuss solutions, or collaborate with other coders, join the community Discord server ( channel): Join the Discord Assumptions Keys are lowercase ASCII strings. Values are ASCII strings. : The request body contains the value. If the key exists, update its value and return success. If the key doesn’t exist, create it and return success. Keep all data in memory. : If the key exists, return 200 OK with the value in the body. If the key does not exist, return . Read the testing scenario from this file: put.txt . Run an HTTP request for each line: → Send a to with body . → Send a to . Confirm that is returned. If not, something is wrong with your implementation. → Send a GET to . Confirm that is returned. If not, something is wrong with your implementation. Each request must be executed sequentially, one line at a time; otherwise, out-of-order responses may fail the client’s assertions. is the format to generate. is the number of lines. Record start and end time for each request. Keep a small histogram of latencies in milliseconds. At the end, print , , and .

1 views
Simon Willison 1 months ago

A new SQL-powered permissions system in Datasette 1.0a20

Datasette 1.0a20 is out with the biggest breaking API change on the road to 1.0, improving how Datasette's permissions system works by migrating permission logic to SQL running in SQLite. This release involved 163 commits , with 10,660 additions and 1,825 deletions, most of which was written with the help of Claude Code. Datasette's permissions system exists to answer the following question: Is this actor allowed to perform this action , optionally against this particular resource ? An actor is usually a user, but might also be an automation operating via the Datasette API. An action is a thing they need to do - things like view-table, execute-sql, insert-row. A resource is the subject of the action - the database you are executing SQL against, the table you want to insert a row into. Datasette's default configuration is public but read-only: anyone can view databases and tables or execute read-only SQL queries but no-one can modify data. Datasette plugins can enable all sorts of additional ways to interact with databases, many of which need to be protected by a form of authentication Datasette also 1.0 includes a write API with a need to configure who can insert, update, and delete rows or create new tables. Actors can be authenticated in a number of different ways provided by plugins using the actor_from_request() plugin hook. datasette-auth-passwords and datasette-auth-github and datasette-auth-existing-cookies are examples of authentication plugins. The previous implementation included a design flaw common to permissions systems of this nature: each permission check involved a function call which would delegate to one or more plugins and return a True/False result. This works well for single checks, but has a significant problem: what if you need to show the user a list of things they can access, for example the tables they can view? I want Datasette to be able to handle potentially thousands of tables - tables in SQLite are cheap! I don't want to have to run 1,000+ permission checks just to show the user a list of tables. Since Datasette is built on top of SQLite we already have a powerful mechanism to help solve this problem. SQLite is really good at filtering large numbers of records. The biggest change in the new release is that I've replaced the previous plugin hook - which let a plugin determine if an actor could perform an action against a resource - with a new permission_resources_sql(actor, action) plugin hook. Instead of returning a True/False result, this new hook returns a SQL query that returns rules helping determine the resources the current actor can execute the specified action against. Here's an example, lifted from the documentation: This hook grants the actor with ID "alice" permission to view the "sales" table in the "accounting" database. The object should always return four columns: a parent, child, allow (1 or 0), and a reason string for debugging. When you ask Datasette to list the resources an actor can access for a specific action, it will combine the SQL returned by all installed plugins into a single query that joins against the internal catalog tables and efficiently lists all the resources the actor can access. This query can then be limited or paginated to avoid loading too many results at once. Datasette has several additional requirements that make the permissions system more complicated. Datasette permissions can optionally act against a two-level hierarchy . You can grant a user the ability to insert-row against a specific table, or every table in a specific database, or every table in every database in that Datasette instance. Some actions can apply at the table level, others the database level and others only make sense globally - enabling a new feature that isn't tied to tables or databases, for example. Datasette currently has ten default actions but plugins that add additional features can register new actions to better participate in the permission systems. Datasette's permission system has a mechanism to veto permission checks - a plugin can return a deny for a specific permission check which will override any allows. This needs to be hierarchy-aware - a deny at the database level can be outvoted by an allow at the table level. Finally, Datasette includes a mechanism for applying additional restrictions to a request. This was introduced for Datasette's API - it allows a user to create an API token that can act on their behalf but is only allowed to perform a subset of their capabilities - just reading from two specific tables, for example. Restrictions are described in more detail in the documentation. That's a lot of different moving parts for the new implementation to cover. Since permissions are critical to the security of a Datasette deployment it's vital that they are as easy to understand and debug as possible. The new alpha adds several new debugging tools, including this page that shows the full list of resources matching a specific action for the current user: And this page listing the rules that apply to that question - since different plugins may return different rules which get combined together: This screenshot illustrates two of Datasette's built-in rules: there is a default allow for read-only operations such as view-table (which can be over-ridden by plugins) and another rule that says the root user can do anything (provided Datasette was started with the option.) Those rules are defined in the datasette/default_permissions.py Python module. There's one question that the new system cannot answer: provide a full list of actors who can perform this action against this resource. It's not possibly to provide this globally for Datasette because Datasette doesn't have a way to track what "actors" exist in the system. SSO plugins such as mean a new authenticated GitHub user might show up at any time, with the ability to perform actions despite the Datasette system never having encountered that particular username before. API tokens and actor restrictions come into play here as well. A user might create a signed API token that can perform a subset of actions on their behalf - the existence of that token can't be predicted by the permissions system. This is a notable omission, but it's also quite common in other systems. AWS cannot provide a list of all actors who have permission to access a specific S3 bucket, for example - presumably for similar reasons. Datasette's plugin ecosystem is the reason I'm paying so much attention to ensuring Datasette 1.0 has a stable API. I don't want plugin authors to need to chase breaking changes once that 1.0 release is out. The Datasette upgrade guide includes detailed notes on upgrades that are needed between the 0.x and 1.0 alpha releases. I've added an extensive section about the permissions changes to that document. I've also been experimenting with dumping those instructions directly into coding agent tools - Claude Code and Codex CLI - to have them upgrade existing plugins for me. This has been working extremely well . I've even had Claude Code update those notes itself with things it learned during an upgrade process! This is greatly helped by the fact that every single Datasette plugin has an automated test suite that demonstrates the core functionality works as expected. Coding agents can use those tests to verify that their changes have had the desired effect. I've also been leaning heavily on to help with the upgrade process. I wrote myself two new helper scripts - and - to help test the new plugins. The and implementations can be found in this TIL . Some of my plugin upgrades have become a one-liner to the command, which runs OpenAI Codex CLI with a prompt without entering interactive mode: There are still a bunch more to go - there's a list in this tracking issue - but I expect to have the plugins I maintain all upgraded pretty quickly now that I have a solid process in place. This change to Datasette core by far the most ambitious piece of work I've ever attempted using a coding agent. Last year I agreed with the prevailing opinion that LLM assistance was much more useful for greenfield coding tasks than working on existing codebases. The amount you could usefully get done was greatly limited by the need to fit the entire codebase into the model's context window. Coding agents have entirely changed that calculation. Claude Code and Codex CLI still have relatively limited token windows - albeit larger than last year - but their ability to search through the codebase, read extra files on demand and "reason" about the code they are working with has made them vastly more capable. I no longer see codebase size as a limiting factor for how useful they can be. I've also spent enough time with Claude Sonnet 4.5 to build a weird level of trust in it. I can usually predict exactly what changes it will make for a prompt. If I tell it "extract this code into a separate function" or "update every instance of this pattern" I know it's likely to get it right. For something like permission code I still review everything it does, often by watching it as it works since it displays diffs in the UI. I also pay extremely close attention to the tests it's writing. Datasette 1.0a19 already had 1,439 tests, many of which exercised the existing permission system. 1.0a20 increases that to 1,583 tests. I feel very good about that, especially since most of the existing tests continued to pass without modification. I built several different proof-of-concept implementations of SQL permissions before settling on the final design. My research/sqlite-permissions-poc project was the one that finally convinced me of a viable approach, That one started as a free ranging conversation with Claude , at the end of which I told it to generate a specification which I then fed into GPT-5 to implement. You can see that specification at the end of the README . I later fed the POC itself into Claude Code and had it implement the first version of the new Datasette system based on that previous experiment. This is admittedly a very weird way of working, but it helped me finally break through on a problem that I'd been struggling with for months. Now that the new alpha is out my focus is upgrading the existing plugin ecosystem to use it, and supporting other plugin authors who are doing the same. The new permissions system unlocks some key improvements to Datasette Cloud concerning finely-grained permissions for larger teams, so I'll be integrating the new alpha there this week. This is the single biggest backwards-incompatible change required before Datasette 1.0. I plan to apply the lessons I learned from this project to the other, less intimidating changes. I'm hoping this can result in a final 1.0 release before the end of the year! You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . Understanding the permissions system Permissions systems need to be able to efficiently list things The new permission_resources_sql() plugin hook Hierarchies, plugins, vetoes, and restrictions New debugging tools The missing feature: list actors who can act on this resource Upgrading plugins for Datasette 1.0a20 Using Claude Code to implement this change Starting with a proof-of-concept Miscellaneous tips I picked up along the way What's next? = "test against datasette dev" - it runs a plugin's existing test suite against the current development version of Datasette checked out on my machine. It passes extra options through to so I can run or as needed. = "run against datasette dev" - it runs the latest dev command with the plugin installed. When working on anything relating to plugins it's vital to have at least a few real plugins that you upgrade in lock-step with the core changes. The and shortcuts were invaluable for productively working on those plugins while I made changes to core. Coding agents make experiments much cheaper. I threw away so much code on the way to the final implementation, which was psychologically easier because the cost to create that code in the first place was so low. Tests, tests, tests. This project would have been impossible without that existing test suite. The additional tests we built along the way give me confidence that the new system is as robust as I need it to be. Claude writes good commit messages now! I finally gave in and let it write these - previously I've been determined to write them myself. It's a big time saver to be able to say "write a tasteful commit message for these changes". Claude is also great at breaking up changes into smaller commits. It can also productively rewrite history to make it easier to follow, especially useful if you're still working in a branch. A really great way to review Claude's changes is with the GitHub PR interface. You can attach comments to individual lines of code and then later prompt Claude like this: . This is a very quick way to apply little nitpick changes - rename this function, refactor this repeated code, add types here etc. The code I write with LLMs is higher quality code . I usually find myself making constant trade-offs while coding: this function would be neater if I extracted this helper, it would be nice to have inline documentation here, this changing this would be good but would break a dozen tests... for each of those I have to determine if the additional time is worth the benefit. Claude can apply changes so much faster than me that these calculations have changed - almost any improvement is worth applying, no matter how trivial, because the time cost is so low. Internal tools are cheap now. The new debugging interfaces were mostly written by Claude and are significantly nicer to use and look at than the hacky versions I would have knocked out myself, if I had even taken the extra time to build them. That trick with a Markdown file full of upgrade instructions works astonishingly well - it's the same basic idea as Claude Skills . I maintain over 100 Datasette plugins now and I expect I'll be automating all sorts of minor upgrades in the future using this technique.

0 views
Dangling Pointers 1 months ago

High-Performance Query Processing with NVMe Arrays: Spilling without Killing Performance

High-Performance Query Processing with NVMe Arrays: Spilling without Killing Performance Maximilian Kuschewski, Jana Giceva, Thomas Neumann, and Viktor Leis SIGMOD'25 In database vernacular, spilling is the process of writing intermediate data to disk in order to evaluate a query with a finite amount of main memory. It goes without saying that database folks are control freaks performance conscious and don’t want to rely on generic OS paging mechanisms to handle working sets which are larger than main memory. This tidbit about Snowflake is fascinating: Only 5% of analytical queries in Snowflake’s 2018 workload trace spill data, but those 5% contribute 45% of the overall CPU time and 29% of the total execution time [77]. One fundamental problem in this area is that it is very hard to predict up-front exactly how much working memory will be needed to efficiently execute a query. Databases have to estimate based on statistics of the relations used in a query, or assume no spilling will be needed, and then gracefully fall back to spilling if necessary. The two operators which this paper deals with are joins and aggregations. Both involve key columns, and the cardinality of the key columns is critical in determining if spilling is necessary. One obvious spilling mechanism is to use a partitioning approach for joins and aggregations. I’ve described partitioned joins in summaries of these papers: Efficiently Processing Joins and Grouped Aggregations on GPUs The reason why partitioning nicely solves the problem is that the working set requirements for both steps of a partitioned join are modest. The partitioning step only requires a small amount of memory per partition (e.g., accumulate 64KiB per partition before appending partitioned tuples to on-disk storage). The join step only needs enough working memory to join a single partition. Section 4.1 of the paper claims that partitioning slows down TPC-H queries by 2-5x. My spider sense is tingling, but let’s take this as an axiom for now. Here is the premise of the paper: partitioning is better for queries that must spill, but worse for queries that can be completed without spilling . What is an efficient and simple design given that a database cannot perfectly predict up-front if it will need to spill? Prior work along similar lines introduced the hybrid hash join. A hybrid hash join partitions the left (build) input of the join and dynamically decides what percentage of build partitions must be spilled. A hash table is built containing all non-spilled build partitions. Next the right (probe) input to the join is processed. For each probe tuple, the database determines if the associated partition was spilled. If the partition was spilled, then the probe tuple is spilled. If not, then the probe tuple is processed immediately via a lookup in the hash table. Finally, all spilled partitions are processed. The downside of this approach is that it always partitions the build side, even when that is unnecessary. This paper proposes a join implementation that only pays the cost of partitioning when spilling is required. It is a two-phase process. In the materialization phase, the build table is scanned (and pushed-down filters are applied). The resulting tuples are stored in a list of pages. At first, the system optimistically assumes that no spilling is necessary and appends each tuple to the current page. If a memory limit is reached, then partitioning is enabled. Each tuple processed after that point is partitioned, and per-partition lists of pages are allocated. If a further memory limit is reached, then some partitions are spilled to disk. Next, the build and probe phase executes in a manner similar to hybrid hash join. However, there is a fast path for the case where no spilling occurred. In this case, the tuples produced by the materialization phase are inserted into one large hash table, and then the probe tuples are streamed, with one hash table lookup per tuple. If partitioning (but no spilling) occurred, then the hash table inserts will have high locality (assuming a chained hash table). If spilling did occur, then the build and probe phase operates like a hybrid hash join, spilling probe tuples if and only if the associated build partition was spilled. The paper isn’t clear on what happens to non-partitioned build tuples once the system decides to start spilling build partitions. My assumption is that in this case, probe tuples must probe both the spilled build tuples and these build tuples that were processed before partitioning was enabled. The implementation of aggregation described in this paper follows a similar 2-phase approach. The materialization phase performs pre-aggregation, the system starts by aggregating into an in-memory hash table. If the hash table grows too large, then partitioning kicks in. Tuples from in-memory hash tables are evicted into per-partition page lists, and subsequent tuples are directly stored in these per-partition page lists. These per-partition pages can be spilled if further memory limits are reached. The subsequent phase then performs any necessary aggregation. If no eviction occurred, then this phase has no work to do. The paper describes an adaptive compression algorithm that improves spilling performance. Some interesting numbers from the paper: The number of CPU cycles per input byte for executing an in-memory TPC-H queries ranges from 3.3 to 60.3 The number of CPU cycles required to write and read a byte to/from SSD is 11.1 The adaptive nature of this scheme is driven by the fact that queries are diverse, and compressors have knobs which trade off speed for compression ratio, as illustrated by Fig. 3: Source: https://dl.acm.org/doi/10.1145/3698813 When spilling occurs, the system dynamically balances CPU usage and IO bandwidth by adjusting these knobs. Fig. 5 shows in-memory throughput while Fig. 6 shows throughput when spilling occurs: Source: https://dl.acm.org/doi/10.1145/3698813 Dangling Pointers The design of the unified hash join hinges on the fact that partitioning is a bad idea for the in-memory case. This is in contrast to papers describing in-memory partitioning join implementation on other types of chips like SPID-Join and Efficiently Processing Joins and Grouped Aggregations on GPUs . I imagine there is a lot of literature about this topic that I haven’t read. Leave a comment with your experience. Thanks for reading Dangling Pointers! Subscribe for free to receive new posts. The number of CPU cycles per input byte for executing an in-memory TPC-H queries ranges from 3.3 to 60.3 The number of CPU cycles required to write and read a byte to/from SSD is 11.1

0 views