Posts in Database (20 found)

SQLAlchemy 2 In Practice - Solutions to the Exercises

To conclude with my SQLAlchemy 2 in Practice series, this article contains the solutions to all the exercises. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you!

0 views
Simon Willison 1 weeks ago

Datasette Agent

We just announced the first release of Datasette Agent , a new extensible AI assistant for Datasette. I've been working on my LLM Python library for just over three years now, and Datasette Agent represents the moment that LLM and Datasette finally come together. I'm really excited about it! Datasette Agent provides a conversational interface for asking questions of the data you have stored in Datasette. Add the datasette-agent-charts plugin and it can generate charts of your data as well. The announcement post (on the new Datasette project blog) includes this demo video : I recorded the video against the new agent.datasette.io live demo instance, which runs Datasette Agent against example databases including the classic global-power-plants by WRI , and a copy of the Datasette backup of my blog. The live demo runs on Gemini 3.1 Flash-Lite - it's cheap, fast and has no trouble writing SQLite queries. A question I asked in the demo was: when did Simon most recently see a pelican? Which ran this SQL query : And replied: The most recent sighting of a pelican by Simon was recorded on May 20, 2026 . The observation included a California Brown Pelican, along with a Common Loon, Canada Goose, Striped Shore Crab, and a California Sea Lion. Here's that sighting on my blog , and the Markdown export of the full conversation transcript. My favorite feature of Datasette Agent is that, like the rest of Datasette, it's extensible using plugins. We've shipped three plugins so far: Building plugins is really fun . I have a bunch more prototypes that aren't quite alpha-quality yet. Claude Code and OpenAI Codex are both proving excellent at writing plugins - just point them at a checkout of the datasette-agent repo for reference and tell them what you want to build! I've also been having fun running the new plugin against local models. Here's a one-liner to run the plugin against gemma-4-26b-a4b in LM Studio on a Mac: Datasette Agent needs reliable tool calls and the ability for a model to produce SQL queries that run against SQLite. The open weight models released in the past six months are increasingly able to handle that. Datasette Agent opens up so many opportunities for the LLM and Datasette ecosystem in general. It's already informed the major LLM 0.32a0 refactor which I'm nearly ready to roll into a stable release, maybe with some additional "LLM agent" abstractions extracte from Datasette Agent itself. I've been exploring my own take on the Claude Artifacts, which is shaping up nicely as a plugin. I'm excited to use Datasette Agent to build my own Claw - a personal AI assistant built around data imported from different parts of my digital life, which is a neat excuse to revisit my older Dogsheep family of tools. We'll also be rolling out Datasette Agent for users of Datasette Cloud . Join our #datasette-agent Discord channel if you'd like to talk about the project. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . datasette-agent-charts , shown in the video, adds charts to Datasette Agent, powered by Observable Plot . datasette-agent-openai-imagegen adds an image generation tool to Datasette Agent using ChatGPT Images 2.0 . datasette-agent-sprites provides tools for executing code in a Fly Sprites persistent sandbox.

0 views
The Coder Cafe 1 weeks ago

LSM Trees Explained

☕ Welcome to The Coder Cafe! Some people reached out after I published the Build Your Own Key-Value Storage Engine series to say they hadn’t gone through all eight posts, but they were curious about the core ideas. So I distilled everything into a single post. No implementation, no exercises, just the core concepts behind LSM trees. Get cozy, grab a coffee, and let’s begin! Fundamental Insights To understand LSM trees, we first need to understand why writes are hard. A B-tree-based database updates data in place . When we write a key, the engine finds the right page on disk and modifies it. This is a random write: the disk head has to seek to an arbitrary location before writing. On spinning disks, that seek takes time. But even on SSDs, random writes cause problems: they wear out cells unevenly and trigger expensive internal garbage collection. LSM trees take a completely different approach. Instead of writing data where it ultimately belongs, they write data sequentially . Writes are recorded in memory and appended to a log file for durability. When the in-memory buffer fills up, its contents are streamed to a new file in one sequential pass. Sequential writes are dramatically faster than random writes because there is no seeking involved. The disk just keeps writing forward. The price of this design is complexity. Data doesn’t live in one place. It accumulates across multiple files over time, and those files need to be periodically merged and reorganized in the background to stay manageable. That background work is what every piece of an LSM tree is built around. The in-memory buffer is called the memtable . The sorted files on disk are called SSTables . We’ll look at each in detail. Every write in an LSM tree starts in memory, in a structure called the memtable. The memtable is a mutable, in-memory store . When a write request arrives, the engine records the key-value pair in the memtable and appends it to a sequential log file on disk (called the write-ahead log, or WAL, which we’ll cover in the next section). The WAL write is a sequential append, so it is fast. There is no random I/O, no page lookup, no in-place modification. This is why LSM trees can sustain very high write throughput. A hashtable works for lookups but not for in-order iteration. Sorting a hashtable takes at flush time. A better choice is an ordered data structure. The most common in practice is a skip list ; for example, LevelDB and RocksDB both use one as their default. A radix trie is another elegant option: it keeps keys in lexicographic order naturally, so iterating in order is just a depth-first traversal, and flushing becomes a simple stream with no sorting step needed. A balanced BST works too. Production implementations typically attach a monotonic sequence number to each entry, so the engine can always determine which version of a key is the most recent, regardless of arrival order. The memtable doesn’t grow forever . At some point, it gets flushed to disk, and a new empty memtable takes its place. What triggers that flush depends on the implementation: it can be a size limit (a number of entries or a memory threshold), elapsed time, or memory pressure, for example. That flush produces a sorted file on disk called an SSTable, which we’ll look at after the WAL. There is a problem with keeping writes in memory: if the process crashes, everything in the memtable is gone. Any write the client received an acknowledgment for is now lost. That breaks a core database guarantee: durability . The solution is a Write-Ahead Log, or WAL. Before writing to the memtable, the engine appends the operation to the WAL, an append-only file on disk . Only after the WAL entry is safely persisted does the engine update the memtable and acknowledge the client. This ordering is what the “write-ahead” in the name refers to: the log is always written before the in-memory state changes. The WAL is not the final home for data; it’s a safety net. If the engine crashes and restarts, it replays the WAL from the beginning to reconstruct the memtable, recovering any writes that hadn’t been flushed to disk yet. One subtlety: writing to a file is not the same as persisting it. Operating systems buffer writes in memory before flushing to disk. To guarantee durability, the engine must call after each WAL entry, forcing the OS to flush its buffers to physical storage. This is not free, though. adds latency to every write. Production systems often use instead, which persists the data without flushing unnecessary file metadata, keeping WAL appends faster. Many also use a technique called group commit to amortize this cost further: instead of syncing after every write, they batch multiple WAL entries and call once for the group. The WAL introduces write amplification : the ratio of data written to disk versus data actually requested by a client. Every byte we write to the database gets written to disk twice: once to the WAL immediately, and once to an SSTable when the memtable is eventually flushed. That cost buys us durability. As we said, when the memtable fills up, it gets written to disk as a Sorted String Table, or SSTable . An SSTable is an immutable, sorted file. Immutable means it is never modified after creation. Sorted means keys are stored in lexicographic order. Both properties matter: Immutability makes SSTables safe to read concurrently without locking. Sorted order makes lookups inside a file efficient. In a simple implementation, an SSTable is just a JSON array of key-value pairs, sorted by key: Production systems use a binary block-based format instead. The SSTable is divided into fixed-size blocks, typically 4 KB, though the exact size varies by implementation. Data blocks hold the actual key-value entries. The SSTable also contains an index block storing the first key of each data block, which makes it possible to binary search for the right block without reading the entire file. In most implementations, the index block is written at the end of the file, since block boundaries are only known after all data blocks have been streamed out. To look up a key, we read the index block, binary search it to find the right data block, fetch that single block from disk, verify its integrity with a checksum, and then binary search within the block. When the index block is not cached, this means most lookups read two disk pages: the index block and one data block. In practice, index blocks are typically kept in memory, so most lookups require only one disk read . Each data block also carries a checksum computed over the block’s bytes. Before using the data, the engine verifies the checksum. If they don’t match, the block is corrupted, and the read fails safely rather than returning garbage. As SSTables accumulate, the engine maintains a catalog file (often called a MANIFEST in systems like RocksDB), which is an append-only log listing all existing SSTables in order of creation. This catalog is the engine’s source of truth for what files exist on disk. On startup, the engine reads it to know which files are live, and replays the WAL to restore the memtable. After a successful flush, the old WAL can be discarded. The data is now safely in an SSTable. Production systems also compress data blocks , typically with a fast algorithm like Snappy, LZ4, or zstd. Compression reduces disk footprint and I/O at the cost of CPU, and it interacts with block sizing: a compressed block may be smaller than a disk page, so implementations often track both logical and physical block sizes. LSM trees are optimized for writes. Reads are where the trade-off shows . To look up a key, the engine searches in order of recency : first the memtable, then SSTables from newest to oldest. The first match wins. This ordering matters because the same key can appear multiple times across different SSTables. Each write to a key produces a new entry rather than updating the existing one. The newest version is the correct one. The problem becomes clear as SSTables accumulate. A key that was written once and never updated might still require the engine to search through dozens of SSTables before finding it, or confirming it doesn’t exist. Each SSTable search is a disk read. This is called read amplification : a single logical read triggers multiple physical reads. For a key that doesn’t exist at all, the engine must check every SSTable before returning a not-found error. That’s the worst case for read amplification, and it gets worse the more SSTables there are. This is a fundamental tension in LSM trees, and it reflects a deeper principle known as the RUM conjecture: a storage engine can excel at two of reads, updates, and memory efficiency, but not all three at once . LSM trees make a deliberate choice: optimize for updates, accept read amplification as the cost. The sorted structure also enables efficient range scans. To retrieve all keys between and , the engine scans the memtable in order, then merges sorted streams from the relevant SSTables. The answer to accumulating SSTables is compaction . Compaction is a background process that takes multiple SSTables, merges them into fewer, cleaner ones , and discards the originals. The result is fewer files to search through, which directly reduces read amplification. It also reclaims disk space consumed by redundant entries: if the same key appears in three different SSTables, compaction keeps only the newest version and discards the rest. One common algorithm is a k-way merge . The engine opens iterators over all SSTables being compacted, each positioned at the first entry. It uses a min-heap to always pull the smallest key across all iterators. When the same key appears in multiple SSTables, the engine picks the version from the newest SSTable and discards the older ones. The merged output is streamed into new SSTable files. In practice, real systems limit the number of SSTables that can participate in a single compaction run to keep resource consumption under control. Updating the catalog after compaction requires care . The engine must not delete the old SSTables before the new ones are safely written to disk. The safe sequence is: write new SSTables, fsync, write a new catalog pointing to the new files, fsync, then delete the old SSTables. A crash at any point leaves the engine in a recoverable state: either the old files are still referenced by the old catalog, or the new files are referenced by the new catalog. Compaction is not free . It consumes I/O and CPU in the background, competing with foreground reads and writes. Every byte of data gets rewritten multiple times across its lifetime, adding to write amplification. Tuning when compaction triggers (and how aggressively it runs) is one of the main knobs in LSM tree performance. We might expect deletion to be straightforward: find the key, remove it. In an LSM tree, it is anything but straightforward . SSTables are immutable. We cannot reach into an existing SSTable and remove an entry. So when a key is deleted, the engine writes a special marker to the memtable called a tombstone , an entry that says “ this key is deleted ”. It eventually gets flushed to an SSTable like any other write. During reads, the engine respects tombstones. If a tombstone for a key is found before a value for that key, scanning newest to oldest, the key is treated as deleted, and a not-found error is returned. The tombstone shadows any older value. The tricky part is knowing when it is safe to discard a tombstone during compaction. Consider this situation: a tombstone for key exists in a newer SSTable, and an old value for exists in an older SSTable that hasn’t been compacted yet. If we drop the tombstone during compaction without also removing the old value, the old value becomes visible again. Deleted data reappears. This is called data resurrection , and it is a correctness bug. NOTE : Correctness here means the engine returns what was actually written, not a stale or deleted value. This is different from consistency in the distributed systems sense, which describes the guarantees clients have about which version of data they see across replicas. The rule is strict: a tombstone can only be dropped when the engine can guarantee that no older value for that key exists anywhere below it on disk . In practice, this means the compaction must include the oldest SSTables that could still hold a shadowed value. This is one of those details that seems minor until we get it wrong. A storage engine that resurrects deleted data is not a storage engine we can trust. Getting this right requires knowing exactly where older values can hide, which brings us to how SSTables are organized on disk. Basic compaction, merging all SSTables into one flat pool, works but doesn’t scale. As the dataset grows, a flat pool of SSTables means reads still have to check many files. Leveling is the structural answer . In a leveled LSM tree, SSTables are organized into levels: , , , and so on. Each level has different rules: is the landing zone . When the memtable flushes, the resulting SSTable lands in L0. files can have overlapping key ranges: two L0 files might both contain entries for key . This is acceptable because L0 files are small and short-lived. and deeper levels are different. Each level maintains non-overlapping key ranges across all its files . A given key can exist in at most one file per level. This is the critical property that makes reads efficient: to look up a key in , we don’t scan all files. We use the key ranges to jump directly to the one file that could contain it. When accumulates enough files, a compaction runs to merge into . This merge enforces the non-overlapping invariant: files (which may overlap) get merged with the relevant L1 files (which define the ranges), producing new files with clean, non-overlapping ranges. Similarly, when grows too large, a compaction merges part of into . Each deeper level is typically larger by a fixed ratio, for example, 10x. might hold 10 MB, 100 MB, 1 GB, and so on. Most data ends up in the deepest level. Most compaction work happens between levels. The benefit is controlled read amplification . To look up a key, we check the memtable, scan all files, then do one binary search per deeper level. The number of deeper levels grows logarithmically with data size. For a dataset with a few levels, that’s a small, bounded number of disk reads, regardless of how many total SSTables exist. When compaction falls behind and accumulates too many files, the engine may trigger a write stall : new writes are paused until compaction catches up and is drained. This is one of the more painful operational issues in LSM-based systems. Leveled compaction is also not the only strategy. Tiered compaction , used by Cassandra, for example, takes a different approach: instead of enforcing non-overlapping ranges per level, it groups SSTables of similar size and merges them when a tier grows too large. Tiered compaction generates less write amplification but more read amplification. The right choice depends on the workload. Leveling helps with reads, but there is still one painful case: looking up a key that doesn’t exist . For a missing key, the engine checks the memtable (not there), checks each L0 file (not there), then checks one file per deeper level (not there). Each check is a disk read. Even with leveling, this adds up. Bloom filters solve this. A Bloom filter is a probabilistic data structure that can answer one question: Is this key definitely not in this SSTable? It has no false negatives: if the key is in the SSTable, the filter will say so. It can have false positives (occasionally it says a key might be present when it isn’t), but in practice, the false positive rate is tunable and kept very low. Many implementations attach a Bloom filter to each SSTable, built at creation time from all the keys it contains. The filters are small, a few kilobytes per SSTable, so they can be loaded into memory at startup and kept there. How does it work? A Bloom filter is a bitset. When a key is added, several hash functions are applied to it, each producing an index into the bitset. The bit at each index is set to 1. To check if a key is in the filter, the same hash functions are applied. If any of the resulting bits is 0, the key is definitely not in the SSTable. No disk read needed. If all bits are 1, the key might be there, and the engine proceeds to read the SSTable. The practical impact is significant. For a key that doesn’t exist (the worst case), the engine skips almost every SSTable without a single disk read . Only the rare false positive triggers an unnecessary disk read. Read amplification for missing keys drops dramatically. Some engines take this further and attach Bloom filters not just per SSTable but per data block within an SSTable, enabling even more precise filtering before fetching a block from disk. Everything described so far assumes a single thread. In reality, a storage engine needs to handle concurrent reads and writes, while flush and compaction run in the background . This is where things get subtle. The core problem: a flush operation replaces the current memtable with a new one and registers a new SSTable in the catalog. A compaction operation removes old SSTables and registers new ones. If a read is in the middle of searching an SSTable that gets deleted by a concurrent compaction, that’s a crash. One common solution is a versioned catalog . A catalog is a snapshot of the engine’s state at a point in time: a reference to the current memtable, the current WAL path, and the current catalog file. Every incoming request acquires the latest catalog version, pins it by incrementing a reference count, performs its work, then releases it by decrementing the reference count. Background workers (the flush worker and the compaction worker) never modify an existing catalog . Instead, when a flush or compaction completes, they create a new catalog version pointing to the updated memtable and SSTable set. From that moment, new requests acquire the new catalog. Old requests that pinned the previous catalog continue reading from it safely. An old catalog version is only cleaned up (its SSTables deleted, its WAL file discarded) when its reference count drops to zero. No reader is using it anymore, so it is safe to remove. This approach keeps foreground reads and writes lock-free in the hot path. Background operations never block requests, and requests never block background operations. They operate on independent catalog versions and only synchronize at the moment of catalog swap , which in many implementations is a single atomic pointer update. The versioned catalog is also what makes crash recovery clean. On startup, the engine reads the latest catalog file on disk, which always reflects a consistent state: either from before the last flush/compaction, or after. Any SSTables on disk not referenced by the catalog are orphans from an incomplete operation and can be safely deleted. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. LSM trees optimize for write throughput by turning random disk writes into sequential ones, at the cost of more complex reads. The memtable absorbs writes in memory; an ordered structure like a skip list, balanced BST, or radix trie keeps keys sorted for efficient flushing. The WAL provides durability: every write is logged to disk before the memtable is updated, enabling crash recovery. SSTables are immutable, sorted files produced by flushing the memtable; a binary block format with checksums makes point lookups efficient and reads safe. A catalog file tracks which SSTables are live and is updated atomically to ensure the engine always has a consistent view of disk state. Read amplification is the fundamental trade-off: finding a key may require searching multiple SSTables, one per level, plus all files. Compaction merges SSTables, eliminates redundant entries, and reclaims space, at the cost of write amplification and background I/O. Tombstones handle deletions in an immutable structure; they can only be discarded when no older value they shadow still exists on disk. Leveling organizes SSTables into levels with non-overlapping key ranges, bounding read amplification to one file lookup per level. Tiered compaction is an alternative strategy that trades less write amplification for more read amplification. Bloom filters allow the engine to skip SSTable reads for missing keys with near certainty, eliminating the worst-case read scenario. A versioned catalog is one common approach to enabling lock-free concurrent reads and background operations by letting each request pin a consistent snapshot of engine state. CRDTs Explained Availability Models Explained The PACELC Theorem Explained The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary . Build Your Own Key-Value Storage Engine IO devices and latency Fundamental Insights To understand LSM trees, we first need to understand why writes are hard. A B-tree-based database updates data in place . When we write a key, the engine finds the right page on disk and modifies it. This is a random write: the disk head has to seek to an arbitrary location before writing. On spinning disks, that seek takes time. But even on SSDs, random writes cause problems: they wear out cells unevenly and trigger expensive internal garbage collection. LSM trees take a completely different approach. Instead of writing data where it ultimately belongs, they write data sequentially . Writes are recorded in memory and appended to a log file for durability. When the in-memory buffer fills up, its contents are streamed to a new file in one sequential pass. Sequential writes are dramatically faster than random writes because there is no seeking involved. The disk just keeps writing forward. The price of this design is complexity. Data doesn’t live in one place. It accumulates across multiple files over time, and those files need to be periodically merged and reorganized in the background to stay manageable. That background work is what every piece of an LSM tree is built around. The in-memory buffer is called the memtable . The sorted files on disk are called SSTables . We’ll look at each in detail. The Memtable Every write in an LSM tree starts in memory, in a structure called the memtable. The memtable is a mutable, in-memory store . When a write request arrives, the engine records the key-value pair in the memtable and appends it to a sequential log file on disk (called the write-ahead log, or WAL, which we’ll cover in the next section). The WAL write is a sequential append, so it is fast. There is no random I/O, no page lookup, no in-place modification. This is why LSM trees can sustain very high write throughput. A hashtable works for lookups but not for in-order iteration. Sorting a hashtable takes at flush time. A better choice is an ordered data structure. The most common in practice is a skip list ; for example, LevelDB and RocksDB both use one as their default. A radix trie is another elegant option: it keeps keys in lexicographic order naturally, so iterating in order is just a depth-first traversal, and flushing becomes a simple stream with no sorting step needed. A balanced BST works too. Production implementations typically attach a monotonic sequence number to each entry, so the engine can always determine which version of a key is the most recent, regardless of arrival order. The memtable doesn’t grow forever . At some point, it gets flushed to disk, and a new empty memtable takes its place. What triggers that flush depends on the implementation: it can be a size limit (a number of entries or a memory threshold), elapsed time, or memory pressure, for example. That flush produces a sorted file on disk called an SSTable, which we’ll look at after the WAL. The Write-Ahead Log There is a problem with keeping writes in memory: if the process crashes, everything in the memtable is gone. Any write the client received an acknowledgment for is now lost. That breaks a core database guarantee: durability . The solution is a Write-Ahead Log, or WAL. Before writing to the memtable, the engine appends the operation to the WAL, an append-only file on disk . Only after the WAL entry is safely persisted does the engine update the memtable and acknowledge the client. This ordering is what the “write-ahead” in the name refers to: the log is always written before the in-memory state changes. The WAL is not the final home for data; it’s a safety net. If the engine crashes and restarts, it replays the WAL from the beginning to reconstruct the memtable, recovering any writes that hadn’t been flushed to disk yet. One subtlety: writing to a file is not the same as persisting it. Operating systems buffer writes in memory before flushing to disk. To guarantee durability, the engine must call after each WAL entry, forcing the OS to flush its buffers to physical storage. This is not free, though. adds latency to every write. Production systems often use instead, which persists the data without flushing unnecessary file metadata, keeping WAL appends faster. Many also use a technique called group commit to amortize this cost further: instead of syncing after every write, they batch multiple WAL entries and call once for the group. The WAL introduces write amplification : the ratio of data written to disk versus data actually requested by a client. Every byte we write to the database gets written to disk twice: once to the WAL immediately, and once to an SSTable when the memtable is eventually flushed. That cost buys us durability. SSTables As we said, when the memtable fills up, it gets written to disk as a Sorted String Table, or SSTable . An SSTable is an immutable, sorted file. Immutable means it is never modified after creation. Sorted means keys are stored in lexicographic order. Both properties matter: Immutability makes SSTables safe to read concurrently without locking. Sorted order makes lookups inside a file efficient. is the landing zone . When the memtable flushes, the resulting SSTable lands in L0. files can have overlapping key ranges: two L0 files might both contain entries for key . This is acceptable because L0 files are small and short-lived. and deeper levels are different. Each level maintains non-overlapping key ranges across all its files . A given key can exist in at most one file per level. This is the critical property that makes reads efficient: to look up a key in , we don’t scan all files. We use the key ranges to jump directly to the one file that could contain it. Each deeper level is typically larger by a fixed ratio, for example, 10x. might hold 10 MB, 100 MB, 1 GB, and so on. Most data ends up in the deepest level. Most compaction work happens between levels. The benefit is controlled read amplification . To look up a key, we check the memtable, scan all files, then do one binary search per deeper level. The number of deeper levels grows logarithmically with data size. For a dataset with a few levels, that’s a small, bounded number of disk reads, regardless of how many total SSTables exist. When compaction falls behind and accumulates too many files, the engine may trigger a write stall : new writes are paused until compaction catches up and is drained. This is one of the more painful operational issues in LSM-based systems. Leveled compaction is also not the only strategy. Tiered compaction , used by Cassandra, for example, takes a different approach: instead of enforcing non-overlapping ranges per level, it groups SSTables of similar size and merges them when a tier grows too large. Tiered compaction generates less write amplification but more read amplification. The right choice depends on the workload. Bloom Filters Leveling helps with reads, but there is still one painful case: looking up a key that doesn’t exist . For a missing key, the engine checks the memtable (not there), checks each L0 file (not there), then checks one file per deeper level (not there). Each check is a disk read. Even with leveling, this adds up. Bloom filters solve this. A Bloom filter is a probabilistic data structure that can answer one question: Is this key definitely not in this SSTable? It has no false negatives: if the key is in the SSTable, the filter will say so. It can have false positives (occasionally it says a key might be present when it isn’t), but in practice, the false positive rate is tunable and kept very low. Many implementations attach a Bloom filter to each SSTable, built at creation time from all the keys it contains. The filters are small, a few kilobytes per SSTable, so they can be loaded into memory at startup and kept there. How does it work? A Bloom filter is a bitset. When a key is added, several hash functions are applied to it, each producing an index into the bitset. The bit at each index is set to 1. To check if a key is in the filter, the same hash functions are applied. If any of the resulting bits is 0, the key is definitely not in the SSTable. No disk read needed. If all bits are 1, the key might be there, and the engine proceeds to read the SSTable. The practical impact is significant. For a key that doesn’t exist (the worst case), the engine skips almost every SSTable without a single disk read . Only the rare false positive triggers an unnecessary disk read. Read amplification for missing keys drops dramatically. Some engines take this further and attach Bloom filters not just per SSTable but per data block within an SSTable, enabling even more precise filtering before fetching a block from disk. Concurrency Everything described so far assumes a single thread. In reality, a storage engine needs to handle concurrent reads and writes, while flush and compaction run in the background . This is where things get subtle. The core problem: a flush operation replaces the current memtable with a new one and registers a new SSTable in the catalog. A compaction operation removes old SSTables and registers new ones. If a read is in the middle of searching an SSTable that gets deleted by a concurrent compaction, that’s a crash. One common solution is a versioned catalog . A catalog is a snapshot of the engine’s state at a point in time: a reference to the current memtable, the current WAL path, and the current catalog file. Every incoming request acquires the latest catalog version, pins it by incrementing a reference count, performs its work, then releases it by decrementing the reference count. Background workers (the flush worker and the compaction worker) never modify an existing catalog . Instead, when a flush or compaction completes, they create a new catalog version pointing to the updated memtable and SSTable set. From that moment, new requests acquire the new catalog. Old requests that pinned the previous catalog continue reading from it safely. An old catalog version is only cleaned up (its SSTables deleted, its WAL file discarded) when its reference count drops to zero. No reader is using it anymore, so it is safe to remove. This approach keeps foreground reads and writes lock-free in the hot path. Background operations never block requests, and requests never block background operations. They operate on independent catalog versions and only synchronize at the moment of catalog swap , which in many implementations is a single atomic pointer update. The versioned catalog is also what makes crash recovery clean. On startup, the engine reads the latest catalog file on disk, which always reflects a consistent state: either from before the last flush/compaction, or after. Any SSTables on disk not referenced by the catalog are orphans from an incomplete operation and can be safely deleted. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. Summary LSM trees optimize for write throughput by turning random disk writes into sequential ones, at the cost of more complex reads. The memtable absorbs writes in memory; an ordered structure like a skip list, balanced BST, or radix trie keeps keys sorted for efficient flushing. The WAL provides durability: every write is logged to disk before the memtable is updated, enabling crash recovery. SSTables are immutable, sorted files produced by flushing the memtable; a binary block format with checksums makes point lookups efficient and reads safe. A catalog file tracks which SSTables are live and is updated atomically to ensure the engine always has a consistent view of disk state. Read amplification is the fundamental trade-off: finding a key may require searching multiple SSTables, one per level, plus all files. Compaction merges SSTables, eliminates redundant entries, and reclaims space, at the cost of write amplification and background I/O. Tombstones handle deletions in an immutable structure; they can only be discarded when no older value they shadow still exists on disk. Leveling organizes SSTables into levels with non-overlapping key ranges, bounding read amplification to one file lookup per level. Tiered compaction is an alternative strategy that trades less write amplification for more read amplification. Bloom filters allow the engine to skip SSTable reads for missing keys with near certainty, eliminating the worst-case read scenario. A versioned catalog is one common approach to enabling lock-free concurrent reads and background operations by letting each request pin a consistent snapshot of engine state. CRDTs Explained Availability Models Explained The PACELC Theorem Explained The Log-Structured Merge-Tree (LSM-Tree) // The original LSM tree whitepaper. Log Structured Merge Tree - ScyllaDB // LSM tree definition from ScyllaDB technical glossary . Build Your Own Key-Value Storage Engine IO devices and latency

0 views
Karboosx 2 weeks ago

How to actually handle database transactions (and why your ORM fails at it)

Wrapping database calls in a simple try-catch block is just asking for trouble. Here is a practical look at how to handle transactions properly, avoid the dreaded ORM lost update, and keep your data actually safe.

0 views
Aran Wilkinson 2 weeks ago

How I lost a database and learned to actually use AI

I ran AI-generated SQL without reading it properly and lost a database. The experience changed how I work with AI tools, replacing freeform chat sessions with a structured process built around PRDs, small tasks, and frequent commits.

0 views

SQLAlchemy 2 In Practice - Chapter 7: Asynchronous SQLAlchemy

This is the seventh chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you! Starting with release 1.4, SQLAlchemy includes support for asynchronous programming with the package, for both the Core and ORM modules. This is an exciting improvement that brings the power of SQLAlchemy to modern applications such as those written with the FastAPI web framework.

0 views
Anton Zhiyanov 3 weeks ago

Solod v0.1: Go ergonomics, practical stdlib, native C interop

Solod ( So ) is a system-level language with Go syntax and zero runtime. It's designed for two main audiences: The initial version (let's call it v0) was focused on picking a subset of Go and translating it to C. The next logical step was to port Go's standard library and make it easier to interop with C. That's what the v0.1 release I'm presenting today is all about. Standard library • SQLite bindings • Persistent map • Store and retrieve • Command-line interface • Performance • Wrapping up Solod v0.1 ships with the following stdlib packages ported from Go: And a couple of its own packages: Stdlib documentation In the following sections, I'll demonstrate some of the v0.1 features using a simple example: a persistent key-value store backed by SQLite. Since So doesn't provide yet, we'll call SQLite directly through its C API. To do this, let's import the necessary headers with the directive and generate extern declarations using the sobind tool: The directive is required for constants ( ) and types ( ). As for functions ( ), we can just declare them without a body — the transpiler will treat them as extern declarations even without . With the SQLite API in place, let's implement a key-value type that wraps the database connection: Add a constructor that connects to an SQLite database and creates a table to store the items: As you can see, this So code looks a lot like regular Go code. However, there are some key differences: First, let's implement the method: No surprises here, just a bunch of SQLite API calls. The method is more interesting: The pointer returned by is managed by SQLite. It becomes invalid after calling (which does before returning). Because of this, we need to allocate a copy of the returned value, using in this case. So's approach to memory allocation is similar to Zig's — all heap allocations must be done explicitly by providing a specific instance of the interface. The caller, of course, must free the allocated string: Here, is a specific allocator that uses libc's and . Alternatively, we could use or any other implementation of the interface: With the type in place, let's create a simple CLI using the package: Then add command routing: Again, no surprises here — the package works just as it does in Go. Solod isn't trying to outperform hand-tuned C. Still, performance matters: the code is benchmarked and optimized to run reasonably fast. Since So compiles to plain C and then to native code with full optimizations, the results are sometimes better than Go's. Here are some highlights from the benchmarks: There're no GC pauses and no Cgo bridge cost when calling C libraries. The tradeoff is that you have to handle memory yourself, but as the SQLite example above shows, So's allocator interface makes that pretty manageable. Solod vs. Go benchmarks Solod is still in its early days, but with the v0.1 release, it's ready for hobby projects. The already-ported parts of the Go standard library make it easy to write command-line tools (check out the , , , and examples ). Plus, with native C interop, you can build just about anything else you need. The next release (v0.2) will likely focus on networking, concurrency, or both — along with more stdlib packages. If you're interested, take a look at So's readme — it has all the information you need to get started. Or try So online without installing anything. Go developers who want low-level control and zero-cost C interop, without having to learn a new language or standard library. C developers who like Go's style. , , and — Abstractions and types for general-purpose I/O. , , , and — Common byte and text operations. and — Generic heap-allocated data structures. and — Generating random data. , , and — Working with the command line and files. — Structured logging. — Measuring and displaying time. — Memory allocation with a pluggable allocator interface. — Low-level C interop helpers. When compiled, the code is first translated to plain C, then compiled into a native binary using GCC or Clang. Unlike Go, there is no runtime (no automatic heap memory allocation, no garbage collection, no goroutine scheduler). There is no overhead when calling C functions, unlike Go's Cgo. The interop syntax is a bit cleaner. For example, Go's ( in the call) automatically decays to C's . Buffered I/O is 3x faster than Go. String and byte operations are up to 2.5x faster. Maps are 1.5x faster for modifications. Integer formatting is 2x faster.

0 views
Phil Eaton 4 weeks ago

Automating Hermitage to see how transactions differ in MySQL and MariaDB

This is an external post of mine. Click here if you are not redirected.

0 views
The Coder Cafe 1 months ago

How Linux 7.0 Broke PostgreSQL

☕ Welcome to The Coder Cafe! On April 3, 2026, Salvatore Dipietro, an engineer at AWS, posted a patch to the Linux kernel mailing list. The reason: on a 96-vCPU Graviton4 machine running Linux 7.0, PostgreSQL throughput had dropped to roughly half of what it produced on Linux 6.x. In this post, we will trace what changed in Linux 7.0, how PostgreSQL manages memory, and what role memory pages play in making the problem appear (or disappear). Get cozy, grab a coffee, and let’s begin! The Problem Salvatore Dipietro ran pgbench (PostgreSQL’s standard benchmarking tool) on a Graviton4 processor with 96 vCPUs. The workload was a benchmark doing simple updates at scale factor 8,470 (i.e., roughly a 847 million row table), simulating 1,024 clients and 96 threads. A serious, high-parallelism load designed to stress the system. The results were striking. Linux 7.0 delivered roughly half the throughput of Linux 6.x on the same hardware and workload: Linux 6.x : 98,565 transactions per second Linux 7.0 : 50,751 transactions per second To find where the time was going, Dipietro ran , a Linux profiling tool that samples what the CPU is actually doing. The result was unambiguous: 55% of the machine’s CPU time was spent inside a single function: . The culprit was traced back to a change in how Linux 7.0 schedules processes. Let’s start there. When multiple threads run on a machine, the OS needs to share the CPU between them. That’s the scheduler’s job. But the scheduler also decides something subtler: when to interrupt a running thread and hand the CPU to another. That decision is called preemption , and the answer varies depending on how the kernel is configured. Before Linux 7.0, there were three options: : The kernel almost never interrupts a running thread. A thread runs until it voluntarily gives up the CPU: when it makes a syscall, blocks on I/O, or explicitly sleeps. This was the traditional server default with fewer context switches, higher throughput, and predictable behavior under load. : The kernel can interrupt a running thread at almost any safe point, even if it is in the middle of doing useful work. This means a thread never has to wait for the current one to finish its slice before getting CPU time, which reduces response time but increases context-switch overhead. Historically, the desktop default, where responsiveness matters more than raw throughput. : Introduced in Linux 6.12 as a compromise between the two. The scheduler can interrupt threads, but tries to wait for natural boundaries rather than cutting in aggressively. The intent is to approximate ‘s throughput behavior while still allowing preemption when needed. In Linux 7.0, was removed as an option on modern CPU architectures, leaving only and . Indeed, was designed to be a drop-in replacement on throughput workloads, and for the vast majority of server software, it is. But PostgreSQL hit a specific case where the difference is catastrophic, and to understand why, we need to look at how PostgreSQL manages memory. PostgreSQL, like most databases, doesn’t store data as rows in a flat file. Instead, it uses a fixed-size abstraction called a data page (8 KB by default) as its basic unit of storage. Everything on disk (e.g., table rows, B-tree index nodes, metadata) is stored in these pages. A table with millions of rows is ultimately a large sequence of data pages on disk. Reading from disk is slow. So PostgreSQL maintains a shared buffer pool , a large region of shared memory that caches recently read data pages . The more of the working set that fits in the buffer pool, the less disk I/O is needed. When a client connects to PostgreSQL, the server spawns a dedicated process to handle that connection, called a backend . Every backend that needs a data page not already in the buffer pool has to first read it from disk, then find a buffer to store it in: either one that is already free, or one currently holding another page that can be evicted. The job of finding that buffer falls to a single crucial function called . To coordinate access to the buffer pool across hundreds of concurrent backends, uses a spinlock. A spinlock is a locking mechanism built on a simple idea: instead of going to sleep while waiting for a lock to become available, a process just keeps checking in a tight loop (it spins ): Why would we ever want that? For very short critical sections, the overhead of putting a thread to sleep and waking it back up can be more expensive than just “spinning“ , meaning actively waiting. If we know the lock holder will be done in nanoseconds, spinning is faster than sleeping. The key assumption behind spinlocks is the following: the thread holding the lock will release it very soon. Nobody is going to preempt that thread in the middle of a 20-nanosecond critical section. The holder will finish and release the lock before anyone has time to notice. uses a single global spinlock to protect the critical section where it selects a buffer. On a 96-vCPU machine with 1,024 clients all hammering the database, every backend competes for the same lock, and any time it takes longer than expected to release, all of them burn CPU spinning. But why did the Linux 7.0 preemption change make it so much worse? The answer lies in how memory works at the hardware level. Every process in Linux, including PostgreSQL, works with virtual memory addresses. For example, the address in one process is a completely different memory from the same address in another process. The hardware translates virtual addresses to physical addresses using a data structure called the page table , maintained by the kernel in memory. A page table is a multi-level tree, so a single address translation requires several sequential memory reads to walk it. Doing that for every memory access would be impossibly slow. Instead, CPUs have a small hardware cache for recent translations called the Translation Lookaside Buffer (TLB): When a process accesses an address it has accessed recently, the TLB already has the translation, and the memory access proceeds quickly. When a process accesses an address it hasn’t seen before, it gets a TLB miss : the CPU has to walk the page table, find the physical address, and store the translation in the TLB. That takes time. There is one more concept to introduce. When PostgreSQL starts, it allocates the shared buffer pool as a large virtual memory region. But allocating virtual memory and having physical memory ready to use are two different things. Indeed, Linux uses a principle called lazy allocation : the allocation is noted, but the actual physical pages are only mapped on first access. The first time any code touches a previously-unmapped virtual address, a minor page fault occurs: the kernel allocates a physical page and stores the mapping. That takes microseconds, orders of magnitude slower than a regular read or write where the page is already mapped. When a process accesses memory for the first time, the kernel doesn’t map it byte by byte. Instead, it maps memory in fixed-size chunks called memory pages via the page table. NOTE : We already used the word “page” to characterize data pages, meaning how PostgreSQL organizes data on disk into fixed-size 8 KB blocks. This is a different concept than a Linux page, which is the unit the kernel uses to manage physical memory. By default, a Linux memory page is 4 KB. PostgreSQL's shared buffer pool, like all memory on Linux, is backed by Linux memory pages under the hood. In Dipietro’s benchmark, the shared buffer pool was configured to 120 GB via the parameter, which at 4 KB per Linux memory page means roughly 31 million memory pages . Therefore, 31 million potential first-touch page faults. Now let’s consider what happens inside . Each backend acquires the spinlock to find a free slot in the buffer pool. To do so, it reads or writes shared memory. If that region of shared memory hasn’t been touched yet, accessing it triggers a minor page fault , meaning that the kernel has to allocate a physical memory page and store the mapping. During a long benchmark with a 120 GB shared buffer pool, new regions keep entering the working set throughout the run, so these faults happen constantly, not just at startup . And when a fault occurs while a backend is holding the spinlock, the consequences are severe. Indeed, we discussed that the key assumption behind spinlocks is that the lock will be released very soon. In that case, the assumption breaks : the holder is stuck inside the kernel fault handler while it stores a physical memory page mapping, and every other backend on the machine is spinning, burning CPU, waiting for a lock that won't be released until the faulting process resumes. The impact of a fault when the lock was acquired depends on the preemption model . Let’s consider the following example. Backend acquires the lock but triggers a page fault. Meanwhile, backends , , and arrive and try to acquire the lock. Since they can’t, they spin, burning CPU on a tight loop while waiting for backend to release the lock. With (before Linux 7) : Once backend enters the fault handler, the kernel handles the fault. Since avoids voluntary rescheduling points, backend is unlikely to be scheduled away before the fault resolves and the lock is released. The spinners wait a bit longer than expected, but the damage is limited. With (Linux 7 and beyond) : The scheduler may decide to preempt backend A while it’s still inside the fault handler, scheduling another process in its place. Backend won’t resume until the scheduler hands control back to it, which can take some time, even after the fault is fully handled: The spinlock hold time goes from “ duration of the fault ” to “ duration of the fault + time waiting for the scheduler .” And that extra wait, let’s call it , is not just of wasted CPU; instead, it is multiplied by every backend currently spinning . In the previous example, backends B, C, and D each burn extra cycles, making the total waste . On a 96-vCPU machine with hundreds of backends, that multiplier is devastating. That's how the benchmark ended up with 56% of the CPU burning in . That extra time waiting for the scheduler was the root cause of the issue. Fortunately, there is an option to overcome this issue in PostgreSQL. The main variable we discussed was , 120 GB in the benchmark, meaning roughly 31 million memory pages. But there is another variable we can adjust: the size of a memory page . As we said, it defaults to 4 KB, but the kernel supports larger pages called huge pages . On x86_64 and ARM64, the supported sizes are 2 MB and 1 GB: 4 KB pages : ~31,000,000 potential page faults 2 MB huge pages : ~61,440 potential page faults 1 GB huge pages : ~120 potential page faults Increasing the size of a memory page reduces the number of potential page faults but also reduces TLB pressure. Indeed, far fewer entries need to cover the same memory, so the working set fits comfortably in the TLB, meaning far fewer TLB misses and page table walks on the hot path. Overall, stops triggering faults while holding the lock. The lock holder finishes quickly. The other backends wait microseconds instead of milliseconds. The regression disappears . NOTE : Setting huge pages in PostgreSQL is controlled by the configuration parameter, which accepts three values: , , and (the default). With , PostgreSQL uses huge pages if available and silently falls back to 4 KB pages otherwise. Use instead so PostgreSQL fails to start rather than running misconfigured without you noticing. The size of the huge pages themselves is a Linux configuration. However, setting huge pages is not without tradeoffs . Huge pages are pre-allocated and reserved upfront, meaning that memory is no longer available to the rest of the system even if PostgreSQL isn’t using it all. There is also a memory waste concern: a huge page is allocated as a whole, so if only a fraction of it is used, the rest is wasted. For most production PostgreSQL deployments with large , these tradeoffs are probably worth it, but they are good to know about. Peter Zijlstra, the Intel kernel engineer who authored the preemption change, proposed a fix: PostgreSQL should adopt Restartable Sequences ( ), a Linux kernel facility that lets userspace code detect whether it was preempted or migrated during a critical section and restart it if so. PostgreSQL's spinlock paths would use to detect preemption and retry, avoiding the scenario where a preempted lock holder stalls all waiting backends. The PostgreSQL community’s response was not enthusiastic. Using a kernel facility specifically to recover performance that PostgreSQL had for free before Linux 7.0 is a tough sell. It also sits uncomfortably next to the kernel’s long-standing principle of not breaking userspace : if software worked correctly before a kernel upgrade, it should work correctly after. AI is getting better every day. Are you? At The Coder Cafe, we serve fundamental concepts to make you an engineer that AI won’t replace. Written by a Google SWE, trusted by thousands of engineers worldwide. Linux 7.0 removed on modern CPU architectures, leaving only and . On most distributions, the default shifted to . An AWS engineer benchmarked PostgreSQL on a 96-vCPU Graviton4 and found throughput cut in half on Linux 7.0, with 55% of CPU burning inside a single spinlock in . The root cause is minor page faults occurring while a backend holds the spinlock. With 4 KB memory pages backing a 120 GB , there are up to 31 million potential first-touch faults throughout a benchmark run. Under , the faulting process resumed quickly and released the lock. Under , the scheduler may preempt it mid-fault, extending the hold time and causing every waiting backend to keep spinning. Enabling huge pages (2 MB or 1 GB) reduces the number of potential faults by orders of magnitude and eliminates TLB pressure, making the regression disappear. Linux Soft vs. Hard Lockup Instruction Pipelining Explained Simultaneous Multithreading Explained [PATCH 0/1] sched: Restore PREEMPT_NONE as default AWS Engineer Reports PostgreSQL Performance Halved By Linux 7.0, But A Fix May Not Be Easy PREEMPT_NONE Is Dead; Your Postgres Probably Doesn’t Care The long road to lazy preemption Buffer Manager Restartable Sequences The Problem Salvatore Dipietro ran pgbench (PostgreSQL’s standard benchmarking tool) on a Graviton4 processor with 96 vCPUs. The workload was a benchmark doing simple updates at scale factor 8,470 (i.e., roughly a 847 million row table), simulating 1,024 clients and 96 threads. A serious, high-parallelism load designed to stress the system. The results were striking. Linux 7.0 delivered roughly half the throughput of Linux 6.x on the same hardware and workload: Linux 6.x : 98,565 transactions per second Linux 7.0 : 50,751 transactions per second : The kernel almost never interrupts a running thread. A thread runs until it voluntarily gives up the CPU: when it makes a syscall, blocks on I/O, or explicitly sleeps. This was the traditional server default with fewer context switches, higher throughput, and predictable behavior under load. : The kernel can interrupt a running thread at almost any safe point, even if it is in the middle of doing useful work. This means a thread never has to wait for the current one to finish its slice before getting CPU time, which reduces response time but increases context-switch overhead. Historically, the desktop default, where responsiveness matters more than raw throughput. : Introduced in Linux 6.12 as a compromise between the two. The scheduler can interrupt threads, but tries to wait for natural boundaries rather than cutting in aggressively. The intent is to approximate ‘s throughput behavior while still allowing preemption when needed. When a process accesses an address it has accessed recently, the TLB already has the translation, and the memory access proceeds quickly. When a process accesses an address it hasn’t seen before, it gets a TLB miss : the CPU has to walk the page table, find the physical address, and store the translation in the TLB. That takes time. With (before Linux 7) : Once backend enters the fault handler, the kernel handles the fault. Since avoids voluntary rescheduling points, backend is unlikely to be scheduled away before the fault resolves and the lock is released. The spinners wait a bit longer than expected, but the damage is limited. With (Linux 7 and beyond) : The scheduler may decide to preempt backend A while it’s still inside the fault handler, scheduling another process in its place. Backend won’t resume until the scheduler hands control back to it, which can take some time, even after the fault is fully handled: The spinlock hold time goes from “ duration of the fault ” to “ duration of the fault + time waiting for the scheduler .” And that extra wait, let’s call it , is not just of wasted CPU; instead, it is multiplied by every backend currently spinning . In the previous example, backends B, C, and D each burn extra cycles, making the total waste . On a 96-vCPU machine with hundreds of backends, that multiplier is devastating. That's how the benchmark ended up with 56% of the CPU burning in . That extra time waiting for the scheduler was the root cause of the issue. Huge Pages to the Rescue Fortunately, there is an option to overcome this issue in PostgreSQL. The main variable we discussed was , 120 GB in the benchmark, meaning roughly 31 million memory pages. But there is another variable we can adjust: the size of a memory page . As we said, it defaults to 4 KB, but the kernel supports larger pages called huge pages . On x86_64 and ARM64, the supported sizes are 2 MB and 1 GB: 4 KB pages : ~31,000,000 potential page faults 2 MB huge pages : ~61,440 potential page faults 1 GB huge pages : ~120 potential page faults Linux 7.0 removed on modern CPU architectures, leaving only and . On most distributions, the default shifted to . An AWS engineer benchmarked PostgreSQL on a 96-vCPU Graviton4 and found throughput cut in half on Linux 7.0, with 55% of CPU burning inside a single spinlock in . The root cause is minor page faults occurring while a backend holds the spinlock. With 4 KB memory pages backing a 120 GB , there are up to 31 million potential first-touch faults throughout a benchmark run. Under , the faulting process resumed quickly and released the lock. Under , the scheduler may preempt it mid-fault, extending the hold time and causing every waiting backend to keep spinning. Enabling huge pages (2 MB or 1 GB) reduces the number of potential faults by orders of magnitude and eliminates TLB pressure, making the regression disappear. Linux Soft vs. Hard Lockup Instruction Pipelining Explained Simultaneous Multithreading Explained [PATCH 0/1] sched: Restore PREEMPT_NONE as default AWS Engineer Reports PostgreSQL Performance Halved By Linux 7.0, But A Fix May Not Be Easy PREEMPT_NONE Is Dead; Your Postgres Probably Doesn’t Care The long road to lazy preemption Buffer Manager Restartable Sequences

0 views
Dangling Pointers 1 months ago

Cicada: Dependably Fast Multi-Core In-Memory Transactions

Cicada: Dependably Fast Multi-Core In-Memory Transactions Hyeontaek Lim, Michael Kaminsky, and David G. Andersen SIGMOD'17 This paper is nine years old, but Gemini tells me it is still the state-of-the-art for single server, in-memory OLTP. Leave a comment if you know of a newer result. Section 3 of the paper describes three design principles for the Cicada database: Optimistic concurrency control Multi-versioning Loosely synchronized clocks Cicada stores multiple versions of each tuple in the database (each write produces a new version). As shown in Fig. 2, each tuple is stored as a linked list of versions: Source: https://dl.acm.org/doi/10.1145/3035918.3064015 Inlining is an important optimization described by the paper. The head node of each linked list can optionally store the data for the most recently generated version, which avoids pointer chasing on the common path. Each list element contains two timestamps. is the write timestamp, is a read timestamp. Each list is sorted such that the first list element is the most-recently-committed version. Cicada transactions do not use locks. Instead, Cicada optimistically assumes that no conflicts will occur and validates that assumption using the timestamps associated with each tuple accessed by a transaction. Per-version timestamps are also used to garbage collect old versions. The authors clearly put a lot of effort into scalable timestamp generation. A naive approach would be a single shared variable that is atomically incremented on each transaction. Contention for the cache line holding that variable would be nasty. Instead, Cicada uses loosely synchronized per-core clocks. Before starting a new transaction, a core uses the instruction to measure how much time has elapsed since the last transaction. This elapsed time is used to increment a local clock variable, which is used to generate the transaction timestamp. The least significant bits of each timestamp contain the ID of the thread which generated the timestamp (to ensure that all timestamps are unique). is not architecturally guaranteed to generate perfectly synchronized results across cores. Cicada compensates for unsynchronized clocks with periodic one-sided synchronization. Periodically, a core will compare the value of its local clock variable with the clock variable of another core. If the remote core has a larger value, then the local core will realize that it has a slow clock and will adjust accordingly to catch up. Cicada does not require perfect synchronization for correctness. If one clock is running slowly, its transactions will simply be more likely to abort. Fig. 3 shows TPC-C results. While Cicada supports durability with redo logging, these results were generated with logging disabled: Source: https://dl.acm.org/doi/10.1145/3035918.3064015 I find the 1 warehouse case to be the most interesting, as it has the most contention. Write contention clearly limits scalability even with all of the careful tuning to Cicada. Fig. 3(a) saturates at about 300K transactions per second for 8 cores. Back-of-the-envelope, that gives each core roughly 50,000 clock cycles to execute each transaction. TPC-C transactions aren’t that meaty. What is the speed of light? Subscribe now Optimistic concurrency control Multi-versioning Loosely synchronized clocks

0 views
Robin Moffatt 1 months ago

Materialized Tables in Apache Flink

Flink added support for what it calls Materialized Tables in 1.20 , released in 2024. You can read about the design and motivations in FLIP-435 . In a nutshell, Materialized Tables provide a way to include the SQL to populate and refresh a table as part of its definition.

0 views
Phil Eaton 1 months ago

Branimir Lambov from IBM on Cassandra

This is an external post of mine. Click here if you are not redirected.

0 views

SQLAlchemy 2 In Practice - Chapter 6: A Page Analytics Solution

This is the sixth chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you! The goal of this chapter is to use the concepts you have learned to build a web traffic analytics solution. This will serve as reinforcement of the techniques demonstrated in previous chapters as well as an example of a more complex and realistic database design.

0 views

SQLAlchemy 2 In Practice - Chapter 5 - Advanced Many-To-Many Relationships

This is the fifth chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you! You have now learned the design blocks used in relational databases. Sometimes, however, these building blocks have to be "tweaked" a bit to achieve a desired goal. This chapter is dedicated to exploring a very useful variation on the many-to-many relationship.

0 views

SQLAlchemy 2 In Practice - Chapter 4 - Many-To-Many Relationships

This is the fourth chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you! Continuing with the topic of relationships, this chapter is dedicated to the many-to-many type, which as its name implies, is used when it is not possible to identify any of the sides as a "one" side.

0 views
Lalit Maganti 1 months ago

Eight years of wanting, three months of building with AI

For eight years, I’ve wanted a high-quality set of devtools for working with SQLite. Given how important SQLite is to the industry 1 , I’ve long been puzzled that no one has invested in building a really good developer experience for it 2 . A couple of weeks ago, after ~250 hours of effort over three months 3 on evenings, weekends, and vacation days, I finally released syntaqlite ( GitHub ), fulfilling this long-held wish. And I believe the main reason this happened was because of AI coding agents 4 . Of course, there’s no shortage of posts claiming that AI one-shot their project or pushing back and declaring that AI is all slop. I’m going to take a very different approach and, instead, systematically break down my experience building syntaqlite with AI, both where it helped and where it was detrimental. I’ll do this while contextualizing the project and my background so you can independently assess how generalizable this experience was. And whenever I make a claim, I’ll try to back it up with evidence from my project journal, coding transcripts, or commit history 5 .

0 views
Armin Ronacher 1 months ago

Absurd In Production

About five months ago I wrote about Absurd , a durable execution system we built for our own use at Earendil, sitting entirely on top of Postgres and Postgres alone. The pitch was simple: you don’t need a separate service , a compiler plugin , or an entire runtime to get durable workflows. You need a SQL file and a thin SDK. Since then we’ve been running it in production, and I figured it’s worth sharing what the experience has been like. The short version: the design held up, the system has been a pleasure to work with, and other people seem to agree. Absurd is a durable execution system that lives entirely inside Postgres. The core is a single SQL file ( absurd.sql ) that defines stored procedures for task management, checkpoint storage, event handling, and claim-based scheduling. On top of that sit thin SDKs (currently TypeScript , Python and an experimental Go one) that make the system ergonomic in your language of choice. The model is straightforward: you register tasks, decompose them into steps, and each step acts as a checkpoint. If anything fails, the task retries from the last completed step. Tasks can sleep, wait for external events, and suspend for days or weeks. All state lives in Postgres. If you want the full introduction, the original blog post covers the fundamentals. What follows here is what we’ve learned since. The project got multiple releases over the last five months. Most of the changes are things you’d expect from a system that people actually started depending on: hardened claim handling, watchdogs that terminate broken workers, deadlock prevention, proper lease management, event race conditions, and all the edge cases that only show up when you’re running real workloads. A few things worth calling out specifically. Decomposed steps. The original design only had , where you pass in a function and get back its checkpointed result. That works well for many cases but not all. Sometimes you need to know whether a step already ran before deciding what to do next. So we added / , which give you a handle you can inspect before committing the result. This turned out to be very useful for modeling intentional failures and conditional logic. This in particular is necessary when working with “before call” and “after call” type hook APIs. Task results. You can now spawn a task, go do other things, and later come back to fetch or await its result. This sounds obvious in hindsight, but the original system was purely fire-and-forget. Having proper result inspection made it possible to use Absurd for things like spawning child tasks from within a parent workflow and waiting for them to finish. This is particularly useful for debugging with agents too. absurdctl . We built this out as a proper CLI tool. You can initialize schemas, run migrations, create queues, spawn tasks, emit events, retry failures from the command line. It’s installable via or as a standalone binary. This has been invaluable for debugging production issues. When something is stuck, being able to just and see exactly where it stopped is a very different experience from digging through logs. Habitat . A small Go application that serves up a web dashboard for monitoring tasks, runs, checkpoints, and events. It connects directly to Postgres and gives you a live view of what’s happening. It’s simple, but it’s the kind of thing that makes the system more enjoyable for humans. Agent integration. Since Absurd was originally built for agent workloads, we added a bundled skill that coding agents can discover and use to debug workflow state via . There’s also a documented pattern for making pi agent turns durable by logging each message as a checkpoint. The thing I’m most pleased about is that the core design didn’t need to change all that much. The fundamental model of tasks, steps, checkpoints, events, and suspending is still exactly what it was initially. We added features around it, but nothing forced us to rethink the basic abstractions. Putting the complexity in SQL and keeping the SDKs thin turned out to be a genuinely good call. The TypeScript SDK is about 1,400 lines. The Python SDK is about 1,900 but most of this comes from the complexity of supporting colored functions. Compare that to Temporal’s Python SDK at around 170,000 lines. It means the SDKs are easy to understand, easy to debug, and easy to port. When something goes wrong, you can read the entire SDK in an afternoon and understand what it does. The checkpoint-based replay model also aged well. Unlike systems that require deterministic replay of your entire workflow function, Absurd just loads the cached step results and skips over completed work. That means your code doesn’t need to be deterministic outside of steps. You can call or in between steps and things still work, because only the step boundaries matter. In practice, this makes it much easier to reason about what’s safe and what isn’t. Pull-based scheduling was the right choice too. Workers pull tasks from Postgres as they have capacity. There’s no coordinator, no push mechanism, no HTTP callbacks. That makes it trivially self-hostable and means you don’t have to think about load management at the infrastructure level. I had some discussions with folks about whether the right abstraction should have been a durable promise . It’s a very appealing idea, but it turns out to be much more complex to implement in practice. It’s however in theory also more powerful. I did make some attempts to see what absurd would look like if it was based on durable promises but so far did not get anywhere with it. It’s however an experiment that I think would be fun to try! The primary use case is still agent workflows. An agent is essentially a loop that calls an LLM, processes tool results, and repeats until it decides it’s done. Each iteration becomes a step, and each step’s result is checkpointed. If the process dies on iteration 7, it restarts and replays iterations 1 through 6 from the store, then continues from 7. But we’ve found it useful for a lot of other things too. All our crons just dispatch distributed workflows with a pre-generated deduplication key from the invocation. We can have two cron processes running and they will only trigger one absurd task invocation. We also use it for background processing that needs to survive deploys. Basically anything where you’d otherwise build your own retry-and-resume logic on top of a queue. Absurd is deliberately minimal, but there are things I’d like to see. There’s no built-in scheduler. If you want cron-like behavior, you run your own scheduler loop and use idempotency keys to deduplicate. That works, and we have a documented pattern for it , but it would be nice to have something more integrated. There’s no push model. Everything is pull. If you need an HTTP endpoint to receive webhooks and wake up tasks, you build that yourself. I think that’s the right default as push systems are harder to operate and easier to overwhelm but there are cases where it would be convenient. In particular there are quite a few agentic systems where it would be super nice to have webhooks natively integrated (wake on incoming POST request). I definitely don’t want to have this in the core, but that sounds like the kind of problem that could be a nice adjacent library that builds on top of absurd. The biggest omission is that it does not support partitioning yet. That’s unfortunate because it makes cleaning up data more expensive than it has to be. In theory supporting partitions would be pretty simple. You could have weekly partitions and then detach and delete them when they expire. The only thing that really stands in the way of that is that Postgres does not have a convenient way of actually doing that. The hard part is not partitioning itself, it’s partition lifecycle management under real workloads. If a worker inserts a row whose lands in a month without a partition, the insert fails and the workflow crashes. So you need a separate maintenance loop that always creates future partitions far enough ahead for sleeps/retries, and does that for every queue. On the delete side, the safe approach is , but getting that to run from doesn’t work because it cannot be run within a transaction, but runs everything in one. I don’t think it’s an unsolvable problem, but it’s one I have not found a good solution for and I would love to get input on . This brings me a bit to a meta point on the whole thing which is what the point of Open Source libraries in the age of agentic engineering is. Durable Execution is now something that plenty of startups sell you. On the other hand it’s also something that an agent would build you and people might not even look for solutions any more. It’s kind of … weird? I don’t think a durable execution library can support a company, I really don’t. On the other hand I think it’s just complex enough of a problem that it could be a good Open Source project void of commercial interests. You do need a bit of an ecosystem around it, particularly for UI and good DX for debugging, and that’s hard to get from a throwaway implementation. I don’t think we have squared this yet, but it’s already much better to use than a few months ago. If you’re using Absurd, thinking about it, or building adjacent ideas, I’d love your feedback. Bug reports, rough edges, design critiques, and contributions are all very welcome—this project has gotten better every time someone poked at it from a different angle.

0 views

SQLAlchemy 2 In Practice - Chapter 3 - One-To-Many Relationships

This is the third chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you! In the previous chapter you learned how to execute a variety of queries on the table. Interestingly, some of those queries were designed to obtain product manufacturers and not products, and this required duplicates to be removed by grouping the results.

0 views

SQLAlchemy 2 In Practice - Chapter 2 - Database Tables

This is the second chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you! This chapter provides an overview of the most basic usage of the SQLAlchemy library to create, update and query database tables.

0 views

SQLAlchemy 2 In Practice - Chapter 1 - Database Setup

Welcome! This is the start of a journey which I hope will provide you with many new tricks to improve how you work with relational databases in your Python applications. Given that this is a hands-on book, this first chapter is dedicated to help you set up your system with a database, so that you can run all the examples and exercises. This is the first chapter of my SQLAlchemy 2 in Practice book. If you'd like to support my work, I encourage you to buy this book, either directly from my store or on Amazon . Thank you!

0 views