Posts in Database (20 found)

Optimizing Datalog for the GPU

Optimizing Datalog for the GPU Yihao Sun, Ahmedur Rahman Shovon, Thomas Gilray, Sidharth Kumar, and Kristopher Micinski ASPLOS'25 Datalog source code comprises a set of relations, and a set of rules. A relation can be explicitly defined with a set of tuples. A running example in the paper is to define a graph with a relation named : A relation can also be implicitly defined with a set of rules. The paper uses the relation as an example: Rule 1 states that two vertices ( and ) are part of the same generation if they both share a common ancestor ( ), and they are not actually the same vertex ( ). Rule 2 states that two vertices ( and ) are part of the same generation if they have ancestors ( and ) from the same generation. “Running a Datalog program” entails evaluating all rules until a fixed point is reached (no more tuples are added). One key idea to internalize is that evaluating a Datalog rule is equivalent to performing a SQL join. For example, rule 1 is equivalent to joining the relation with itself, using as the join key, and as a filter. Semi-naïve Evaluation is an algorithm for performing these joins until convergence, while not wasting too much effort on redundant work. The tuples in a relation are put into three buckets: : holds tuples that were discovered on the current iteration holds tuples which were added in the previous iteration : holds all tuples that have been found in any iteration For a join involving two relations ( and ), is computed as the union of the result of 3 joins: joined with joined with joined with The key fact for performance is that is never joined with . More details on Semi-naïve Evaluation can be found in these notes . This paper introduces the hash-indexed sorted array for storing relations while executing Semi-naïve Evaluation on a GPU. It seems to me like this data structure would work well on other chips too. Fig. 2 illustrates the data structure (join keys are in red): Source: https://dl.acm.org/doi/10.1145/3669940.3707274 The data array holds the actual tuple data. It is densely packed in row-major order. The sorted index array holds pointers into the data array (one pointer per tuple). These pointers are lexicographically sorted (join keys take higher priority in the sort). The hash table is an open-addressed hash table which maps a hash of the join keys to the first element in the sorted index array that contains those join keys. A join of relations A and , can be implemented with the following pseudo-code: Memory accesses when probing through the sorted index array are coherent. Memory accesses when accessing the data array are coherent up to the number of elements in a tuple. Table 3 compares the results from this paper (GPULog) against a state-of-the-art CPU implementation (Soufflé). HIP represents GPULog ported to AMD’s HIP runtime and then run on the same Nvidia GPU. Source: https://dl.acm.org/doi/10.1145/3669940.3707274 Dangling Pointers The data structure and algorithms described by this paper seem generic, it would be interesting to see them run on other chips (FPGA, DPU, CPU, HPC cluster). I would guess most of GPULog is bound by memory bandwidth, not compute. I wonder if there are Datalog-specific algorithms to reduce the bandwidth/compute ratio. Subscribe now : holds tuples that were discovered on the current iteration holds tuples which were added in the previous iteration : holds all tuples that have been found in any iteration joined with joined with joined with

2 views

Mutable atomic deletes with Parquet backed columnar tables on S3

In the previous post, I explored a Parquet on S3 design with tombstones for constant time deletes and a CAS updated manifest for snapshot isolation. This post extends that design. The focus is in file delete operations where we replace a Parquet row group and publish a new footer using S3 Multipart Upload (MPU) and UploadPartCopy without having to download and rebuild unchanged bytes. We preserve the same system model with immutable data files and a manifest pointer that serializes visibility.

0 views

No Cap, This Memory Slaps: Breaking Through the Memory Wall of Transactional Database Systems with Processing-in-Memory

No Cap, This Memory Slaps: Breaking Through the Memory Wall of Transactional Database Systems with Processing-in-Memory Hyoungjoo Kim, Yiwei Zhao, Andrew Pavlo, Phillip B. Gibbons VLDB'25 This paper describes how processing-in-memory (PIM) hardware can be used to improve OLTP performance. Here is a prior paper summary from me on a similar topic, but that one is focused on OLAP rather than OLTP. UPMEM is specific PIM product (also used in the prior paper ) on this blog. A UPMEM DIMM is like a DRAM DIMM, but each DRAM bank is extended with a simple processor which can run user code. That processor has access to a small local memory and the DRAM associated with the bank. This paper calls each processor a PIM Module . There is no direct communication between PIM modules. Fig. 2 illustrates the system architecture used by this paper. A traditional CPU is connected to a set of boring old DRAM DIMMs and is also connected to a set of UPMEM DIMMs. Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Four Challenges The paper identifies the following difficulties associated with using UPMEM to accelerate an OLTP workload: PIM modules can only access their local memory PIM modules do not have typical niceties associated with x64 CPUs (high clock frequency, caches, SIMD) There is a non-trivial cost for the CPU to send data to UPMEM DIMMs (similar to the CPU writing data to regular DRAM) OLTP workloads have tight latency constraints The authors arrived at a solution that both provides a good speedup and doesn’t require boiling the ocean. The database code and architecture remain largely unchanged. Much of the data remains in standard DRAM DIMMs, and the database operates on it as it always has. In section 3.2 the authors identify a handful of data structures and operations with near-memory affinity which are offloaded. These data structures are stored in UPMEM DIMMs, and the algorithms which access them are offloaded to the PIM modules. The key feature that these algorithms have in common is pointer chasing . The sweet spots the authors identify involve a small number of parameters sent from the CPU to a PIM module, then the PIM module performing multiple roundtrips to its local DRAM bank, followed by the CPU reading back a small amount of response data. The roundtrips to PIM-local DRAM have lower latency than accesses from a traditional CPU core. One data structure which involves a lot of pointer chasing is B+ tree traversal. Thus, the system described in this paper moves B+ tree indexes into UPMEM DIMMs and uses PIM modules to search for values in an index. Note that the actual tuples that hold row data stay in plain-old DRAM. The tricky part is handling range queries while distributing an index across many banks. The solution described in this paper is to partition the set of keys into 2 R partitions (the lower bits of a key define the index the partition which holds that key). Each partition is thus responsible for a contiguous array of keys. For a range query, the lower bits of the lower and upper bounds of the range can be used to determine which partitions must be searched. Each PIM module is responsible for multiple partitions, and a hash function is used to convert a partition index into a PIM module index. MVCC is a concurrency control method which requires the database to keep around old versions of a given row (to allow older in-flight queries to access them). The set of versions associated with a row are typically stored in a linked list (yet another pointer traversal). Again, the actual tuple contents are stored in regular DRAM, but the list links are stored in UPMEM DIMMs, with the PIM modules traversing the links. Section 4.3 has more information about how old versions are eventually reclaimed with garbage collection. Fig. 7 has the headline results. is the baseline, is the work described by this paper. It is interesting that only beats on for read-only workloads. Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Processing-in-memory can help with memory bandwidth and memory latency. It seems like this work is primarily focused on memory latency. I suppose this indicates that OLTP workloads are fundamentally latency-bound, because there is not enough potential concurrency between transactions to hide that latency. Is there no way to structure a database such that OLTP workloads are not bound by memory latency? It would be interesting to see if these tricks could work in a distributed system, where the PIM modules are replaced by separate nodes in the system. Subscribe now Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory Four Challenges The paper identifies the following difficulties associated with using UPMEM to accelerate an OLTP workload: PIM modules can only access their local memory PIM modules do not have typical niceties associated with x64 CPUs (high clock frequency, caches, SIMD) There is a non-trivial cost for the CPU to send data to UPMEM DIMMs (similar to the CPU writing data to regular DRAM) OLTP workloads have tight latency constraints

0 views
Marc Brooker 1 weeks ago

Locality, and Temporal-Spatial Hypothesis

Good fences make good neighbors? Last week at PGConf NYC, I had the pleasure of hearing Andreas Freund talking about the great work he’s been doing to bring async IO to Postgres 18. One particular result caught my eye: a large difference in performance between forward and reverse scans, seemingly driven by read ahead 1 . The short version is that IO layers (like Linux’s) optimize performance by proactively pre-fetching data ahead of the current read point in a file, so it’s already cached when needed. Notably, most of these systems don’t do this backwards. This leads to a big difference in performance between forward scans (where the pages are already in the cache when they’re needed) and backward scans (where the database needs to block on IO to fetch the next page). This lead me to thinking more about a particular hypothesis behind many database designs: a temporal-spatial locality hypothesis 2 . Before we get there, let’s talk about locality more generally, because it might be the single most important idea in database performance (and computer systems performance generally). Almost all database systems take advantage of these forms of locality, and would lost significant performance without taking advantage of them. Stacks of books could be written about these ideas. Stacks of books have been written about these ideas. We could talk about cache-oblivious algorithms , or non-polluting read and write instructions , or have an argument about linked lists. Instead, I want to zoom in to a particular idea in databases: temporal-spatial hypothesis . The hypothesis I mean has a definition something like this: Temporal-spatial locality is the idea that data that was written at approximately the same time will be read at approximately the same time, and therefore should be stored near each other 4 . Is the Temporal-Spatial Hypothesis true? In a streaming system, where keys are written in order by a producer and read in order by subscribers, the temporal-spatial hypothesis is trivially true. Most real-world streaming systems are highly optimized based on this assumption, and choose on-disk formats and APIs that take advantage of it. Time-series, metrics, and observability systems mostly behave the same way. Readers in these systems are normally interested in a window of time, using the same concept of time that writers had. Hash-based databases (like DynamoDB ) reject the temporal-spatial hypothesis . Or at least don’t optimize for it. When new rows are created, they are assigned a large random key (often by hashing the natural primary key), which are then spread over the storage space. This has huge write-time advantages, especially in mitigating write-time hot spots. They pay the cost at read time: an in-order scan of a table is no more efficient than a random scan. Relational schemas that assign large surrogate keys (such as s) to items similarly reject the hypothesis 3 . Distributed and sharded databases get the same hot-spot avoiding benefits, which can be significant. Single-box databases may get better write performance from reduced lock or latch contention, or worse write performance because of reduced cache effectiveness. keys can significantly reduce read performance in both types of systems, again by defeating spatial locality and making effective cache sizes smaller. These schemas may still be a good choice for data modelling reasons. Many such schemas restore the temporal ordering, through a layer of indirection, by indexing on a timestamp column. The flip side of that is a lot of relational schemas use time-ordered primary keys ( SERIAL , AUTO_INCREMENT , etc) even when reads aren’t correlated in time. This leads to increased write-time contention with no particular read-time benefit. In turn, database systems spend significant effort weakening the guarantees around these time-ordered keys. These optimizations include making them not truly ordered (see CACHE in Postgres, for example), and by giving them their own special isolation rules (see the rollback and visibility behavior of nextval in Postgres, for example). I don’t have a lot of data to back this up, but my belief is that the temporal-spatial hypothesis is weakly true for OLTP workloads generally, but perhaps only to the extent that recent keys are hotter keys (the temporal hypothesis ), rather than a strong correlation between keys written at similar times. However, there are some workloads for which it is a key optimization, and these workloads need to either use data structures optimized for this fact, or very carefully design their schemas with knowledge of their database system’s underlying data structures. Attempting to be a little crisper Let me attempt to be a little crisper about the temporal-spatial hypothesis , and how it differs from other common ideas about locality. Let’s say we have keys $K = {K_0, \ldots, K_i, K_{i+1}, \ldots}$, and that for each key $K_i$ we have time since write $K^{W}_i$, time since the last access (read or write) $K^{A}_i$, and a probability $P_r(K_i)$ that it’ll be read again soon. Then, lets say we have the array $\omega$, which is the set $K$ ordered by $K^W$ (e.g. $\omega_i$ is the key written immediately after $\omega_{i-1}$). Our spatial-temporal hypothesis is: Notice how, if the natural order of keys matches the write order of keys, and keys are stored in natural order, the spatial and temporal-spatial hypotheses are the same. Temporal locality is the idea that data accessed recently is likely to be accessed again soon. This idea is what’s behind CPU caches, database buffer pools, and most caches you’ll come across in computer systems. Spatial locality is the idea that when we access data, we’re likely to access nearby data soon. The temporal hypothesis is that $P_r(K_i)$ is inversely related to $K^{A}_i$. In other words, the more time that’s passed since a key was last accessed, the less likely it is to be accessed again soon. A corollary to the temporal hypothesis is that $P_r(K_i)$ more strongly affected by $K^{W}_i$ than $K^{A}_i$ (i.e. recency of creation is a strong signal for heat). The spatial hypothesis is that $P_r(K_i)$ is inversely related to $K^{A}_{i-1}$ or $K^{A}_{i+1}$ (or, more generally $K^{A}_{i+j}$ for small integers $j$ ) 5 . In other words, a key is more likely to be accessed soon if a nearby key was accessed recently. The temporal-spatial hypothesis is that $P_r(\omega_i)$ related to $P_r(\omega_{i-1})$ or $P_r(\omega_{i+1})$ (or, more generally $P_r(\omega_{i+j})$ for small integers $j$). In other words, keys written nearby in time are likely to be read nearby in time. At least that was my understanding, but there was a lot going on in the talk, and I’m not a deep Postgres expert. My apologies to Andreas if I’m misrepresenting his results here. By hypothesis , here I mean an assumption we’re making that we’re baking into the system design. Consider the generational hypothesis behind many tracing garbage collector designs, which makes the assumption that objects tend to have short lifetimes. Or, more generally, that object lifetimes tend to be Pareto-distributed (an idea related to the Lindy Effect ). At least in common database storage formats, which keep keys in key order to optimize for worst-case lookups based on key equality. stored near each other here is taking advantage of the ways that database systems already optimize for spatial locality. A more general version could instead say and therefore be stored to optimize this access pattern . The story I told in the beginning about the performance of forward and backward scans is simply stating that the system was optimizing for the $K^{A}_{i-1}$ case and not the $K^{A}_{i+1}$ case.

0 views
Shayon Mukherjee 1 weeks ago

An MVCC-like columnar table on S3 with constant-time deletes

Parquet is excellent for analytical workloads. Columnar layout, aggressive compression, predicate pushdown, but deletes require rewriting entire files. Systems like Apache Iceberg and Delta Lake solve this by adding metadata layers that track delete files separately from data files. But what if, for fun, we built something (arguably) simpler? S3 now has conditional writes (If-Match, If-None-Match) that enable atomic operations without external coordination. Let’s explore how we might build a columnar table format on S3 that gets most of Parquet’s benefits while supporting constant-time deletes.

0 views
Shayon Mukherjee 1 weeks ago

Exploring PostgreSQL to Parquet archival for JSON data with S3 range reads

PostgreSQL handles large JSON payloads reasonably well until you start updating or deleting them frequently. Once payloads cross the 8 KB TOAST threshold and churn becomes high, autovacuum can dominate your I/O budget and cause other issues. I have been exploring the idea of moving older JSON data (read: cold data) to Parquet on S3 while keeping recent data hot in PostgreSQL daily partitions, then simply dropping those partitions instead of running expensive DELETE operations and subsequent vacuum cycles.

0 views
Jefferson Heard 1 weeks ago

The best worst hack that saved our bacon

No-one really likes engineering war stories, but this one's relevant because there's a moral to it. I've talked before about defining technical debt as technical decisions that provide immediate value, but with long-term negative impact if they aren't cleaned up. Sometimes introducing technical debt is necessary and you do it consciously to avoid a disaster. As long as you provide yourself enough room to clean it up, it's just part of the regular course of business when millions of people count on your software to get through their days. Twelve years of calendar appointments on our platform, and the data model was starting to show some wear and tear. Specifically, though, our occurrence table was created with a plain integer primary key, and we were approaching two billion occurrences on the calendar. Well, specifically, the primary key was rapidly approaching 2,147,483,647 – the magic number that is the maximum value for a signed 32-bit integer. We had actually known about this for some time, and we had done most of the work to fix it already. Our backend code was upgraded to bigints and the actual column itself had a migration set to upgrade it to a big integer. The plan had been in the works for a month and a half, and we almost ran with it. But then, roughly a week before we were going to deploy it (and maybe only a month before the keys ran out), someone, maybe me, I don't recall, noticed that these integer keys were exposed in one of our public APIs. You can count on one thing in SaaS software. If you provide an integration API to your customers or vendors and it exposes an attribute, that attribute is crucial to someone, somewhere. And in our case the people using the integrations often had to rely on their university's IT department to do the integration itself. Those backlogs are counted in months, and so we couldn't deploy something that would potentially break customer integrations. What to do? Well, Postgres integer primary keys are signed. So there's this WHOLE other half of the 32-bit word that you're not using if you're just auto-incrementing keys. My simple (read stupid) solution, which absolutely worked was to set the sequence on that primary key to -2,147,483,648 and let it continue to auto-increment , taking up the other half of that integer space. It was so dumb that I think we met like three times together with SRE to say things like, "Is it really this simple? Is this really likely to work? Are we really doing something this dumb?" and the conclusion was yes, and that it would buy us up to 3 years of time to migrate, but we would do it within 6-8 months so all IT departments can make alternative arrangements for their API integrations. The long term solution was the BigInt, yes, but it was also to expose all keys as opaque handles rather than integers to avoid dictionary attacks and so that we could use any type we needed to on the backend without API users having to account for it. It was also to work through the Customer Success team and make sure no-one counted on the integer-ness (integrality?) of the keys or better that no-one was using the occurrence IDs at all. In the end we had a smooth transition because of quick thinking and willingness to apply a baldfaced hack to our production (and staging) database. We had a fixed timeline we all acknowledged where the tech debt had to be addressed, and we'd firmly scoped out the negative consequences of not addressing it. It wasn't hard, but it meant that no matter who was in charge or what team changes were made, the cleanup would get done in time and correctly. It was the right thing to do. A few customers had been counting on those IDs and we were able to advise their IT departments about how to change their code and to show them what the new API response would look like long before they actually were forced to use it. In the meantime, everything just worked. Do I advise that you use negative primary keys to save room on your database? No. Was it the right choice of technical debt for the time? Absolutely.

0 views
Evan Schwartz 2 weeks ago

Subtleties of SQLite Indexes

In the last 6 months, Scour has gone from ingesting 330,000 pieces of content per month to over 1.4 million this month. The massive increase in the number of items slowed down the ranking for users' feeds and sent me looking for ways to speed it up again. After spending too many hours trying in vain to squeeze more performance out of my queries and indexes, I dug into how SQLite's query planner uses indexes, learned some of the subtleties that explained why my initial tweaks weren't working, and sped up one of my main queries by ~35%. Scour is a personalized content feed that finds articles, blog posts, etc related to users' interests. For better and for worse, Scour does its ranking on the fly whenever users load their feeds page. Initially, this took 100 milliseconds or less, thanks to binary vector embeddings and the fact that it's using SQLite so there is no network latency to load data. The most important table in Scour's database is the table. It includes an ID, URL, title, language, publish date (stored as a Unix timestamp), and a text quality rating. Scour's main ranking query filters items based on when they were published, whether they are in a language the user understands, and whether they are above a certain quality threshold. The question is: what indexes do we need to speed up this query? When I first set up Scour's database, I put a bunch of indexes on the table without really thinking about whether they would help. For example, I had separate indexes on the published date, the language, and the quality rating. Useless. It's more important to have one or a small handful of good composite indexes on multiple columns than to have separate indexes on each column. In most cases, the query planner won't bother merging the results from two indexes on the same table. Instead, it will use one of the indexes and then scan all of the rows that match the filter for that index's column. It's worth being careful to only add indexes that will be used by real queries. Having additional indexes on each column won't hurt read performance. However, each index takes up storage space and more indexes will slow down writes, because all of the indexes need to be updated when new rows are inserted into the table. If we're going to have an index on multiple columns, which columns should we include and what order should we put them in? The order of conditions in a query doesn't matter, but the order of columns in an index very much does. Columns that come earlier in the index should be more "selective": they should help the database narrow the results set as much as possible. In Scour's case, the most selective column is the publish date, followed by the quality rating, followed by the language. I put an index on those columns in that order: ...and found that SQLite was only using one of the columns. Running this query: Produced this query plan: It was using the right index but only filtering by (note the part of the plan that says ). Puzzling. My aha moment came while watching Aaron Francis' High Performance SQLite course . He said the main rule for SQLite indexes is: "Left to right, no skipping, stops at the first range." (This is a much clearer statement of the implications of the Where Clause Analysis buried in the Query Optimizer Overview section of the official docs.) This rule means that the query planner will: My query includes two ranges ( ), so the "stops at the first range" rule explains why I was only seeing the query planner use one of those columns. This rule does, however, suggest that I can create an index that will allow SQLite to filter by the one non-range condition ( ) before using one of the ranges: The query plan shows that it can use both the and columns (note the part that reads ): Now we're using two out of the three columns to quickly filter the rows. Can we use all three? (Remember, the query planner won't be able to use multiple range queries on the same index, so we'll need something else.) SQLite has a nice feature called Partial Indexes that allows you to define an index that only applies to a subset of the rows matching some conditions. In Scour's case, we only really want items where the is less than or equal to 90%. The model I'm using to judge quality isn't that great , so I only trust it to filter out items if it's really sure they're low quality. This means I can create an index like this: And then update the query to use the same condition: And it should use our new partial index... right? Wrong. This query is still using the previous index. There's a subtle mistake in the relationship between our index and our query. Can you spot it? Our index contains the condition but our query says . These are mathematically equivalent but they are not exactly the same . SQLite's query planner requires the conditions to match exactly in order for it to use a partial index. Relatedly, a condition of or even in the query would also not utilize the partial index. If we rewrite our query to use the exact same condition of , we get the query plan: Now, we're starting with the items whose , then using the index to find the items in the desired language(s), and lastly narrowing down the results to the items that were published in the correct time range. As mentioned in the intro, these changes to the indexes and one of Scour's main ranking queries yielded a ~35% speedup. Enabling the query planner to make better use of the indexes makes it so that SQLite doesn't need to scan as many rows to find the ones that match the query conditions. Concretely, in Scour's case, filtering by language removes about 30% of items for most users and filtering out low quality content removes a further 50%. Together, these changes reduce the number of rows scanned by around 66%. Sadly, however, a 66% reduction in the number of rows scanned does not directly translate to a 66% speedup in the query. If we're doing more than counting rows, the work to load the data out of the database and process it can be more resource intensive than scanning rows to match conditions. (The optimized queries and indexes still load the same number of rows as before, they just identifying the desired rows faster.) Nevertheless, a 35% speedup is a noticeable improvement. It's worth digging into how your database's query planner uses indexes to help get to the bottom of performance issues. If you're working with SQLite, remember that: Thanks to Aaron Francis for putting together the High Performance SQLite course ! (I have no personal or financial relationship to him, but I appreciate his course unblocking me and helping me speed up Scour's ranking.) Thank you also to Adam Gluck and Alex Kesling for feedback on this post.

5 views

Accelerate Distributed Joins with Predicate Transfer

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

1 views
Dizzy Zone 2 weeks ago

Redis is fast - I'll cache in Postgres

There are books & many articles online, like this one arguing for using Postgres for everything. I thought I’d take a look at one use case - using Postgres instead of Redis for caching. I work with APIs quite a bit, so I’d build a super simple HTTP server that responds with data from that cache. I’d start from Redis as this is something I frequently encounter at work, switch it out to Postgres using unlogged tables and see if there’s a difference. I’ll run the experiment on my homelab’s k8s cluster. The idea is to run Postgres or Redis on one node, limiting it to 2CPUs via k8s limits, as well as 8GiB of memory. On another node, I’ll run the web server itself and then spin a pod for the benchmark executed via k6 on the third. Both postgres and redis are used with the out of the box settings for the following images: I wrote a simple webserver, with 2 endpoints, a cache and a “Session” struct which we’ll store in the cache: For redis, I’ve implemented the cache using as follows: The postgres cache is implemented using the library: I’ll seed the redis and postgres with 30 million entries each, keeping record of the inserted uuids. From there, I’ll generate a subset of existing uuids to use while benchmarking. This allows for simulating both hits and misses. I’ll do a few runs of benchmarks for gets first, then sets and then a mixed run. Each run will execute for 2 minutes. I’ll look at the number of operations per second, latencies as well as memory and CPU usage during those times. To simulate a somewhat real scenario where only a subset of keys exist in the cache the set benchmark will have a 10% chance to update an existing key, whereas the get will have an 80% chance of picking an existing key. The mixed workload will have a 20% chance to execute a set scenario and 80% for the get scenario. Redis performed better than Postgres, which did not surprise me at all. The bottleneck was actually the HTTP server. The machine running the http server maxed out on CPU, with redis running comfortably with ~1280mCPU - short of the 2000mCPU limit imposed. Redis used ~3800MiB of RAM, which stayed flat across the runs. For postgres, the bottleneck was the CPU on postgres side. It consistently maxed out the 2 cores dedicated to it, while also using ~5000MiB of RAM. Redis also did better when it comes to latencies of the HTTP responses: Once again Redis performed better. The CPU usage stayed roughly the same as in the case of the GET experiment, with the RAM usage growing to ~4300MiB due to the new keys being inserted. The bottleneck stayed on the HTTP server side, with Redis using ~1280mCPU once again. Postgres once again was bottlenecked by the CPU, constantly using 100% of the 2 cores it was limited to. During the course of the run, the memory usage grew to ~5500MiB. During the test, the endpoints with the Redis cache implementation also had better latencies: The mixed benchmark also returned the predictable result of Redis reigning superior. As has been the story so far, the CPU stayed put at ~1280mCPU, RAM usage grew a bit due to the new keys being inserted. Postgres maxed out the two cores and reached around 6GiB of memory used. Latencies once again were better when using redis: In the benchmark, I’ve used an unlogged table for postgres but this has not seemed to help, or has it? If I rerun the same benchmark with a normal(logged) table we can look at the numbers. The unlogged table makes a huge difference for the write benchmark and a somewhat smaller but still significant one for the mixed workload. This is because the unlogged tables skip the write ahead log making them a lot faster for writes. There’s very little difference for the read performance though and I expect more runs would show the two test cases converging. Redis is faster than postgres when it comes to caching, there’s no doubt about it. It conveniently comes with a bunch of other useful functionality that one would expect from a cache, such as TTLs. It was also bottlenecked by the hardware, my service or a combination of both and could definitely show better numbers. Surely, we should all use Redis for our caching needs then, right? Well, I think I’ll still use postgres. Almost always, my projects need a database. Not having to add another dependency comes with its own benefits. If I need my keys to expire, I’ll add a column for it, and a cron job to remove those keys from the table. As far as speed goes - 7425 requests per second is still a lot. That’s more than half a billion requests per day. All on hardware that’s 10 years old and using laptop CPUs. Not many projects will reach this scale and if they do I can just upgrade the postgres instance or if need be spin up a redis then. Having an interface for your cache so you can easily switch out the underlying store is definitely something I’ll keep doing exactly for this purpose. Thanks for reading!

0 views

Scaling IP Lookup to Large Databases using the CRAM Lens

Thanks for reading Dangling Pointers! Subscribe for free to receive new posts and support my work.

0 views
Phil Eaton 1 months ago

In response to a developer asking about systems

Sometimes I get asked questions that would be more fun to answer in public. All letters are treated as anonymous unless permission is otherwise granted. Hey [Redacted]! It's great to hear from you. I'm very glad you joined the coffee club and met some good folks. :) You asked how to learn about systems. A great question! I think I need to start first with what I mean when I say systems. My definition of systems is all of the underlying software we developers use but are taught not to think about because they are so solid: our compilers and interpreters, our databases, our operating system, our browser, and so on. We think of them as basically not having bugs, we just count on them to be correct and fast enough so we can build the applications that really matter to users. But 1) some developers do actually have to work on these fundamental blocks (compilers, databases, operating systems, browsers, etc.) and 2) it's not thaaaat hard to get into this development professionally and 3) even if you don't get into it professionally, having a better understanding of these fundamental blocks will make you a better application developer. At least I think so. To get into systems I think it starts by you just questioning how each layer you build on works. Try building that layer yourself. For example you've probably used a web framework like Rails or Next.js. But you can just go and write that layer yourself too (for education). And you've probably used Postgres or SQLite or DynamoDB. But you can also just go and write that layer yourself (for education). It's this habit of thinking and digging into the next lower layer that will get you into systems. Basically, not being satisfied with the black box. I do not think there are many good books on programming in general, and very very few must-read ones, but one that I recommend to everybody is Designing Data Intensive Applications. I think it's best if you read it with a group of people. (My book club will read it in December when the 2nd edition comes out, you should join.) But this book is specific to data obviously and not interested in the fundamentals of other systems things like compilers or operating systems or browsers or so on. Also, I see getting into this as a long-term thing. Throughout my whole career (almost 11 years now) I definitely always tried to dig into compilers and interpreters, I wrote and blogged about toy implementations a lot. And then 5 years ago I started digging into databases and saw that there was more career potential there. But it still took 4 years until I got my first job as a developer working on a database (the job I currently have). Things take time to learn and that's ok! You have a long career to look forward to. And if you end up not wanting to dig into this stuff that's totally fine too. I think very few developers actually do. And they still have fine careers. Anyway, I hope this is at least mildly useful. I hope you join the Software Internals Discord and nycsystems.xyz as well and look forward to seeing you at future coffee clubs! Cheers, Phil I wrote a letter in response to a developer asking about how to learn systems. pic.twitter.com/2ILNpzl662

0 views
André Arko 1 months ago

Rails on SQLite: exciting new ways to cause outages

This post was originally given as a talk for Friendly.rb . The slides are also available. Between Litestack and the Rails 8 trifecta of Solid Cable, Solid Cache, and Solid Queue, it’s easier than ever to spin up a Rails app that doesn’t need a database service, or a redis service, or a file storage service. It’s great to simplify things, but even after 20 years of deploying Rails apps I was still caught out by some of the ways things are different. Based on what happened when I built a new side project in Rails on SQLite, we’ll cover what’s different, what’s new, and several ways that you can knock your site offline or even destroy your entire production database. As we go, we’ll also talk about the advantages of using SQLite, and how those differences can help you. So who am I, how did I learn these things, and why should you listen to me? I’m André Arko, better known on the internet as @indirect. A long time ago, I helped create Bundler , and I’ve been the OSS team lead for RubyGems and Bundler for more than a decade at this point. I work at Spinel Cooperative , a collective of Ruby open source maintainers building rv , the Ruby language manager that can install Ruby in one second flat. We offer retainers for unlimited access to core team experts from Bundler, Rails, Hotwire, and more, who can answer your questions and solve your problems.

0 views
Phil Eaton 1 months ago

A simple clustering and replication solution for Postgres

This is an external post of mine. Click here if you are not redirected.

1 views
Phil Eaton 1 months ago

Analytics query goes 6x faster with EDB Postgres Distributed's new analytics engine

This is an external post of mine. Click here if you are not redirected.

0 views
Michael Hoffmann 1 months ago

Connecting a MySQL Database in Nuxt with Drizzle ORM

Explore how to seamlessly integrate MySQL databases into your Nuxt applications using the Drizzle ORM. This guide covers setup, configuration, and an example of executing queries, all to enhance your full-stack development workflow with modern, efficient tools.

0 views
Robin Moffatt 1 months ago

Kafka to Iceberg - Exploring the Options

You’ve got data in Apache Kafka . You want to get that data into Apache Iceberg . What’s the best way to do it? Perhaps invariably, the answer is: IT DEPENDS . But fear not: here is a guide to help you navigate your way to choosing the best solution for you 🫵. I’m considering three technologies in this blog post:

0 views