Latest Posts (16 found)
Marc Brooker 1 weeks ago

Locality, and Temporal-Spatial Hypothesis

Good fences make good neighbors? Last week at PGConf NYC, I had the pleasure of hearing Andreas Freund talking about the great work he’s been doing to bring async IO to Postgres 18. One particular result caught my eye: a large difference in performance between forward and reverse scans, seemingly driven by read ahead 1 . The short version is that IO layers (like Linux’s) optimize performance by proactively pre-fetching data ahead of the current read point in a file, so it’s already cached when needed. Notably, most of these systems don’t do this backwards. This leads to a big difference in performance between forward scans (where the pages are already in the cache when they’re needed) and backward scans (where the database needs to block on IO to fetch the next page). This lead me to thinking more about a particular hypothesis behind many database designs: a temporal-spatial locality hypothesis 2 . Before we get there, let’s talk about locality more generally, because it might be the single most important idea in database performance (and computer systems performance generally). Almost all database systems take advantage of these forms of locality, and would lost significant performance without taking advantage of them. Stacks of books could be written about these ideas. Stacks of books have been written about these ideas. We could talk about cache-oblivious algorithms , or non-polluting read and write instructions , or have an argument about linked lists. Instead, I want to zoom in to a particular idea in databases: temporal-spatial hypothesis . The hypothesis I mean has a definition something like this: Temporal-spatial locality is the idea that data that was written at approximately the same time will be read at approximately the same time, and therefore should be stored near each other 4 . Is the Temporal-Spatial Hypothesis true? In a streaming system, where keys are written in order by a producer and read in order by subscribers, the temporal-spatial hypothesis is trivially true. Most real-world streaming systems are highly optimized based on this assumption, and choose on-disk formats and APIs that take advantage of it. Time-series, metrics, and observability systems mostly behave the same way. Readers in these systems are normally interested in a window of time, using the same concept of time that writers had. Hash-based databases (like DynamoDB ) reject the temporal-spatial hypothesis . Or at least don’t optimize for it. When new rows are created, they are assigned a large random key (often by hashing the natural primary key), which are then spread over the storage space. This has huge write-time advantages, especially in mitigating write-time hot spots. They pay the cost at read time: an in-order scan of a table is no more efficient than a random scan. Relational schemas that assign large surrogate keys (such as s) to items similarly reject the hypothesis 3 . Distributed and sharded databases get the same hot-spot avoiding benefits, which can be significant. Single-box databases may get better write performance from reduced lock or latch contention, or worse write performance because of reduced cache effectiveness. keys can significantly reduce read performance in both types of systems, again by defeating spatial locality and making effective cache sizes smaller. These schemas may still be a good choice for data modelling reasons. Many such schemas restore the temporal ordering, through a layer of indirection, by indexing on a timestamp column. The flip side of that is a lot of relational schemas use time-ordered primary keys ( SERIAL , AUTO_INCREMENT , etc) even when reads aren’t correlated in time. This leads to increased write-time contention with no particular read-time benefit. In turn, database systems spend significant effort weakening the guarantees around these time-ordered keys. These optimizations include making them not truly ordered (see CACHE in Postgres, for example), and by giving them their own special isolation rules (see the rollback and visibility behavior of nextval in Postgres, for example). I don’t have a lot of data to back this up, but my belief is that the temporal-spatial hypothesis is weakly true for OLTP workloads generally, but perhaps only to the extent that recent keys are hotter keys (the temporal hypothesis ), rather than a strong correlation between keys written at similar times. However, there are some workloads for which it is a key optimization, and these workloads need to either use data structures optimized for this fact, or very carefully design their schemas with knowledge of their database system’s underlying data structures. Attempting to be a little crisper Let me attempt to be a little crisper about the temporal-spatial hypothesis , and how it differs from other common ideas about locality. Let’s say we have keys $K = {K_0, \ldots, K_i, K_{i+1}, \ldots}$, and that for each key $K_i$ we have time since write $K^{W}_i$, time since the last access (read or write) $K^{A}_i$, and a probability $P_r(K_i)$ that it’ll be read again soon. Then, lets say we have the array $\omega$, which is the set $K$ ordered by $K^W$ (e.g. $\omega_i$ is the key written immediately after $\omega_{i-1}$). Our spatial-temporal hypothesis is: Notice how, if the natural order of keys matches the write order of keys, and keys are stored in natural order, the spatial and temporal-spatial hypotheses are the same. Temporal locality is the idea that data accessed recently is likely to be accessed again soon. This idea is what’s behind CPU caches, database buffer pools, and most caches you’ll come across in computer systems. Spatial locality is the idea that when we access data, we’re likely to access nearby data soon. The temporal hypothesis is that $P_r(K_i)$ is inversely related to $K^{A}_i$. In other words, the more time that’s passed since a key was last accessed, the less likely it is to be accessed again soon. A corollary to the temporal hypothesis is that $P_r(K_i)$ more strongly affected by $K^{W}_i$ than $K^{A}_i$ (i.e. recency of creation is a strong signal for heat). The spatial hypothesis is that $P_r(K_i)$ is inversely related to $K^{A}_{i-1}$ or $K^{A}_{i+1}$ (or, more generally $K^{A}_{i+j}$ for small integers $j$ ) 5 . In other words, a key is more likely to be accessed soon if a nearby key was accessed recently. The temporal-spatial hypothesis is that $P_r(\omega_i)$ related to $P_r(\omega_{i-1})$ or $P_r(\omega_{i+1})$ (or, more generally $P_r(\omega_{i+j})$ for small integers $j$). In other words, keys written nearby in time are likely to be read nearby in time. At least that was my understanding, but there was a lot going on in the talk, and I’m not a deep Postgres expert. My apologies to Andreas if I’m misrepresenting his results here. By hypothesis , here I mean an assumption we’re making that we’re baking into the system design. Consider the generational hypothesis behind many tracing garbage collector designs, which makes the assumption that objects tend to have short lifetimes. Or, more generally, that object lifetimes tend to be Pareto-distributed (an idea related to the Lindy Effect ). At least in common database storage formats, which keep keys in key order to optimize for worst-case lookups based on key equality. stored near each other here is taking advantage of the ways that database systems already optimize for spatial locality. A more general version could instead say and therefore be stored to optimize this access pattern . The story I told in the beginning about the performance of forward and backward scans is simply stating that the system was optimizing for the $K^{A}_{i-1}$ case and not the $K^{A}_{i+1}$ case.

0 views
Marc Brooker 3 weeks ago

Seven Years of Firecracker

Time flies like an arrow. Fruit flies like a banana. Back at re:Invent 2018, we shared Firecracker with the world. Firecracker is open source software that makes it easy to create and manage small virtual machines. At the time, we talked about Firecracker as one of the key technologies behind AWS Lambda, including how it’d allowed us to make Lambda faster, more efficient, and more secure. A couple years later, we published Firecracker: Lightweight Virtualization for Serverless Applications (at NSDI’20). Here’s me talking through the paper back then: The paper went into more detail into how we’re using Firecracker in Lambda, how we think about the economics of multitenancy ( more about that here ), and how we chose virtualization over kernel-level isolation (containers) or language-level isolation for Lambda. Despite these challenges, virtualization provides many compelling benefits. From an isolation perspective, the most compelling benefit is that it moves the security-critical interface from the OS boundary to a boundary supported in hardware and comparatively simpler software. It removes the need to trade off between kernel features and security: the guest kernel can supply its full feature set with no change to the threat model. VMMs are much smaller than general-purpose OS kernels, exposing a small number of well-understood abstractions without compromising on software compatibility or requiring software to be modified. Firecracker has really taken off, in all three ways we hoped it would. First, we use it in many more places inside AWS, backing the infrastructure we offer to customers across multiple services. Second, folks use the open source version directly, building their own cool products and businesses on it. Third, it was the motivation for a wave of innovation in the VM space. In this post, I wanted to write a bit about two of the ways we’re using Firecracker at AWS that weren’t covered in the paper. Bedrock AgentCore Back in July, we announced the preview of Amazon Bedrock AgentCore . AgentCore is built to run AI agents. If you’re not steeped in the world of AI right now, you might be confused by the many definitions of the word agent . I like Simon Willison’s take : An LLM agent runs tools in a loop to achieve a goal. 1 Most production agents today are programs, mostly Python, which use a framework that makes it easy to interact with tools and the underlying AI model. My favorite one of those frameworks is Strands , which does a great job of combining traditional imperative code with prompt-driven model-based interactions. I build a lot of little agents with Strands, most being less than 30 lines of Python (check out the strands samples for some ideas ). So where does Firecracker come in? AgentCore Runtime is the compute component of AgentCore. It’s the place in the cloud that the agent code you’ve written runs. When we looked at the agent isolation problem, we realized that Lambda’s per-function model isn’t rich enough for agents. Specifically, because agents do lots of different kinds of work on behalf of different customers. So we built AgentCore runtime with session isolation . Each session with the agent is given its own MicroVM, and that MicroVM is terminated when the session is over. Over the course of a session (up to 8 hours), there can be multiple interactions with the user, and many tool and LLM calls. But, when it’s over the MicroVM is destroyed and all the session context is securely forgotten. This makes interactions between agent sessions explicit (e.g. via AgentCore Memory or stateful tools), with no interactions at the code level, making it easier to reason about security. Firecracker is great here, because agent sessions vary from milliseconds (single-turn, single-shot, agent interactions with small models), to hours (multi-turn interactions, with thousands of tool calls and LLM interactions). Context varies from zero to gigabytes. The flexibility of Firecracker, including the ability to grow and shrink the CPU and memory use of VMs in place, was a key part of being able to build this economically. Aurora DSQL We announced Aurora DSQL, our serverless relational database with PostgreSQL compatibility, in December 2014. I’ve written about DSQL’s architecture before , but here wanted to highlight the role of Firecracker. Each active SQL transaction in DSQL runs inside its own Query Processor (QPs), including its own copy of PostgreSQL. These QPs are used multiple times (for the same DSQL database), but only handle one transaction at a time. I’ve written before about why this is interesting from a database perspective. Instead of repeating that, lets dive down to the page level and take a look from the virtualization level. Let’s say I’m creating a new DSQL QP in a new Firecracker for a new connection in an incoming database. One way I could do that is to start Firecracker, boot Linux, start PostgreSQL, start the management and observability agents, load all the metadata, and get going. That’s not going to take too long. A couple hundred milliseconds, probably. But we can do much better. With clones . Firecracker supports snapshot and restore , where it writes down all the VM memory, registers, and device state into a file, and then can create a new VM from that file. Cloning is the simple idea that once you have a snapshot you can restore it as many time as you like. So we boot up, start the database, do some customization, and then take a snapshot. When we need a new QP for a given database, we restore the snapshot. That’s orders of magnitude faster. This significantly reduces creation time, saving the CPU used for all that booting and starting. Awesome. But it does something else too: it allows the cloned microVMs to share unchanged ( clean ) memory pages with each other, significantly reducing memory demand (with fine-grained control over what is shared). This is a big saving, because a lot of the memory used by Linux, PostgreSQL, and the other processes on the box aren’t modified again after start-up. VMs get their own copies of pages they write to (we’re not talking about sharing writable memory here), ensuring that memory is still strongly isolated between each MicroVM. Another knock-on effect is the shared pages can also appear only once in some levels of the CPU cache hierarchy, further improving performance. There’s a bit more plumbing that’s needed to make some things like random numbers work correctly in the cloned VMs 2 . Last year, I wrote about our paper Resource management in Aurora Serverless . To understand these systems more deeply, let’s compare their approaches to one common challenge: Linux’s approach to memory management. At a high level, in stock Linux’s mind, an empty memory page is a wasted memory page. So it takes basically every opportunity it can to fill all the available physical memory up with caches, buffers, page caches, and whatever else it may think it’ll want later. This is a great general idea. But in DSQL and Aurora Serverless, where the marginal cost of a guest VM holding onto a page is non-zero, it’s the wrong one for the overall system. As we say in the Aurora serverless paper , Aurora Serverless fixes this with careful tracking of page access frequency: − Cold page identification: A kernel process called DARC [8] continuously monitors pages and identifies cold pages. It marks cold file-based pages as free and swaps out cold anonymous pages. This works well, but is heavier than what we needed for DSQL. In DSQL, we take a much simpler approach: we terminate VMs after a fixed period of time. This naturally cleans up all that built-up cruft without the need for extra accounting. DSQL can do this because connection handling, caching, and concurrency control are handled outside the QP VM. In a lot of ways this is similar to the approach we took with MVCC garbage collection in DSQL. Instead of PostgreSQL’s , which needs to carefully keep track of references to old versions from the set of running transactions, we instead bound the set of running transactions with a simple rule (no transaction can run longer than 5 minutes). This allows DSQL to simply discard versions older than that deadline, safe in the knowledge that they are no longer referenced. Simplicity, as always, is a system property.

0 views
Marc Brooker 2 months ago

Dynamo, DynamoDB, and Aurora DSQL

Programming Model Dynamo is a simple key-value store, that doesn’t offer transactions of any kind: Dynamo does not provide any isolation guarantees and permits only single key updates. DynamoDB offers single-shot serializable ACID transactions, with a single transaction consisting of multiple reads and writes. DSQL has the richest programming model, offering interactive transactions, full SQL support, and a rich type system. Availability and Latency The Dynamo paper makes a number of claims about the trade-offs between consistency, availability, and latency that have not stood the test of time. I’m not trying to call out the paper authors (several are personal friends of mine, and many are long-time colleagues), but point out that we’ve learned a lot about building distributed databases in 20 years. Cloud infrastructure has also advanced considerably. Experience at Amazon has shown that data stores that provide ACID guarantees tend to have poor availability. This was true in the mid 2000s, but many ACID systems offer excellent availability today. That includes DynamoDB, DSQL, and others like Aurora Postgres. DynamoDB and DSQL can tolerate the failure of hosts, or an entire availability zone, without losing consistency, durability, or availability. From the very early replicated database works, it is well known that when dealing with the possibility of network failures, strong consistency and high data availability cannot be achieved simultaneously. Here, the Dynamo paper is citing Bernstein and Goodman (from 1984) and Lindsay et al 1 (from 1979) to highlight the inherent trade-offs between availability and consistency. These results aren’t in any way wrong, but ( as I’ve argued before ), they aren’t as practically important as the Dynamo paper implies they are. Strongly consistent systems offer excellent availability in the face of failures of many kinds ( including entire region failures ). Dynamo also allows applications to pick different trade-offs for performance, losing durability, consistency, or availability in the process. The main advantage of Dynamo is that its client applications can tune the values of N, R and W to achieve their desired levels of performance, availability and durability. This made complete sense in the mid-2000s. But better ways of thinking about replication and failure correlation, vastly improved system performance (thanks SSDs!), and much better datacenter networks have made this kinds of tunability uninteresting. It’s notable that both DynamoDB and DSQL offer significantly lower latencies than Dynamo while making none of the associated trade-offs discussed in the paper. The Amazon Dynamo paper is a classic. You should read it if you haven’t. But time has marched on, we’ve learned a ton, we’ve got better hardware and better ideas, and much of what the Dynamo paper says doesn’t make sense in the real world anymore. That’s a good thing!

0 views
Marc Brooker 2 months ago

LLMs as Parts of Systems

Towers of Hanoi is a boring game, anyway. Over on the Kiro blog, I wrote a post about Kiro and the future of AI spec-driven software development , looking at where I think the space of AI-agent-powered development tools is going. In that post, I made a bit of cheeky oblique reference to a topic I think is super important. I asked Kiro to build a Towers of Hanoi game. It’s an oblique reference to Apple’s The Illusion of Thinking paper, and the discourse that followed it. The question of whether LLMs can scalably play Towers of Hanoi is an interesting theoretically and scientifically, but not the most important question. The more important one is can systems built with LLMs play these games? . By picking me Towers of Hanoi in that other post, I was pointing out that the answer is clearly yes . And has been for several LLM generations. As a system builder, I’m much more interested in what systems of LLMs and tools can do together. LLMs and code interpreters. LLMs and databases. LLMs and browsers. LLMs and SMT solvers. These systems can do things, today, that LLMs alone simply can’t, and will never be able to do. More importantly, they can do things today orders of magnitude more cheaply and quickly than LLMs can, even in the case where they can do the same things. You know, this kind of thing: Trivial? Yes. But I’ve now created a system that that can solve problems that this LLM can’t. A better LLM can, but at about six orders of magnitude higher cost per example. Systems, fundamentally, are more than the sum of their components. A good system can do things that no component can do alone. The trivial example is trivial, but you can imagine how that power could extend to being able to use decades of progress in algorithms. And not only , but much more powerful things like SMT solvers, or ILP approximation, or MCMC. But, given the efficiency, even trivial things like , , and are interesting. A more advanced example is Amazon Bedrock’s Automated Reasoning Checks . Automated Reasoning checks use LLMs to extract the underlying rules from a set of documents, and facts from an LLM or agent response, and then uses an SMT solver to verify that the facts are logically consistent with the rules. Danilo’s blog post goes through a detailed example of what this looks like, if you’d like to understand more. Automated Reasoning checks, like my trivial example above, combine LLMs with other methods of reasoning to create a system that’s greater than the sum of its parts. LLMs are used for what they’re good at - extracting facts and rules from the messy natural language that humans use. SMT solvers are used for what they’re great at - reasoning precisely through a set of logical steps, and providing formal justification for that reasoning. The current generation of LLMs can’t do this type of reasoning alone, but systems composed of LLMs and other tools (SMT solvers in this case) can do it. The hype and buzz around LLMs makes it easy to forget this point, but it’s a critical one. LLMs are more powerful, more dependable, more efficient, and more flexible when deployed as a component of a carefully designed system. It’s a very exciting time to be a systems person, because we’ve been handed a new and extremely powerful component that can be used to build better systems with new capabilities. Some old ideas will go away, but the fundamental ideas won’t. They’ll be more relevant than ever. What About The Bitter Lesson? Is this view of LLMs as parts of systems consistent with Rich Sutton’s The Bitter Lesson ? Sutton says: The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. We have to learn the bitter lesson that building in how we think we think does not work in the long run. I’m not proposing that these systems, systems composed of LLMs and other computational and storage tools, should build in how we think we think 1 . The way I read Sutton’s point is not at all incompatible with the idea that there are better (more efficient, more reliable, etc) and worse (less efficient, less reliable, etc) ways to do computing. For example, Sutton doesn’t seem to be implying that generating and executing (or memoizing, retrieving, and executing) code to do a task is less good than doing it with a lot of linear algebra. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done. Indeed. Computing, from Python to SMT, has been a powerful tool of discovery for over eighty years. Making these tools available to the systems we build, and specifically to the AI agents we build, gives them more power and leverage. Not by encoding the way we think, but by encoding the things computer science has learned about the way the universe works.

0 views
Marc Brooker 3 months ago

Career advice, or something like it

Cynicism is bad. If I could offer you a single piece of career advice, it’s this: avoid negativity echo chambers. Every organization and industry has watering holes where the whiners hang out. The cynical. The jaded. These spots feel attractive. Everybody has something they can complain about, and complaining is fun. These places are inviting and inclusive: as long as you’re whining, or complaining, or cynical, you’re in. If you’re positive, optimistic, or ambitious, you’re out. Avoid these places. That doesn’t mean you need to be 100% up-beat all the time, or be a pushover, or never complain. Those things are normal human behavior. But strongly avoid communities that make complaining the core of their identity. My personal limit is about 20%. I’ll stop engaging with communities when 20% of the content is negative. Nobody but you can decide how much you care about your career. Or how much you like your employer, colleagues, or industry. Not everything is in your control, even in a field like software engineering where autonomy and flexibility are widely available. I recommend you choose one of two paths. If you want to move your career or industry forward, focus on the positive parts of your role, and spend energy making things better. Alternatively, if you don’t want to advance your career, spend the right amount of energy to stay where you are. Then, instead of joining that whiny waterhole, go home and mow the lawn, play with your dog, take a walk in the woods with your kids, or whatever you enjoy. These complaining places aren’t just bad for your career, they’re bad for your mental and physical health too. Avoid them even if the content resonates with you. Especially if the content resonates with you. Positive echo chambers can be bad too, but I’ve run into very few of those during my career. What I can say with certainty, though, is that none of the people I look up to in this industry are spending their time in on Slack. They recognize when stuff sucks, and either fix it or accept it. My advice: find the yes, and communities, and spend time there. Find the people doing cool stuff you admire, and spend time with them. Find the people doing the work you want to do, or living the life you want to live, and find ways to learn from them. You’re not going to find them in . Spend your energy at work doing great work. Spend your energy outside of work taking time with the people you love, doing things you enjoy, and showing up for your community. This doesn’t imply you shouldn’t work with others to push for change. Just be clear that’s what you’re doing. DMing away about how much things suck doesn’t make change. It just makes you sad and angry. Sadness and anger are seldom the emotions that drive real change. The corollary to the above point is that you should protect your communities. There are communities you care about and love. If those communities become negativity echo chambers, a lot of good folks will disengage. They’ll look more busy, more active, but the conversation will all be one-note. Then, slowly, your community will lose what made it good. The folks who made your community interesting will find other places to be, and other things to do with their time. Doing this is socially hard. It, at least, requires directly modelling what good looks like. Sharing the perspectives you want to see. Doing the work that needs to be done. It also requires some level of moderation, which is tough. It’s especially tough to turn the trend around once the echo starts.

0 views
Marc Brooker 4 months ago

Systems Fun at HotOS

One day somebody will tell me what systems means. Last week I attended HotOS 1 for the first time. It was super fun. Just the kind of conference I like: single-track, a mix of academic and industry, a mix of normal practical ideas and less-normal less-practical big thinking. I went partially because a colleague twisted my arm, and partially because of this line in the CFP: The program committee will explicitly favor papers likely to stimulate reflection and discussion. That sounds like a good time. Some Papers or Talks I Enjoyed I thought I’d give a shout-out to some of the papers I enjoyed the most. These aren’t the best papers, just ones I particularly liked for my own arbitrary reasons 2 . The NIC should be part of the OS by Xu and Roscoe exposes a lot of what was traditionally the OS kernel’s state (like the run queue for each core) to the NIC, allowing the NIC to use that state to decide which packets to deliver and when. Their goal is to do better than existing kernel bypass methods. After all, many modern cloud applications are short bursts of compute between waiting for packets. They do this with CXL 3.0’s cache-coherent peripheral interconnect, but I don’t think peripheral cache coherence is actually critical to the big picture here. This approach of pushing scheduling work down to the NIC seems especially interesting for straight packet processing workloads, and for Serverless workloads, both of which tend to be very run and wait heavy with high concurrency. Just super fun deep systems work 3 . Batching with End-to-End Performance Estimation by Borisov et al. This paper starts by questioning the common wisdom that batching is good for throughput and bad for latency, by presenting a set of scenarios that show that it can be good or bad for both latency and throughput depending on small timing differences. I love questioning the common wisdom, so this is my jam. They then point out that this situation can be improved by making batching logic aware of end-to-end latency, and propose a smart way to estimate that latency using Little’s Law. Spork: A posix_spawn you can use as a fork by Vögele et al. One day, I decided to move to a new house. So a built a perfect replica of my old house in a new place, then tore it down, and built a new house 4 . That’s fork . Fork sucks . is better, but nobody uses it. So what do we do? In this paper, the authors propose to trap , analyze the forking binary (!) to find the that corresponds to the , and then dynamically rewrite (!) the forking program to call instead. If you like deep OS level optimizations, with just the right amount of unhingedness, you’ll enjoy this paper 5 . Real Life is Uncertain. Consensus Should Be Too! by Frank et al 7 . You know how consensus papers talk about the -threshold failure model, and safety and liveness, and every implementer of those papers realizes that -threshold is kinda bunk because things don’t actually fail that way and the world is messy and probablistic and correlated? Yeah. This paper looks at better ways to talk about these things, building a bridge between theory and practice. It goes on to propose some ideas for building better consensus algorithms that deal with the messiness of the real world. Cool paper, important conversation. From Ahead-of- to Just-in-Time and Back Again: Static Analysis for Unix Shell Programs by Lazarek et al. Just read that title. Static analysis for shell programs. Static analysis for shell programs . Shell programs, like , are a little bit of technical debt from a prior era 6 that we just can’t make go away. But what if we can make them way safer with static analysis? If your initial thought it ‘wow, that sounds hard’, you’re on the right track. Analyzing Metastable Failures by Isaacs et al. This is the start of something I think is super important: making conceptual and practical progress on catching metastable failures at design time. Figure 1 and Figure 3 are worth the price of admission by themselves, a pair of important concepts that might turn out to be super important to the future of large-scale systems building. Overall, HotOS was a great time, with tons of new and interesting ideas. Interesting Trends

0 views
Marc Brooker 4 months ago

Good Performance for Bad Days

Good things are good, one finds. Two weeks ago, I flew to Toronto to give one of the keynotes at the International Conference on Performance Evaluation. It was fun. Smart people. Cool dark squirrels. Interesting conversations. The core of what I tried to communicate is that, in my view, a lot of the performance evaluation community is overly focused on happy case performance (throughput, latency, scalability), and not focusing as much as we need to on performance under saturation and overload. In fact, the opposite is potentially more interesting. For builders and operators of large systems, a lack of performance predictability under overload is a big driver of unavailability. This is a common theme in postmortems and outage reports across the industry. Overload drives systems into regimes they aren’t used to handling, which leads to downtime. Sometimes, in the case of metastable failures , this leads to downtime that persists even after the overload has passed. How did we get into this situation? Not Measuring the Hard Stuff At least one reason is immediately obvious if you pay attention to the performance evaluation in the majority of systems papers. Most of them show throughput, latency, or some other measure of goodness at a load far from the saturation point of the system. The first-order reason for this is unsurprising: folks want to show the system they built in a good light. But there are some second-order reasons too. One is that performance evaluation is easiest, and most repeatable, in this part of the performance curve, and it takes expertise that many don’t have to push beyond it. Some bolder authors will compare saturation points, showing that their systems are able to do more good stuff even when the load is excessive. Only the boldest will go beyond this saturation point to show the performance of their system under truly excessive amounts of load, after the point where performance starts to drop. This regime is important, because it’s very hard to compose reliable end-to-end systems without knowing where the saturation points of components are, and how they perform beyond that point. Even if you try do things like rate limiting and throttling at the front door, which you should, you still need to know how much you can send, and what the backend looks like when it starts saturating. As a concrete example, TCP uses latency and loss as a signal to slow down, and assumes that if everybody slows down congestion will go away. These nice, clean, properties don’t tend to be true of more complex systems. Open and Closed If you read only one performance-related systems paper in your life, make it Open Versus Closed: A Cautionary Tale . This paper provides a crucial mental model for thinking about the performance of systems. Here’s the key image from that paper: When we look at the performance space, we see two things: That doesn’t make sense. The most famous downside of this disconnect is coordinated omission , where we massively underestimate the performance impact of tail latency. But that’s far from the whole picture. Closed benchmarks are too kind to realistically reflect how performance changes with load, for the simple reason that they slow their load down when latency goes up. The real world isn’t that kind to systems. In most cases, if you slow down, you just have more work to be done later. Metastability As I’ve written about before ( a few times , in different ways ), metastability is a problem that distributed systems engineers need to pay more attention to. Not paying attention to performance under overload means not paying attention to metastability, where the majority of real-world triggers are overload-related. Metastability isn’t some esoteric problem. It can be triggered by retries. or by caches, as I’ve written about before . Disconnect between benchmark and real-world workloads The other issue, as it always has been, is a disconnect between benchmark workloads and real-world workloads. This is true in that benchmarks don’t reflect the size and scale of work done by real-world systems. And in that they don’t reflect the coordination and contention patterns present in real-world workloads. The example I used was TPC-C, which has coordination patterns that are much easier to scale than most real-world workloads. When visualized as a graph of rows that transact together , that becomes clear - you can basically partition on warehouse and avoid all cross-partition write-write conflicts. Performance evaluation is super important to system designers and operators. This is a community I care about a lot. But I think there’s a growing disconnect between practice and theory in this space, which we need to close to keep the field relevant and alive.

0 views
Marc Brooker 6 months ago

Decomposing Aurora DSQL

Finally, and again because of the MVCC scheme and use of physical clocks, DSQL doesn’t require a validate step on read-only transactions, or the second ordering step. Like most databases, it also doesn’t require a persist step following a read-only commit. Transactions that only do reads have an even simpler breakdown. Coordination Another useful way to use Alex Miller’s model is thinking about which steps require coordination, either fundamentally or in any given implementation. Fundamentally, execution doesn’t (or, at least, coordination can be replaced with synchrony during execution using MVCC and physical time). Ordering doesn’t strictly (physical time again, though you have to be super careful here). Validation does seem to require coordination. Persistence requires replication (at least in distributed databases), but that doesn’t imply it requires coordination. So, if coordination is what you’re optimizing for (such as because you’re running across multiple regions, or care deeply about scalability), then optimization for the validation phase makes sense. This mirrors the choice we made with DSQL, and reflects the core reasoning behind our choice of MVCC and OCC. In DSQL’s design, execution requires no coordination, persistence requires no cross-shard or cross-replica coordination, and validation and ordering require only coordination between the minimal number of shards.

0 views
Marc Brooker 6 months ago

One or Two? How Many Queues?

At low utilization, as expected, the queue length remains short and both systems offer reasonable service times (see my 2021 post on this topic to dive a bit deeper here ). However, as utilization approaches 100%, the mean wait times for single queue variant increase much more quickly than the multi-queue variant. Keep in mind there’s some bias on the mean here - because the utilization of each variant is calculated independently, there are a lot more total arrivals for a quicker visit. Another way to think about it is how the result changes with the ratio of visit service times (i.e. the ratio between the mean latency of a standing or sitting visit). Perhaps counter-intuitively, the single queue variant is slower even when the service times are equal, and things degrade from there (with loads of simulation noise). What Can We Learn? Perhaps one thing we’ll learn is whether this was a wise choice of framing for a blog post. But, more usefully to everybody but me, is that the go-to rule that one queue is better than multiple queues breaks down in this case of precommitment . If you have multiple different types of work in your system, a queue per type of work may be a good choice.

0 views
Marc Brooker 8 months ago

What Fekete's Anomaly Can Teach Us About Isolation

Is it just fancy write skew? In the first draft of yesterday’s post , the example I used was one that showed Fekete’s anomaly. After drafting, I realized the example distracted too much from the story. But there’s still something I want to say about the anomaly, and so now we’re here. What is Fekete’s anomaly? It’s an example of a snapshot isolation behavior first described in Fekete, O’Neil, and O’Neil’s paper A Read-Only Transaction Anomaly Under Snapshot Isolation . The first time I read about it, I found it spooky. As in five stages of grief spooky. But the more I’ve thought about it, the more I think it’s just a great teaching example. To understand the anomaly, let’s talk about two people. We’ll call the Pat and Betty . Pat and Betty share a pair of bank accounts - a savings account and a current account. They bank at Alan’s bank, which charges a $1 overdraft fee any time a withdrawal will reduce the total value of their accounts below $0. One day, Pat and Betty are running separate errands. Pat goes to the ATM, checks the savings balance and sees $0, then deposits $20. After completing his transaction, Pat comes back to the ATM, checks their balance, and sees $20 in savings, and $0 in current. Around the same time, Betty goes to the ATM and withdraws $10 from the current account. Checking their account later, they notice a balance of -$11 in the current account, and $20 in savings. But that’s impossible! Pat saw $0 and $20, so Alan’s bank shouldn’t have charged them that $1. Did they get ripped off? Let’s tell the same story again, in SQL. Starting with some setup: Then we get to the anomaly itself, showing Pat’s two transactions ( and ), and Betty’s one ( ): Under snapshot isolation (SI), and other some other implementations of weaker-than-serializable isolation levels, this SQL can run as-is. Under serializable isolation, at least one of these transactions would be rejected (standard Postgres would reject the commit of ). What’s interesting about this anomaly is that, if it wasn’t for , we could simply say that happens before and Alan’s Bank’s behavior was justified. But, by doing a read-only transaction, caught them in a weak isolation anomaly. But why, and what can we learn from this? In a Picture Before we answer that question, let’s draw a picture of the transactions and the database state: The interesting part, fittingly, is the bit marked the interesting part is here : the decision whether to commit . Why does commit under snapshot isolation? We can answer that in three ways: The OCC view: is allowed to commit under SI, because it has no write-write conflicts with the transactions that committed between it’s start and end. ’s write set is , ’s is , and ’s is . No conflict there. would not be allowed to commit under serializability because of read-write conflict with : ’s read set is which intersects with ’s write set . The 2PL MVCC view: Under SI, and read from different MVCC snapshots, and the write lock taken by on row doesn’t conflict with the write lock taken by on row . The theory view: To quote from my favorite transaction theory paper, Crooks et al’s Seeing is Believing : Like serializability, SI prevents transactions T from seeing the effects of concurrently running transactions. The commit test enforces this requirement by having all operations in \(T\) read from the same state \(s\), produced by a transaction that precedes \(T\) … You can see that in the diagram: reads the results from our setup transaction , which directly precedes it in the history. However, SI no longer insists on that state \(s\) being \(T\)’s parent state \(s_p\): other transactions, whose operations T will not observe, may commit in between \(s\) and \(s_p\). Here, \(s_p\) is the state that applies its changes to, which isn’t the same state as the one it reads. The commit test only forbids \(T\) from modifying any of the keys that changed value as the system’s state progressed from \(s\) to \(s_p\). Which it hasn’t: only row has changed, and only needs to change . But What Does It Mean? Fekete’s anomaly sure is weird: by introducing a third read-only transaction, we get the database to show an anomalous behavior that would otherwise appear serializable. On the other hand, it also seems like a relatively straightforward case of constraint violation caused by write skew. To quote A Critique of ANSI SQL Isolation Levels Transactions must preserve the constraint predicate to maintain consistency: if the database is consistent when the transaction starts, the database will be consistent when the transaction commits. If a transaction reads a database state that violates the constraint predicate, then the transaction suffers from a constraint violation concurrency anomaly. From the perspective of the database system builder it’s a direct consequence of what we wrote about yesterday : SI allows a database (single-system or distributed) to avoid the coordination (inter-machine, inter-process, or inter-thread) necessary to detect read-write conflicts. Because that coordination scales with reads, and reads tend to be more common than writes in many DB workloads, this can be a big win. What about the perspective of the application builder? The Application Builder’s View The application builder needs to answer two questions about this anomaly (and write skew and constraint violation ) more generally: As much as it makes DB-heads uncomfortable, evidence shows that many (if not most) application builders have concluded that the answer to question 1 is no . This isn’t an unreasonable answer. But let’s assume they do care, what do they do? One option is to choose a serializable database and use its serializable isolation level. This can work for some workloads, but certainly not all. Serializability comes with significant performance and concurrency implications. But there are some more surgical fixes. For example, in Aurora DSQL’s snapshot mode you can use 1 . This works because the theoretical model of isolation levels doesn’t map to special SQL features like in a super clean way, but roughly they are ways to increase isolation for some transactions. The second way is to force the write-write conflict. Overall, Fekete’s update isn’t something to be spooked about, but is an interesting example of the trade-off between serializability and weaker isolation levels. The anomaly is a direct result of the reduced coordination needed for SI: the very same reduced coordination that brings significant performance and scalability.

0 views
Marc Brooker 8 months ago

Versioning versus Coordination

Spoiler: Versioning Wins. Today, we’re going to build a little database system. For availability, latency, and scalability, we’re going to divide our data into multiple shards, have multiple replicas of each shard, and allow multiple concurrent queries. As a block diagram, it’s going to look something like this: Next, borrowing heavily from Hermitage , we’re going to run some SQL. So far so good. We’ve inserted three rows into our database. Next, we’re going to run two concurrent transactions (from two different connections, call them and ), like so: There’s only one valid serializable 1 ordering of these transactions: at line , has seen the world before commits, and so must see that same pre- world until it commits. Therefore must happen before in the serial order. How might we implement this requirement in our distributed architecture? We could use locking: takes a shared lock on at , blocks on it when trying to get an exclusive lock to update the row, and can complete. There are two practical problems with this approach. First, we’re blocking a writer on a reader, which reduces concurrency and throughput. Second, specific to our distributed architecture, needs to take its lock in a single place where needs to look for it. With multiple replicas, where this single place is is not obvious. That can be solved by choosing a primary replica, implementing a single lock manager, or by locking on all replicas 2 . In all three cases, read scalability is lost and read coordination is required. Enter David P. Reed’s 1979 work on versions . Instead of making its desired changes in-place, it creates a new version of the rows, that only becomes visible to transactions that start after commits. , which started earlier, does not see these new versions. The storage layer needs to provide a way to request its reads as of a particular version, which it does by storing multiple copies of the data. The effect this has on the coordination in our database is significant: never has to block on . In fact, doesn’t even need to know that exists at all. could be off in the corner, doing its reads happily against one of a million data replicas, and is none the wiser. This helps scalability ( avoiding coordination is key to scalability ), but also helps throughput (writers never have to wait for readers, readers never have to wait for writers, readers never have to wait for readers), and performance consistency (no waiting means other transactions can’t slow you down). Since the early 1980s, multi-versioning has been a core technique in the implementation of database systems, but it’s role in avoiding coordination in distributed systems is less well-known. The reason multi-versioning is so powerful is because it allows the system to have an extra piece of information (when was this data created?) about the data that it doesn’t need to discover from coordination patterns. As Reed wrote in 1979: Since [versions] are objects that are used by programs, they give a tool for programming that allows explicit recognition of consistent states within the program. In contrast, traditional synchronization mechanisms, such as semaphores, locking, monitors, send-receive, etc. do not give a tool for representing or naming consistent states – one can deduce the states assumed by the system by timing relationships among the execution of steps of programs. Versions are the difference between knowing consistent states and having to deduce consistent states! That’s a powerful idea. Picking A Version, Serving A Version Above, I mentioned that requests it’s reads as-of a particular version. This raises two questions: how to pick the version, and how the storage engine keeps track of all the versions. How to pick a version depends a lot on the properties you want. Serializability, in one common definition, would allow read-only transactions to pick almost any version (including the beginning of time , returning empty results for all reads). This definition is silly. Let’s go back to SQL to think about the results we want: Here, lines and are doing the same thing as in our first snippet, but we’ve introduced a third transaction . At line , we’re showing what most programmers would expect: a new transaction that starts after commits sees the results of ’s writes. This goal is, informally, called read-after-write consistency (generally considered a type of strong consistency ) 3 . There are many ways to achieve this goal. One would be to have a version authority that hands out transaction version numbers in a strict order - but this re-introduces the exact coordination we were trying to avoid! In Aurora DSQL , we pick this time using a physical clock (EC2’s microsecond-accurate time-sync service ). This allows us to avoid all coordination between readers, including reads inside read-write transactions (e.g. notice for ’s has to be a read-modify-write to find the new for each row). The fundamental idea of using physical time this way dates back to the late 1970s, although mostly with folks acknowledging the difficulty of the synchronization problem. Somewhat amusingly, Reed says: Synchronizing the system clocks whenever they come up by using the operator’s watch will usually get the system time accurate within a few minutes before going on to note that Lamport clocks allow the system to do better. The 1970s consensus seems to be that adequate physical clock synchronization wasn’t possible - today it’s easily available in the cloud. Keeping Track of All Those Versions! The next question is how to keep track of all those versions. This is a deep question of its own, with tons of interesting trade-offs and different approaches. I won’t dive into those here, but instead take a different tack. Let’s isolate from our last example: which we can then rewrite as: Similarly, would be re-written as an at a new version number. I don’t know of any database system that’s implemented this way, but it’s a good illustration which bring us to two invariants we need to maintain around versions: In other words, we must retain the last version (or we lose durability), and we must retain at least until is done. The former property is a local one, that can be implemented by each replica with no coordination. The latter is a distributed one, which brings us back to our theme of coordination. Again, we could clearly solve this problem with coordination: register each running transaction in a list, unregister it on commit, keep track of the low-water-mark timestamp. That’s possible to build (and even scale arbitrarily), but it’s nice to avoid that coordination. In Aurora DSQL we avoid that coordination in a simple way: transactions are limited in time (five minutes in the current preview release), and versions are tied to physical time. This turns invariant 2 into a local property too, once again avoiding coordination 4 . In distributed database systems, versioning and physical clocks allow coordination to be avoided in nearly all read cases. This is an extremely powerful tool, because avoiding coordination improves throughput and scalability, reduces latency and cost, helps availability, and simplifies the design of systems.

0 views
Marc Brooker 10 months ago

Snapshot Isolation vs Serializability

Under serializability, once is committed, can only commit if $R_2 \cap W_1 = \emptyset$ (or, more generally, the set of all writes accepted between and does not intersect with 1 ). Under snapshot isolation, once is committed, can only commit if $W_2 \cap W_1 = \emptyset$ (or, more generally, the set of all writes accepted between and does not intersect with ). Notice how similar those two statements are: one about $R_2 \cap W_1$ and one about $W_2 \cap W_1$. That’s the only real difference in the rules. But it’s a crucial difference. It’s a crucial difference because of one of the cool and powerful things that SQL databases make easy: s. You can grow a transaction’s write set with and and friends, but most OLTP applications don’t tend to. You can grow a transaction’s read set with any , and many applications do that. If you don’t believe me, go look at the ratio between predicate (i.e. not exact PK equality) s in your code base versus predicate s and s. If the ratios are even close, you’re a little unusual. Concurrency versus Isolation This is where we enter a world of trade-offs 3 : avoiding SI’s write skew requires the database to abort (or, sometimes, just block) transactions based on what they read . That’s true for OCC, for PCC, or for nearly any scheme you can devise. To get good performance (and scalability) in the face of concurrency, applications using serializable isolation need to be extremely careful about reads . The trouble is that isolation primarily exists to simplify the lives of application programmers, and make it so they don’t have to deal with concurrency. SQL-like isolation models do that quite effectively, and are (in my opinion) one of the best ideas in the history of computing. But as we move up the isolation levels to serializability, we start pushing more complexity onto the application programmer by forcing them to worry more about concurrency from a performance and throughput perspective. This is the cause of my belief that snapshot isolation, combined with strong consistency, is the right default for most applications and most teams of application programmers: it provides a useful minimum in the sum of worries about anomalies and performance. Fundamentally, it does that by observing that write sets are smaller than read sets, for the majority of OLTP applications (often MUCH smaller). How Low Can We Go? How Low Should We Go? That raises the question of whether its worth going even lower in the isolation spectrum. I don’t have as crisp a general answer, but in Aurora DSQL’s case the answer is, mostly no . Reducing the isolation level from SI to does not save any distributed coordination, because of the use of physical clocks and MVCC to provide a consistent read snapshot to transactions without coordination. It does save some local coordination on each storage replica, and the implementation of multiversioning in storage, but our measurements indicate those are minimal. If you’d like to learn more about that architecture, here’s me talking about it at re:Invent 2024: Optimistic versus Pessimistic Concurrency Control What I said above about SI is, from my perspective, equally true in both OCC and PCC designs, when combined with MVCC to form read snapshots and timestamps to choose a read snapshot. In both cases, the only coordination strictly required for SI is to look for write-write conflicts ($W_2 \cap W_1$) that occured between and , and to choose a commit timestamp based on that detection. The big advantage OCC has in avoiding coordination has to do with how those write-write conflicts are detected. In backward validation OCC, the only state that is needed to decide whether to commit transaction $T$ is from already committed transactions . This means that all state in the commit protocol is transient, and can be reconstructed from the log of committed transactions, without causing any false aborts. This is a significant benefit! The other benefit of OCC, as I covered in my first post on DSQL is that we can avoid all coordination during the process of executing a transaction, and only coordinate at time. This is a huge advantage when coordination latency is considerable, like in the multi-region setting. As with SI versus serializability, the story of trade-offs between optimistic and pessimistic approaches is a long one. But, for similar reasons, I think OCC (when combined with MVCC) is the right choice for most transactional workloads.

0 views
Marc Brooker 10 months ago

DSQL Vignette: Wait! Isn't That Impossible?

Laws of physics are real. In today’s post, I’m going to look at how Aurora DSQL is designed for availability, and how we work within the constraints of the laws of physics. If you’d like to learn more about the product first, check out the official documentation , which is always a great place to go for the latest information on Aurora DSQL, and how to fit it into your architecture. In yesterday’s post, I mentioned that Aurora DSQL is designed to remain available, durable, and strongly consistent even in the face of infrastructure failures and network partitions. In this post, we’re going to dive a little deeper into DSQL’s architecture, focussing on multi-region active-active. Aurora DSQL is designed both for single-region applications (looking for a fast, serverless, scalable, relational database), and multi-region active-active applications (looking for all those things, plus multi-region, synchronous replication, and the ability to support active-active applications). Aurora DSQL’s Multi-Region Architecture Let’s start by dipping into Aurora DSQL’s multi-region architecture. We’ll focus on multi-region clusters here, because they highlight the trade-offs best, but the same rules apply to single region clusters if we substitute AZ for region . In a multi-region DSQL cluster 1 each of two regions runs a nearly complete copy of the cluster’s infrastructure: a full copy of storage, enough Query Processors (QPs) to handle the load, and so on. The exception is the adjudicator : the leader adjudicator for each shard exists in only one region at a time. We’ll come back to that, because it’s a key part of the story. What benefits does this architecture offer? You’ll notice there are three regions here: two regions with DSQL endpoints and full copies of the data, and one witness region with only a copy of the Journal. That’s going to become important as we turn to discussing failures. What happens during a partition? Next, we’ll turn our attention to what happens during the time when one of the three regions is disconnected and not available. In the case it’s the witness region that’s disconnected, nothing customer-observable happens (except a small increase in latency for some configurations 3 ). However, if one of the two full regions because uncontactable, then DSQL makes an important decision. When one of the two full regions becomes unavailable ( partitioned off if you like the CAP terminology), the DSQL endpoint in that region becomes unavailable for both reads and writes. The endpoint in the other region (the one on the majority side ) remains available, and continues to offer strong consistency, isolation, and multi-region durability. Applications can send their customers to the healthy region, and end customers can observe no unavailability. Some statements of the CAP theorem make it sound like this is an impossible option, but that’s not true: there’s no theoretical or practical constraint on continuing to provide availability and strong consistency on the majority side of a network partition, as long as we don’t accept writes on the smaller side. I wrote a blog post about this a couple months back: Lets Consign CAP to the Cabinet of Curiosities . Going back to the diagram, you’ll notice that one thing did need to move: the adjudicators that were in the disconnected region. Moving the adjudicator leader to the majority partition allows that partition to start making adjudication decisions. In the Aurora DSQL design, the adjudicator contains no durable or persistent state, so this move requires only recreating its small amount of transient state on the majority side. This side knows all committed transactions, and so knows everything it need to know to recreate the adjudicator state. Storage was already fully replicated, and so there was no need to move it. This is another benefit of the choice of OCC. In a pessimistic locking design, if lock state is lost there’s no choice but to abort all ongoing transactions. In our OCC-based design, the adjudicators state doesn’t depend on ongoing transactions at all, and can be reconstructed only from the set of committed transactions. The adjudicators don’t even know about the running transactions, and will never come to learn about read-only transactions at all. Taking advantage of these properties We’ve covered what happens inside DSQL’s architecture. Next, let’s consider what that means for patterns of building multi-region applications on DSQL. First, I think that active-active architectures are the best choice for many multi-region application architectures. This simply means that both sides are actively taking customer traffic at all times (except during failures, when you move everybody over to one side). I think this for a few reasons: Here’s what that looks like: Here, we’ve built out an application architecture here across multiple AZs in each region, used the local DSQL endpoint in each region, and routed customers to the right region using latency-based DNS routing. Taking advantage of an active-active distributed database means that many applications don’t even need to know they’re running across multiple regions - and don’t need to be modified to handle failover, switchover, and other tasks. Re-routing traffic at the front door, and handing hard problems like replication and consistency over to Aurora DSQL, greatly simplifies this architecture.

0 views
Marc Brooker 10 months ago

DSQL Vignette: Transactions and Durability

Not Just Scale I’ve written a lot about scalability so far, but DSQL’s disaggregated architecture is about a lot more than scalability. We get significant durability, fault tolerance, availability, and performance consistency benefits from this approach. There’s no single machine, link, or even datacenter that make reads or writes stop for a DSQL cluster, and no single place where data is stored. Availability is the most important property we get from disaggregation and distribution. Aurora DSQL is designed to remain available, strongly consistent, strongly isolated, and durable even when an AZ becomes unavailable (for single-region clusters), or when a region becomes unavailable (for multi-region clusters). Tomorrow’s post will go into how that works.

0 views
Marc Brooker 10 months ago

DSQL Vignette: Reads and Compute

Another key benefit of this coordination-free approach is that we can send reads to the nearest read replica (in the same region, and generally AZ) to reduce cost and latency. Reads never have to go to a leader or a primary to be sequenced or have their lock state maintained, simply because they don’t have any lock state. This is true in read-only transactions, read-write transactions, and even for the reads triggered by writes (e.g. is a read-modify-write). Avoiding Caching and Coherence Aurora DSQL uses a logical interface to storage. The QP doesn’t ask for pages, it asks for rows. Knowing the logical structure of the data it holds allows DSQL’s storage to offer quite a high-level interface to the QP: the QP can ask storage to do work like filtering , aggregation , projection , and other common tasks on its behalf. Unlike SQL designs that build on K/V stores, this allows to DSQL to do much of the heavy lifting of filtering and finding data right next to the data itself, on the storage replicas, without sacrificing scalability of storage or compute. This, in turn, allows us to avoid the scalability bottleneck of having to have a large, coherent, cache 8 on-box with SQL execution. In-AZ (or closer) networking, combined with carefully-designed protocols and the ability to push chatty work down, keeps storage fast without the need to cache. We still cache some low-write-rate information (like the list of tables and their definitions). You can see this in action with : Here, the index-only scan on the primary key index on this table (Aurora DSQL tables are index organized) is pushed down to storage, along with the projection of the selected columns. This significantly reduces the number of round-trips between the QP and storage system, with a great impact on performance. Pushing operations down to storage is a good bet for another reason: Latency Lags Bandwidth . Networks have gotten a lot faster over the last couple decades, but the rate of change of latency has been much slower than the rate of change of bandwidth (partially, this just has to do with speed-of-light limitations). This has been true over multiple decades, and looks set to continue for decades more. That trend means that pushdown, which moves operations close to the storage devices themselves and removes a lot of round-trips, is a good bet for the long-term. The Big Picture The overall approach here is disaggregation : we’ve taken each of the critical components of an OLTP database and made it a dedicated service. Each of those services is independently horizontally scalable, most of them are shared-nothing, and each can make the design choices that is most optimal in its domain. This approach is enabled by the extremely fast and reliable networking available in modern data centers, and by designing each component as part of the overall architecture. Tomorrow we’ll go into the write path, which will reveal how the whole picture comes together.

0 views
Marc Brooker 10 months ago

DSQL Vignette: Aurora DSQL, and A Personal Story

It's happening. In this morning’s re:Invent keynote, Matt Garman announced Aurora DSQL. We’re all excited, and some extremely excited, to have this preview release in customers’ hands. Over the next few days, I’m going to be writing a few posts about what DSQL is, how it works, and how to make the best use of it. This post is going to look at the product itself, and a little bit of a personal story. The official AWS documentation for Aurora DSQL is a great place to start to understand what DSQL is and how to use it. What is Aurora DSQL? Aurora DSQL is a new serverless SQL database, optimized for transaction processing, and designed for the cloud. DSQL is designed to scale up and down to serve workloads of nearly any size, from your hobby project to your largest enterprise application. All the SQL stuff you expect is there: transactions, schemas, indexes, joins, and so on, all with strong consistency and isolation 5 . DSQL offers active-active multi-writer capabilities in multiple availability zones (AZs) in a single region, or across multiple regions. Reads and writes, even in read-write transactions, are fast and local, requiring no cross-region communication (or cross-AZ communication in single region setups). Transaction commit goes across regions (for multi-region setups) or AZs (for single-regions setups), ensuring that your transactions are durable, isolated, and atomic. DSQL is PostgreSQL compatible, offering a subset of PostgreSQL’s (huge) SQL feature set. You can connect with your favorite PostgreSQL client (even the cli), use your favorite ORMs and frameworks, etc. We’ll be adding more PostgreSQL-compatible features over time, making it easy to bring your existing code to DSQL. DSQL is serverless. Here, we mean that you create a cluster in the AWS console (or API or CLI), and that cluster will include an endpoint. You connect your PostgreSQL client to that endpoint. That’s all you have to do: management, scalability, patching, fault tolerance, durability, etc are all built right in. You never have to worry about infrastructure. As we launch Aurora DSQL, we’re talking a lot about multi-region active-active, but that’s not the only thing its for. We built DSQL to be a great choice for single-region applications of all sizes - from a few requests per day to thousands a second and beyond. A Personal Story In 2020 I was working on serverless compute at AWS, spending most of my time with the great AWS Lambda team 1 . As always, I spent a lot of time talking to customers, and realized that I was hearing two consistent things from serverless and container customers: Existing relational database offerings weren’t a great fit for the fast-moving scalable world of serverless and containers. These customers loved relational databases and SQL, for all the reasons folks have loved relational for forty years, but felt a lot of friction between the needs of serverless compute and existing relational products. Amazon RDS Proxy helped with some of this friction, but it wasn’t going away. Large, highly-regulated, AWS customers with global businesses were building applications across multiple AWS regions, but running into a tricky architectural trade-off. They could pick multi-region active-active (with DynamoDB Global Tables, for example), but lose out on SQL, ACID, and strong cross-region consistency. Or they could choose active-standby (with Aurora Global Database, for example), but lose the peace of mind of having their application actively running in multiple places, and the ability to serve strongly consistent data to customers from their closest region. These customers wanted both things. At the same time, a few pieces of technology were coming together. One was a set of new virtualization capabilities, including Caspian (which can dynamically and securely scale the resources allocated to a virtual machine up and down), Firecracker 3 (a lightweight VMM for fast-scaling applications), and the VM snapshotting technology we were using to build Lambda Snapstart . We used Caspian to build Aurora Serverless V2 2 , bringing a vertical auto scaling to Aurora’s full feature set. The second was EC2 time sync , which brings microsecond-accurate time to EC2 instances around the globe. High-quality physical time is hugely useful for all kinds of distributed system problems . Most interestingly, it unlocks ways to avoid coordination within distributed systems, offering better scalability and better performance. The new horizontal sharding capability for Aurora Postgres, Aurora Limitless Database , uses these clocks to make cross-shard transactions more efficient. The third was Journal, the distributed transaction log we’d used to build critical parts of multiple AWS services (such as MemoryDB , the Valkey compatible durable in-memory database 4 ). Having a reliable, proven, primitive that offers atomicity, durability, and replication between both availability zones and regions simplifies a lot of things about building a database system (after all, A tomicity and D urability are half of ACID). The fourth was AWS’s strong formal methods and automated reasoning tool set . Formal methods allow us to explore the space of design and implementation choices quickly, and also helps us build reliable and dependable distributed system implementations 6 . Distributed databases, and especially fast distributed transactions, are a famously hard design problem, with tons of interesting trade-offs, lots of subtle traps, and a need for a strong correctness argument. Formal methods allowed us to move faster and think bigger about what we wanted to build. Finally, AWS has been building big cloud systems for a long time ( S3 is turning 19 next year! , can you believe it?), and we have a ton of experience. Along with that experience is an incredible pool of talented engineers, scientists, and leaders who know how to build and operate things. If there’s one thing that’s AWS’s real secret sauce, it’s that our engineers and leaders are close to the day-to-day operation of our services 7 , bringing a constant flow of real-world lessons of how to improve our existing services and build better new ones. The combination of all of these things made it the right time to think big about building a new distributed relational database. We knew we wanted to solve some really hard problems on behalf of our customers, and we were starting to see how to solve them. So, in 2021 I started spending a lot more time with the databases teams at AWS, including the incredibly talented teams behind Aurora and QLDB. We built a team to go do something audacious: build a new distributed database system, with SQL and ACID, global active-active, scalability both up and down (with independent scaling of compute, reads, writes, and storage), PostgreSQL compatibility, and a serverless operational model. I’m proud of the incredibly talented group of people that built this, and can’t wait to see how our customers use it. One Big Thing There are a lot of interesting benefits to the approach we’ve taken with DSQL, but there’s one I’m particularly excited about (the same one Matt highlighted in the keynote): the way that latency scales with the number of statements in a transaction. For cross-region active-active, latency is all about round-trip times. Even if you’re 20ms away from the quorum of regions, making a round trip (such as to a lock server) on every statement really hurts latency. In DSQL local in-region reads are as fast as 1.2ms, so 20ms on top of that would really hurt. From the beginning, we took avoiding this as a key design goal for our transaction protocol, and have achieved our goals. In Aurora DSQL, you only incur additional cross-region latency on , not for each individual , , or in your transaction (from any of the endpoints in an active-active setup). That’s important, because even in the relatively simple world of OLTP, having 10s or even 100s of statements in a transaction is common. It’s only when you (and then only when you a read-write transaction) that you incur cross-region latency. Read-only transactions, and read-only autocommit s are always in-region and fast (and strongly consistent and isolated). In designing DSQL, we wanted to make sure that developers can take advantage of the full power of transactions, and the full power of SQL. Later this week I’ll share more about how that works under the covers. The goal was to simplify the work of developers and architects, and make it easier to build reliable, scalable, systems in the cloud. A Few Other Things In Aurora DSQL, we’ve chosen to offer strong consistency and snapshot isolation . Having observed teams at Amazon build systems for over twenty years, we’ve found that application programmers find dealing with eventual consistency difficult, and exposing eventual consistency by default leads to application bugs. Eventual consistency absolutely does have its place in distributed systems 8 , but strong consistency is a good default. We’ve designed DSQL for strongly consistent in-region (and in-AZ) reads, giving many applications strong consistency with few trade-offs. We’ve also picked snapshot isolation by default. We believe that snapshot isolation 9 is, in distributed databases, a sweet spot that offers both a high level of isolation and few performance surprises. Again, our goal here is to simplify the lives of operators and application programmers. Higher isolation levels push a lot of performance tuning complexity onto the application programmer, and lower levels tend to be hard to reason about. As we talk more about DSQL’s architecture, we’ll get into how we built for snapshot isolation from the ground up. Picking a serverless operational model, and PostgreSQL compatibility, was also based on our goal of simplifying the work of operators and builders. Tons of folks know (and love) Postgres already, and we didn’t want them to have to learn something new. For many applications, moving to Aurora DSQL is as simple as changing a few connection-time lines. Other applications may need larger changes, but we’ll be working to reduce and simplify that work over time.

0 views