Posts in Serverless (3 found)
Marc Brooker 10 months ago

DSQL Vignette: Aurora DSQL, and A Personal Story

It's happening. In this morning’s re:Invent keynote, Matt Garman announced Aurora DSQL. We’re all excited, and some extremely excited, to have this preview release in customers’ hands. Over the next few days, I’m going to be writing a few posts about what DSQL is, how it works, and how to make the best use of it. This post is going to look at the product itself, and a little bit of a personal story. The official AWS documentation for Aurora DSQL is a great place to start to understand what DSQL is and how to use it. What is Aurora DSQL? Aurora DSQL is a new serverless SQL database, optimized for transaction processing, and designed for the cloud. DSQL is designed to scale up and down to serve workloads of nearly any size, from your hobby project to your largest enterprise application. All the SQL stuff you expect is there: transactions, schemas, indexes, joins, and so on, all with strong consistency and isolation 5 . DSQL offers active-active multi-writer capabilities in multiple availability zones (AZs) in a single region, or across multiple regions. Reads and writes, even in read-write transactions, are fast and local, requiring no cross-region communication (or cross-AZ communication in single region setups). Transaction commit goes across regions (for multi-region setups) or AZs (for single-regions setups), ensuring that your transactions are durable, isolated, and atomic. DSQL is PostgreSQL compatible, offering a subset of PostgreSQL’s (huge) SQL feature set. You can connect with your favorite PostgreSQL client (even the cli), use your favorite ORMs and frameworks, etc. We’ll be adding more PostgreSQL-compatible features over time, making it easy to bring your existing code to DSQL. DSQL is serverless. Here, we mean that you create a cluster in the AWS console (or API or CLI), and that cluster will include an endpoint. You connect your PostgreSQL client to that endpoint. That’s all you have to do: management, scalability, patching, fault tolerance, durability, etc are all built right in. You never have to worry about infrastructure. As we launch Aurora DSQL, we’re talking a lot about multi-region active-active, but that’s not the only thing its for. We built DSQL to be a great choice for single-region applications of all sizes - from a few requests per day to thousands a second and beyond. A Personal Story In 2020 I was working on serverless compute at AWS, spending most of my time with the great AWS Lambda team 1 . As always, I spent a lot of time talking to customers, and realized that I was hearing two consistent things from serverless and container customers: Existing relational database offerings weren’t a great fit for the fast-moving scalable world of serverless and containers. These customers loved relational databases and SQL, for all the reasons folks have loved relational for forty years, but felt a lot of friction between the needs of serverless compute and existing relational products. Amazon RDS Proxy helped with some of this friction, but it wasn’t going away. Large, highly-regulated, AWS customers with global businesses were building applications across multiple AWS regions, but running into a tricky architectural trade-off. They could pick multi-region active-active (with DynamoDB Global Tables, for example), but lose out on SQL, ACID, and strong cross-region consistency. Or they could choose active-standby (with Aurora Global Database, for example), but lose the peace of mind of having their application actively running in multiple places, and the ability to serve strongly consistent data to customers from their closest region. These customers wanted both things. At the same time, a few pieces of technology were coming together. One was a set of new virtualization capabilities, including Caspian (which can dynamically and securely scale the resources allocated to a virtual machine up and down), Firecracker 3 (a lightweight VMM for fast-scaling applications), and the VM snapshotting technology we were using to build Lambda Snapstart . We used Caspian to build Aurora Serverless V2 2 , bringing a vertical auto scaling to Aurora’s full feature set. The second was EC2 time sync , which brings microsecond-accurate time to EC2 instances around the globe. High-quality physical time is hugely useful for all kinds of distributed system problems . Most interestingly, it unlocks ways to avoid coordination within distributed systems, offering better scalability and better performance. The new horizontal sharding capability for Aurora Postgres, Aurora Limitless Database , uses these clocks to make cross-shard transactions more efficient. The third was Journal, the distributed transaction log we’d used to build critical parts of multiple AWS services (such as MemoryDB , the Valkey compatible durable in-memory database 4 ). Having a reliable, proven, primitive that offers atomicity, durability, and replication between both availability zones and regions simplifies a lot of things about building a database system (after all, A tomicity and D urability are half of ACID). The fourth was AWS’s strong formal methods and automated reasoning tool set . Formal methods allow us to explore the space of design and implementation choices quickly, and also helps us build reliable and dependable distributed system implementations 6 . Distributed databases, and especially fast distributed transactions, are a famously hard design problem, with tons of interesting trade-offs, lots of subtle traps, and a need for a strong correctness argument. Formal methods allowed us to move faster and think bigger about what we wanted to build. Finally, AWS has been building big cloud systems for a long time ( S3 is turning 19 next year! , can you believe it?), and we have a ton of experience. Along with that experience is an incredible pool of talented engineers, scientists, and leaders who know how to build and operate things. If there’s one thing that’s AWS’s real secret sauce, it’s that our engineers and leaders are close to the day-to-day operation of our services 7 , bringing a constant flow of real-world lessons of how to improve our existing services and build better new ones. The combination of all of these things made it the right time to think big about building a new distributed relational database. We knew we wanted to solve some really hard problems on behalf of our customers, and we were starting to see how to solve them. So, in 2021 I started spending a lot more time with the databases teams at AWS, including the incredibly talented teams behind Aurora and QLDB. We built a team to go do something audacious: build a new distributed database system, with SQL and ACID, global active-active, scalability both up and down (with independent scaling of compute, reads, writes, and storage), PostgreSQL compatibility, and a serverless operational model. I’m proud of the incredibly talented group of people that built this, and can’t wait to see how our customers use it. One Big Thing There are a lot of interesting benefits to the approach we’ve taken with DSQL, but there’s one I’m particularly excited about (the same one Matt highlighted in the keynote): the way that latency scales with the number of statements in a transaction. For cross-region active-active, latency is all about round-trip times. Even if you’re 20ms away from the quorum of regions, making a round trip (such as to a lock server) on every statement really hurts latency. In DSQL local in-region reads are as fast as 1.2ms, so 20ms on top of that would really hurt. From the beginning, we took avoiding this as a key design goal for our transaction protocol, and have achieved our goals. In Aurora DSQL, you only incur additional cross-region latency on , not for each individual , , or in your transaction (from any of the endpoints in an active-active setup). That’s important, because even in the relatively simple world of OLTP, having 10s or even 100s of statements in a transaction is common. It’s only when you (and then only when you a read-write transaction) that you incur cross-region latency. Read-only transactions, and read-only autocommit s are always in-region and fast (and strongly consistent and isolated). In designing DSQL, we wanted to make sure that developers can take advantage of the full power of transactions, and the full power of SQL. Later this week I’ll share more about how that works under the covers. The goal was to simplify the work of developers and architects, and make it easier to build reliable, scalable, systems in the cloud. A Few Other Things In Aurora DSQL, we’ve chosen to offer strong consistency and snapshot isolation . Having observed teams at Amazon build systems for over twenty years, we’ve found that application programmers find dealing with eventual consistency difficult, and exposing eventual consistency by default leads to application bugs. Eventual consistency absolutely does have its place in distributed systems 8 , but strong consistency is a good default. We’ve designed DSQL for strongly consistent in-region (and in-AZ) reads, giving many applications strong consistency with few trade-offs. We’ve also picked snapshot isolation by default. We believe that snapshot isolation 9 is, in distributed databases, a sweet spot that offers both a high level of isolation and few performance surprises. Again, our goal here is to simplify the lives of operators and application programmers. Higher isolation levels push a lot of performance tuning complexity onto the application programmer, and lower levels tend to be hard to reason about. As we talk more about DSQL’s architecture, we’ll get into how we built for snapshot isolation from the ground up. Picking a serverless operational model, and PostgreSQL compatibility, was also based on our goal of simplifying the work of operators and builders. Tons of folks know (and love) Postgres already, and we didn’t want them to have to learn something new. For many applications, moving to Aurora DSQL is as simple as changing a few connection-time lines. Other applications may need larger changes, but we’ll be working to reduce and simplify that work over time.

0 views
Phil Eaton 1 years ago

Build a serverless ACID database with this one neat trick (atomic PutIfAbsent)

Delta Lake is an open protocol for serverless ACID databases. Due to its simplicity, scalability, and the number of open-source implementations, it's quickly becoming the DuckDB of serverless transactional databases for analytics workloads. Iceberg is a contender too, and is similar in many ways. But since Delta Lake is simpler (simple != better) that's where we'll focus in this post. Delta Lake has one of the most accessible database papers I've read ( link ). It's kind of like the movfuscator of databases. Thanks to its simplicity, in this post we'll implement a Delta Lake-inspired serverless ACID database in 500 lines of Go code with zero dependencies. It will support creating tables, inserting rows into a table, and scanning all rows in a table. All while allowing concurrent readers and writers and achieving snapshot isolation . There are other critical parts of Delta Lake we'll ignore: updating rows, deleting rows, checkpointing the transaction metadata log, compaction, and probably much more I'm not aware of. We must start somewhere. All code for this post is available on GitHub . Delta Lake writes immutable data files to blob storage. It stores the names of new data files for a transaction in a metadata file. It handles concurrency (i.e. achieves snapshot isolation) with an atomic PutIfAbsent operation on the metadata file for the transaction. This method of concurrency control works because the metadata files follow a naming scheme that includes the transaction id in the file name. When a new transaction starts, it finds all existing metadata files and picks its own transaction id by adding 1 to the largest transaction id it sees. When a transaction goes to commit, writing the metadata file will fail if another transaction has already picked the same transaction id. If a transaction does no writes and creates no tables, the transaction does not attempt to write any metadata file. Snapshot isolation! Let's dig into the implementation. Let's give ourselves some nice assertion methods, a debug method, and a uuid generator. In : Is that uuid method correct? Hopefully. Efficient? No. But it's preferable to avoid dependencies in pedagogical projects. As mentioned above, the basic requirement is that we support atomically writing some bytes to a location if the location doesn't already exist. On top of that we also need the ability to list locations by prefix, and the ability to read the bytes at some location. We'll diverge from Delta Lake in how we name files on disk. For one, we'll keep all files in the same directory with a fixed prefix for metadata and another table name prefix for each data file. This simplifies the implementation of a bit. However, this also diverges from Delta Lake in that transactions will represent all tables. In Delta Lake that is not so. Delta Lake has a per-table transaction log. Only transactions that read and write the same table in Delta Lake achieve snapshot isolation. So let's set up an interface to describe these requirements: And this is literally all we need to get ACID transactions. That's crazy! We could implement the atomic part of this interface in 2024 using conditional writes on S3. Or we could implement this interface with the header on Azure Cloud Storage. Or we could implement this interface with the header on Google Cloud Storage. Indeed a good exercise for the reader would be to implement this interface for other blob storage providers and see your serverless cloud database in action! But the simplest method of all is to implement it on the filesystem, which is what we'll do next. If we had a server we could implement atomic with a mutex. But we're serverless baby. Thankfully, POSIX supports atomic link which will fail if the new name is already a file. So we'll just create a temporary file and write out all bytes. Finally, we link the temporary file to the permanent name we intended. For cleanliness (not correctness), if there is an error at any point, we'll remove the temporary file. yencabulator on HN pointed out that an earlier version of this post had a buggy implementation of (that attempted to manage atomicity solely via ) would leave around potentially bad metadata files if the call ever failed. The approach works around that because the file is already fully and correctly written by the time we do the link. and are minimal wrappers around filesystem APIs: It is worth talking a bit about reading a directory though. Go doesn't provide a nice iterator API for us and I didn't want to implement this as callbacks with . We could use but it allocates for all files in the directory. Sure, in a pedagogical project we don't worry about millions of files. But the API, the error cases in particular, also isn't much simpler than . What's more, even though we iterated through batches of directory entries, and did prefix filtering before accumulating, we still could have considered returning an iterator here ourselves. It seems possible and likely that the number of data files grows quite large in a production system. But I was lazy. It would be nice if Go introduced an actual iterator API for reading a directory. :) In any case the ACID properties of Delta Lake (and Iceberg) don't depend on being able to read up-to-date data. This is because concurrent (or stale) transactions that write will fail on commit . And also because all files written (even metadata files) are immutable. Since all data is immutable, we will always be able to read at least a consistent snapshot of data. But we will never be able to get SERIALIZABLE read-only transactions. This is just how Delta Lake and Iceberg work. And it is a similar or better consistency level to what any major SQL database gives you by default . You'll see what I mean later on when we implement transaction commits. Now that we've got a blob storage abstraction and a filesystem implementation of it, let's start sketching out what a client and what a transaction looks like. In Delta Lake, a transaction consists of a list of actions. An action might be to define a table's schema, or to add a data file, or to remove a data file, etc. In this post we'll only implement the first two actions. These fields are all exported (i.e. capitalized, if you're not familiar with Go) because we will be writing them to disk when the transaction commits as the transaction's metadata. In fact s and the transaction's id will be the only parts of the transaction we write to disk. Everything else will be in-memory state. For our convenience we will track in memory a history of all previous actions, a mapping of table columns, and a mapping of unflushed data by table. Only the current will ever have filled out. will be populated when the transaction starts by reading through for s, and we will also add onto it when we create a table in the current transaction. We will append to every time we write a new data file and every time we create a new table. We will add rows to for a table until for that table reaches upon which time we will write that data to disk and add a entry to . A will consist of an implementation and a possibly empty . Empty meaning there is no current transaction. In a previous version of my code I named this struct . But that's misleading. There is no central database. There is just the client and the blob storage. Clients work with transactions directly and only when attempting to commit does the blob storage abstraction let the client know if the transaction succeeded or not. When we start a transaction, we will first read all existing transactions from disk and accumulate the actions from each prior transaction. We will interpret s and materialize them into a current view of all tables. And we will assign a transaction ID to this transaction to be 1 greater than the largest existing transaction ID we see. Again it doesn't matter if the call we use returns an up-to-date list. Notably on blob storage there are few guarantees about LIST operations recency. The Delta Lake paper mentions this too. Out-of-date transactions attempting to write will be caught when we go to commit the transaction. Out-of-date transactions attempting only to read will still read a consistent snapshot. And we're set. When we create a table, we need to add a to the transactions . And we also want to add the table info to the in-memory field. We don't do any of this durably. The change here will be written to disk on commit (if the transaction succeeds). Easy peasy. Now for the fun part, writing data! This is the next area where we'll diverge from Delta Lake. For the sake of zero dependencies we are going to store data in-memory as an array of array of . And when we later write rows to disk we'll write them as JSON. A real Delta Lake implementation would store data in-memory in Apache Arrow format, and write to disk as Parquet. In line with Delta Lake though we will buffer data in memory until we get 64K rows. When we get 64K rows for a particular table we will flush all those rows to disk. (When we go to commit a transaction we will flush any outstanding rows.) Now let's implement flushing. Recall that data objects in Delta Lake (and Iceberg) are immutable. Once we've got enough data to write a data object, we give it a unique name, write it to disk, and add a to the transaction's list of . That's it for writing data! Let's now look at reading data. We're going to make scanning mildly more complicated than it needed to be in pedagogical code because we'll have return an iterator rather than an array with all rows. The will first read from in-memory (unflushed) data. And then it will read through every data object for the table that is still a part of this transaction. We will know which data objects are still a part of this transaction by reading through all actions. A future version of this project would also eliminate data object files from the list by observing actions. But we don't do that in this post. The needs to track where we are in in-memory rows, in data objects, and within a particular data object. And the will be driven by a method that goes through in-memory data first and then through what's on disk. That's it for scanning a table! The final piece of the puzzle is committing a transaction. When we commit a transaction we must flush any remaining data. A read-only transaction (one which has no ) is immediately done. There is no concurrency check. Otherwise we will serialize transaction state and attempt to atomically . The only way this will fail is if there is another concurrent writer. This is the crux of Delta Lake. It's simple. And honestly it's a bit shocking. Real Delta Lake does support automatic retries in some cases. But primarily you are limited to a single writer per table, even if the writers are writing non-conflicting rows. Iceberg is basically the same here, it's just how metadata is tracked that differs. As mentioned in another note above, our implementation is actually stricter than Delta Lake since it manages all table transaction logs together. This means you can get snapshot isolation across all tables (which Delta Lake doesn't support) but it will mean significantly more contention and failed write transactions. The Delta Lake and Iceberg folks apparently wanted to avoid FoundationDB (i.e. the Snowflake architecture, which is mentioned in the Delta Lake paper) so much that they'd give up row-level concurrency to be mostly serverless. Is it worth it? Dunno. Delta Lake and Iceberg are getting massive adoption. Many very smart people have worked, and continue to work, on both. Moreover it is apparently what the market wants. Every database-like product is implementing, or is planning to implement, Delta Lake or Iceberg. Let's add a test in to see what happens with concurrent writers. Follow the comments and debug logs for details: Try it out: That's pretty cool. And what about a reader and concurrent writer? Observe that the reader always reads a snapshot. Follow the comments again for detail: As mentioned, we didn't touch a lot of things. Handling updates and deletes, transaction log checkpoints, data object compaction, etc. Take a close look at the Delta Lake paper and the Delta Lake Spec and see what you can do! Build a serverless ACID database with this one neat trick. (New blog post) https://t.co/rHgfKSPY6q pic.twitter.com/1hmjsxIk6w

0 views
Matthias Endler 3 years ago

zerocal - A Serverless Calendar App in Rust Running on shuttle.rs

Every once in a while my buddies and I meet for dinner. I value these evenings, but the worst part is scheduling these events! We send out a message to the group. We wait for a response. We decide on a date. Someone sends out a calendar invite. Things finally happen. None of that is fun except for the dinner. Being the reasonable person you are, you would think: “Why don’t you just use a scheduling app?”. I have tried many of them. None of them are any good. They are all… too much ! Just let me send out an invite and whoever wants can show up. The nerdy, introvert engineer’s solution 💡 What we definitely need is yet another calendar app which allows us to create events and send out an invite with a link to that event! You probably didn’t see that coming now, did you? Oh, and I don’t want to use Google Calendar to create the event because I don’t trust them . Like any reasonable person, I wanted a way to create calendar entries from my terminal . That’s how I pitched the idea to my buddies last time. The answer was: “I don’t know, sounds like a solution in search of a problem.” But you know what they say: Never ask a starfish for directions. Show, don’t tell That night I went home and built a website that would create a calendar entry from parameters. It allows you to create a calendar event from the convenience of your command line: You can then save that to a file and open it with your calendar app. In a sense, it’s a “serverless calendar app”, haha. There is no state on the server, it just generates a calendar event on the fly and returns it. How I built it You probably noticed that the URL contains “shuttleapp.rs”. That’s because I’m using shuttle.rs to host the website. Shuttle is a hosting service for Rust projects and I wanted to try it out for a long time. To initialize the project using the awesome axum web framework, I’ve used and I was greeted with everything I needed to get started: Let’s quickly commit the changes: To deploy the code, I needed to sign up for a shuttle account. This can be done over at https://www.shuttle.rs/login . It will ask you to authorize it to access your Github account. and finally: Now let’s head over to zerocal.shuttleapp.rs : Deploying the first version took less than 5 minutes. Neat! We’re all set for our custom calendar app. Writing the app To create the calendar event, I used the icalendar crate (shout out to hoodie for creating this nice library!). iCalendar is a standard for creating calendar events that is supported by most calendar apps. Let’s create a demo calendar event: Simple enough. How to return a file!? Now that we have a calendar event, we need to return it to the user. But how do we return it as a file? There’s an example of how to return a file dynamically in axum here . Some interesting things to note here: Here is the implementation: We just create a new object and set the header to the correct MIME type for iCalendar files: . Then we return the response. Add date parsing This part is a bit hacky, so feel free to glance over it. We need to parse the date and duration from the query string. I used dateparser , because it supports sooo many different date formats . Would be nice to support more date formats like and , but I’ll leave that for another time. Let’s test it: Nice, it works! Opening it in the browser creates a new event in the calendar: Of course, it also works on Chrome, but you do support the open web , right? And for all the odd people who don’t use a terminal to create a calendar event, let’s also add a form to the website. Add a form I modified the function a bit to return the form if the query string is empty: After some more tweaking, we got ourselves a nice little form in all of its web 1.0 glory: The form And that’s it! We now have a little web app that can create calendar events. Well, almost. We still need to deploy it. Deploying Right, that’s all. It’s that easy. Thanks to the folks over at shuttle.rs for making this possible. The calendar app is now available at zerocal.shuttleapp.rs . Now I can finally send my friends a link to a calendar event for our next pub crawl. They’ll surely appreciate it. yeah yeah From zero to calendar in 100 lines of Rust Boy it feels good to be writing some plain HTML again. Building little apps never gets old. Check out the source code on GitHub and help me make it better! 🙏 Here are some ideas: Check out the issue tracker and feel free to open a PR! I don’t want to have to create an account for your calendar/scheduling/whatever app. I don’t want to have to add my friends. I don’t want to have to add my friends’ friends. I don’t want to have to add my friends’ friends’ friends. You get the idea: I just want to send out an invite and get no response from you. Every calendar file is a collection of events so we wrap the event in a object, which represents the collection. is a trait that allows us to return any type that implements it. is a newtype wrapper around that implements . ✅ Add location support (e.g. or ). Thanks to sigaloid . Add support for more human-readable date formats (e.g. , ). Add support for recurring events. Add support for timezones. Add Google calendar short-links ( ). Add example bash command to create a calendar event from the command line. Shorten the URL (e.g. )?

0 views