GreatReads - Blog Aggregator · Phoenix Framework

0 views

Andrew Atkinson 8 months ago

Avoid UUID Version 4 Primary Keys

Over the last decade, when working on databases with UUID Version 4 1 as the primary key data type, these databases have usually have bad performance and excessive IO. UUID is a native data type in Postgres can be stored as binary data. Various versions are in the RFC. Version 4 has mostly random bits, obfuscating information like when the value was created or where it was generated. Version 4 UUIDs are easy to generate in Postgres using the 2 function since version 13 (released in 2020). I’ve learned there are misconceptions about UUID Version 4, and sometimes these are the reasons users pick this data type. Because of the poor performance, misconceptions, and available alternatives, I’ve come around to a simple position: Avoid UUID Version 4 for primary keys . My more controversial take is to avoid UUIDs in general, but I understand there are some legitimate reasons for them without practical alternatives. As a database enthusiast, I wanted to have an articulated position on this classic “Integer v. UUID” debate. Among databases folks, debating this may be tired and clichéd. However, from my consulting work, I can say I work with databases using UUID v4 in 2024 and 2025, and still see the issues discussed in this post. Let’s dig in. Although unreleased as of this writing, and pulled from Postgres 17 previously, UUID V7 is part of Postgres 18 4 scheduled for release in the Fall of 2025. What kind of app databases are in scope for this post? The kinds of web applications I’m thinking of with this post are monolithic web apps, with Postgres as their primary OLTP database. The apps could be in categories like social media, e-commerce, click tracking, or business process automation apps. The types of performance issues discussed here are related to inefficient storage and retrieval, meaning they happen for all of these types of apps. What’s the core issue with UUID v4? The core issue with UUID Version 4, given 122 bits are “random or pseudo-randomly generated values” 1 and primary keys are backed by indexes, is the impact to inserts and retrieval of individual items or ranges of values from the index. Random values don’t have natural sorting like integers or lexicographic (dictionary) sorting like character strings. UUID v4s do have “byte ordering,” but this has no useful meaning for how they’re accessed. What use cases for UUID are there? One use case for UUIDs is when there’s a need to generate an identifier on a client or from multiple services, then passed to Postgres for persistence. For web apps, generally they instantiate objects in memory and don’t expect an identifier to be used for lookups until after an instance is persisted as a row (where the database generates the identifier). In a microservices architecture where the apps have their own databases, the ability to generate identifiers from each database without collisions is a use case for UUIDs. The UUID could also identify the database a value came from later, vs. an integer. For collision avoidance (see HN discussion 5 ), we can’t practically make the same guarantee with sequence-backed integers. There are hacks, like generating even and odd integers between two instances, or using different ranges in the int8 range. There are also alternative identifiers like using composite primary keys (CPKs), however the same set of 2 values wouldn’t uniquely identify a particular table. The avoidance of collisions is described this way on Wikipedia: 6 The number of random version-4 UUIDs which need to be generated in order to have a 50% probability of one collision: 2.71 quintillion This number would be equivalent to: Generating 1 billion UUIDs per second for about 86 years. Are UUIDs secure? One misconception about UUIDs is that they’re secure. However, the RFC describes that they shouldn’t be considered secure “capabilities.” From RFC 4122 1 Section 6 Security Considerations: Do not assume that UUIDs are hard to guess; they should not be used as security capabilities How can we create obfuscated codes from integers? While UUID V4s obfuscate their creation time, the values can’t be ordered to see when they were created relative to each other. We can achieve those properties with integers with a little more work. One option is to generate a pseudo-random code from an integer, then use that value externally, while still using integers internally. To see the full details of this solution, please check out: Short alphanumeric pseudo random identifiers in Postgres 7 We’ll summarize it here. The obfuscated id is stored in a generated column. By reviewing the generated values, they are similar, but aren’t ordered by their creation order. The values in insertion order were , , then . With alphabetical order, the last two would be flipped: first, then second, then third, sorting on the fifth character. If I wanted to use this approach for all tables, I’d try a centralized table that was polymorphic, storing a record for each table that’s using a code (and a foreign key constraint). That way I’d know where the code was used. Why else might we want to skip UUIDs? UUIDs are 16 bytes (128 bits) per value, which is double the space of bigint (8 bytes), or quadruple the space of 4-byte integers. This extra space adds up once many tables have millions of rows, and copies of a database are being moved around as backups and restores. A more considerable impact to performance though is the poor characteristics of writing and reading random data into indexes. For random UUID v4s, Postgres incurs more latency for every insert operation. For integer primary key rows, their values are maintained in index pages with “append-mostly” operations on “leaf nodes,” since their values are orderable, and since B-Tree indexes store entries in sorted order. For UUID v4s, primary key values in B-Tree indexes are problematic. Inserts are not appended to the right most leaf page. They are placed into a random page, and that could be mid-page or an already-full page, causing a page split that would have been unnecessary with an integer. Planet Scale has a nice visualization of index page splits and rebalancing. 8 Unnecessary splits and rebalancing add space consumption and processing latency to write operations. This extra IO shows up in Write Ahead Log (WAL) generation as well. Buildkite reported a 50% reduction in write IO for the WAL by moving to time-ordered UUIDs. Given fixed size pages, we want high density within the pages. Later on we’ll use pageinspect to check the average leaf density between integer and UUID to help compare the two. B-Tree page layout means you can fit fewer UUIDs per 8KB page. Since we have the limitation of fixed page sizes, we at least want them to be as densely packed as possible. Since UUID indexes are ~40% larger in leaf pages than bigint (int8) for the same logical number of rows, they can’t be as densely packed with values. As Lukas says, “ All in all, the physical data structure matters as much as your server configuration to achieve the best I/O performance in Postgres .” 9 This means that for individual lookups, range scans, or UPDATES, we will incur ~40% more I/O on UUID indexes, as more pages are scanned. Remember that even to access one row, in Postgres the whole page is accessed where the row is, and copied into a shared memory buffer. Let’s insert and query data and take a look at numbers between these data types. Let’s create integer, UUID v4, and UUID v7 fields, index them, load them into the buffer cache with pg_prewarm . I will use the schema examples from the Cybertec post Unexpected downsides of UUID keys in PostgreSQL by Ants Aasma. View andyatkinson/pg_scripts PR #20 . On my Mac, I compiled the extension. Once compiled and enabled for Postgres, I could use the extension functions to generate UUID V7 values. Another extension is used. It’s a module included with Postgres, so it just needs to be enabled per database where it’s used. The difference in latency and the enormous difference in buffers from the post was reproducible in my testing. “Holy behemoth buffer count batman” - Ants Aasma Cybertec post results: Since these are buffer hits we’re accessing them from memory, which is faster than disk. We can focus then on only the difference in latency based on the data types. How many more pages are accessed for the UUID index? 8,535,628 (8.5 million!) more 8KB pages were accessed, a 31229.4% increase. In terms of MB and MB/s that is: Calculating a low and high estimate of access speeds for memory: Accessing 68.3 GB of data from memory ( in PostgreSQL) would add: That’s between ~1 and ~3.4 seconds of additional latency solely based on the data type. Here we used 10 million rows and performed 1 million updates, but the latencies will get worse as data and query volumes increase. We can inspect the average fill percentage (density) of leaf pages using the pageinspect extension. The ( andyatkinson/pg_scripts PR #20 ) query in the repo gets the indexes for the integer and v4 and v7 uuid columns, their total page counts, their page stats, and the number of leaf pages. Using the leaf pages, the query calculates an average fill percentage. After performing the 1 million updates on the 10 million rows mentioned in the example, I got these results from that query: This shows the index had an average fill percentage of nearly 98%, while the UUID v4 index was around 79%. The Postgres buffer cache is a critical part of good performance. For good performance, we want our queries to produce cache “hits” as much as possible. The buffer cache has limited space. Usually 25-40% of system memory is allocated to it, and the total database size including table and index data is usually much larger than that amount of memory. That means we’ll have trade-offs, as all data will not fit into system memory. This is where the challenges come in! When pages are accessed they’re copied into the buffer cache as buffers. When write operations happen, buffers are dirtied before being flushed. 10 Since the UUIDs are randomly located, additional buffers will need to be copied to the cache compared to ordered integers. Buffers might be evicted to make space that are needed, decreasing hit rates. Since the tables and indexes are more likely to be fragmented, it makes sense to rebuild the tables and indexes periodically. Rebuilding tables can be done using pg_repack, pg_squeeze, or if you can afford to perform the operation offline. Indexes can be rebuilt online using . While the newly laid out data in pages, they will still not have correlation, and thus not be smaller. The space formerly occupied by deletes will be reclaimed for reuse though. If possible, size your primary instance to have 4x the amount of memory of your size of database. In order words if your database is 25GB, try and run a 128GB memory instance. This gives around 32GB to 50GB of memory for buffer cache ( ) which is hopefully enough to store all accessed pages and index entries. Use pg_buffercache 11 to inspect the contents, and pg_prewarm 12 to populate tables into it. One tactic I’ve used when working with UUID v4 random values where sorting is happening, is to provide more memory to sort operations. To do that in Postgres, we can change the setting. This setting can be changed for the whole database, a session, or even for individual queries. Check out Configuring work_mem in Postgres on PgMustard for an example of setting this in a session. Since Rails 6, we can control implicit_order_column. 13 The database_consistency gem even has a checker for folks using UUID primary keys. When ORDER BY is generated in queries implicitly, it may be worth ordering on a different high cardinality field that’s indexed, like a timestamp field. Cluster on a column that’s high cardinality and indexed could be a mitigation option. For example, imagine your UUID primary table has a timestamp column that’s indexed with , and clustering on that. I don’t see CLUSTER used ever really though as it takes an access exclusive lock. The CLUSTER is a one-time operation that would also need to be repeated regularly to maintain its benefits. For new databases that may be small, with unknown growth, I recommend plain old integers and an identity column (backed by a sequence) 14 for primary keys. These are signed 32 bit (4-byte) values. This provides about 2 billion positive unique values per table. For many business apps, they will never reach 2 billion unique values per table, so this will be adequate for their entire life. I’ve also recommended always using bigint/int8 in other contexts. I guess it comes down to what you know about your data size, how you can project growth. There are plenty of low growth business apps out there, in constrained industries, and constrained sets of business users. For Internet-facing consumer apps with expected high growth, like social media, click tracking, sensor data, telemetry collection types of apps, or when migrating an existing medium or large database with 100s of millions or billions of rows, then it makes sense to start with (int8), 64-bit, 8-byte integer primary keys. Since Postgres 18 is not yet released, generating UUID V7s now in Postgres is possible using the extension. If you have an existing UUID v4 filled database and can’t afford a costly migration to another primary key data type, then starting to populate new values using UUID v7 will help somewhat. Fortunately the binary data type in Postgres can be used whether you’re storing V4 or V7 UUID values. Another alternative that relies on an extension is sequential_uuids . 15 Do you see any errors or have any suggested improvements? Please contact me . Thanks for reading! https://datatracker.ietf.org/doc/html/rfc4122#section-4.4 ↩ ↩ 2 ↩ 3 https://www.postgresql.org/docs/current/functions-uuid.html ↩ https://stackoverflow.com/a/6953207/126688 ↩ https://www.thenile.dev/blog/uuidv7 ↩ https://news.ycombinator.com/item?id=36429986 ↩ https://en.wikipedia.org/wiki/Universally_unique_identifier ↩ https://andyatkinson.com/generating-short-alphanumeric-public-id-postgres ↩ https://planetscale.com/blog/the-problem-with-using-a-uuid-primary-key-in-mysql ↩ https://pganalyze.com/blog/5mins-postgres-io-basics ↩ https://stringintech.github.io/blog/p/postgresql-buffer-cache-a-practical-guide/ ↩ https://www.postgresql.org/docs/current/pgbuffercache.html ↩ https://www.postgresql.org/docs/current/pgprewarm.html ↩ https://www.bigbinary.com/blog/rails-6-adds-implicit_order_column ↩ https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/ ↩ https://pgxn.org/dist/sequential_uuids ↩ UUIDs (or GUID in Microsoft speak) 3 ) are long strings of 36 characters, 32 digits, 4 hyphens, stored as 128 bits (16 byte) values, stored using the binary data type in Postgres The RFC documents how the 128 bits are set The bits for UUID Version 4 are mostly random values UUID Version 7 includes a timestamp in the first 48 bits, which works much better with database indexes compared with random values Convert a decimal integer like “2” into binary bits. E.g. a 4 byte, 32 bit integer: 00000000 00000000 00000000 00000010 Perform an exclusive OR (XOR) operation on all the bits using a key Encode each bit using a base62 alphabet 27,332 buffer hits, index only scan on the column 8,562,960 buffer hits, index only scan on the UUID V4 index scan 68,285,024 MB or ~68.3 GB! more data that’s accessed Low estimate: 20 GB/s High estimate: 80 GB/s ~3.4 seconds of latency (low speed) ~0.86 seconds of latency (high speed) UUID v4s increase latency for lookups, as they can’t take advantage of fast ordered lookups in B-Tree indexes For new databases, don’t use for primary key types, which generates random UUID v4 values UUIDs consume twice the space of UUID v4 values are not meant to be secure per the UUID RFC UUID v4s are random. For good performance, the whole index must be in buffer cache for index scans, which is increasingly unlikely for bigger data. UUID v4s cause more page splits, which increase IO for writes with increased fragmentation, and increased size of WAL logs For non-guessable, obfuscated pseudo-random codes, we can generate those from integers, which could be an alternative to using UUIDs If you must use UUIDs, use time-orderable UUIDs like UUID v7 Franck Pachot for AWS Heroes has an interesting take on UUID in PostgreSQL Brandur has a great post: Identity Crisis: Sequence v. UUID as Primary Key 5mins of Postgres: UUIDs vs Serial for Primary Keys - what’s the right choice? andyatkinson/pg_scripts PR #20 https://datatracker.ietf.org/doc/html/rfc4122#section-4.4 ↩ ↩ 2 ↩ 3 https://www.postgresql.org/docs/current/functions-uuid.html ↩ https://stackoverflow.com/a/6953207/126688 ↩ https://www.thenile.dev/blog/uuidv7 ↩ https://news.ycombinator.com/item?id=36429986 ↩ https://en.wikipedia.org/wiki/Universally_unique_identifier ↩ https://andyatkinson.com/generating-short-alphanumeric-public-id-postgres ↩ https://planetscale.com/blog/the-problem-with-using-a-uuid-primary-key-in-mysql ↩ https://pganalyze.com/blog/5mins-postgres-io-basics ↩ https://stringintech.github.io/blog/p/postgresql-buffer-cache-a-practical-guide/ ↩ https://www.postgresql.org/docs/current/pgbuffercache.html ↩ https://www.postgresql.org/docs/current/pgprewarm.html ↩ https://www.bigbinary.com/blog/rails-6-adds-implicit_order_column ↩ https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/ ↩ https://pgxn.org/dist/sequential_uuids ↩

Performance

HTML

0 views

Andrew Atkinson 8 months ago

CORE Database Schema Design: Constraint-driven, Optimized, Responsive, and Efficient

In this post, we’ll cover some database design principles and package them up into a catchy mnemonic acronym. Software engineering is loaded with acronyms like this. For example, SOLID principles describe 5 principles, Single responsibility, Open-closed, Liskov substitution, Interface segregation and Dependency inversion, that promote good object-oriented design. Databases are loaded with acronyms, for example “ACID” for the properties of a transaction, but I wasn’t familiar with one the schema designer could keep in mind while they’re working. Thus, the motivation for this acronym was to help the schema designer, by packaging up some principles of good design practices for database schema design. It’s not based in research or academia though, so don’t take this too seriously. That said, I’d love your feedback! Let’s get into it. In picking an acronym, I wanted it to be short and have each letter describe a word that’s useful, practical, and grounded in experience. I preferred a real word for memorability! The result was “ CORE .” Let’s explore each letter and the word behind it. The first word (technically two) is “constraint-driven.” Relational databases offer rigid structures, but the ability to be changed while online, a form of flexibility in their evolution. We evolve their structure through DDL . They use data types and constraints to make changes, as entities and relationships evolve. Constraint-driven refers to leveraging all the constraint objects available, designing for our needs today, but also in a more general sense applying constraints (restrictions) to designs in the pursuit of data consistency and quality. Let’s look at some examples. Choose the appropriate data types, like a numeric data type and not a character data type when storing a number. Use for columns by default. Create foreign key constraints for table relationships by default. Validate expected data inputs using check constraints. For small databases, use primary keys. If tables get huge later, no problem, we can migrate the data into a bigger more suitable structure. The mindset is to prefer rigidity initially, design for today, then leverage the flexibility available to evolve later, as opposed to designing for a hypothetical future state. Databases present loads of optimization opportunities. Relational data is initially stored in a normalized form to eliminate duplication, but later denormalizations can be performed when read access is more important. When our use cases are not known at the outset, plan to iterate on the design, changing the structure to better support the use cases that emerge. This will mean evolving the schema design. This applies to tables, columns, constraints, indexes, parameters, queries, and anything that can be optimized to better support real use cases. Queries are restructured and indexes are added to reduce data access. Strive for highly selective data access (a small proportion of rows) on high cardinality (uniqueness) data to reduce latency. Critical background processes like VACUUM get optimized too. Resources (workers, memory, parallelization) are increased proportionally. When problems emerge like column or row level unexpected data, missing referential integrity, or query performance problems, engineers inspect logs, catalog statistics, and parameters, from the core engine and third party extensions to diagnose issues. When DDL changes are ready, the engineer applies them in a non-blocking way, in multiple steps as needed. Operations are performed “online” by default when practical. DDL changes are in a source code file, reviewed, tracked, and a copy of the schema design is kept in sync across environments. Parameter (GUC) tuning (Postgres: , etc.) happens in a trackable way. Parameters are tuned online when possible, and scoped narrowly, to optimize their values for real queries and use cases. It’s relatively costly to store data in the database, compared with file storage! The data consumes limited space and accessing data unnecessarily adds latency. Data that’s stored is queried later or it’s archived. To minimize space consumption and latency, tables, columns, constraints, and indexes are removed continually by default, when they no longer are required, to reduce system complexity. Server software is upgraded at least annually so that performance and security benefits can be leveraged. Huge tables are split into smaller tables using table partitioning for more predictable administration. There’s lots more to evolving a database schema design, but these principles are a few I keep in mind. Did you notice anything missing? Do you have other feedback? Please contact me with your thoughts. Over the years, I’ve learned a lot from Postgres.fm hosts Nikolay and Michael , and other community leaders like Lukas and Franck , as they’ve shaped my database design choices. I’m grateful to them for sharing their knowledge and experience with the community. Thanks for reading!

Programming

0 views

Andrew Atkinson 9 months ago

Tip: Put your Rails app on a SQL query diet

Much of the time taken processing HTTP requests in web apps is SQL queries. To minimize that, we want to avoid unnecessary and duplicate queries, and generally perform as few queries as possible. Think of the work that needs to happen for every query. The database engine parses it, creates a query execution plan, executes it, and then sends the response to the client. When the response reaches the client, there’s even more work to do. The response is transformed into application objects in memory. How do we see how many queries are being created for our app actions? When doing backend work in a web app like Rails, monitor the number of queries being created directly, by the ORM, or by libraries. ORMs like Active Record can generate more than one query from a given line of code. Libraries can generate queries that are problematic and may be unnecessary. Over time, developers may duplicate queries unknowingly. These are all real causes of unnecessary queries from my work experience. Why are excessive queries a problem? Besides parsing, planning, executing, and serializing the response, the client is subject to a hard upper limit on the number of TCP connections it can send to the database server. In Postgres that’s configured as . The application will have a variable number of open connections based on use, and its configuration of processes, threads and its connection pool. Keeping the query count low helps avoid exceeding the upper limit. What about memory use? With Ruby on Rails, the cost of repeated queries is shifted because the SQL Cache is enabled by default, which stores and serves results for matching repeated queries, at the cost of some memory use. As an side, from Rails 7.1 the SQL Cache uses a least recently used (LRU) algorithm . We can also configure the max number of queries to cache, 100 by default, to control how much memory is used. Prior to Rails 7.2, I recommend adding the query_count gem which does a simple thing, it shows the count of SQL queries processed for an action. The count is in the Rails log file like this: . In this case, 100 queries were performed and 50 used the SQL Cache. From Rails 7.2 onward, the count of queries is now built in, so query_count is no longer needed . Rails 7.2 onward looks like this: . Here 10 queries ran, and 1 used the SQL Cache. While the SQL Cache saves the roundtrip for a repeated query, ideally we want to eliminate the repeated query. It’s worth hunting for it and considering refactoring or restructuring data access. Another tactic is using memoization to store results for the duration of processing one controller action. Read more about that: Speeding up Rails with Memoization . How do I get started? To get started, identify some slow API endpoints in production, run them locally in development, and begin monitoring their quantity of SQL queries. Find the Source code locations for database queries in Rails with Marginalia and Query Logs . Determine how to factor out data access that can be shared. It’s hard to give a generic number. However, duplicate queries are a category to remove. Let’s say you’ve got a Book model for your bookstore app. Scan your Rails log file for a pattern like this: If you see that sort of pattern, track down the source locations, and eliminate any repeated loads. Let’s assume this is not a N + 1 queries problem , but repeated access to the same data from different source code locations. You may be able to factor out and consolidate a data load. You may be able to use an existing loaded collection for an existence check, or use memoization to use previously calculated results. Using these tactics, I’ve reduced controller actions with 250+ SQL queries (a ton!) to 50 or fewer (still a lot), by going through these steps. Monitor the log, find source locations for first party code, ORM generated queries, query code from libraries (gems), Rails controller action “before filters,” and other sources, then eliminate and consolidate. When faced with a lot of queries, I find it helpful to study the bare minimum of what’s needed by the client, working outside in, then look to see if it’s possible to reduce the tables, rows, and columns to only what’s needed. Track the count of SQL queries performed in different versions of Rails Remove unnecessary queries so they don’t use limited system resources Eliminate repeated queries to keep the count as low as possible Only access data that’s needed for client application use cases

0 views

Andrew Atkinson 9 months ago

Big Problems From Big IN lists with Ruby on Rails and PostgreSQL

If you’ve created web apps with relational databases and ORMs like Active Record (part of Ruby on Rails), you’ve probably experienced database performance problems after a certain size of data and query volume. In this post, we’re going to look at a specific type of problematic query pattern that’s somewhat common. We’ll refer to this pattern as “Big s,” which are queries with an clause that has a big list of values. As data grows, the length of the list of values will grow. These queries tend to perform poorly for big lists, causing user experience problems or even partial outages. We’ll dig into the origins of this pattern, why the performance of it is poor, and explore some alternatives that you can use in your projects. The technical term for values are a parenthesized list of scalar expressions . For example in the SQL query below, the clause portion is and the list of scalar expressions is . The purpose of this clause is to perform filtering. Looking at a query execution plan in Postgres, we’ll see something like this fragment below: This of course filters the full set of books down to ones that match on . Filtering is a typical database operation. Why are these slow? Remember that our queries are parsed, planned, and executed. A big list of values are treated like constants, and don’t have associated statistics. Queries with big lists of values take more time to parse and use more memory. Without pre-collected table statistics for planning decisions, PostgreSQL is more likely to mis-estimate cardinality and row selectivity. This can mean the planner chooses a sequential scan over an index scan, causing a big slowdown. How do we create this pattern? In Active Record, a developer might create this query pattern by using to collect some ids in a list, then pass that list as an argument to another query. Here’s an example of that: The are supplied as the argument querying by foreign key: Another scenario is when this query is created from ORM methods. What does that look like? This query pattern can happen when using eager loading methods like or . This Crunchy Data post mentions how eager loading methods produce clause SQL queries. The post links to the Eager Loading Associations documentation which has examples in Active Record and the resulting SQL that we’ll use here. Let’s first discuss N+1 with these examples. Let’s study the examples here. Here’s some Active Record for books and authors: The issue above is the undesirable N+1 query pattern, where a table is repeatedly queried in a loop, instead of bulk loading all of the desired authors. To fix the N+1, we’ll add the eager loading method to the code above. That looks like this: We’ve now eliminated the N+1 queries, but we’ve opened ourselves up to a new possible problem. While the fixed the N+1 queries, Active Record is now creating two queries, with the second one having an clause. Here’s the example from above as SQL: Here we only have 10 values for the clause, so performance will be fine. However, once we’ve got hundreds or thousands of values, we will run into the problems described above. Performance will tank if the primary key index isn’t used for this filtering operation. Are there alternatives for eager loading? Besides and which create two queries with the second having an clause, there’s another way to do eager loading in Active Record. An alternative method works a little bit differently. It produces a single SQL query that uses a . Here’s an example of from the Active Record documentation: The following single SQL query is produced. Note that it has no clause. Since we’re now using a join operation, we’ve got statistics available from both tables. This makes it much more likely PostgreSQL can correctly estimate selectivity and cardinality. The planner also isn’t needing to parse and store a large list of constant values. While clauses might perform fine with smaller inputs, e.g. 100 values or fewer, 1 for large lists we should try and restructure the query to use a join operation instead. Besides restructuring the queries into joins, are there other alternatives? Crunchy Data’s post Postgres Query Boost: Using ANY Instead of IN describes how is more restrictive on the input. A more usable alternative to can be using or , which has more flexibility in handling the list of values. Here’s A CTE example using : However, is not generated by Active Record. What if we want to generate these queries using Active Record? One option is to use the method provided by the ActiveRecordExtended gem. Let’s talk at another alternative approach using a clause. In the comments in the PR above, Vlad and Sean discussed an alternative for using a clause. Let’s look at an example with a CTE and clause: Or we can write this as a subquery: This is better because the list is a big list of scalar expressions, where the clause is treated like a relation (or table). This can help with join strategy selection. Yet another option for big lists of values is to put these into a temporary table for the session. The temporary table can even index the ids. Another form is using with an ARRAY: The form can perform better. With an list, the values are parsed like a chain of OR operations, with the planner handling one branch at a time. is treated like a single functional expression. This form also supports prepared statements. With prepared statements, the statement is parsed and planned once and then can be reused. Here’s an example of fetching books by author: Unfortunately generic guidelines here won’t guarantee success in your specific database. Row counts, data distributions, cardinality, or correlation are just some of the factors that affect query execution. My recommended process is to test on production-like data, work in the SQL layer, then try out restructured queries using these tactics, and study their query execution plans collected using . Query plan collection and analysis is outside the scope of this post, but in brief, you’ll want to compare the plans and look to access fewer buffers, at lower costs, with fewer rows evaluated, fewer loops, for more efficient execution. If you’re working in Active Record, you’d then translate your SQL back into the Active Record source code location where the queries were generated. How do we find problematic queries that ran earlier in Postgres? To find out if your query stats include the problematic queries, let’s search the results of by querying the field. Unfortunately these don’t always group up well, so there can be duplicates or near-duplicates. You may have lots of PGSS results to sift through. Here’s a basic query to filter on for : See the linked PR for a reproduction set of commands to create these tables, queries, and then inspect the query statistics using PGSS. While you can find and restructure your queries towards more efficient patterns, are there any changes coming to Postgres itself to better handle these? As part of the PostgreSQL 17 release in 2024, the developers made improvements to more efficiently work with scalar expressions and indexes, resulting in fewer repeated scans, and thus faster execution. This reduces latency by reducing IO, and the benefits are available to all Postgres users without the need to change their SQL queries or ORM code! There are more usability improvements coming for Postgres users, pg_stat_statements, and clause queries. One problem with these has been that similar entries aren’t collapsed together when they have a different numbers of scalar array expressions. For example was not grouped with . Having the statistics for nearly identical entries split across multiple results makes them less useful. Fortunately, fixes are coming. On the Ruby on Rails side, Sean Linsley is working on a fix by replacing the use of with which solves the grouping problem. Here’s the PR: https://github.com/rails/rails/pull/49388#issuecomment-2680362607 On the PostgreSQL side, there are fixes coming for PostgreSQL 18. Related improvements are coming to PostgreSQL 18 for 2025. This commit 1 implements the automatic conversion of into ScalarArrayOpExpr. Another noteworthy commit is: “Squash query list jumbling” from Álvaro Herrera. 2 pg_stat_statements produces multiple entries for queries like depending on the number of parameters, because every element of ArrayExpr is individually jumbled. Most of the time that’s undesirable, especially if the list becomes too large. This commit 3 mentions the original design was for a GUC query_id_squash_values, but that was removed in favor of making this the default behavior. In this post, we looked at a problematic query pattern, big lists. You may have instances of this pattern in your codebase from direct means or from using some ORM methods. This type of query performs poorly for big lists of values, as they take more resources to parse, plan, and execute. There are fewer indexing options compared with an alternative structured as a join operation. Join queries provide two sets of table statistics from both tables being joined, that help with query planning. We learned how to find instances of these using pg_stat_statements for PostgreSQL. The post then considers several alternatives. Our main tactics are to convert these queries to joins when possible. Outside of that, we could consider using the operator with an array of values, a clause, and consider using a prepared statement. The next time you see big lists causing database performance problems, hopefully you feel more prepared to restructure and optimize them! Thanks for reading this post. I’d love to hear about any tips or tricks you have for these types of queries! https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c0962a113d1f2f94cb7222a7ca025a67e9ce3860 ↩ ↩ 2 https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=62d712ecfd940f60e68bde5b6972b6859937c412 ↩ https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9fbd53dea5d513a78ca04834101ca1aa73b63e59 ↩ https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c0962a113d1f2f94cb7222a7ca025a67e9ce3860 ↩ ↩ 2 https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=62d712ecfd940f60e68bde5b6972b6859937c412 ↩ https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9fbd53dea5d513a78ca04834101ca1aa73b63e59 ↩

Performance

0 views

Andrew Atkinson 9 months ago

Short alphanumeric pseudo random identifiers in Postgres

In this post, we’ll cover a way to generate short, alphanumeric, pseudo random identifiers using native Postgres tactics. These identifiers can be used for things like transactions or reservations, where users need to read and share them easily. This approach is an alternative to using long, random generated values like UUID values, which have downsides for usability and performance. We’ll call the identifier a and store it in a column with that name. Here are some example values: In database design, we can use natural or surrogate keys to identify rows. We won’t cover the differences here as that’s out of scope. For our identifier, we’re going to generate it from a conventional surrogate primary key called . We aren’t using natural keys here. The is intended for use outside the database, while the primary key is used inside the database to be referenced by foreign key columns on other tables. Whle is short which minimizes space and speeds up access, the main reason for it is for usability. With that said, the target for total space consumption was to be fewer bytes than a 16-byte UUID. This was achieved with an primary key and this additional 5 character generated value, targeting a smaller database where this provides plenty of unique values now and into the future. Let’s get into the design details. Here were the desired design properties: Additional details: Here are the functions used: This function obfuscates the integer value using exclusive or (XOR) obfuscation. Main entrypoint function: This converts the obfuscated value into the alphanumeric value, used within . This is “base 62” with the 26 upper and lower case characters, and 10 numbers (0-9). Reverses the back into the original integer. Used within : For a length of 5 with this system, we can create up to around ~1 billion unique values. This was sufficiently large for the original use case. For use cases requiring more values, by storing 6 characters for then up to ~56 billion values could be generated, based on a primary key. Let’s create a sample table with an primary key with a generated identity column. Besides the use in the identity column, we’ll again use the keyword to create a column for the . The column uses the column as input, obfuscates it, encodes it to base 62, producing a 5 character value. How do we guarantee conforms to our expected data properties? Constraints! For an existing system, we could add a unique index first as follows: Then we can add the unique constraint using the unique index, along with the constraint: Let’s insert data into the table: Let’s query the data, and also make sure it’s reversed (using the function) properly: Let’s compare the time spent inserting 1 million rows into an equivalent table without the column or value generation. That took an average of 2037.906 milliseconds, or around 2 seconds on my machine. Inserting 1 million rows with the took an average of 6954.070 or around 7 seconds, or about 3.41x slower. Note that these times were with the indexes and constraints in place on the table in the second example, but not the first, meaning their presence contributed to the total time. Summary: Creating this identifier made the write operations 3.4x slower for me locally, which was an acceptable amount of overhead for the intended use case. Compared with random values, the pseudo random remains orderable, which means that lookups for individual rows or ranges of rows can use indexes, running fast and reliably even as row counts grow. We can add a unique index on the column like this: We can very that individual lookups or range scans use this index, by inspecting query execution plans for this table. https://github.com/andyatkinson/pg_scripts/pull/15 Feedback on this approach is welcomed! Please use my contact form to provide feedback or leave comments on the PR. Future enhancements to this could include unit tests using pgTAP for the functions, packaging them into an extension, or supporting more features like case insensitivity or a modified input alphabet. Thanks for reading! A fixed size, 5 characters in length, regardless of the size of the input integer (and within the range of the data type) Fewer bytes of space than a data type An obfuscated value, pseudo random, not easily guessable. While not easily guessable, this is not meant to be “secure” Reversibility back into the original integer Only native Postgres capabilities, no extensions, client web app language can be anything as it’s within Postgres Non math-heavy implementation is stored using not not , following recommendations for best practices PL/PgSQL functions, native Postgres data types and constraints are used, like UNIQUE, NOT NULL, and CHECK, and a stored generated column. Converts integers to bits, uses exclusive-or (XOR) bitwise operation and modulo operations. Did not set out to support case insensitivity now, possible future enhancement Did not try to exclude similar-looking characters (see: Base32 Crockford below), possible future enhancement Uses a Hexadecimal key (make this any key you want) Sets a max value for the data type range , which is just under 1 billion possible values. This was enough for this system and into the future, but a bigger system would want to use Converts integer bytes into bits gets a constraint and , so we know we have a unique value A constraint is added to validate the length Base32 Crockford - An emphasis ease of use for humans: removing similar looking characters, case insensitivity. ULID - Also 128 bits/8 bytes like UUIDs, so I had ruled these out for space consumption, and they’re slightly less “usable” NanoIDs at PlanetScale - I like aspects of NanoID. This is random generation though like UUID vs. encoding a unique integer.

PostgreSQL

0 views

Andrew Atkinson 10 months ago

Source code locations for database queries in Rails with Marginalia and Query Logs

Back in 2022, we covered how to log database query generation information from a web app using for Postgres. https://andyatkinson.com/blog/2022/10/07/pgsqlphriday-2-truths-lie The application context annotations can look like this. They’ve been re-formatted for printing: I use to identify costly queries generated in the web application, often ORM queries (the ORM is Active Record in Ruby on Rails), with the goal of working on efficiency and performance improvements. The annotations above are included in the field and formatted as SQL-compatible comments. Application context usually includes the app name and app concepts like MVC controller names, action names, or even more precise info which we’ll cover next. How can we make these even more useful? For Ruby on Rails, we’ve used the Marginalia Ruby gem to create these annotations. Besides the context above, a super useful option is the option which captures the source code file and line number. Given how dynamic Ruby code can be, including changes that can happen at runtime, the level logging takes these annotations from “nice to have” to “critical” to find opportunities for improvements. What’s more, is that besides Marginalia, we now have a second option that’s built-in to Ruby on Rails. In Rails 7.1, Ruby on Rails gained similar functionality to Marginalia directly in the framework. While nice to have directly in the framework, the initial version didn’t have the source code line-level capability. That changed in the last year! Starting from PR 50969 to Rails linked below, for Rails 7.2.0 and 8.0.2, the option was added to Active Record Query Logs , equivalent to the option in Marginalia. PR: Support tag option for query log tags by fatkodima https://github.com/rails/rails/pull/50969#issuecomment-2797357558 An example of in action looks like this: Nice, now we’ve got the class name, line number, and Ruby method. In this example, we can get to work optimizing the method. Dima described how the Marginalia option was costly in production and even managed to improve that with the Query Logs change. If you’re unsure about source code line logging in production, but want to get started using it, a great place to start is using it in your local development enrivonment. To avoid enabling the option for all environments, we’ll use an environment variable that’s enabled only for local development. Here’s a real example I use for Marginalia: For Query Logs, in (adjust to be for the environments you prefer), the equivalent could look like this: The configuration above was tested with Rails 7.2.2. If your team uses Query Logs in development or production, I’d love to know! Having source code line-level logging for query statistics is critical information that allows backend engineers to quickly zero in on where to fix database performance issues. With this combo of information, engineers can identify the most heavy queries, then go backwards into the source code to know where to redesign, refactor, or restructure, or even remove costly queries. The extension is critical for this workflow, but it’s not without opportunities for improvement. One issue that has is that many of the entries can be duplicates or near-duplicates, making it tougher to sift through. Fortunately, fixes are coming for that too! Stay tuned for a future post where we’ll cover upcoming improvements in future versions of Postgres that will help de-duplicate entries, as well as options to achieve that with Ruby on Rails even for older versions of Postgres. Consider subscribing to my newsletter, where I send out occasional issues linking to blog posts, conferences, industry news, and more, so you don’t miss that post! Thanks for reading.

Performance

0 views

Andrew Atkinson 10 months ago

🎙️ Talking Postgres Podcast: Helping Rails developers learn Postgres with Andrew Atkinson

Back in November, I met with Claire Giordano, host of the Talking Postgres podcast, who asked a ton of great questions about my experience writing a Postgres book aimed at Ruby on Rails web developers. Claire had a lot of thoughtful questions. Here’s a few: Claire also brought up a lot of fun points and reactions. Here’s a sample: Quote mentioned by Claire: Writing is thinking. To write well is to think clearly. That’s why it’s so hard. —David McCullough 👉 Helping Rails developers learn Postgres with Andrew Atkinson It was a real honor to be a guest on this prestigious podcast. I’m lucky to call Claire a friend as well! Thank you for the opportunity Claire, Aaron, and team! Check out more episodes of Talking Postgres ! Why write a technical book? Was there some moment or spark? Why write a book about Postgres? Why Ruby on Rails? The importance planting seeds and encouraging others with ambitious projects Would I consider writing a book for Django and Python for Postgres? Where does the book fit in the landscape? How long did it take to write this book? Did I ever want to quit writing, even for a moment? Did I have a party when the book was fully complete? I talked about “little parties” with Rails developer communities at events like Rails World and Sin City Ruby What was my experience like in working with other publishers I shared my deep appreciation for the efforts of the technical reviewers of the book! We talked about cheese! 🧀 (stories and connections with Postgres icons David Rowley and Melanie Plageman) What was my favorite chapter? Is there a frequently asked question I get about databases from Rails developers? For my consulting services, do clients hire me for my Rails expertise or my Postgres expertise?

Programming

Python

0 views

Andrew Atkinson 1 years ago

Django and Postgres for the Busy Rails Developer

About 10 years ago I wrote a post PostgreSQL for the Busy MySQL Developer , as part of switching from MySQL to Postgres for my personal and professional projects wherever I could. Recently I had the chance to work with Python, Django, and Postgres as a long-time and busy Rails developer. There were some things I thought were really nice. So am I switching? The team I worked with was experienced with Django so I was curious to learn from them which libraries and tools are popular, and how how to write idiomatic code. In this post I’ll briefly cover the database parts of Django using Postgres (of course!), highlight libraries and tools, and compare aspects to the Ruby on Rails framework. You’ll find a small Django repo towards the end as well. Ruby and Python are both general purpose programming languages. On the similarity side, they can both be used to write script style code, or organize code into classes using object oriented paradigms. In local development, it felt like the execution of Python was perhaps faster than Ruby, however I’ve noticed that new apps are always fast to work with, given how little code is being loaded and executed. As a developer we typically need to run multiple versions of Ruby, Python, Node, and other runtimes, to support different codebases, and to avoid modifying our system installation. In Ruby I use rbenv to manage multiple versions of Ruby, and to avoid using the version of Ruby that was installed by macOS, which is usually outdated compared with the version I want for a new app. In Python, I used pyenv to accomplish the same thing, which seemed quite similar in use. Both have concepts of a local and global version, and roughly similar commands to install and change versions. In Ruby on Rails, Bundler has been the de facto standard forever, as a way to pull in Ruby library code and make sure it’s loaded and accessible in the Rails application. In Python, the team selected the poetry dependency management tool. Commands are similar to Bundler commands, for example is about the same as . Dependencies can be expressed in a file and poetry creates a lock file with specific library versions. TOML and YAML are similar. In Ruby on Rails, although I personally resisted rule detection etc. for years, Rubocop has become the standard, even being built in to the most recent Rails version 8. Rubocop has configurable rules that can automatically reformat code or lint code for issues. Formatters like standardrb are commonly used as well. For the Django app the team selected ruff , which performed formatting of code and linting for issues like missing imports. I found ruff fast and easy to use and genuinely helpful. For example, sometimes I’d fire up a Django shell, having skipped running ruff, only to realize there are issues it would have caught. On this small codebase, ruff ran instantly, so it was a no-brainer to run regularly, or even include in my code editor. In Rails and Django, SQLite is the default database, however I wanted to use Postgres. In Ruby, we have the pg gem which connects the application to Postgres as a driver. This does work at a lower level than the application like sending TCP requests, mapping Postgres query result responses into Ruby data types, and much more. In Python, we used the psycopg2 library and I found it pretty easy to use. Besides being used by the framework ORM, I created a wrapper class using psycopg2 to use for sending SQL queries outside of models. For example, we inspected Postgres system catalog views to capture certain data as part of the product features. Both Ruby on Rails and Django have the concept of Migrations , which are Ruby or Python code files that describe a database structure change, and have a version. From the Ruby or Python code files, SQL DDL (or DML) statements are generated which are run against the configured database. For example, to add a table in Rails typically a developer uses the Ruby helper as opposed to writing a SQL statement. Adding or dropping an index or modifying a column type are other types of DDL statements that typically are performed via migrations. The Django approach has noteworthy differences and a slightly different workflow that I enjoyed more in some ways. For example, changes are started in a file, which contains all the application models (multiple models in a single file), and the database layer details about each model attribute. This means that we specify database data types for columns, whether fields are unique, indexed, and more in the models file. The interesting difference compared with Rails is that the next step in Django is to run , which generates Python migration files. This is different from Rails, where Rails developers would first generate a migration file to place changes into. In Django the generated migration file can be inspected or simply applied using the command. This command is nearly identical to the Rails equivalent command . For a new project where we were rapidly iterating on the models and their attributes, I preferred the way Django worked to how Rails works, or found it at least as productive. Here are some commands like running , or running commands like or , to give you a flavor. Both Django and Rails use interpreted languages, Python and Ruby respectively, that each support an interactive execution environment. This environment is called a read, eval, print loop or REPL for short. In Rails, the Ruby REPL “irb” is launched and Rails application code is loaded automatically when running the rails console command. In Django the equivalent command is running shell , however application code needs to be imported before it can be used, using statements. Both frameworks also support opening a database client, by running in Rails or in Django. When Postgres is configured, these both open a psql session. In Django projects and applications are separate concepts. In my experimental project I made a “booksproject” project and a “books” app. Check out the booksproject repo . The books app models are Author, Publisher, and Books. The tables for those models are contained in a custom schema , and Django is configured to access it. The application connects to Postgres as the user and the dev database is called . There’s no concept of what I’d call “safety” for migrations for either framework out of the box. Operations like adding indexes in Postgres don’t use the concurrently keyword by default for example. We can add safety using additional libraries like Strong Migrations in Ruby. At a smaller scale of data and query volume, even unsafe operations will be fine. With that said, I think some visibility into blocking database operations, and how to perform them using safe alternatives is valuable. In models add to a field definition to add a unique constraint (via a unique index). After running a migration for a unique index will be created. In Active Record we’d generate the migration file first, then fill in the create statement adding a unique index. When querying a model like Book we’d use which returns a QuerySet object with one or more books. The method will generate a SQL query with a clause to filter down the rows or all rows can be accessed using . For example: Statements in Python are whitespace sensitive so we’d indent the attributes in below by 4 spaces: The generated SQL DDL isn’t displayed when running by default. Unlike Rails, Django provides a mechanism to preview it. To do that run the command instead of . For example, to print the 0001 migration DDL: Note that Django uses an identity column for the primary key, and as of Rails 8 Active Record does not. For the basics of an Author, Publisher, and Books models, or Postgres configuration including a custom schema and user, check out booksproject repo. To collect random Django tips I’ve created a django-tips page. This page can be used in a similar way as my rails-tips and postgresql-tips pages, mostly as a reference for myself and possibly as a useful resource for others. Do you have any similarities and differences between Django and Rails to share? I’d love to hear from you. 😅 And no, I’m not “switching” from Rails and Ruby, but I did enjoy working with Python, Django, and Postgres! Thanks for reading.

Web Development

Shell

Python

0 views

Andrew Atkinson 1 years ago

PostgreSQL 17: JSON_TABLE(), MERGE with RETURNING, and Updatable Views

It’s time for a new Postgres release! PostgreSQL 17 shipped a few weeks ago, with lots of new features to explore. As a mature database system, prized for reliability, stability, and backwards compatibility, new features aren’t often the most splashy. However, there are still goodies that could become new tools in the toolboxes of data application builders. The Postgres 17 release notes is a good starting point, as it covers a breadth of items. In this post, we’ll pick out three items, and create some runnable examples with commands that can be copied and pasted into a Postgres 17 instance. Let’s dive in! To easily try out PostgreSQL 17, let’s use Docker. As an aside: for macOS, if you’re interested in using pg_upgrade , please see the post in-place upgrade from Postgres 14 to 15 as an example on how to upgrade your locally installed, earlier version. From here, we’ll assume you’re connected to a 17 instance, ready to run commands. Postgres supports SQL/JSON, which is like a selector style expressional language that provides methods to extract data from JSON. SQL/JSON path expressions specify item(s) to be retrieved from a JSON value, similarly to XPath expressions used for access to XML content. When we combine SQL/JSON expressions with a new function JSON_TABLE(), we can do powerful transformations of JSON text data into query results that match what you’d get from a traditional table. Let’s take a look at an example! Each of these examples will be on this PostgreSQL 17 branch of my pg_scripts repo. We’ll create a table “books” and insert a row into it. The books table has a “data” column with the “jsonb” data type. Create the table: Use the function to prepare JSON data for storage: Now comes the cool part. Use the new function as shown below to pull out attribute data from the JSON. This pulls the publisher, isbn, and author properties from within the JSON. If you’ve used other methods to achieve this before, this syntax hopefully looks much clearer. The nice part is the query result presents these attributes as if they were traditional columns in a table. Read more about JSON_TABLE . There’s more JSON functionality mentioned in the PostgreSQL 17 Release Notes . SQL/JSON constructors (JSON, JSON_SCALAR, JSON_SERIALIZE) and query functions (JSON_EXISTS, JSON_QUERY, JSON_VALUE). What’s next? Although the keyword was added in Postgres 15, it was enhanced in 17. Let’s try it out. Make two tables, “people” and “employees”. Insert a person: We can use the keyword to perform an “upsert” operation, meaning data is inserted when it doesn’t exist, or updated when it does. gained support for the clause in PostgreSQL 17. What’s the clause? The clause provides one or more fields in the result of INSERT, UPDATE, DELETE, and now MERGE statements, all classified as Data Manipulation Language (DML). avoids a SELECT query following a modification to get the inserted or updated values. operations are arguably more declarative than using the clause, and is based on a SQL standard with implementations in other database systems. Let’s try out without : Now we can try it with , and include the result: Database views gained some enhancements in Postgres 17. There are both regular views, and materialized views. We’ll focus on regular views for this section. Besides being defined and queried using , views can be updated using an statement. Simple views are automatically updatable: the system will allow , , , and statements to be used with a view. What’s new? Besides being updatable using statements, views can also now be updated indirectly by triggers. Let’s create an example. Create an employees table where some employees are “admins” and some aren’t. Non-admins can access only non-admins. Let’s create a view for non-admins: What would it look like to perform an update using a trigger? Let’s create an function that’s called from a trigger. Now a trigger that calls the function: Now we can run an on non-admins (updating the view), and verify that the trigger calls the function, updating the underlying “employees” table. We can even add the clause. After this update, querying non_admins no longer includes user id=2, since they were turned into an admin via the update. That wraps up the hands-on items. What are some other noteworthy enhancements? A big one that’s practical for Ruby on Rails apps, is internal improvements for queries with clauses and many values. The Postgres core team was able to eliminate repeated unnecessary index scans, reducing query execution latency, and improving performance. The cool thing about this one will be that no code changes are required, and the real-world benefits are looking promising. Check out these blog posts for more info: The release notes explain that PostgreSQL 17 uses a new internal memory structure for vacuum that consumes up to 20x less memory. This is great because it means that there will be more server memory available for other purposes. For audio coverage, check out the Postgres.fm PostgreSQL 17 episode . Thanks for checking this out! If you spot any SQL issues, please send a Pull Request . https://dev.to/lifen/as-rails-developers-why-we-are-excited-about-postgresql-17-27nj https://www.crunchydata.com/blog/real-world-performance-gains-with-postgres-17-btree-bulk-scans

JSON

Tutorial