Short alphanumeric pseudo random identifiers in Postgres
In this post, we’ll cover a way to generate short, alphanumeric, pseudo random identifiers using native Postgres tactics. These identifiers can be used for things like transactions or reservations, where users need to read and share them easily. This approach is an alternative to using long, random generated values like UUID values, which have downsides for usability and performance. We’ll call the identifier a and store it in a column with that name. Here are some example values: In database design, we can use natural or surrogate keys to identify rows. We won’t cover the differences here as that’s out of scope. For our identifier, we’re going to generate it from a conventional surrogate primary key called . We aren’t using natural keys here. The is intended for use outside the database, while the primary key is used inside the database to be referenced by foreign key columns on other tables. Whle is short which minimizes space and speeds up access, the main reason for it is for usability. With that said, the target for total space consumption was to be fewer bytes than a 16-byte UUID. This was achieved with an primary key and this additional 5 character generated value, targeting a smaller database where this provides plenty of unique values now and into the future. Let’s get into the design details. Here were the desired design properties: Additional details: Here are the functions used: This function obfuscates the integer value using exclusive or (XOR) obfuscation. Main entrypoint function: This converts the obfuscated value into the alphanumeric value, used within . This is “base 62” with the 26 upper and lower case characters, and 10 numbers (0-9). Reverses the back into the original integer. Used within : For a length of 5 with this system, we can create up to around ~1 billion unique values. This was sufficiently large for the original use case. For use cases requiring more values, by storing 6 characters for then up to ~56 billion values could be generated, based on a primary key. Let’s create a sample table with an primary key with a generated identity column. Besides the use in the identity column, we’ll again use the keyword to create a column for the . The column uses the column as input, obfuscates it, encodes it to base 62, producing a 5 character value. How do we guarantee conforms to our expected data properties? Constraints! For an existing system, we could add a unique index first as follows: Then we can add the unique constraint using the unique index, along with the constraint: Let’s insert data into the table: Let’s query the data, and also make sure it’s reversed (using the function) properly: Let’s compare the time spent inserting 1 million rows into an equivalent table without the column or value generation. That took an average of 2037.906 milliseconds, or around 2 seconds on my machine. Inserting 1 million rows with the took an average of 6954.070 or around 7 seconds, or about 3.41x slower. Note that these times were with the indexes and constraints in place on the table in the second example, but not the first, meaning their presence contributed to the total time. Summary: Creating this identifier made the write operations 3.4x slower for me locally, which was an acceptable amount of overhead for the intended use case. Compared with random values, the pseudo random remains orderable, which means that lookups for individual rows or ranges of rows can use indexes, running fast and reliably even as row counts grow. We can add a unique index on the column like this: We can very that individual lookups or range scans use this index, by inspecting query execution plans for this table. https://github.com/andyatkinson/pg_scripts/pull/15 Feedback on this approach is welcomed! Please use my contact form to provide feedback or leave comments on the PR. Future enhancements to this could include unit tests using pgTAP for the functions, packaging them into an extension, or supporting more features like case insensitivity or a modified input alphabet. Thanks for reading! A fixed size, 5 characters in length, regardless of the size of the input integer (and within the range of the data type) Fewer bytes of space than a data type An obfuscated value, pseudo random, not easily guessable. While not easily guessable, this is not meant to be “secure” Reversibility back into the original integer Only native Postgres capabilities, no extensions, client web app language can be anything as it’s within Postgres Non math-heavy implementation is stored using not not , following recommendations for best practices PL/PgSQL functions, native Postgres data types and constraints are used, like UNIQUE, NOT NULL, and CHECK, and a stored generated column. Converts integers to bits, uses exclusive-or (XOR) bitwise operation and modulo operations. Did not set out to support case insensitivity now, possible future enhancement Did not try to exclude similar-looking characters (see: Base32 Crockford below), possible future enhancement Uses a Hexadecimal key (make this any key you want) Sets a max value for the data type range , which is just under 1 billion possible values. This was enough for this system and into the future, but a bigger system would want to use Converts integer bytes into bits gets a constraint and , so we know we have a unique value A constraint is added to validate the length Base32 Crockford - An emphasis ease of use for humans: removing similar looking characters, case insensitivity. ULID - Also 128 bits/8 bytes like UUIDs, so I had ruled these out for space consumption, and they’re slightly less “usable” NanoIDs at PlanetScale - I like aspects of NanoID. This is random generation though like UUID vs. encoding a unique integer.