At Epsio we’re building a streaming SQL engine which natively integrates with databases/warehouses (both as “source”s and “sink”s). Epsio acts as a stream processing “middle man“- receiving data from one (or more) sources, applying transformations (according to the users’ queries), and sinking results into multiple tables in the users’ sink database. As part of this, we make it a priority to optimize database interactions so they run with maximum efficiency and impose minimal overhead on our users’ databases.
To sink data into the users databases, Epsio needs to be able to insert and delete rows efficiently. Inserting rows is fairly easy- with most databases we simply do INSERT INTO(or perhaps COPY .. for large amounts of data). Deletions, however, are more complicated. We could perhaps delete by all the fields DELETE FROM pasten WHERE user=nadav AND age=21 AND … - but a few issues would arise:
- It’s crazy inefficient- the users’ database would need to do a lot of work to find that row, given you don’t want a massive index on all the tables fields
- What if there are multiple rows with the same fields, but Epsio only wants to delete one? Some databases give this option, but Postgres for example does not.
Enter primary keys.
Primary keys are a cornerstone for any relational table. They give the ability to have a handle to a specific row, sort of like a pointer. The primary key is usually indexed, which means that given a primary key, the database can quickly jump to it’s corresponding row very quickly. These keys are often used to perform quick UPDATES/DELETES, and also as foreign keys for other tables to point to them.
There are multiple data types used for primary keys. The main two types are:
- Auto incrementing integers- every new row inserted receives the “next“ number, starting at 0
- UUIDs (128 bits)- every row receives a randomly generated UUID.
Auto incrementing keys must be generated by the server, as given two clients inserting at the same time the server must be the one to coordinate which one will receive which number.
UUIDs, however, can be generated by the client. This is because the probability of a UUID collision is astronomically low- so low in fact that most systems rely on them to be absolutely unique.
But why does it even matter if the client is the one creating the identifier? To answer this, take for example an application that allows users to make posts, comment and edit them. If the client is the one creating an identifier, the application could be “optimistic“ about the server successfully receiving the post, and allow the user to edit/comment on the post even before the server has responded with it’s OK. This is because the client already has the identifier of the entity it’s creating, even before the server responded.
At Epsio, we also leverage UUIDs for primary keys in the users’ databases. This is mainly because there are situations such as COPY …
in Postgres where you can’t receive back auto incremented keys from rows inserted. UUIDs allow us to generate identifiers before the insert, and later use those identifiers to delete from the database.
So up till now, UUIDs may sound like a gift from the Statistical Lords of the Universe, but there are some serious caveats.
The Problem with Indexing UUIDs (and specifically, UUIDv4)
The reason primary keys are so efficient for lookups is because they are indexed. This means the database is holding some type of data structure, like a B-Tree, with all the UUIDs (with the data perhaps being the physical location of the row). B-Trees work very well when the writes are as condensed as possible; as the writes become more scattered, the B-Tree hits more leafs, and thus more pages need to be brought into cache and more splits occur. Sequential numbers added into a B-Tree is the best situation for a B-Tree- you’re always hitting the right most leaf, filling it up, then splitting, etc. The default UUID type (UUIDv4) that is used by default in most DBs / clients though, is the opposite- it’s the worst for a B-Tree. The very things that makes UUIDs so magical- their global uniqueness- also means they’re completely random numbers across a 122bit spectrum. To put it in perspective, it’s range is wider than there are atoms in the universe!
This matters because inserts will suffer a serious hit, while lookups may suffer slightly. This means our throughput for insertions will go down, and the database will need to work harder. Not great.
Enter UUIDv7
UUIDv4 is fairly simple- as stated above, it’s simply a random number in a 122bit spectrum. UUIDv7 however, is a bit more interesting- it’s first 48 bits are used to contain the current timestamp (IE the timestamp of when the UUID was generated), a few more bits for version and variant, and 74 bits are used to store a random number. It’s exact layout is as follows
This means that if we generate multiple UUIDv7s in consecutive times (say one millisecond after the other), they’ll all sort one after the other! This solves the index issue, and actually adds a very big benefit: It’s possible to now know the insert time of the row from within the primary key. This can help debugging, and in internal “chaos“ simulators we’ve used this to understand and figure out issues in Epsio.
To give a taste of the performance increase, I made a simple benchmark with my friend Claude to insert 10 million rows with UUIDv4 and then 10 million with UUIDv7. Claude of course proceeded to write code with an abundance of emojis, as one does.
Results: