Skip to content

Commit

Permalink
add upid blog post
Browse files Browse the repository at this point in the history
  • Loading branch information
carderne committed Aug 16, 2024
1 parent 39843c9 commit 393a5c2
Show file tree
Hide file tree
Showing 2 changed files with 145 additions and 0 deletions.
145 changes: 145 additions & 0 deletions _posts/2024-08-16-upid.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
---
layout: single
title: UPID is as UPID does
date: 2024-08-16
image: /assets/images/2024/upid.png
---
## Context
I developed a new type of universally-unique ID, and it was pretty fun. I was inspired to do so when I read [this blog post](https://brandur.org/nanoglyphs/026-ids) a month or two ago, and it mentioned Stripe's pretty prefixed [IDs](https://dev.to/stripe/designing-apis-for-humans-object-ids-3o5a). They look like this:

```elm
cus_MJA953cFzEuO1z
└─┘ └────────────┘
└─ prefix └─ random
```

I've used Stripe's API before and thought these IDs were great. Any time I dumped out some `JSON` I didn't have to wonder what each bunch of random characters stood for, it told me!

The downside is that you have to make a choice when you implement them. Either

1. Store them as `TEXT` in the database, which means you'll have slower lookups and waste MBs, or
2. Store just the random part as a 128bit UUID in the database, which means your APIs have to strip/prepend the prefix at some boundary each time.

If your ORM is clever or extensible, you can solve option 2. For example, [TypeID](https://github.com/jetify-com/typeid/) has an [Elixir implementation](https://github.com/sloanelybutsurely/typeid-elixir) that will handle this back-and-forthing for you. But TypeID is a pretty general spec and not able to make this universal. There's no way to make this work with [Prisma](https://www.prisma.io/), for example.

## UPID
But the rest of use are stuck with one of these imperfect solutions. So I came up with [UPID](https://github.com/carderne/upid). It looks like this in its text encoding:

```elm
user_2accvpp5guht4dts56je5a
└──┘ └──────┘└───────────┘└─ version
└─ prefix └─ time └─ random
```

And something like this, in binary:
```bash
00000001 10010001 01011100 10110100
01111110 11101101 11100011 00011110
01001111 10100010 00010101 00101011
10101011 11010110 00010101 01110110
```

That's right, it's just 128 bits! That means anywhere you have a 128 bit datatype (like UUID), you can drop in a UPID with no issue. So you can use UPID in your server code and store them as UUIDs in (for example) Postgres. You don't need to strip/prepend the prefix, because it's always there.

It is most similar to a [ULID](https://github.com/ulid/spec), but with bits taken away from the randomness and time components to make space for the prefix. Like ULID, it uses a 32 character alphabet (making 5 bits per character), but unlike [Crockford's base32](https://www.crockford.com/base32.html), it keeps the whole alpha part, so you can write anything with a-z in the prefix.

Also like ULID, it is "lexicographically sortable", which basically just means it can be ordered by date. This is useful on its own, but is also valuable for things like ensuring index locality, WAL efficiency and easy sharding. However, UPID has lower time precision (256ms vs 1ms), which is arguably better for two reasons: [some people](https://github.com/paralleldrive/cuid2?tab=readme-ov-file#note-on-k-sortablesequentialmonotonically-increasing-ids) are concerned you can leak information with time-based IDs, and 256ms leaks a lot less; and many system clocks have worse than 1ms precision, so you won't have a false sense of monotonicity that might not exist.

The great part about it, and the part that lets it solve the problem of the two options from further up, is that

So, how does it work? Here's that same ID from further up, with the separate components broken up.

```elm
user _ 2accvpp5 guht4dts56je5 a
└────┘ └────────┘ └─────────────┘ └─────┘
prefix time random version total
4 chars 8 chars 13 chars 1 char 26 chars
└────────│────────────────│───────────┐ │
│ │ │ │
│ │ │ │
40 bits 64 bits 24 bits 128 bits
5 bytes 8 bytes 3 bytes 16 bytes
time random prefix+version
```

Notice that in the binary format, the time bits are stored first, so that if you sort a bunch of UPIDs, they will be neatly ordered by time.

Any library that implements the UPID spec just needs to be able to shuffle back and forth between the 128 bit binary representation and the text representation. There are already [Python](https://github.com/carderne/upid), [Rust](https://github.com/carderne/upid), [Postgres](https://github.com/carderne/upid) (those are all links to the same repo btw) implementations, along with one in [TypeScript](https://github.com/carderne/upid-ts) that powers a little [demo website](https://upid.rdrn.me/) that looks like this:

{% include image.html url="/assets/images/2024/upid.png" description="Cool" class="narrow-img" %}

## Cool?
This was the rare case where writing it in Rust was actually easier than in Python, because it's more explicit about what's a `u8` or `u64` or `u128`, whereas in Python it's all just `bytes` and good luck figuring out how many or what the precision is.

This matters when you're bit-shifting like a madman and need to keep track of which bits came from where:

```rust
let res = (time_bits << 88)
| (random << 24)
| ((prefix[0] as u128) << 16)
| ((prefix[1] as u128) << 8)
| prefix[2] as u128;
```

You can also do stuff like this, and Rust will shout at you if you accidentally make your alphabet too long:
```rust
pub const ENCODE: &[u8; 32] = b"234567abcdefghijklmnopqrstuvwxyz";
```

That's the base32 alphabet by the way. The numbers come first so that, if for some reason you want to sort the text-encoded strings, they'll still be sorted in the right order!

## Using it
In Python:
```python
from upid import upid
upid("prod") # prod_2acptrhhfi7asnb5iessba
```

Rust:
```rust
use upid::Upid;
UPID::new("cust") // cust_2acptrk7ypgl2hl45g3vca
```

Or TypeScript:
```typescript
import { upid } from "upid-ts";
upid("acct") // acct_2acptrqnipixsl2xugym7a
```

In all of this cases, the actual data is just a 128 bit binary blob. And if you store it somewhere as a `UUID` or `u128` or something, you can just load it right back into a `UPID`.

If anyone wants to write an implementation in another language (ChatGPT should be able to do most of the work), let me know and I'll add it to [the list](https://github.com/carderne/upid/tree/main?tab=readme-ov-file#implementations).

Wait a second, what about Postgres!?
## Postgres
Thanks to [pgrx](https://github.com/pgcentralfoundation/pgrx), it's now dead-easy to write extensions for Postgres in Rust. And this is the best part. If you install the extension, then you get the binary-text back-and-forthing directly in Postgres.

So you can do stuff like this:
```sql
CREATE EXTENSION upid_pg;

CREATE TABLE members (
id upid NOT NULL DEFAULT gen_upid('memb') PRIMARY KEY,
name text NOT NULL
);

INSERT INTO members (name) VALUES('Bob');

SELECT * FROM members;
-- id | name
-- -----------------------------+------
-- memb_2acptt2ytgmpf5cmj6iy5a | Bob
```

That means your library code doesn't even need to care about UPID, or install any extra libraries. It will be transmitted over the wire as a pretty string, and stored in the DB as an efficient 128 bit blob.

The only downside is that, as easy as it is to write Postgres extensions, it's not yet very easy to install them. I've created `.deb` packages (get yours at the [Releases](https://github.com/carderne/upid/releases)) so you can install it quite easily on Debian-based distros. Alternatively, you can try out the Docker image that has Postgres 16 with the UPID extension pre-installed:

```bash
docker run -e POSTGRES_HOST_AUTH_METHOD=trust \
-p 5432:5432 carderne/postgres-upid:16
```

## Go on...
So that's UPID! I hope someone finds it useful, or at least interesting. I'll be using it in some of my personal projects, and trying to talk whoever I can into using it in their critical production projects. There's not much risk — at the end of the day it's just a 128 bit ID, and the full implementations are a few hundreds lines of code at most. Go on...
Binary file added assets/images/2024/upid.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 393a5c2

Please sign in to comment.