add upid blog post

carderne · Aug 16, 2024 · 393a5c2 · 393a5c2
1 parent 39843c9
commit 393a5c2
Show file tree

Hide file tree

Showing 2 changed files with 145 additions and 0 deletions.
diff --git a/_posts/2024-08-16-upid.md b/_posts/2024-08-16-upid.md
@@ -0,0 +1,145 @@
+---
+layout: single
+title: UPID is as UPID does
+date: 2024-08-16
+image: /assets/images/2024/upid.png
+---
+## Context
+I developed a new type of universally-unique ID, and it was pretty fun. I was inspired to do so when I read [this blog post](https://brandur.org/nanoglyphs/026-ids) a month or two ago, and it mentioned Stripe's pretty prefixed [IDs](https://dev.to/stripe/designing-apis-for-humans-object-ids-3o5a). They look like this:
+
+```elm
+cus_MJA953cFzEuO1z
+└─┘ └────────────┘
+ └─ prefix    └─ random
+```
+
+I've used Stripe's API before and thought these IDs were great. Any time I dumped out some `JSON` I didn't have to wonder what each bunch of random characters stood for, it told me!
+
+The downside is that you have to make a choice when you implement them. Either
+
+1. Store them as `TEXT` in the database, which means you'll have slower lookups and waste MBs, or
+2. Store just the random part as a 128bit UUID in the database, which means your APIs have to strip/prepend the prefix at some boundary each time.
+
+If your ORM is clever or extensible, you can solve option 2. For example, [TypeID](https://github.com/jetify-com/typeid/) has an [Elixir implementation](https://github.com/sloanelybutsurely/typeid-elixir) that will handle this back-and-forthing for you. But TypeID is a pretty general spec and not able to make this universal. There's no way to make this work with [Prisma](https://www.prisma.io/), for example.
+
+## UPID
+But the rest of use are stuck with one of these imperfect solutions. So I came up with [UPID](https://github.com/carderne/upid). It looks like this in its text encoding:
+
+```elm
+user_2accvpp5guht4dts56je5a
+└──┘ └──────┘└───────────┘└─ version
+ └─ prefix └─ time   └─ random
+```
+
+And something like this, in binary:
+```bash
+00000001 10010001 01011100 10110100
+01111110 11101101 11100011 00011110
+01001111 10100010 00010101 00101011
+10101011 11010110 00010101 01110110
+```
+
+That's right, it's just 128 bits! That means anywhere you have a 128 bit datatype (like UUID), you can drop in a UPID with no issue. So you can use UPID in your server code and store them as UUIDs in (for example) Postgres. You don't need to strip/prepend the prefix, because it's always there.
+
+It is most similar to a [ULID](https://github.com/ulid/spec), but with bits taken away from the randomness and time components to make space for the prefix. Like ULID, it uses a 32 character alphabet (making 5 bits per character), but unlike [Crockford's base32](https://www.crockford.com/base32.html), it keeps the whole alpha part, so you can write anything with a-z in the prefix.
+
+Also like ULID, it is "lexicographically sortable", which basically just means it can be ordered by date. This is useful on its own, but is also valuable for things like ensuring index locality, WAL efficiency and easy sharding. However, UPID has lower time precision (256ms vs 1ms), which is arguably better for two reasons: [some people](https://github.com/paralleldrive/cuid2?tab=readme-ov-file#note-on-k-sortablesequentialmonotonically-increasing-ids) are concerned you can leak information with time-based IDs, and 256ms leaks a lot less; and many system clocks have worse than 1ms precision, so you won't have a false sense of monotonicity that might not exist.
+
+The great part about it, and the part that lets it solve the problem of the two options from further up, is that 
+
+So, how does it work? Here's that same ID from further up, with the separate components broken up.
+
+```elm
+    user   _   2accvpp5      guht4dts56je5       a
+   └────┘     └────────┘    └─────────────┘   └─────┘
+   prefix       time            random        version     total
+   4 chars      8 chars         13 chars      1 char      26 chars
+       └────────│────────────────│───────────┐  │
+                │                │           │  │
+                │                │           │  │
+             40 bits            64 bits      24 bits     128 bits
+             5 bytes            8 bytes      3 bytes      16 bytes
+             time               random       prefix+version
+```
+
+Notice that in the binary format, the time bits are stored first, so that if you sort a bunch of UPIDs, they will be neatly ordered by time.
+
+Any library that implements the UPID spec just needs to be able to shuffle back and forth between the 128 bit binary representation and the text representation. There are already [Python](https://github.com/carderne/upid), [Rust](https://github.com/carderne/upid), [Postgres](https://github.com/carderne/upid) (those are all links to the same repo btw) implementations, along with one in [TypeScript](https://github.com/carderne/upid-ts) that powers a little [demo website](https://upid.rdrn.me/) that looks like this:
+
+{% include image.html url="/assets/images/2024/upid.png" description="Cool" class="narrow-img" %}
+
+## Cool?
+This was the rare case where writing it in Rust was actually easier than in Python, because it's more explicit about what's a `u8` or `u64` or `u128`, whereas in Python it's all just `bytes` and good luck figuring out how many or what the precision is.
+
+This matters when you're bit-shifting like a madman and need to keep track of which bits came from where:
+
+```rust
+let res = (time_bits << 88)
+	| (random << 24)
+	| ((prefix[0] as u128) << 16)
+	| ((prefix[1] as u128) << 8)
+	| prefix[2] as u128;
+```
+
+You can also do stuff like this, and Rust will shout at you if you accidentally make your alphabet too long:
+```rust
+pub const ENCODE: &[u8; 32] = b"234567abcdefghijklmnopqrstuvwxyz";
+```
+
+That's the base32 alphabet by the way. The numbers come first so that, if for some reason you want to sort the text-encoded strings, they'll still be sorted in the right order!
+
+## Using it
+In Python:
+```python
+from upid import upid
+upid("prod")            # prod_2acptrhhfi7asnb5iessba
+```
+
+Rust:
+```rust
+use upid::Upid;
+UPID::new("cust")      // cust_2acptrk7ypgl2hl45g3vca
+```
+
+Or TypeScript:
+```typescript
+import { upid } from "upid-ts";
+upid("acct")           // acct_2acptrqnipixsl2xugym7a
+```
+
+In all of this cases, the actual data is just a 128 bit binary blob. And if you store it somewhere as a `UUID` or `u128` or something, you can just load it right back into a `UPID`.
+
+If anyone wants to write an implementation in another language (ChatGPT should be able to do most of the work), let me know and I'll add it to [the list](https://github.com/carderne/upid/tree/main?tab=readme-ov-file#implementations).
+
+Wait a second, what about Postgres!?
+## Postgres
+Thanks to [pgrx](https://github.com/pgcentralfoundation/pgrx), it's now dead-easy to write extensions for Postgres in Rust. And this is the best part. If you install the extension, then you get the binary-text back-and-forthing directly in Postgres.
+
+So you can do stuff like this:
+```sql
+CREATE EXTENSION upid_pg;
+
+CREATE TABLE members (
+    id   upid NOT NULL DEFAULT gen_upid('memb') PRIMARY KEY,
+    name text NOT NULL
+);
+
+INSERT INTO members (name) VALUES('Bob');
+
+SELECT * FROM members;
+--              id              | name
+-- -----------------------------+------
+--  memb_2acptt2ytgmpf5cmj6iy5a | Bob
+```
+
+That means your library code doesn't even need to care about UPID, or install any extra libraries. It will be transmitted over the wire as a pretty string, and stored in the DB as an efficient 128 bit blob.
+
+The only downside is that, as easy as it is to write Postgres extensions, it's not yet very easy to install them. I've created `.deb` packages (get yours at the [Releases](https://github.com/carderne/upid/releases)) so you can install it quite easily on Debian-based distros. Alternatively, you can try out the Docker image that has Postgres 16 with the UPID extension pre-installed:
+
+```bash
+docker run -e POSTGRES_HOST_AUTH_METHOD=trust \
+    -p 5432:5432 carderne/postgres-upid:16
+```
+
+## Go on...
+So that's UPID! I hope someone finds it useful, or at least interesting. I'll be using it in some of my personal projects, and trying to talk whoever I can into using it in their critical production projects. There's not much risk — at the end of the day it's just a 128 bit ID, and the full implementations are a few hundreds lines of code at most. Go on...
diff --git a/assets/images/2024/upid.png b/assets/images/2024/upid.png