Identifiers are data that identify a unique entity apart from other entities. The concept of Identifiers has many uses in the world. In software, identifiers found in every facet of development. Some types of identifiers are standardized like UUID and URI. Most identifiers, however, are not externally defined in a specification and are dependent on many factors specific to their application.
In practice, identifiers are serialized strings that must be interpreted, parsed, encoded and decoded along software system pathways. They transit multiple systems, in many kinds of mediums like JSON, emails and log files. Software that must interpret this data along the way has to know how to consume the identifier and interpret it's value.
To illustrate this problem, consider a string identifier encoded with Base64. The generator of the identifier needs to convert their identifier value into a byte array or string. It transforms this array into Base64 and sends or stores this result. Later, another application encounters this Base64 string, and then must make several determinations:
- Is this string encoded?
- If so, how should it be decoded?
- Once decoded into a byte array, should it be transformed into another data structure?
- Once it is transformed, what are the semantics of the value?
The developer must find a source of truth to answer their questions about this multi-step process. Often docs are out-of-date, the developers are unavailable, or they provide incorrect guidance. This process is hard, error-prone and the source of many bugs, failures, and other negative outcomes.
The Identifiers project hopes to tackle this problem by defining sharable identifier types that can be applied across software domains. It intends to make it simple to convert a data identifier into a string, transmit it or store it, and then allow a different application convert the encoded identifier into a semantic data value for processing.
Identifier types can be primitive values, semantic values or structured identifiers.
- string (UTF-8)
- boolean
- integer (32-bit signed)
- float 64-bit signed decimals (IEEE 754)
- long (64-bit signed)
- bytes (array of bytes)
All identifier types have collection variants that hold multiple values of the type. The two collection types are list and map. Collections can only hold same-typed values at this time.
A list identifier is a list of values. They are not a list of identifiers, but are a single identifier composed of multiple values of the same type.
A map identifier is a map of values. Maps are useful to create a single identifier composed of multiple labeled values of the same type. These values are labeled by the map keys. The keys are stored in alphabetically-sorted order for consistency.
Composite identifiers combine other identifiers of mixed types into a single identifier. One can combine primitive identifiers, semantic identifiers and structured variants together into one composite identifier. They can be either a list or a map of other identifiers.
Semantic identifiers are based on either single or structured primitive identifiers. They can be considered to "extend" a base identifier type.
type | base type | structure | notes |
---|---|---|---|
uuid |
bytes |
16 bytes | Supports all uuid versions. https://en.wikipedia.org/wiki/Universallyuniqueidentifier |
datetime |
long |
single value | Time in Unix/Posix Epoch, in milliseconds. https://en.wikipedia.org/wiki/Epoch_(computing) |
geo |
float-list |
[latitude, longitude] | decimal latitude & longitude. https://en.wikipedia.org/wiki/Geotagging |
- IP: https://stackoverflow.com/questions/8105629/ip-addresses-stored-as-int-results-in-overflow
- IPv6: https://technet.microsoft.com/en-us/library/cc781672(v=ws.10).aspx#w2k3tr_ipv6_how_thcz
- MAC: https://en.wikipedia.org/wiki/MAC_address
- Flicks: https://github.com/OculusVR/Flicks
- Currency: https://www.iso.org/iso-4217-currency-codes.html
- Location: http://www.unece.org/cefact/locode/service/location.html
If you have suggestions please file an Issue to start a discussion.
Semantic identifiers are guaranteed safe passage through older systems that do not understand the semantics of the identifier. They can consume a semantic identifier, parse it's data, and pass it through to another system without losing the semantic type information.
As an example, if a system encounters an unknown IPv6 semantic identifier, but has no explicit support for IPv6 identifiers, this system will interpret the value as it's base identifier type which is a fixed list of 2 longs. If this system then passes this identifier on to another system that does understand IPv6 identifiers, that system will interpret it as a IPv6 identifier. The IPv6 type information is not lost along the way.
Identifiers have two forms of string encoding—Data and Human. These forms have different uses.
The data form is intended for identifiers that go into transmitted data like JSON and XML, as well as data storage like a SQL database. They are not intended for use in URIs and are not human-enterable, though they are composed of visible characters.
Identifiers serialized for data purposes are encoded with a Base128 symbol set for minimum size bloat and safe transferability.
Identifiers are often consumed and entered by humans and thus have different constraints. Examples of this form are account identifiers, URLs and serial numbers. These identifiers are often encountered in messages like text and email. The specification can be found in the Base32 specification.
The following projects implement the Identifiers specification:
This section applies to library authors who build implementations of the Identifiers spec for platforms of their choosing.
The primitive identifiers should map to any existing platform types. Most platforms have string
, boolean
, and the other primitive types natively implemented. If one is not available, the implementer is encouraged to build the type support into the library rather than requiring the consumer to explicitly utilize a third-party library. For instance, JS does not support a full 64-bit long value, so the JS implementation utilizes the a popular Long library to support the long number space.
All primitive identifier types are associated with a 1-byte type code. Semantic identifiers have a second type code to identify themselves. The type codes are calculated with bitwise operators to accumulate the various flags that compose their full value.
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
---|---|---|---|---|---|---|---|
primitive | primitive | primitive | list | map | list-of | map-of | semantic |
All identifier types also have structured variants that hold their values in collections. Their type codes combine the structural flags and the type code of the value. To create the full type code, |
the appropriate structured type code to the base primitive type.
type | code | MsgPack family |
---|---|---|
list |
0x8 |
array |
map |
0x10 |
map |
String-encoded identifiers are compressed using MsgPack. More details are in the following section, but the related MsgPack information is included in the type tables for easy reference.
Here are the type codes for primitive types, as well as their list and map structured types.
type | code | MsgPack family | list | map |
---|---|---|---|---|
string |
0x0 |
string | 0x8 |
0x10 |
boolean |
0x1 |
bool | 0x9 |
0x11 |
integer |
0x2 |
int | 0xa |
0x12 |
float |
0x3 |
float | 0xb |
0x13 |
long |
0x4 |
int | 0xc |
0x14 |
bytes |
0x5 |
bin | 0xd |
0x15 |
Composite identifiers can be either lists or maps of other identifiers. Composite identifiers are not typed with primitive type flags. They contain fully-formed identifiers of any type. They can be used to define Semantic identifiers.
When encoded to MsgPack, the outer type is either composite-list or composite-map. The contents of composite identifiers are fully-encoded identifiers.
type | code | MsgPack family |
---|---|---|
composite-list |
0x38 |
array |
composite-map |
0x58 |
map |
Semantic identifiers have 2-byte type values. The first byte is the primitive and structural information, and the second byte is the "slot" number. The integer type value is computed by starting with the base type (including structural type), adding a semantic value flag (0x80
), and then adding the slot position shifted left by 0x8
. The left shift pushes the slot position into the second byte. For example, The geo
type code is calculated like this:
float | list | semantic | slot |
---|---|---|---|
0x3 |
0x8 |
0x80 |
2 << 0x8 |
0x3 | 0x8 | 0x80 | (2 << 0x8) = 0x28b
The following table lists the defined semantic types:
type | slot | code | MsgPack format | list | map |
---|---|---|---|---|---|
uuid |
0 |
0x85 |
bin 16 size 16 | 0x8d |
0x95 |
datetime |
1 |
0x184 |
int | 0x18c |
0x194 |
geo |
2 |
0x28b |
fixarray size 2 floats | 0x2ab |
0x2cb |
It is possible for a semantic identifier's base type to be a list or map of primitives. The example of this is the geo
identifier. In order to create a list or map of these identifiers, the structured types must be marked as either a list-of
or map-of
the semantic identifier. These type code addenda are defined in the following table:
type | code | MsgPack family |
---|---|---|
list-of |
0x20 |
array[semantic] |
map-of |
0x40 |
map[semantic] |
For example, to create the type code for a list of geo
s, Set the list-of
flag bit (0x20
):
0x28b | 0x20 = 0x2ab
In order to encode an identifier, one must first encode it to bytes using the MsgPack encoding format. These bytes are then encoded using either Base128 for data uses or Base32 for human uses. Implementations will auto-detect the encoding format and decode into an identifier value correctly.
Internally encoded Identifiers are compressed MsgPack data structures. In order to inter-operate with MsgPack correctly, One must pass the MsgPack encoder the following array:
[type-code, identifier-encode-value]
Each identifier type has a specific encode value shape that must be met. Implementations will often have platform-specific formats of the identifier values, like native class representations, but these must be transformed into formats that are usable by MsgPack.
Most MsgPack implementations have cross-platform quirks that will require fine-tuning or even fixing. For instance, the Java version of MsgPack treats all doubles as FLOAT64
while other platforms encode float values as either FLOAT32
or FLOAT64
. The java version of identifiers has to manually emit FLOAT32
for single-precision doubles. The Test Compatibility Kit will aid the implementer in discovering and mitigating their platform's MsgPack anomalies.
It is expected that encoded identifiers created in one system will be consumed in another system of a different architecture. For instance, a Java server will encode an Identifier that will be consumed by a JavaScript client. To support this goal, all implementations must pass the Test Compatibility Kit.