-
Notifications
You must be signed in to change notification settings - Fork 264
ProtocolBuffersV2Wishlist
Protocol Buffers are a brilliant technology, but they aren’t perfect. This page documents things about Protocol Buffers that could be improved (in my opinion), and ideas for a v2 (if such a thing were ever to happen).
Protocol Buffers encode variable-length integers using continuation bits. Each byte sets the high bit (0x80
) if there are more bytes to come, or clears it if not. Bit-level examples:
0aaa aaaa -> aaa aaaa 1aaa aaaa 0bbb bbbb -> aa aaaa abbb bbbb 1aaa aaaa 1bbb bbbb 1ccc cccc 0ddd dddd -> aaaa aaab bbbb bbcc cccc cddd dddd
This coding is good in terms of bit density (it takes b / 7 + 1
bytes to encode b
bits), but is not the most efficient in terms of CPU time to encode or decode. There are two main problems:
- When decoding, you don’t get the information up-front about how many bytes there will be. You have to branch on every input byte. (This isn’t strictly true, but the workarounds hurt the common case).
- When encoding or decoding, you have to perform a nontrivial transformation to either insert or remove the continuation bits.
A coding that is more efficient to encode/decode puts all length information in the first byte, like UTF-8:
0aaa aaaa -> 0aaa aaaa 10aa aaaa bbbb bbbb -> 00aa aaaa bbbb bbbb 110a aaaa bbbb bbbb cccc cccc -> 000a aaaa bbbb bbbb cccc cccc ... 1111 1111 [8 bytes] -> [8 bytes]
This encoding has exactly the same bit density as the existing varint decoding (except when all 64 bits are used, in which case it uses only 9 bytes instead of 10), but since all the length information is encoded in the first byte, the decoder doesn’t need to look at or branch on the subsequent bytes; it only needs to copy them. Also, with this encoding there is no need to shift the data for every byte; since the raw data is not interleaved with continuation bits, the data can be decoded with a single load and bitwise-AND. The combination of these two factors would significantly increase encoding and decoding efficiency.
Authors of .proto files are forced to choose between several different types that all represent the same range of values. For example, an int32
value can be represented by int32
, sint32
, and sfixed32
. They all work the same from the user’s perspective, but they have very different bit density. In this case:
-
int32
is best if the value will rarely be <0. -
sint32
is best if the value will often be <0. -
sfixed32
is best if the value will often be >2^28 or <-2^28 (or something like that)
In other words, developers have to ask them each time they add a field what its expected data distribution will be. This is a burden on developer productivity. If the developer picks wrong, or isn’t aware of what the optimal choices are, or the data distribution changes later, the encoded data will have poor bit density. It is not possible to change between them later, without making a backward-incompatible change. It also means that .proto files are littered with these implementation details about the encoding:
message Foo {
optional sint32 bar;
optional fixed64 baz;
optional int64 quux;
}
…instead of just letting developers say what they mean:
message Foo {
optional int32 bar;
optional uint64 baz;
optional int64 quux;
}
If .proto files only contained the latter, the encoder could optimize at encode time and pick the most efficient encoding. The encoding would be present in the byte stream, and decoders would decode based on the specified encoding.
Unfortunately this is not possible with the current byte format, because the wire format does not completely specify the integer encoding. Specifically, the wire format does not specify whether a given integer is zig-zag encoded. As a result, implementing this feature would require a backward-incompatible change.
There is a run-time cost inherent in this idea, since decoders would have to branch based on the observed wire type. Right now decoders expect a specific wire type given a .proto type, so the only question is “is this wire type the right one?” (which it nearly always is). In the world I describe, this would be a branch that would possibly be difficult to predict. The cost would have to be measured.
Application developers could save memory if small types (uint8
, int8
, uint16
, int16
) were supported.
I often forget (despite having a fair amount of experience with protobufs) that you have to specify “optional” before every non-required field. This is annoying and verbose. It doesn’t add any useful information. And worse, the fields might not truly be optional; if your application has any validation rules that are any more complex than “this field must be present,” then they cannot fully be expressed in .proto syntax. It might be the case that removing a field could make an valid message invalid (according to the application), despite it being labeled “optional.” Examples:
- this repeated field must have >2 elements.
- exactly one of this set of fields must be set (ie. union/polymorphic types)
- if version == 1 then foo must be set, elseif version == 2 then bar must be set (think backward compatible message revisions).
It should be allowed to omit “optional” from fields.
Submessages and Groups are the same thing from a logical perspective (they are both nested messages), but they have different syntax in .proto files and a different on-the-wire encoding. In proto2 groups are considered deprecated. But groups have the nice property that they can be serialized without knowing up-front the size of the entire group. This means that unlike submessages, groups can be encoded in a streaming way; the cost is that it’s not as easy to skip them when deserializing. For example, if you are streaming protocol buffer data through a filter that mutates part of the protobuf, groups stream nicely whereas submessages require the filter to buffer the whole submessage before emitting the first byte.
If groups and submessages were treated the same (ie. you could choose at encode time which encoding to use), then developers who are serializing protobufs would have an extra trade-off available to them. Groups could be more appropriate if resources in the encoding process are very scarce. This would require decoders to be capable of decoding both.
Currently strings and submessages are encoded with the same wire type. This means that if you encounter a delimited field with an unknown tag, you cannot know if it represents a submessage or a string. There are some useful algorithms that can only run on arbitrary data if they can distinguish between the two; for example, a canonicalizer that takes a serialized protobuf and transforms it into a canonical serialization.
If “true” and “false” were their own wire types, booleans could be encoded with only one byte, which would be a 50% savings. On the other hand this would use two extra wire types, and there are currently only two wire types to spare. So you couldn’t do this and also give submessages their own wire type.
Right now unions can be emulated by specifying several optional fields of distinct types and documenting that only one of them should be set at a time. If protocol buffers had real unions, implementations could save space in-memory, and also provide an easy way to ask “what union member is currently set?”