feat(encoding): implement UTF-8 support in metric and label names #236

fedetorres93 · 2024-11-05T15:26:13Z

Adds UTF-8 support for metric and label names. Addresses #190.

These changes are based on the work done on the Prometheus common libraries prometheus/common#537 and prometheus/common#570

Encoders will use the new quoting syntax {"foo"} iff the metric does not conform to the legacy name format (foo{})
The Registry struct has two new fields: name_validation_scheme which determines if validation is done using the legacy or the UTF-8 scheme and escaping_scheme which determines the escaping scheme used by default when scrapers don't support UTF-8.
Scrapers can announce via content negotiation that they support UTF-8 names by adding escaping=allow-utf-8 in the Accept header. In cases where UTF-8 is not available, metric providers can be configured to escape names in a few different ways: values (U__ UTF value escaping for perfect round-tripping), underscores (all invalid chars become _), dots (dots become _dot_, _ becomes __, all other values become ___). Escaping can either be a global default (escaping_scheme, as mentioned above) or can also be specified in the Accept header with the escaping= term, which can be allow-utf-8 (for UTF-8-compatible), underscores, dots, or values. Existing functionality is maintained.

Work towards prometheus/prometheus#13095.

Signed-off-by: Federico Torres <[email protected]>

…nside braces Signed-off-by: Federico Torres <[email protected]>

Signed-off-by: Federico Torres <[email protected]>

… label names Signed-off-by: Federico Torres <[email protected]>

Signed-off-by: Federico Torres <[email protected]>

fedetorres93 · 2024-11-11T18:26:04Z

@mxinden Please take a look, thanks!

fedetorres93 · 2024-12-09T19:45:53Z

@mxinden I just wanted to check in regarding this PR. Did you have a chance to take an initial look?

Looking forward to any feedback you might have. Thanks!

mxinden · 2024-12-13T15:22:55Z

Hi @fedetorres93,

I am sorry for the delay. I appreciate the time you invested making a solid patch!

As I assume many of us do, I am maintaining this crate in my free time. Skimming the proposal, I don't see it adding a lot of value and thus I don't see myself prioritizing this work over other patches on this repository.

Now me not investing time into this obviously doesn't mean other can't. Maybe you can find a Prometheus maintainer to champion this work and do extensive reviews.

On a high level, I would want this work to (a) not introduce significant complexity to the library and (b) no performance regressions to the hot paths (metric recording and metric encoding).

ywwg · 2024-12-13T16:59:22Z

Hi @mxinden , thanks for your reply and continued maintenance of this library! With the release of Prometheus 3.0, one of our goals in adding support for UTF-8 everywhere in Prometheus includes updating all of the officially-supported client libraries, of which Rust is one. One issue we've had with getting these libraries updated is finding the necessary language experts such as yourself to certify that the changes we're making are safe and well-written. So far, you're the only person we've found who is both an expert in Rust and Prometheus.

As far as correctness goes, I can help do a review to make sure the test cases are covering all of the situations where quoting and escaping are required. For performance, I'm guessing there are ways of benchmarking Rust but I am not familiar with them. @fedetorres93 and I can work to try to generate those numbers. But for language correctness, we need a Rust expert to verify that we're adhering to Rust style.

I think splitting up the work this way will keep the burden on you as low as possible. Fede and I can go back and forth on the test coverage and performance and then call upon you again when we think we're in good shape. Does that sound workable?

mxinden · 2024-12-14T16:42:56Z

Works for me. Thanks!

ywwg · 2024-12-16T15:03:44Z

Thank you for your flexibility! @fedetorres93 let's start by really ramping up on unit tests, as I've worked on the Go common library I've had to add a bunch of edge cases

Signed-off-by: Federico Torres <[email protected]>

fedetorres93 · 2025-01-20T15:46:19Z

@mxinden I added more tests to cover more edge cases. I think we're in better shape now, in case you want to take a look to see if we're okay regarding language correctness.

As for performance, the only difference I see in the benchmarks is some regression in text encoding (about +15 ms), which can be attributed to the new name validation and escaping.

encode                  time:   [43.848 ms 44.070 ms 44.317 ms]
                        change: [+37.747% +39.110% +40.430%] (p = 0.00 < 0.05)
                        Performance has regressed.

Please let me know if you think I may have missed some optimization opportunities since I'm not that familiar with Rust.

Thanks!

This makes four changes: 1. The `EscapingScheme` and `ValidationScheme` enums are now `Copy` since they are very small and cheap to copy. They're passed by value rather than by reference. 2. The `escape_name` function now returns a `Cow` rather than a `String` to avoid allocations in many cases. 3. `escape_name` also preallocates a buffer for the escaped name rather than starting with an empty `String` and growing it, to amortize the allocations. 4. Use `is_ascii_alphabetic` and `is_ascii_digit` to check for characters that are valid in metric and label names. Based on profiles I suspect that prometheus#2 has the highest impact but haven't split these out to see how much of a difference it makes. Signed-off-by: Ben Sully <[email protected]>

fedetorres93 · 2025-01-22T18:28:38Z

I pushed some changes by @sd2k (thanks for your help!) which improve the text encode benchmark results I shared before:

encode                  time:   [37.371 ms 37.433 ms 37.499 ms]
                        change: [+33.474% +34.031% +34.591%] (p = 0.00 < 0.05)
                        Performance has regressed.

mxinden · 2025-01-23T13:30:49Z

encode time: [37.371 ms 37.433 ms 37.499 ms]
change: [+33.474% +34.031% +34.591%] (p = 0.00 < 0.05)
Performance has regressed.

A 35% performance regression seems significant, especially for a feature that the majority of users won't need. Do you see any ways to improve this? With the latest optimizations, is this inherent to using UTF-8?

fedetorres93 added 17 commits October 1, 2024 10:41

WIP Quote non-legacy metric names in descriptor

f25f0f4

Signed-off-by: Federico Torres <[email protected]>

Quote non-legacy label names and put non-legacy quoted metric names i…

9d1e518

…nside braces Signed-off-by: Federico Torres <[email protected]>

Add quoted metric and label names tests for text encoding

378d0e4

Signed-off-by: Federico Torres <[email protected]>

Refactor metric and label names validation

b7f6396

Signed-off-by: Federico Torres <[email protected]>

[WIP] Add content negotiation for non-legacy characters in metric and…

2339769

… label names Signed-off-by: Federico Torres <[email protected]>

Fix text encoding tests

4dc8859

Signed-off-by: Federico Torres <[email protected]>

Move name validation functions to encoding

ce522cd

Signed-off-by: Federico Torres <[email protected]>

Add getters for name_validation_scheme and escaping_scheme

81cd210

Signed-off-by: Federico Torres <[email protected]>

Add RegistryBuilder

9eee74c

Signed-off-by: Federico Torres <[email protected]>

Add documentation

e25fceb

Signed-off-by: Federico Torres <[email protected]>

Remove name validation scheme and escaping scheme constructors

37ca910

Signed-off-by: Federico Torres <[email protected]>

Fix metric and label name escaping when UTF-8 validation is enabled

68d9ae9

Signed-off-by: Federico Torres <[email protected]>

Remove commented code

c0b0b41

Signed-off-by: Federico Torres <[email protected]>

Remove unused structs in axum UTF-8 example

8b85cdb

Signed-off-by: Federico Torres <[email protected]>

Remove unnecessary escape_name calls

4b1ce00

Signed-off-by: Federico Torres <[email protected]>

Merge remote-tracking branch 'upstream/master' into ftorres/utf-8

8749a9c

Formatting

b0d7ae2

Signed-off-by: Federico Torres <[email protected]>

fedetorres93 marked this pull request as ready for review November 7, 2024 19:39

Merge branch 'master' into ftorres/utf-8

e62a619

fedetorres93 added 3 commits January 16, 2025 15:16

Add more test cases

598fe97

Signed-off-by: Federico Torres <[email protected]>

Merge branch 'master' into ftorres/utf-8

813b459

Additional tests

ca92109

Signed-off-by: Federico Torres <[email protected]>

fedetorres93 force-pushed the ftorres/utf-8 branch from f11d11c to 0c3ad82 Compare January 22, 2025 18:15

fedetorres93 force-pushed the ftorres/utf-8 branch from 0c3ad82 to 1ff883d Compare January 22, 2025 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(encoding): implement UTF-8 support in metric and label names #236

feat(encoding): implement UTF-8 support in metric and label names #236

fedetorres93 commented Nov 5, 2024

fedetorres93 commented Nov 11, 2024

fedetorres93 commented Dec 9, 2024

mxinden commented Dec 13, 2024 •

edited

Loading

ywwg commented Dec 13, 2024

mxinden commented Dec 14, 2024

ywwg commented Dec 16, 2024

fedetorres93 commented Jan 20, 2025

fedetorres93 commented Jan 22, 2025

mxinden commented Jan 23, 2025

feat(encoding): implement UTF-8 support in metric and label names #236

Are you sure you want to change the base?

feat(encoding): implement UTF-8 support in metric and label names #236

Conversation

fedetorres93 commented Nov 5, 2024

fedetorres93 commented Nov 11, 2024

fedetorres93 commented Dec 9, 2024

mxinden commented Dec 13, 2024 • edited Loading

ywwg commented Dec 13, 2024

mxinden commented Dec 14, 2024

ywwg commented Dec 16, 2024

fedetorres93 commented Jan 20, 2025

fedetorres93 commented Jan 22, 2025

mxinden commented Jan 23, 2025

mxinden commented Dec 13, 2024 •

edited

Loading