Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenTelemetry sampling conventions #793

Closed
wants to merge 32 commits into from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Mar 5, 2024

Changes

Introduces 3 conventional attributes to describe sampling in OpenTelemetry collection pipelines. This specification refers to OTEP 235.

Related to the specification change (from OTEP 235) in open-telemetry/opentelemetry-specification#3910.

Related to the specification of about representing trace context in logs in open-telemetry/opentelemetry-specification#3909.

Prototype for a Collector sampler based on these attributes in open-telemetry/opentelemetry-collector-contrib#29720 and [WIP] open-telemetry/opentelemetry-collector-contrib#24811.

Part of open-telemetry/opentelemetry-specification#1413

Merge requirement checklist

@jmacd jmacd marked this pull request as ready for review March 5, 2024 23:52
@jmacd jmacd requested review from a team March 5, 2024 23:52
model/otel/sampling.yaml Outdated Show resolved Hide resolved
model/otel/sampling.yaml Outdated Show resolved Hide resolved
model/otel/sampling.yaml Outdated Show resolved Hide resolved
docs/otel/sampling.md Outdated Show resolved Hide resolved
docs/otel/sampling.md Outdated Show resolved Hide resolved
docs/otel/sampling.md Outdated Show resolved Hide resolved
docs/otel/sampling.md Outdated Show resolved Hide resolved
docs/otel/sampling.md Outdated Show resolved Hide resolved
docs/otel/sampling.md Outdated Show resolved Hide resolved
docs/otel/sampling.md Outdated Show resolved Hide resolved
docs/otel/sampling.md Outdated Show resolved Hide resolved
@jmacd
Copy link
Contributor Author

jmacd commented Mar 6, 2024

@lmolkova @pyohannes Thank you! I realized from your responses that this repository needs to contain more detail, whereas I had been planning on rolling that detail into open-telemetry/opentelemetry-specification#3910 and now I've put some of that text into this PR.

I created a file in the attributes registry and moved the attribute definitions there.

I am hoping to justify the use of "sampling" as a prefix, for historical reasons. I removed the tag and decided to wait for more feedback before introducing a requirements-level statement, because these attributes are somewhat unusual.

Since spans record tracestate, unlike log records, we are proposing two new attributes that are only useful for log records corresponding with fields that are defined for use in tracestate. I had been using the tag to de-select those two attributes in the span attributes and leave them in the logs attributes. Now, there is just one section called "Sampling attributes" and the docs explain this.

Copy link
Contributor

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the additional context, I still have some questions and suggestions

docs/sampling/README.md Outdated Show resolved Hide resolved
docs/sampling/README.md Outdated Show resolved Hide resolved
docs/sampling/README.md Outdated Show resolved Hide resolved
docs/sampling/README.md Outdated Show resolved Hide resolved
docs/sampling/README.md Outdated Show resolved Hide resolved
docs/sampling/README.md Outdated Show resolved Hide resolved
docs/sampling/README.md Outdated Show resolved Hide resolved
docs/sampling/README.md Outdated Show resolved Hide resolved
model/registry/sampling.yaml Outdated Show resolved Hide resolved
model/registry/sampling.yaml Outdated Show resolved Hide resolved
.chloggen/793.yaml Outdated Show resolved Hide resolved
Copy link
Member

@kentquirk kentquirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very happy about this document.

@jmacd jmacd requested a review from tigrannajaryan as a code owner March 7, 2024 17:56
@jmacd
Copy link
Contributor Author

jmacd commented Mar 25, 2024

Reviewers: I see no reason to continue promoting sampling.priority as a convention, despite its definition in OpenTracing. I found a very different definition for a near-identical concept in the OpenTelemetry contrib-collector probabilistic sampler processor for logs. If we used the logs processor definition, "sampling.priority" would equal a request to sample at a particular percentage, where there is an implicit conjunction. In the terminology of OTEP 250, the priority mechanism is the FIRST in the conjunction (making it possible to drop before evaluating the second).

Therefore, I have revised this PR only to specify sampling.threshold and sampling.randomness.

@trask
Copy link
Member

trask commented Mar 26, 2024

I have revised this PR only to specify sampling.threshold and sampling.randomness.

it looks like you need to regenerate the markdown to fully remove sampling.priority

model/trace/sampling.yaml Outdated Show resolved Hide resolved
model/logs/sampling.yaml Outdated Show resolved Hide resolved
docs/attributes-registry/sampling.md Outdated Show resolved Hide resolved
docs/sampling/README.md Outdated Show resolved Hide resolved
@jmacd
Copy link
Contributor Author

jmacd commented Mar 27, 2024

Updated: sampling.priority is really removed, now.
This leaves the question in #793 (review) (and #793 (comment)) about whether we agree to use log record attributes to convey metadata about the collection path.

The OpenTelemetry sampling decision is defined in terms of a Threshold
value and a Randomness value, each containing 56 bits of information.

A constant known as _maximum adjusted count_ (`MaxAdjustedCount`),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be just me, but I think that max-something suggests inclusiveness, so this can be confusing. How about AdjustedCountLimit?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with MaxAdjustedCount. It is inclusive with respect to the adjusted count. However, I understand that it can be a little confusing as it is also used as an exclusive upper limit for the threshold and the random value.

docs/sampling/README.md Show resolved Hide resolved
The OpenTelemetry sampling decision is defined in terms of a Threshold
value and a Randomness value, each containing 56 bits of information.

A constant known as _maximum adjusted count_ (`MaxAdjustedCount`),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with MaxAdjustedCount. It is inclusive with respect to the adjusted count. However, I understand that it can be a little confusing as it is also used as an exclusive upper limit for the threshold and the random value.

Copy link
Member

@kentquirk kentquirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor wording clarification, but otherwise this looks good.

docs/sampling/README.md Outdated Show resolved Hide resolved
### Sampling randomness

When determining the Randomness value from an item of telemetry,
sampler implementations SHOULD:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sampler implementations SHOULD:
sampler implementations SHOULD evaluate the following in order:

Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 21, 2024
@lmolkova
Copy link
Contributor

@jmacd do we have a consensus on this PR within the sampling WG?

Copy link

github-actions bot commented May 8, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label May 8, 2024
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this May 15, 2024
@jmacd
Copy link
Contributor Author

jmacd commented May 31, 2024

Miscellaneous updates:

Apologies for stalling. I am re-opening this with the intention to start it moving again, next week.

Related: @kalyanaj is working on minor revisions to OTEP 235 based on our analysis of the W3C tracestate level 2 "random" flag.

Related: The semantic conventions here go slightly beyond OTEP 235, which focuses on tracestate and does not explicitly document our approach to logs sampling. The new semantic conventions in this PR could be added back into OTEP 235 as there is no disagreement in the Sampling SIG, or we could just approve them here -- I am referring to the new sampling.randomness and sampling.threshold attributes.

The OTel Collector prototype for this is in review in its final stage. If we don't get thus work approved and merged soon, the work done there will be at-risk for near-future breaking changes. Please see open-telemetry/opentelemetry-collector-contrib#31894.

@jmacd jmacd reopened this May 31, 2024
jmacd and others added 2 commits May 31, 2024 09:24
@github-actions github-actions bot removed the Stale label Jun 1, 2024
@jmacd
Copy link
Contributor Author

jmacd commented Jun 12, 2024

I will re-open this PR soon, will close it for now.
I think we might want to submit another OTEP with some missing details in OTEP 235 and/or see open-telemetry/oteps#261.

In particular, the review for open-telemetry/opentelemetry-collector-contrib#31894 raises the question of whether the behavior adopted in probabilisticsamplerprocessor should be a semantic conventional approach, namely:

If you have a hashing algorithm to construct some number of bits used for a consistent threshold-based approach and you want to record your sampling decision in a collection-path sampler such as the referenced component, you SHOULD synthesize an rv corresponding with your threshold-based decision (interpreted as a rejection threshold), which along with th means you can interpret sampling probability for a wide family of samplers (including the legacy probabilisticsamplerprocessor technique). See https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/31894/files#r1637230966.

@jmacd jmacd closed this Jun 12, 2024
jpkrohling added a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Jun 13, 2024
…rt OTEP 235) (#31894)

**Description:** Creates new sampler modes named "equalizing" and
"proportional". Preserves existing functionality under the mode named
"hash_seed".

Fixes #31918

This is the final step in a sequence, the whole of this work was
factored into 3+ PRs, including the new `pkg/sampling` and the previous
step, #31946. The two new Sampler modes enable mixing OTel sampling SDKs
with Collectors in a consistent way.

The existing hash_seed mode is also a consistent sampling mode, which
makes it possible to have a 1:1 mapping between its decisions and the
OTEP 235 randomness and threshold values. Specifically, the 14-bit hash
value and sampling probability are mapped into 56-bit R-value and
T-value encodings, so that all sampling decisions in all modes include
threshold information.

This implements the semantic conventions of
open-telemetry/semantic-conventions#793, namely
the `sampling.randomness` and `sampling.threshold` attributes used for
logs where there is no tracestate.

The default sampling mode remains HashSeed. We consider a future change
of default to Proportional to be desirable, because:

1. Sampling probability is the same, only the hashing algorithm changes
2. Proportional respects and preserves information about earlier
sampling decisions, which HashSeed can't do, so it has greater
interoperability with OTel SDKs which may also adopt OTEP 235 samplers.

**Link to tracking Issue:** 

Draft for
open-telemetry/opentelemetry-specification#3602.
Previously
#24811,
see also open-telemetry/oteps#235
Part of #29738

**Testing:** New testing has been added.

**Documentation:** ✅

---------

Co-authored-by: Juraci Paixão Kröhling <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

10 participants