-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956
base: main
Are you sure you want to change the base?
Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #11956 +/- ##
==========================================
+ Coverage 91.67% 91.70% +0.02%
==========================================
Files 455 462 +7
Lines 24039 24749 +710
==========================================
+ Hits 22038 22695 +657
- Misses 1629 1672 +43
- Partials 372 382 +10 ☔ View full report in Codecov by Sentry. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
The upstream component which called `ConsumeX` will have this `outcome` attribute applied to its produced measurements, and the downstream | ||
component that `ConsumeX` was called on will have the attribute applied to its consumed measurements. | ||
|
||
Errors should be "tagged as coming from downstream" the same way permanent errors are currently handled: they can be wrapped in a `type downstreamError struct { err error }` wrapper error type, then checked with `errors.As`. Note that care may need to be taken when dealing with the `multiError`s returned by the `fanoutconsumer`. (If PR #11085 introducing a single generic `Error` type is merged, an additional `downstream bool` field can be added to it to serve the same purpose.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this may be a breaking change for some components, IF they are checking for types of errors using something other than errors.As
. I think it's ok though and those components should update to use errors.As
instead anyways. However, we should be aware of this when implementing changes in case it is a widespread problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right. I think those components would already be broken anyway, because of the permanentError
and multiError
wrappers we already use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we need / want to update this once #11085 is merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If #11085 gets merged before this PR, I'll update this paragraph to only include the parenthetical. If this PR gets merged first, I think presenting the two alternatives is probably good enough? But we could make a second amendment if we feel the need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor suggestion for simpler language, otherwise looks great, thank you!
The upstream component which called `ConsumeX` will have this `outcome` attribute applied to its produced measurements, and the downstream | ||
component that `ConsumeX` was called on will have the attribute applied to its consumed measurements. | ||
|
||
Errors should be "tagged as coming from downstream" the same way permanent errors are currently handled: they can be wrapped in a `type downstreamError struct { err error }` wrapper error type, then checked with `errors.As`. Note that care may need to be taken when dealing with the `multiError`s returned by the `fanoutconsumer`. (If PR #11085 introducing a single generic `Error` type is merged, an additional `downstream bool` field can be added to it to serve the same purpose.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we need / want to update this once #11085 is merged?
Given the two approvals and the announcements on #otel-collector-dev and the Collector SIG meeting from last week, this is entering final comment period. cc @open-telemetry/collector-approvers |
For both metrics, an `outcome` attribute with possible values `success`, `failure`, and `rejected` should be automatically recorded, | ||
based on whether the corresponding function call returned successfully, returned an internal error, or propagated an error from a | ||
component further downstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to ensure there is detail about the OTLP PartialSuccess fields rejected_spans
, rejected_log_records
, rejected_data_points
. Here's what I propose:
An OTLP exporter that receives success with one of the rejected counts set will:
- return
nil
indicating success - count
N-j
success points where N is the item count and j is the number rejected - count
j
rejected points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this RFC, the exporter wouldn't be the one outputting the metrics; this would be done by an external mechanism— a wrapper of the Consumer
API placed in between each component.
This means that adding information about partial successes to pipeline instrumentation metrics would require a whole new mechanism for propagating partial successes through the Collector pipeline (and probably other considerations such as how retry exporters should handle them), which sounds like a much broader discussion...
Considering this mechanism would likely only be used by the OTLP exporter, I would suggest emitting a custom metric in said exporter instead of including it as part of general pipeline telemetry. Even in the latter case, I think this may warrant a separate PR, or even an RFC of its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the conversation above this seems like something that can be dealt with independently from this PR.
@jmacd I will merge this on Tuesday unless you consider this a blocker for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the consequences of not getting this right are pretty severe, as OpenTelemetry has advised vendors to use the rejected counts and partial success to indicate when well-formed data cannot be accepted for backend-specific reasons. Also, I think OpenTelemetry should prioritize the user experience for its own protocol over others.
There is already a red flag for me, in this PR, because the term "rejected" is being introduced without a definition.
We have an existing definition for rejected items in OpenTelemetry, which is what happens following a partial success, and we used to refer to failures as "refused" or "failed" in various Collector observability metrics. That said, I'm ready to accept a wider definition for "rejected", but if an OTLP exporter returns partial success and we count 0 rejected points, while using a separate OTLP-specific metric to count rejected points, I think the user experience will be bad, especially for OTLP users.
In this RFC, the exporter wouldn't be the one outputting the metrics; this would be done by an external mechanism— a wrapper of the Consumer API placed in between each component.
I think we could improve this situation by having senders return (error, Details)
, although that would be a pretty bit change. The best way to avoid my concerns, in the short term, is not to overload the term "rejected" in favor of the term we've used in the past, "refused".
I have pretty strong feels about how we define "dropped" as well. I don't think data should ever count as both rejected/refused as well as dropped, so I think some definitions would help, and keep in mind that "rejected" is already defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best way to avoid my concerns, in the short term, is not to overload the term "rejected" in favor of the term we've used in the past, "refused".
This seems reasonable to me. I don't think we're intending to modify any definitions outside of what is defined in this RFC.
The larger problem, if I understand correctly, boils down to reconciling the "all or nothing" interface which stands in between components vs the notion of partial success between the exporter and destination. I suggest this is separate discussion from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't wish to block this effort, and I agree we should disambiguate requests that fail inside a component vs downstream. The term "reject" was chosen for its dictionary definition, "dismiss as inadequate", "failure to meet standards", because it describes a returned judgement about the data. If there's a downstream failure because of timeout, unavailable destination, and so on, the term "reject" feels less applicable to me. For "refuse" the dictionary has a more applicable definition ("not willing to perform an action").
I think "rejected" is a OK term for downstream failures, but receivers have been using "refused" for this. Mostly, hope we can eventually count the OTLP partial success rejections. This is discussed in #9243.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I wasn't at all aware of the ongoing discussions about partial successes, and this distinction between "internal error / invalid data" and "data is valid but intentionally rejected for backend-specific reasons". Given this and the current use of the word "refused" in receiver metrics, I absolutely agree we should use that instead. To be honest, I hadn't thought very hard about the exact word used, so I'll update the PR with that in mind.
I'll give this a few more days since there was a minor change and merge this on Tuesday next week |
Context
The Pipeline Component Telemetry RFC was recently accepted (#11406). The document states the following regarding error monitoring:
Observability requirements for stable pipeline components were also recently merged (#11772). The document states the following regarding error monitoring:
Because errors are typically propagated across
ConsumeX
calls in a pipeline (except for components with an internal queue likeprocessor/batch
), the error observability mechanism proposed by the RFC implies that Pipeline Telemetry will record failures for every component interface upstream of the component that actually emitted the error, which does not match the goals set out in the observability requirements, and makes it much harder to tell which component errors are coming from from the emitted telemetry.Description
This PR amends the Pipeline Component Telemetry RFC with the following:
outcome=failure
value to cases where the error comes from the very next component (the component on whichConsumeX
was called);outcome
attribute:rejected
, for cases where an error observed at an interface comes from further downstream (the component did not "fail", but its output was "rejected");The current proposal for the mechanism is for the pipeline instrumentation layer to wrap errors in an unexported
downstream
struct, which upstream layers could check for witherrors.As
to check whether the error has already been "attributed" to a component. This is the same mechanism currently used for tracking permanent vs. retryable errors. Please check the diff for details.Possible alternatives
There are a few alternatives to this amendment, which were discussed as part of the observability requirements PR:
Consumer
API to no longer propagate errors upstream → prevents proper propagation of backpressure through the pipeline (although this is likely already a problem with thebatch
prcessor);