-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ddtrace/tracer: add integration tag to spans_started/finished #3023
base: main
Are you sure you want to change the base?
Conversation
Datadog ReportBranch report: ✅ 0 Failed, 5127 Passed, 70 Skipped, 2m 53.72s Total Time New Flaky Tests (2)
|
BenchmarksBenchmark execution time: 2025-01-08 18:52:07 Comparing candidate commit f4e7820 in PR branch Found 2 performance improvements and 9 performance regressions! Performance is the same for 48 metrics, 0 unstable metrics. scenario:BenchmarkHttpServeTrace-24
scenario:BenchmarkInjectW3C-24
scenario:BenchmarkOTelApiWithCustomTags/datadog_otel_api-24
scenario:BenchmarkOTelApiWithCustomTags/otel_api-24
scenario:BenchmarkPartialFlushing/Disabled-24
scenario:BenchmarkPartialFlushing/Enabled-24
scenario:BenchmarkSetTagString-24
scenario:BenchmarkSetTagStringer-24
scenario:BenchmarkSingleSpanRetention/no-rules-24
scenario:BenchmarkSingleSpanRetention/with-rules/match-all-24
scenario:BenchmarkSingleSpanRetention/with-rules/match-half-24
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so you're reporting spansStarted/spansFinished on span.Start/span.Finished if the integration is not empty, and leaving the chunk reporting to any spans that are manual... I understand why you did this but not totally sure about the approach.
span.Start and span.Stop are typically called quite frequently, so if a majority of the spans are from automatic integrations, this will be very noisy (and defeats the purpose of reporting the metrics at a specified interval, to reduce noise)
One alternative idea:
Change the way we track spansStarted and spansFinished to be some kind of counter map that includes the integration name, e.g. map[string]uint32
where the key is the integration name and the value is the count of spans started/finished that integration name. Then, in this goroutine, we'll have to iterate over the map and report the spans started/finished per integration
(or some other idea I haven't thought of?)
@mtoffl01 Good points! A map would probably work better; I was hesitant at first since I didn't want to change too much of what already exists, but knowing that these metrics are pretty old... I'm more down to change it up now. |
@hannahkm I'm approving this but we should investigate why the benchmarks report the increased allocations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I definitely have some concerns 🤔 Maybe you can write some additional tests to provide peace of mind....
- Tests designed to try and make the system fail -- what happens when you have multiple goroutines access a start span / finish span method, can we prove that we've protected against a race condition?
- Maybe you want to write dedicated benchmarks to show how much performance is impacted
ddtrace/tracer/tracer.go
Outdated
spansStarted, spansFinished struct { | ||
mu sync.Mutex | ||
spans map[string]uint32 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder whether:
- You should use sync.Map instead
- Because spans are started and finished so often, is the requirement to lock going to cause more contention (and thus degrade performance) than this integration tag is worth?
Maybe @darccio can weigh in on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot to be said about optimizing these sorts of things, but last time I checked sync.Map
was not state of the art. Instead I would recommend taking a look at this package.
I previously coded up a similar counter map using v2 of this library here: https://github.com/felixge/countermap/blob/main/xsync_map_counter_map.go
It was like 20x faster than a naive mutex implementation and also significantly faster than sync.Map. But please try for yourself and let me know what you find. Feel free to ping me on slack (I don't always manage to keep up with GitHub). Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still have some of the same concerns as before regarding lock contention 🤔 . Maybe try some of Felix's ideas!
ddtrace/tracer/metrics.go
Outdated
t.spansStarted.mu.Lock() | ||
for name, v := range t.spansStarted.spans { | ||
t.statsd.Count("datadog.tracer.spans_started", int64(v), []string{"integration:" + name}, 1) | ||
t.spansStarted.spans[name] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. At first I was going to suggest that you use delete instead, since delete is generally more performant (constant time complexity and removes the key from the map, freeing up memory), but in this case, it's possible and even likely that the same key will be reused in the map in the future. In this case, it's possible that setting the value to 0 will be more performant, since you won't need to re-allocate space for the key in the map. I'm curious if you chose to set the value to 0, rather than use delete, for this purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I assumed that it would be more likely for a key to be reused than to be a one off and done. But I do wonder if we have data about which situation is more common (i.e. many spans being created for a certain integration vs just a few, all at once).
ddtrace/tracer/metrics.go
Outdated
t.statsd.Count("datadog.tracer.spans_finished", int64(atomic.SwapUint32(&t.spansFinished, 0)), nil, 1) | ||
t.spansStarted.mu.Lock() | ||
for name, v := range t.spansStarted.spans { | ||
t.statsd.Count("datadog.tracer.spans_started", int64(v), []string{"integration:" + name}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may want to use Incr here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why use Incr? If there has been more than one span started from a certain integration, the value of v
should be greater than 1
, so we would want to use Count.
ddtrace/tracer/metrics.go
Outdated
t.spansStarted.mu.Unlock() | ||
t.spansFinished.mu.Lock() | ||
for name, v := range t.spansFinished.spans { | ||
t.statsd.Count("datadog.tracer.spans_finished", int64(v), []string{"integration:" + name}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as line 98: Use Incr.
assert.Equal(counts["datadog.tracer.spans_started"], int64(1)) | ||
for _, c := range statsdtest.FilterCallsByName(tg.CountCalls(), "datadog.tracer.spans_started") { | ||
if slices.Equal(c.Tags(), []string{"integration:contrib"}) { | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as line 112
Co-authored-by: Mikayla Toffler <[email protected]>
ddtrace/tracer/tracer.go
Outdated
spansStarted, spansFinished, tracesDropped uint32 | ||
// These maps keep count of the number of spans started and finished from | ||
// each component, including contribs and "manual" spans. | ||
spansStarted, spansFinished *xsync.MapOf[string, int64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix the perf issue you're seeing, try using a *atomic.Int64
type instead of int64
for the map value. Also use the Load
+LoadOrStore
method on the map to do the update like shown in the example I posted earlier.
After the initial phase of initializing the map keys, the code will end up taking only the Load
path which is much faster than Compute
. The latter always needs to take a lock whereas the former uses optimistic concurrency control that requires no locks or retries during steady state.
Let me know if that works.
(I know you mentioned you tried Load+Store in your slack DM, but I suspect you didn't try with an *atomic.Int64
pointer like I'm suggesting?)
What does this PR do?
Add a
integration
tag to the existingdatadog.tracer.spans_started
anddatadog.tracer.spans_finished
metrics. The value of the tag will be the name of the component from which the span was started. For example, for a contrib, it will be the name of the contrib package (chi
,net/http
, etc). For spans that were created manually, the tag will saymanual
.Motivation
We want to know, in addition to when a span is started, where the span originated from. This could be a contrib or a manual implementation.
Reviewer's Checklist
v2-dev
branch and reviewed by @DataDog/apm-go.Unsure? Have a question? Request a review!