ddtrace/tracer: add integration tag to spans_started/finished #3023

hannahkm · 2024-12-10T16:06:19Z

What does this PR do?

Add a integration tag to the existing datadog.tracer.spans_started and datadog.tracer.spans_finished metrics. The value of the tag will be the name of the component from which the span was started. For example, for a contrib, it will be the name of the contrib package (chi, net/http, etc). For spans that were created manually, the tag will say manual.

Motivation

We want to know, in addition to when a span is started, where the span originated from. This could be a contrib or a manual implementation.

Reviewer's Checklist

Changed code has unit tests for its functionality at or near 100% coverage.
System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
There is a benchmark for any new code, or changes to existing code.
If this interacts with the agent in a new way, a system test has been added.
Add an appropriate team label so this PR gets put in the right place for the release notes.
Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.
For internal contributors, a matching PR should be created to the v2-dev branch and reviewed by @DataDog/apm-go.

Unsure? Have a question? Request a review!

datadog-datadog-prod-us1 · 2024-12-10T16:13:11Z

Datadog Report

Branch report: apm-rd/span-source-health-metric
Commit report: 50b5f15
Test service: dd-trace-go

✅ 0 Failed, 5127 Passed, 70 Skipped, 2m 53.72s Total Time
❄️ 2 New Flaky

New Flaky Tests (2)

TestHealthMetricsRaceCondition - gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer - Last Failure

Expand for error

 Failed
 
 === RUN   TestHealthMetricsRaceCondition
     metrics_test.go:213: 
         	Error Trace:	/home/runner/work/dd-trace-go/dd-trace-go/ddtrace/tracer/metrics_test.go:213
         	Error:      	Not equal: 
         	            	expected: 5
         	            	actual  : 4
         	Test:       	TestHealthMetricsRaceCondition
 --- FAIL: TestHealthMetricsRaceCondition (0.16s)

TestReportHealthMetricsAtInterval - gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer - Last Failure

Expand for error

 Failed
 
 === RUN   TestReportHealthMetricsAtInterval
     metrics_test.go:65: 
         	Error Trace:	/home/runner/work/dd-trace-go/dd-trace-go/ddtrace/tracer/metrics_test.go:65
         	Error:      	Not equal: 
         	            	expected: 1
         	            	actual  : 0
         	Test:       	TestReportHealthMetricsAtInterval
     metrics_test.go:66: 
 ...

pr-commenter · 2024-12-10T17:48:35Z

Benchmarks

Benchmark execution time: 2025-01-08 18:52:07

Comparing candidate commit f4e7820 in PR branch apm-rd/span-source-health-metric with baseline commit 4f57a47 in branch main.

Found 2 performance improvements and 9 performance regressions! Performance is the same for 48 metrics, 0 unstable metrics.

scenario:BenchmarkHttpServeTrace-24

🟥 execution_time [+521.960ns; +611.440ns] or [+3.277%; +3.839%]

scenario:BenchmarkInjectW3C-24

🟥 execution_time [+171.438ns; +204.362ns] or [+4.275%; +5.096%]

scenario:BenchmarkOTelApiWithCustomTags/datadog_otel_api-24

🟥 execution_time [+174.136ns; +200.064ns] or [+3.688%; +4.237%]

scenario:BenchmarkOTelApiWithCustomTags/otel_api-24

🟥 execution_time [+321.193ns; +383.807ns] or [+4.529%; +5.412%]

scenario:BenchmarkPartialFlushing/Disabled-24

🟥 execution_time [+12.557ms; +15.539ms] or [+4.640%; +5.743%]

scenario:BenchmarkPartialFlushing/Enabled-24

🟥 execution_time [+10.812ms; +14.014ms] or [+3.928%; +5.091%]

scenario:BenchmarkSetTagString-24

🟩 execution_time [-9.201ns; -6.739ns] or [-7.468%; -5.470%]

scenario:BenchmarkSetTagStringer-24

🟩 execution_time [-7.251ns; -4.749ns] or [-4.811%; -3.151%]

scenario:BenchmarkSingleSpanRetention/no-rules-24

🟥 execution_time [+6.966µs; +9.023µs] or [+2.897%; +3.753%]

scenario:BenchmarkSingleSpanRetention/with-rules/match-all-24

🟥 execution_time [+10.346µs; +12.554µs] or [+4.336%; +5.261%]

scenario:BenchmarkSingleSpanRetention/with-rules/match-half-24

🟥 execution_time [+10.048µs; +12.087µs] or [+4.204%; +5.057%]

mtoffl01

Ok, so you're reporting spansStarted/spansFinished on span.Start/span.Finished if the integration is not empty, and leaving the chunk reporting to any spans that are manual... I understand why you did this but not totally sure about the approach.

span.Start and span.Stop are typically called quite frequently, so if a majority of the spans are from automatic integrations, this will be very noisy (and defeats the purpose of reporting the metrics at a specified interval, to reduce noise)

One alternative idea:
Change the way we track spansStarted and spansFinished to be some kind of counter map that includes the integration name, e.g. map[string]uint32 where the key is the integration name and the value is the count of spans started/finished that integration name. Then, in this goroutine, we'll have to iterate over the map and report the spans started/finished per integration
(or some other idea I haven't thought of?)

ddtrace/mocktracer/mockspan.go

ddtrace/mocktracer/mockspan_test.go

hannahkm · 2024-12-17T20:01:43Z

@mtoffl01 Good points! A map would probably work better; I was hesitant at first since I didn't want to change too much of what already exists, but knowing that these metrics are pretty old... I'm more down to change it up now.

…metric

darccio · 2024-12-20T13:12:34Z

@hannahkm I'm approving this but we should investigate why the benchmarks report the increased allocations.

mtoffl01

Overall, I definitely have some concerns 🤔 Maybe you can write some additional tests to provide peace of mind....

Tests designed to try and make the system fail -- what happens when you have multiple goroutines access a start span / finish span method, can we prove that we've protected against a race condition?
Maybe you want to write dedicated benchmarks to show how much performance is impacted

ddtrace/tracer/tracer.go

mtoffl01 · 2024-12-20T15:42:25Z

ddtrace/tracer/tracer.go

+	spansStarted, spansFinished struct {
+		mu    sync.Mutex
+		spans map[string]uint32
+	}


I do wonder whether:

You should use sync.Map instead

Because spans are started and finished so often, is the requirement to lock going to cause more contention (and thus degrade performance) than this integration tag is worth?

Maybe @darccio can weigh in on this.

⚠️ Yes, I think this is potentially a dangerous place to put a heavy sync.Mutex. The benchmark platform seems to have detected it as a problem as well:

There is a lot to be said about optimizing these sorts of things, but last time I checked sync.Map was not state of the art. Instead I would recommend taking a look at this package.

I previously coded up a similar counter map using v2 of this library here: https://github.com/felixge/countermap/blob/main/xsync_map_counter_map.go

It was like 20x faster than a naive mutex implementation and also significantly faster than sync.Map. But please try for yourself and let me know what you find. Feel free to ping me on slack (I don't always manage to keep up with GitHub). Thanks!

…metric

mtoffl01

Still have some of the same concerns as before regarding lock contention 🤔 . Maybe try some of Felix's ideas!

ddtrace/mocktracer/mockspan_test.go

mtoffl01 · 2025-01-03T15:36:43Z

ddtrace/tracer/metrics.go

+			t.spansStarted.mu.Lock()
+			for name, v := range t.spansStarted.spans {
+				t.statsd.Count("datadog.tracer.spans_started", int64(v), []string{"integration:" + name}, 1)
+				t.spansStarted.spans[name] = 0


Interesting. At first I was going to suggest that you use delete instead, since delete is generally more performant (constant time complexity and removes the key from the map, freeing up memory), but in this case, it's possible and even likely that the same key will be reused in the map in the future. In this case, it's possible that setting the value to 0 will be more performant, since you won't need to re-allocate space for the key in the map. I'm curious if you chose to set the value to 0, rather than use delete, for this purpose?

Yeah, I assumed that it would be more likely for a key to be reused than to be a one off and done. But I do wonder if we have data about which situation is more common (i.e. many spans being created for a certain integration vs just a few, all at once).

mtoffl01 · 2025-01-03T15:45:20Z

ddtrace/tracer/metrics.go

-			t.statsd.Count("datadog.tracer.spans_finished", int64(atomic.SwapUint32(&t.spansFinished, 0)), nil, 1)
+			t.spansStarted.mu.Lock()
+			for name, v := range t.spansStarted.spans {
+				t.statsd.Count("datadog.tracer.spans_started", int64(v), []string{"integration:" + name}, 1)


I think you may want to use Incr here instead.

Hm, why use Incr? If there has been more than one span started from a certain integration, the value of v should be greater than 1, so we would want to use Count.

mtoffl01 · 2025-01-03T15:47:28Z

ddtrace/tracer/metrics.go

+			t.spansStarted.mu.Unlock()
+			t.spansFinished.mu.Lock()
+			for name, v := range t.spansFinished.spans {
+				t.statsd.Count("datadog.tracer.spans_finished", int64(v), []string{"integration:" + name}, 1)


Same as line 98: Use Incr.

mtoffl01 · 2025-01-03T16:01:18Z

ddtrace/tracer/metrics_test.go

+		assert.Equal(counts["datadog.tracer.spans_started"], int64(1))
+		for _, c := range statsdtest.FilterCallsByName(tg.CountCalls(), "datadog.tracer.spans_started") {
+			if slices.Equal(c.Tags(), []string{"integration:contrib"}) {
+				return


same comment as line 112

ddtrace/tracer/metrics_test.go

ddtrace/tracer/spancontext.go

Co-authored-by: Mikayla Toffler <[email protected]>

felixge · 2025-01-07T20:41:32Z

ddtrace/tracer/tracer.go

-	spansStarted, spansFinished, tracesDropped uint32
+	// These maps keep count of the number of spans started and finished from
+	// each component, including contribs and "manual" spans.
+	spansStarted, spansFinished *xsync.MapOf[string, int64]


To fix the perf issue you're seeing, try using a *atomic.Int64 type instead of int64 for the map value. Also use the Load+LoadOrStore method on the map to do the update like shown in the example I posted earlier.

After the initial phase of initializing the map keys, the code will end up taking only the Load path which is much faster than Compute. The latter always needs to take a lock whereas the former uses optimistic concurrency control that requires no locks or retries during steady state.

Let me know if that works.

(I know you mentioned you tried Load+Store in your slack DM, but I suspect you didn't try with an *atomic.Int64 pointer like I'm suggesting?)

ddtrace/tracer: use ext.Component to report source of new spans

1aa93b8

hannahkm added 2 commits December 10, 2024 11:14

ddtrace/tracer: apply source to finished spans health metric

0a51b8a

ddtrace/tracer: check for nil span before checking source

7601d35

hannahkm added 9 commits December 10, 2024 14:25

ddtrace/mocktracer: update mockspan to also hold source

13ecd2c

contrib: check for correct source on mockspans in tests

31aa679

contrib: remove incorrect checks for source

819e312

ddtrace/tracer: check for appropriate tag in spans_started metric

24ad484

ddtrace/tracer: test for different values of source

a8f097d

contrib,ddtrace/tracer: rename source to integration

feb73d7

ddtrace/mocktracer: replace missed source with integration

7fdb0c8

ddtrace/tracer: fix false positives in test

4b609a2

ddtrace/tracer: create test for spans_finished integration tag

f061f22

hannahkm changed the title ~~ddtrace/tracer: add source tag to spans_started health metric~~ ddtrace/tracer: add integration tag to spans_started/finished Dec 12, 2024

github-actions bot added the apm:ecosystem contrib/* related feature requests or bugs label Dec 12, 2024

ddtrace/tracer: fix failing smoke tests

236cb25

mtoffl01 requested changes Dec 17, 2024

View reviewed changes

ddtrace/mocktracer/mockspan.go Outdated Show resolved Hide resolved

ddtrace/mocktracer/mockspan.go Outdated Show resolved Hide resolved

ddtrace/mocktracer/mockspan_test.go Outdated Show resolved Hide resolved

hannahkm added 4 commits December 17, 2024 15:28

ddtrace/tracer: use map to keep track of spans started and finished

090c79c

ddtrace/tracer: fix races when accessing maps

c7db44d

ddtrace/tracer: replace sprintf usage with concat

df8f03e

Merge remote-tracking branch 'origin' into apm-rd/span-source-health-…

5cf2e2e

…metric

hannahkm marked this pull request as ready for review December 19, 2024 21:27

hannahkm requested review from a team as code owners December 19, 2024 21:27

darccio approved these changes Dec 20, 2024

View reviewed changes

mtoffl01 reviewed Dec 20, 2024

View reviewed changes

hannahkm added 6 commits January 2, 2025 14:39

ddtrace/tracer: remove atomics

3eb1f8b

Merge remote-tracking branch 'origin' into apm-rd/span-source-health-…

e44dc19

…metric

ddtrace/tracer: remove duplicate and use FilterCallsByName

e6274d5

ddtrace/tracer: reset counts after each health metric report

8d93180

ddtrace/tracer: test races and benchmark

abf1bed

nit: fix typo in checking metric counts

07857b6

mtoffl01 requested changes Jan 3, 2025

View reviewed changes

hannahkm and others added 8 commits January 3, 2025 13:27

ddtrace/tracer: nit name fixes

a066261

ddtrace/tracer: improve testname

c005d73

Co-authored-by: Mikayla Toffler <[email protected]>

ddtrace/tracer: convert map to use int64 instead of uint32

cf97626

ddtrace/tracer: fix missed type change

f3505c9

ddtracer/tracer: check that metric counts revert to 0 after reporting

dbf336c

ddtrace/tracer: use xsync Map for spansStarted and Finished

c5f05f5

internal/exectracetest: go mod tidy

32c20f7

try using delete instead of clear

5d56dc5

felixge reviewed Jan 7, 2025

View reviewed changes

use atomic.int64 instead of int64

f4e7820

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddtrace/tracer: add integration tag to spans_started/finished #3023

ddtrace/tracer: add integration tag to spans_started/finished #3023

hannahkm commented Dec 10, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Dec 10, 2024 •

edited

Loading

pr-commenter bot commented Dec 10, 2024 •

edited

Loading

mtoffl01 left a comment •

edited

Loading

hannahkm commented Dec 17, 2024

darccio commented Dec 20, 2024

mtoffl01 left a comment •

edited

Loading

mtoffl01 Dec 20, 2024

felixge Jan 3, 2025

mtoffl01 left a comment

mtoffl01 Jan 3, 2025

hannahkm Jan 3, 2025

mtoffl01 Jan 3, 2025

hannahkm Jan 3, 2025

mtoffl01 Jan 3, 2025

mtoffl01 Jan 3, 2025

felixge Jan 7, 2025 •

edited

Loading

ddtrace/tracer: add integration tag to spans_started/finished #3023

Are you sure you want to change the base?

ddtrace/tracer: add integration tag to spans_started/finished #3023

Conversation

hannahkm commented Dec 10, 2024 • edited Loading

What does this PR do?

Motivation

Reviewer's Checklist

datadog-datadog-prod-us1 bot commented Dec 10, 2024 • edited Loading

Datadog Report

New Flaky Tests (2)

pr-commenter bot commented Dec 10, 2024 • edited Loading

Benchmarks

scenario:BenchmarkHttpServeTrace-24

scenario:BenchmarkInjectW3C-24

scenario:BenchmarkOTelApiWithCustomTags/datadog_otel_api-24

scenario:BenchmarkOTelApiWithCustomTags/otel_api-24

scenario:BenchmarkPartialFlushing/Disabled-24

scenario:BenchmarkPartialFlushing/Enabled-24

scenario:BenchmarkSetTagString-24

scenario:BenchmarkSetTagStringer-24

scenario:BenchmarkSingleSpanRetention/no-rules-24

scenario:BenchmarkSingleSpanRetention/with-rules/match-all-24

scenario:BenchmarkSingleSpanRetention/with-rules/match-half-24

mtoffl01 left a comment • edited Loading

Choose a reason for hiding this comment

hannahkm commented Dec 17, 2024

darccio commented Dec 20, 2024

mtoffl01 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtoffl01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixge Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

hannahkm commented Dec 10, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Dec 10, 2024 •

edited

Loading

pr-commenter bot commented Dec 10, 2024 •

edited

Loading

mtoffl01 left a comment •

edited

Loading

mtoffl01 left a comment •

edited

Loading

felixge Jan 7, 2025 •

edited

Loading