-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix orphaned spans on Celery workers #822
Comments
@timmc-edx: I wanted to bring your attention to Alex's note from Slack:
This might be additional info about things that are off on the spans you will be looking into. If not, you can ignore as far as this ticket is concerned, other than to report back your findings eventually. Thanks. |
Numbers across resourcesOn prod LMS for the past 2 days, all
Key: [A] = always top-level; [S] = sometimes top-level, sometimes child Numbers drilldownFiltering on the most common top-level celery.apply over what I hope is a representative smaller time window in prod (no spikes):
So we have:
However, over a different time period that contained some spikes:
Here, the Trace-level analysisThe transmit task has a recalculate span as parent. Analysis of this relationship over a recent time period:
|
Filed https://help.datadoghq.com/hc/en-us/requests/1877349 ("Orphaned spans on celery worker") with Datadog Support. |
Answering the question "do other celery workers have this problem...
Results, filtered down to edX services:
So edxapp has multiple kinds of top-level spans, but the other workers have at most |
A proposed fix was released in 2.17.3 and we deployed it today, but it doesn't seem to have helped. |
(Also posted on DD support ticket.) Some good but mixed news -- I just checked the graphs again, and apparently the issue has partially disappeared, coinciding with the first deploy after our winter deployment freeze. I created a dashboard to better track the issue, with two graphs: One to look at all top-level spans on an affected Celery worker (broken out by operation type, and excluding celery.run), and another showing the version of edx-platform code running on the worker (to check when deploys happened, although it's not guaranteed to show all of them). At the time of the Jan 6 deploy, a number of operation names stop appearing as top-level spans: However, the following continue to appear as top-level spans: The previous deployment (76cceaa-6643) occurred on Dec 20 and picked up ddtrace-2.18.0. The newer deployment (b510cef-6659) on Jan 6 picked up ddtrace-2.18.1 It's possible that this was caused by a config change on our side, and I'll need to check that hypothesis too. However, 2.18.1 does include a celery-related bugfix, and I wonder if that might have been the culprit here. There are still orphaned spans, mostly from redis, but this should improve observability a good deal already. Comparing before/after deploys:
ddtrace is the only plausible-looking change. |
Datadog Support suggested that we try enabling this to see if it changes anything about the orphaned spans on the edxapp workers (unexpected top-level spans that are not `operation_name:celery.run`, but instead are other spans that have lost their parent association). Just going to enable this on stage so we don't mess with traces in prod, for now. See edx/edx-arch-experiments#822
We're seeing orphaned spans on several of our Celery workers, identifiable as service entry spans that are not
operation_name:celery.run
:We noticed this because of missing code owner on root spans. We've restricted our monitors to just
celery.run
root spans, but we still want to fix this because about 10% of traces are broken.Filed https://help.datadoghq.com/hc/en-us/requests/1877349 ("Orphaned spans on celery worker") with Datadog Support. Update: Their proposed fix: DataDog/dd-trace-py#11272
Acceptance criteria
operation_name:celery.run
when looking at top-level spans)Things to try
DD_CELERY_DISTRIBUTED_TRACING
in stage or edge?The text was updated successfully, but these errors were encountered: