Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix orphaned spans on Celery workers #822

Open
5 tasks
robrap opened this issue Oct 8, 2024 · 6 comments
Open
5 tasks

Fix orphaned spans on Celery workers #822

robrap opened this issue Oct 8, 2024 · 6 comments
Assignees

Comments

@robrap
Copy link
Contributor

robrap commented Oct 8, 2024

We're seeing orphaned spans on several of our Celery workers, identifiable as service entry spans that are not operation_name:celery.run:

Service Operation names
edx-edxapp-lms-workers celery.apply, has_access, get_course.split_modulestore, enqueue_send_task, mako.template.render_unicode
edx-edxapp-cms-workers celery.apply, get_course.split_modulestore, django.template.render
edx-ai-translations-worker celery.apply
license-manager-worker celery.apply
edx-video-encode-manager-worker celery.apply

We noticed this because of missing code owner on root spans. We've restricted our monitors to just celery.run root spans, but we still want to fix this because about 10% of traces are broken.

Filed https://help.datadoghq.com/hc/en-us/requests/1877349 ("Orphaned spans on celery worker") with Datadog Support. Update: Their proposed fix: DataDog/dd-trace-py#11272

Acceptance criteria

Things to try

  • Are the broken links between run/apply happening across different queues?
    • Try to identify one of these broken pairs and look for any other features that might be different from a "generic" celery scenario.
  • Try enabling DD_CELERY_DISTRIBUTED_TRACING in stage or edge?
@robrap robrap added this to Arch-BOM Oct 8, 2024
@robrap robrap converted this from a draft issue Oct 8, 2024
@robrap
Copy link
Contributor Author

robrap commented Oct 8, 2024

@timmc-edx: I wanted to bring your attention to Alex's note from Slack:

Edit: ah right, it's in the edx-edxapp-lms-workers APM. But why is the apply_async operation associated with the worker? Isn't the webapp the thing that calls apply_async() and does the redis operation to put the task and its args/kwargs on the queue?

This might be additional info about things that are off on the spans you will be looking into. If not, you can ignore as far as this ticket is concerned, other than to report back your findings eventually. Thanks.

@timmc-edx
Copy link
Member

timmc-edx commented Oct 8, 2024

Numbers across resources

On prod LMS for the past 2 days, all celery.apply resource names that at least sometimes are top-level [query] in descending order by frequency:

[A] integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
[S] openedx.core.djangoapps.credentials.tasks.v1.tasks.send_grade_to_credentials
[S] integrated_channels.integrated_channel.tasks.transmit_single_learner_data
[A] openedx.core.djangoapps.programs.tasks.award_program_certificates
[S] openassessment.workflow.tasks.update_workflow_for_submission_task
[S] lms.djangoapps.gating.tasks.task_evaluate_subsection_completion_milestones
[S] openedx.core.djangoapps.programs.tasks.award_course_certificate
[S] openassessment.workflow.tasks.update_workflows_for_ora_block_task
[S] lms.djangoapps.certificates.tasks.generate_certificate
[A] openassessment.workflow.tasks.update_workflows_for_course_task
[S] openedx.core.djangoapps.notifications.tasks.send_notifications
[A] lms.djangoapps.grades.tasks.compute_grades_for_course_v2
[S] lms.djangoapps.grades.tasks.recalculate_subsection_grade_v3
[S] lms.djangoapps.bulk_email.tasks.send_course_email
[A] openedx.core.djangoapps.programs.tasks.revoke_program_certificates

Key: [A] = always top-level; [S] = sometimes top-level, sometimes child

Numbers drilldown

Filtering on the most common top-level celery.apply over what I hope is a representative smaller time window in prod (no spikes):

  • env:prod resource_name:integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
    • @_top_level:* (local root)
      • operation_name:celery.run service:edx-edxapp-lms-workers 35,065
      • operation_name:celery.apply service:edx-edxapp-lms-workers 3,104
    • -@_top_level:* (child)
      • operation_name:celery.apply service:edx-edxapp-lms-workers 31,711
      • operation_name:celery.apply service:edx-edxapp-lms 30

So we have:

  • A handful of celery.apply on the webapp; 3 orders of magnitude more than the total on the workers. Probably mostly kicked off as a result of a different celery task. No celery.run on the webapp, as expected since it is called with apply_async.
  • On the worker, about as many celery.apply as celery.run.
  • On the worker, celery.apply is 10x as common a child than a root span. (Spot check: parent is lms.djangoapps.grades.tasks.recalculate_subsection_grade_v3).
  • Approx same number of celery.apply root as celery.run (to within 1%).

However, over a different time period that contained some spikes:

  • env:prod resource_name:integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
    • @_top_level:* (local root)
      • operation_name:celery.run service:edx-edxapp-lms-workers 1,179
      • operation_name:celery.apply service:edx-edxapp-lms-workers 937
    • -@_top_level:* (child)
      • operation_name:celery.apply service:edx-edxapp-lms-workers 1470
      • operation_name:celery.apply service:edx-edxapp-lms 1

Here, the celery.apply count is over twice the celery.run count, and the celery.apply is about as common as child as root.

Trace-level analysis

The transmit task has a recalculate span as parent. Analysis of this relationship over a recent time period:

a = env:prod service:edx-edxapp-lms-workers resource_name:lms.djangoapps.grades.tasks.recalculate_subsection_grade_v3
b = env:prod service:edx-edxapp-lms-workers resource_name:integrated_channels.integrated_channel.tasks.transmit_single_subsection_learner_data
  • When the is parent/child (a => b) the parent is always celery.run and the child is celery.apply.
  • I'd like to check if there are cases where the parent does not have this child (a && NOT(a => b)) but Datadog has a bug that is making it silently return zero results. So instead, we'll have to look for traces with a and b missing/present anywhere and assume that when present, it's always parent/child.
    • a && b 1.58k
    • a && NOT(b) 6.62k
    • NOT(a) && b 4.58k
  • It's possible that a doesn't always call b—or that b is being orphaned some of the time. Unclear.

@timmc-edx
Copy link
Member

Filed https://help.datadoghq.com/hc/en-us/requests/1877349 ("Orphaned spans on celery worker") with Datadog Support.

@timmc-edx
Copy link
Member

timmc-edx commented Oct 10, 2024

Answering the question "do other celery workers have this problem...

  • Services that have top-level celery.run, which are therefore probably celery workers: Service entry spans, by service, matching env:prod operation_name:celery.run
  • The subset of those that also have other top-level spans: Service entry spans, by service and operation name, matching env:prod -operation_name:celery.run service:(a OR b OR ...)

Results, filtered down to edX services:

Service Operation names
edx-edxapp-lms-workers celery.apply, has_access, get_course.split_modulestore, enqueue_send_task, mako.template.render_unicode
edx-edxapp-cms-workers celery.apply, get_course.split_modulestore, django.template.render
edx-ai-translations-worker celery.apply
license-manager-worker celery.apply
edx-video-encode-manager-worker celery.apply

So edxapp has multiple kinds of top-level spans, but the other workers have at most celery.apply. (That doesn't necessarily mean that edxapp is different, though, other than having a wide variety of spans that can be orphaned.)

@timmc-edx timmc-edx changed the title Discovery: Do we have orphaned celery.apply spans Fix orphaned spans on Celery workers Oct 15, 2024
@timmc-edx timmc-edx moved this from In Progress to Blocked in Arch-BOM Oct 15, 2024
@timmc-edx
Copy link
Member

A proposed fix was released in 2.17.3 and we deployed it today, but it doesn't seem to have helped.

@dianakhuang dianakhuang moved this from Blocked to Ready For Development in Arch-BOM Jan 6, 2025
@timmc-edx
Copy link
Member

timmc-edx commented Jan 9, 2025

(Also posted on DD support ticket.)

Some good but mixed news -- I just checked the graphs again, and apparently the issue has partially disappeared, coinciding with the first deploy after our winter deployment freeze.

I created a dashboard to better track the issue, with two graphs: One to look at all top-level spans on an affected Celery worker (broken out by operation type, and excluding celery.run), and another showing the version of edx-platform code running on the worker (to check when deploys happened, although it's not guaranteed to show all of them).

At the time of the Jan 6 deploy, a number of operation names stop appearing as top-level spans: celery.apply, mysql.query, django.cache, MySQLdb.connection.commit, has_access, kafka.produce

However, the following continue to appear as top-level spans: redis.command, requests.request, s3.command

The previous deployment (76cceaa-6643) occurred on Dec 20 and picked up ddtrace-2.18.0. The newer deployment (b510cef-6659) on Jan 6 picked up ddtrace-2.18.1 It's possible that this was caused by a config change on our side, and I'll need to check that hypothesis too. However, 2.18.1 does include a celery-related bugfix, and I wonder if that might have been the culprit here. There are still orphaned spans, mostly from redis, but this should improve observability a good deal already.


Comparing before/after deploys:

ddtrace is the only plausible-looking change.

@timmc-edx timmc-edx moved this from Ready For Development to In Progress in Arch-BOM Jan 9, 2025
timmc-edx added a commit to edx/configuration that referenced this issue Jan 9, 2025
Datadog Support suggested that we try enabling this to see if it changes
anything about the orphaned spans on the edxapp workers (unexpected
top-level spans that are not `operation_name:celery.run`, but instead are
other spans that have lost their parent association).

Just going to enable this on stage so we don't mess with traces in prod,
for now.

See edx/edx-arch-experiments#822
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants