-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when using ProcessPoolExecutor
with stack v2 profiler
#11762
Comments
Whoa! Very meaningful issue, @oranav. Deepest apologies for the inconvenience, but thank you for the report. Will work on this ASAP. |
Took a careful look at this, and thanks to some insight from @taegyunkim I think #11768 fixes it. |
Will update this thread with
@oranav; thanks again for the comprehensive overview, the thoughtful discussion, and the effective reproduction. This really improves our ability to provide a speedy fix. 🙇 |
Thanks @sanchda! Indeed one of my suspicions was that the process forks while the mutex is held, but I thought forking must happen while the GIL is taken, and I thought Echion's mutices are all taken under GIL - but apparently I missed something :) |
I'm not sure why it took so long to surface this defect, but it turns out that stack v2 can deadlock applications because not all mutices are reset. The repro in #11762 appears to be pretty durable. I need to investigate it a bit more in order to distill it down into a native stress test we can use moving forward. In practice, this patch suppresses the noted behavior in the repro. ## Checklist - [X] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) --------- Co-authored-by: Taegyun Kim <[email protected]>
#11768 has been merged, so this should be fixed in an upcoming release.
|
I'm going to keep this open until an end-user reports that the new release fixes the problem. @oranav ; if you like, I can direct you to a workflow for getting access to a pre-release build, but otherwise I can update this thread when a known official release has been made available. |
I'm not sure why it took so long to surface this defect, but it turns out that stack v2 can deadlock applications because not all mutices are reset. The repro in #11762 appears to be pretty durable. I need to investigate it a bit more in order to distill it down into a native stress test we can use moving forward. In practice, this patch suppresses the noted behavior in the repro. - [X] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) - [x] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) --------- Co-authored-by: Taegyun Kim <[email protected]> (cherry picked from commit d855c4a)
FYI, this will be released (later today, I hope) in 2.18.1. I'm also attempting to back-port to the 2.17 and 2.16 lines (🤞). It'll be part of mainline starting in the 2.19.0 release. |
Confirming that 2.18.1 shipped. Would love to hear some folks weigh in on whether or not it solved this problem for them. |
We've encountered deadlocks which prevent the process from exiting successfully when using the V2 stack profiler.
Python version is 3.11.11, ddtrace version is 2.17.3, architecture is linux/amd64.
The child process is stuck on
Datadog::Sampler::register_thread
:And then the parent process waits in
join()
forever.Here's a simple reproduction script:
Run with:
Just run a couple of times (it's non-deterministic), and you'll see hangups every now and then.
I've also noticed that the stuck process is always the first worker process created - might lead to some clue.
Thanks!
The text was updated successfully, but these errors were encountered: