Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV in memory profiler (when memalloc_add_event calls traceback_free) #11751

Open
oranav opened this issue Dec 17, 2024 · 7 comments
Open
Assignees
Labels
Profiling Continous Profling

Comments

@oranav
Copy link
Contributor

oranav commented Dec 17, 2024

We're hitting SIGSEGVs every now and then with the memory profiler.

Python version is 3.11.11. ddtrace is 2.17.3. We're using the amd64 architecture.

I've extracted a native stack traceback from the coredump:

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007fb47c227f1f in __pthread_kill_internal (signo=11, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007fb47c1d8fb2 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#3  0x00007fb47a2f222f in ?? () from /app/venv/lib/python3.11/site-packages/ddtrace/internal/datadog/profiling/ddup/../libdd_wrapper-glibc-x86_64.so
#4  <signal handler called>
#5  0x00007fb47c523964 in _Py_Dealloc (op=<unknown at remote 0x7fb47ac9c230>) at Objects/object.c:2390
#6  0x00007fb4781a86e5 in traceback_free () from /app/venv/lib/python3.11/site-packages/ddtrace/profiling/collector/_memalloc.cpython-311-x86_64-linux-gnu.so
#7  0x00007fb4781a7a40 in memalloc_add_event.part () from /app/venv/lib/python3.11/site-packages/ddtrace/profiling/collector/_memalloc.cpython-311-x86_64-linux-gnu.so
#8  0x00007fb4781a7cbe in memalloc_malloc () from /app/venv/lib/python3.11/site-packages/ddtrace/profiling/collector/_memalloc.cpython-311-x86_64-linux-gnu.so
#9  0x00007fb47c545be6 in PyObject_Malloc (size=44) at Objects/obmalloc.c:712
#10 _PyBytes_FromSize (size=11, use_calloc=0) at Objects/bytesobject.c:103
#11 0x00007fb47c583a47 in PyBytes_FromStringAndSize (size=11, str=0x7fb3efd2d820 "<REDACTED>") at Objects/bytesobject.c:136

It seems to me that this call access some invalid memory.

I believe #11460 might fix it; a possible explanation is that two threads decide to ditch the same traceback, in case reservoir sampling yielded the same index in both threads, then we might call traceback_free twice on the same pointer (as long as it isn't guarded by a lock).
I'm not sure if that's the case though, but it's a possible explanation.

@sanchda
Copy link
Contributor

sanchda commented Dec 17, 2024

👋 Thank you for the report, @oranav. #11460 is indeed the fix for this.

@taegyunkim taegyunkim added the Profiling Continous Profling label Dec 17, 2024
@sanchda sanchda self-assigned this Dec 17, 2024
@sanchda
Copy link
Contributor

sanchda commented Dec 20, 2024

FYI, this will be released (later today, I hope) in 2.18.1. I'm also attempting to back-port to the 2.17 and 2.16 lines (🤞). It'll be part of mainline starting in the 2.19.0 release.

@sanchda
Copy link
Contributor

sanchda commented Dec 20, 2024

Confirming that 2.18.1 shipped. Would love to hear some folks weigh in on whether or not it solved this problem for them.

@apenney
Copy link

apenney commented Jan 8, 2025

We have some real serious production issues with 2.18.1 (and previous versions) with 2.14.1 because the last working one for us. We get a bunch of SIGSEGV and cripplingly high cpu/memory usage until everything dies. I made an actual support ticket (1984876) that I wanted to highlight here in case it's related. (might be a separate issue, but this is the only issue we found that seemed semi-related)

@sanchda
Copy link
Contributor

sanchda commented Jan 8, 2025

@apenney following up with our support organization for the circumstantial details in your ticket. Will respond with top priority.

@sanchda
Copy link
Contributor

sanchda commented Jan 8, 2025

@apenney I'm not sure yet whether your report is related to this one. I'm going to try to override our support processes and will iterate through there (it's a lot easier for me to review customer environments in the context of a support ticket than to ask for some kind of painful back-and-forth over Github Issues).

Note that my fix here DID introduce a performance regression, which has been fixed and backported by @nsrip-dd, but it has yet to land in a release.

@sanchda
Copy link
Contributor

sanchda commented Jan 8, 2025

@apenney still investigating this. Let me break down where we're at.

  1. The crashes I see don't appear to be related to this ticket. Maybe they were at one point, or maybe I'm missing them (sorry!), but the memalloc issues described by the OP no longer appear to be relevant.
  2. However, there are a number of crashes from other components of dd-trace-py. I'm coordinating with the appropriate engineers to gain some insight into things.
  3. The SIGILL trap you posted is problematic, and we don't have automatic detection for SIGILL just yet. I'm somewhat hoping that these issues are resolved by addressing the problems in category 2. If not, I'm also proposing that we upgrade our crash analysis infrastructure simultaneously.

Anyway @apenney, in terms of timeline, here's what you can expect.

  1. I'll back off on posting updates on this issue in this thread, unless you have something tactical to share or if our other lines of communication don't sync up within some appropriate amount of time
  2. Ownership of your ticket will transition to a different part of our org (not me, probably).
  3. Focus is on understanding the crashes. Unfortunately, it's really hard to pinpoint overhead until we have crashes sorted out. If you have evidence for overhead being a totally orthogonal issue, please share your findings in a ticket and we might be able to divide-and-conquer things more effectively on this end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Profiling Continous Profling
Projects
None yet
Development

No branches or pull requests

4 participants