Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not ready for merge] Userspace stack tracing from kernel programs #466

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

brenns10
Copy link
Contributor

@brenns10 brenns10 commented Feb 5, 2025

This has always been a bit of a far-off idea, but with the module API it's working, some of the time (definitely not all of the time). So I thought I'd share it to see if some of the tweaks necessary to make it happen would be reasonable.

Basically, there's a GDB script called pstack that attaches GDB and takes a stack trace of a program. You can also use /proc/$PID/stack to get the kernel stack (assuming it's running in kernel mode or blocked). I was hoping to come up with a way to replicate that behavior in drgn, in a way that would work against /proc/kcore or /proc/vmcore. Essentially, it would allow you to get userspace stack traces from the crashed kernel (but not necessarily the whole core dump, like contrib/gcore.py) in the kdump kernel just before or after dumping the vmcore. (Presumably, userspace pages are filtered for 99.9% of all kdump configurations, so you'd need to run this while /proc/kcore is still available).

The main part of this requires creating a custom program that has a memory reader, as well as specifying all the required Modules, their biases, and their address ranges. From there, you can get the userspace struct pt_regs from the kernel program, copy it to the user program, and then unwind the stack.

To do this, I've needed to tweak drgn a bit:

  1. Python memory readers may return FaultError, but the resulting drgn_error has the wrong error code. So I added a special-case to pass through fault-errors back to the drgn error. This could be made more general, but I don't actually think it would be good to do that generally.
  2. I made loaded & debug file biases writable, so that I could update the biases.
  3. I included a crude patch to support .gnu_debugdata which is helpful for my use case.
  4. I needed to rip out the compatibility checks for stack tracing, because otherwise drgn would fail saying that stack tracing is not supported for this program.

At the end of the day, I'm quite confident that none of this is ready for merge, but I did wonder if any of the individual changes make sense to include?

For fun, here's the result of running the contrib/pstack.py script against the current bash process:

$ python -m drgn -q --kernel-dir ~/vmlinux_repo/$(uname -r) -k contrib/pstack.py $$
#0  context_switch (kernel/sched/core.c:5328:2)
#1  __schedule (kernel/sched/core.c:6693:8)
#2  __schedule_loop (kernel/sched/core.c:6770:3)
#3  schedule (kernel/sched/core.c:6785:2)
#4  do_wait (kernel/exit.c:1697:3)
#5  kernel_wait4 (kernel/exit.c:1851:8)
#6  __do_sys_wait4 (kernel/exit.c:1879:13)
#7  do_syscall_x64 (arch/x86/entry/common.c:52:14)
#8  do_syscall_64 (arch/x86/entry/common.c:89:7)
#9  entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)
#10 0x7ff5ba4d8b7a
------ userspace ---------
#0  wait4+0x1a/0xab
#1  waitchld.constprop.0+0xbb/0xa5f
#2  wait_for+0x4ca/0xc0e
#3  execute_command_internal+0x2768/0x2ef6
#4  execute_command+0xc8/0x1b8
#5  reader_loop+0x289/0x3d9
#6  main+0x15be/0x198b
#7  __libc_start_call_main+0x80/0xac
#8  __libc_start_main@@GLIBC_2.34+0x80/0x148
#9  _start+0x25/0x26
#10 ???

Extra modules exist to allow loading additional ELF files. But the
"file->is_loadable" check blocks some unusual ELF files from being
loaded. Allow users the freedom of loading unusual files for extra
modules, but not for other kinds of modules.

Signed-off-by: Stephen Brennan <[email protected]>
Signed-off-by: Stephen Brennan <[email protected]>
Some parts of libdrgn detect drgn error codes and handle them
appropriately. For instance, the stack tracing code expects to get a
fault error. A drgn error that has been translated into a Python
exception, and back to a drgn error, no longer retains its code. This
means that if the stack tracing code is used with Python memory readers,
the fault errors will not be treated as fault errors.

For now, special case the fault errors so they are translated back to
drgn exceptions. In general, translating Python exceptions back to drgn
errors is not a great idea, because there is a loss of information. But
in this specific case, no information is lost, and it allows custom
memory readers to behave more like built-in memory readers.

Signed-off-by: Stephen Brennan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant