Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HARD FAULT when aborting essential thread #84460

Open
fsammoura1980 opened this issue Jan 23, 2025 · 5 comments
Open

HARD FAULT when aborting essential thread #84460

fsammoura1980 opened this issue Jan 23, 2025 · 5 comments
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@fsammoura1980
Copy link
Contributor

fsammoura1980 commented Jan 23, 2025

Describe the bug
I created a test for aborting an essential thread per this CL: fsammoura1980@f68690f. I set valid_fault to true to let the test know that I am expecting a Kernel panic. Although the test clearly states that Fatal error expected as part of test case. as shown in the log, the test terminates with a HARD FAULT, which I expect to be a zephyr bug.

  • The test output logs:
(.venv) fsammoura@fsammoura1:~/zephyrproject/zephyr$ west build -t run -b qemu_cortex_m3 tests/kernel/threads/thread_error_case
-- west build: running target run
[0/1] To exit from QEMU enter: 'CTRL+a, x'[QEMU] CPU: cortex-m3
qemu-system-arm: warning: nic stellaris_enet.0 has no peer
Timer with period zero, disabling
*** Booting Zephyr OS build v4.0.0-3779-gf68690fca63a ***
Running TESTSUITE thread_error_case
===================================================================
START - test_abort_essential_thread
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1334
        aborting essential thread 0x200000f0
E: r0/a1:  0x00000004  r1/a2:  0x00000536  r2/a3:  0x90000000
E: r3/a4:  0x00000004 r12/ip:  0x00000000 r14/lr:  0x000028fd
E:  xpsr:  0x21000000
E: Faulting instruction address (r15/pc): 0x00003a1e
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x200000f0 (unknown)
Caught system error -- reason 4 1
Fatal error expected as part of test case.
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1334
        aborting essential thread 0x200000f0
E: ***** HARD FAULT *****
E:   Fault escalation (see below)
E: ARCH_EXCEPT with reason 4

E: r0/a1:  0x00000004  r1/a2:  0x00000536  r2/a3:  0x90000000
E: r3/a4:  0x00000004 r12/ip:  0x00000000 r14/lr:  0x000028fd
E:  xpsr:  0x2100000b
E: Faulting instruction address (r15/pc): 0x00003a1e
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Fault during interrupt handling

E: Current thread: 0x200000f0 (unknown)
Caught system error -- reason 4 0
Fatal error was unexpected, aborting...

This issue was also reproduced on a Google board using this CL: https://chromium-review.googlesource.com/c/chromiumos/platform/ec/+/6096994

To Reproduce

Steps to reproduce the behavior:

  1. CL: fsammoura1980@f68690f
  2. run test: (.venv) fsammoura@fsammoura1:~/zephyrproject/zephyr$ west build -t run -b qemu_cortex_m3 tests/kernel/threads/thread_error_case
  3. See error

Expected behavior

We test is expecting a kernel panic, so it should end up without a HARD FAULT. I believe this is a zephyr bug.

Impact

Adding proper test infrastructure to a Google project

Logs and console output

(.venv) fsammoura@fsammoura1:~/zephyrproject/zephyr$ west build -t run -b qemu_cortex_m3 tests/kernel/threads/thread_error_case
-- west build: running target run
[0/1] To exit from QEMU enter: 'CTRL+a, x'[QEMU] CPU: cortex-m3
qemu-system-arm: warning: nic stellaris_enet.0 has no peer
Timer with period zero, disabling
*** Booting Zephyr OS build v4.0.0-3779-gf68690fca63a ***
Running TESTSUITE thread_error_case
===================================================================
START - test_abort_essential_thread
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1334
        aborting essential thread 0x200000f0
E: r0/a1:  0x00000004  r1/a2:  0x00000536  r2/a3:  0x90000000
E: r3/a4:  0x00000004 r12/ip:  0x00000000 r14/lr:  0x000028fd
E:  xpsr:  0x21000000
E: Faulting instruction address (r15/pc): 0x00003a1e
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x200000f0 (unknown)
Caught system error -- reason 4 1
Fatal error expected as part of test case.
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1334
        aborting essential thread 0x200000f0
E: ***** HARD FAULT *****
E:   Fault escalation (see below)
E: ARCH_EXCEPT with reason 4

E: r0/a1:  0x00000004  r1/a2:  0x00000536  r2/a3:  0x90000000
E: r3/a4:  0x00000004 r12/ip:  0x00000000 r14/lr:  0x000028fd
E:  xpsr:  0x2100000b
E: Faulting instruction address (r15/pc): 0x00003a1e
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Fault during interrupt handling

E: Current thread: 0x200000f0 (unknown)
Caught system error -- reason 4 0
Fatal error was unexpected, aborting...

Environment (please complete the following information):

Additional context

None

@fsammoura1980 fsammoura1980 added the bug The issue is a bug, or the PR is fixing a bug label Jan 23, 2025
Copy link

Hi @fsammoura1980! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

@fsammoura1980
Copy link
Contributor Author

I am not sure why the details of the bug did not show up:
I created a test for aborting an essential thread: fsammoura1980@f68690f
ZTEST_USER(thread_error_case, test_abort_essential_thread)

I set fault_valid to true to signal that a kernel panic is expected. per the logs:

(.venv) fsammoura@fsammoura1:~/zephyrproject/zephyr$ west build -t run -b qemu_cortex_m3 tests/kernel/threads/thread_error_case
-- west build: running target run
[0/1] To exit from QEMU enter: 'CTRL+a, x'[QEMU] CPU: cortex-m3
qemu-system-arm: warning: nic stellaris_enet.0 has no peer
Timer with period zero, disabling
*** Booting Zephyr OS build v4.0.0-3779-gf68690fca63a ***
Running TESTSUITE thread_error_case
===================================================================
START - test_abort_essential_thread
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1334
        aborting essential thread 0x200000f0
E: r0/a1:  0x00000004  r1/a2:  0x00000536  r2/a3:  0x90000000
E: r3/a4:  0x00000004 r12/ip:  0x00000000 r14/lr:  0x000028fd
E:  xpsr:  0x21000000
E: Faulting instruction address (r15/pc): 0x00003a1e
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: 0x200000f0 (unknown)
Caught system error -- reason 4 1
Fatal error expected as part of test case.
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1334
        aborting essential thread 0x200000f0
E: ***** HARD FAULT *****
E:   Fault escalation (see below)
E: ARCH_EXCEPT with reason 4

E: r0/a1:  0x00000004  r1/a2:  0x00000536  r2/a3:  0x90000000
E: r3/a4:  0x00000004 r12/ip:  0x00000000 r14/lr:  0x000028fd
E:  xpsr:  0x2100000b
E: Faulting instruction address (r15/pc): 0x00003a1e
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Fault during interrupt handling

E: Current thread: 0x200000f0 (unknown)
Caught system error -- reason 4 0
Fatal error was unexpected, aborting...

The test did indeed detect that Fatal error expected as part of test case.. However, a HARD FAULT later happened, which is unexpected. I believe this is a bug in zephyr.

This was also reproduced in a similar test run on a Google board:
https://chromium-review.googlesource.com/c/chromiumos/platform/ec/+/6096994

@nashif
Copy link
Member

nashif commented Jan 24, 2025

I am not sure why the details of the bug did not show up:

you had your details enclosed with '<! ... >`

@fsammoura1980
Copy link
Contributor Author

Thanks Anas.

@andyross
Copy link
Contributor

Someone familiar with the ztest ztest_set_fault_valid implementation is probably going to have to chime in. My guess is that this is getting wires crossed, because you can't just ignore the panic that happens in k_thread_abort(), because if it gets magically "unpanicked" you're still left with an illegal state where an "aborted" thread now runs to completion.

This may need special handling if we have to support it. My vague guess is that we'll just need to document the limitation and pull a will-not-fix on this. But in theory it should be possible, we just need to recognize how to get back to a legal state.

@fabiobaltieri fabiobaltieri added the priority: low Low impact/importance bug label Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

6 participants