Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate test hang #1954

Open
oleksandr-pavlyk opened this issue Jan 4, 2025 · 7 comments
Open

Investigate test hang #1954

oleksandr-pavlyk opened this issue Jan 4, 2025 · 7 comments

Comments

@oleksandr-pavlyk
Copy link
Collaborator

Sometimes, and more frequently than I would have liked it to, the Linux test run times out, seemingly hanging on with this being the last output:

tests/elementwise/test_bitwise_xor.py::test_bitwise_xor_inplace_python_scalar[u2] PASSED [  7%]

Usually, rerunning the test_linux step runs fine.

I was not able to reproduce locally yet.

@ndgrigorian
Copy link
Collaborator

I have reproduced the hang locally

Using the following script test_bitwise_xor.py

import pytest

import dpctl.tensor as dpt
from dpctl.tests.helper import get_queue_or_skip, skip_if_dtype_not_supported
from dpctl.tests.elementwise.utils import _integral_dtypes

@pytest.mark.parametrize("iters", range(1, 2001))
@pytest.mark.parametrize("dtype", ["?"] + _integral_dtypes)
def test_bitwise_xor_inplace_python_scalar(dtype, iters):
    assert iters
    q = get_queue_or_skip()
    skip_if_dtype_not_supported(dtype, q)
    X = dpt.zeros((10, 10), dtype=dtype, sycl_queue=q)
    dt_kind = X.dtype.kind
    if dt_kind == "b":
        X ^= False
    else:
        X ^= int(0)

and then

$ CL_CONFIG_CPU_TARGET_ARCH=corei7-avx ONEAPI_DEVICE_SELECTOR=*:cpu pytest test_bitwise_xor.py

the test ended up hanging for me after a time.

$ CL_CONFIG_CPU_TARGET_ARCH=corei7-avx ONEAPI_DEVICE_SELECTOR=*:cpu pytest test_bitwise_xor.py
================================================= test session starts ==================================================
platform linux -- Python 3.11.9, pytest-8.2.2, pluggy-1.5.0
rootdir: /home/ngrigori/test
plugins: hypothesis-6.104.2
collected 18000 items

test_bitwise_xor.py ................................................................................

Introducing q.wait() to the last line fixed the hang.

=========================================== 18000 passed in 69.62s (0:01:09) ===========================================

@ndgrigorian
Copy link
Collaborator

ndgrigorian commented Jan 5, 2025

Turns out that specifying the architecture is not necessary

$ ONEAPI_DEVICE_SELECTOR=*:cpu pytest test_bitwise_xor.py
================================================= test session starts ==================================================
platform linux -- Python 3.11.9, pytest-8.2.2, pluggy-1.5.0
rootdir: /home/ngrigori/test
plugins: hypothesis-6.104.2
collected 18000 items

test_bitwise_xor.py ............................................................................................ [  0%]
................................................................................................................ [  1%]
................................................................................................................ [  1%]
................................................................................................................ [  2%]
................................................................................................................ [  3%]
................................................................................................................ [  3%]
................................................................................................................ [  4%]
................................

meaning the hang can be reproduced on any CPU in theory.

@oleksandr-pavlyk
Copy link
Collaborator Author

I tried to run the test twice on my Core i7 1185G7 CPU, and the 18000 tests passed for me twice. This is consistent with the hang encountered in the CI only occasionally.

@ndgrigorian If you can reproduce the hang somewhat reliably, please try running it under gdb and inspect back-trace in the main thread as well as in other threads using thread apply all bt GDB command.

@oleksandr-pavlyk
Copy link
Collaborator Author

@ndgrigorian I was able to reproduce the hang using CL_CONFIG_CPU_VECTORIZER_MODE=4 ONEAPI_DEVICE_SELECTOR=*:cpu gdb --args python -m pytest -s test_hang.py.

I encountered the hang on an umpteen attempt, and the backtrace seem to suggest a deadlock in CPU runtime:

GDB threads backtrace in the hung state
(gdb) thread apply all bt

Thread 16 (Thread 0x7fffa27fc6c0 (LWP 657191) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffb6242064 in tbb::detail::r1::futex_wait (futex=0x7fffa27fa330, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffa27fa330) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::sleep_node<tbb::detail::r1::market_context>::wait (this=0x7fffa27fa300) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:170
#4  0x00007fffb6241c8f in tbb::detail::r1::concurrent_monitor_base<tbb::detail::r1::market_context>::commit_wait (this=0x7fffb4ff3a80, node=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:232
#5  tbb::detail::r1::concurrent_monitor_base<tbb::detail::r1::market_context>::wait<tbb::detail::r1::sleep_node<tbb::detail::r1::market_context>, tbb::detail::r1::external_waiter::pause(tbb::detail::r1::arena_slot&)::{lambda()#1}&>(tbb::detail::r1::external_waiter::pause(tbb::detail::r1::arena_slot&)::{lambda()#1}&, tbb::detail::r1::sleep_node<tbb::detail::r1::market_context>&&) (this=0x7fffb4ff3a80, node=..., pred=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:262
#6  tbb::detail::r1::sleep_waiter::sleep<tbb::detail::r1::external_waiter::pause(tbb::detail::r1::arena_slot&)::{lambda()#1}>(unsigned long, tbb::detail::r1::external_waiter::pause(tbb::detail::r1::arena_slot&)::{lambda()#1}) (this=0x7fffa27fa4a8, uniq_tag=<optimized out>, wakeup_condition=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/waiters.h:133
#7  tbb::detail::r1::external_waiter::pause (this=this@entry=0x7fffa27fa4a8) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/waiters.h:160
#8  0x00007fffb6241364 in tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::external_waiter> (this=0x7fffb4fe0f00, tls=..., ed=..., waiter=..., isolation=0, fifo_allowed=<optimized out>, critical_allowed=<optimized out>) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:232
#9  tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (this=this@entry=0x7fffb4fe0f00, t=<optimized out>, t@entry=0x0, waiter=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:362
#10 0x00007fffb623f10c in tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (this=0x7fffb4fe0f00, t=0x0, waiter=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:470
#11 tbb::detail::r1::task_dispatcher::execute_and_wait (t=0x0, wait_ctx=..., w_ctx=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.cpp:168
#12 0x00007fffc1e97e8a in tbb::detail::d2::task_group_base::wait() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#13 0x00007fffc1e9d055 in tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#14 0x00007fffb6228a37 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=warning: RTTI symbol not found for class 'tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>'
...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:821
#15 0x00007fffc1e9cc9d in Intel::OpenCL::TaskExecutor::TaskGroup::WaitForAll() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#16 0x00007fffc1e9cb5d in tbb::detail::d1::task_arena_function<TaskGroupWaiter, void>::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#17 0x00007fffb622a41d in tbb::detail::r1::delegated_task::execute(tbb::detail::d1::execution_data&)::{lambda()#1}::operator()() const (this=<optimized out>) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:734
#18 tbb::detail::d0::try_call_proxy<tbb::detail::r1::delegated_task::execute(tbb::detail::d1::execution_data&)::{lambda()#1}>::on_completion<tbb::detail::r1::delegated_task::execute(tbb::detail::d1::execution_data&)::{lambda()#2}>(tbb::detail::r1::delegated_task::execute(tbb::detail::d1::execution_data&)::{lambda()#2}) (this=<optimized out>, on_completion_body=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/../../include/oneapi/tbb/detail/_template_helpers.h:230
#19 tbb::detail::r1::delegated_task::execute (this=0x7fffffff9b40, ed=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:735
#20 0x00007fffb62410ae in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (this=this@entry=0x7fffb4fe0f00, t=0x7fffffff9b40, t@entry=0x0, waiter=...)at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/../../include/oneapi/tbb/task_group.h:382
#21 0x00007fffb623f10c in tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (this=0x7fffb4fe0f00, t=0x0, waiter=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:470
#22 tbb::detail::r1::task_dispatcher::execute_and_wait (t=0x0, wait_ctx=..., w_ctx=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.cpp:168
#23 0x00007fffc1e97e8a in tbb::detail::d2::task_group_base::wait() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#24 0x00007fffc1e9d055 in tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#25 0x00007fffb6228a37 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=warning: RTTI symbol not found for class 'tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>'
...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:821
#26 0x00007fffc1e9cdd0 in Intel::OpenCL::TaskExecutor::SpawningTaskGroup::WaitForAll() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#27 0x00007fffc1e83436 in Intel::OpenCL::TaskExecutor::out_of_order_executor_task::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#28 0x00007fffc1e9ca88 in tbb::detail::d1::enqueue_task<Intel::OpenCL::TaskExecutor::ArenaFunctorRunner<Intel::OpenCL::TaskExecutor::out_of_order_executor_task> >::execute(tbb::detail::d1::executi--Type <RET> for more, q to quit, c to continue without paging--
on_data&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#29 0x00007fffb62410ae in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (this=this@entry=0x7fffb4fe0f00, t=0x7fffb4fef300, t@entry=0x0, waiter=...)at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/../../include/oneapi/tbb/task_group.h:382
#30 0x00007fffb623f10c in tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (this=0x7fffb4fe0f00, t=0x0, waiter=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:470
#31 tbb::detail::r1::task_dispatcher::execute_and_wait (t=0x0, wait_ctx=..., w_ctx=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.cpp:168
#32 0x00007fffc1e97e8a in tbb::detail::d2::task_group_base::wait() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#33 0x00007fffc1e9d055 in tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#34 0x00007fffb6228a37 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=warning: RTTI symbol not found for class 'tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>'
...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:821
#35 0x00007fffc1e9cdd0 in Intel::OpenCL::TaskExecutor::SpawningTaskGroup::WaitForAll() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#36 0x00007fffc1e83436 in Intel::OpenCL::TaskExecutor::out_of_order_executor_task::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#37 0x00007fffc1e9cb12 in tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorRunner<Intel::OpenCL::TaskExecutor::out_of_order_executor_task>, void>::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#38 0x00007fffb6228a37 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=warning: RTTI symbol not found for class 'tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorRunner<Intel::OpenCL::TaskExecutor::out_of_order_executor_task>, void>'
...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:821
#39 0x00007fffc1e99c42 in Intel::OpenCL::TaskExecutor::out_of_order_command_list::LaunchExecutorTask(bool, Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::TaskExecutor::ITaskBase> const&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#40 0x00007fffc1e9933e in Intel::OpenCL::TaskExecutor::base_command_list::InternalFlush(bool) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#41 0x00007fffc1e991e2 in Intel::OpenCL::TaskExecutor::base_command_list::WaitForCompletion(Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::TaskExecutor::ITaskBase> const&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#42 0x00007fffc1e713f5 in Intel::OpenCL::CPUDevice::CPUDevice::clDevCommandListWaitCompletion(void*, cl_dev_cmd_desc*) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#43 0x00007fffc1e2a149 in Intel::OpenCL::Framework::IOclCommandQueueBase::WaitForCompletion(Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::Framework::QueueEvent> const&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#44 0x00007fffc1df82a4 in Intel::OpenCL::Framework::EventsManager::WaitForEvents(unsigned int, _cl_event* const*, bool) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#45 0x00007fffc1e0d033 in Intel::OpenCL::Framework::ExecutionModule::WaitForEvents(unsigned int, _cl_event* const*) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#46 0x00007fffc1d66856 in clWaitForEvents () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#47 0x00007fffca9df36a in urEventWait () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_opencl.so.0
#48 0x00007ffff28f79eb in ur_loader::urEventWait(unsigned int, ur_event_handle_t_* const*) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
#49 0x00007ffff290ac27 in urEventWait () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
#50 0x00007ffff32bec64 in sycl::_V1::detail::DispatchHostTask::waitForEvents() const () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#51 0x00007ffff32bdaa4 in sycl::_V1::detail::DispatchHostTask::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#52 0x00007ffff31f8dd5 in sycl::_V1::detail::ThreadPool::worker() () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#53 0x00007ffff5720db4 in std::execute_native_thread_routine (__p=0x55555a9a6d70) at ../../../../../src/libstdc++-v3/src/c++11/thread.cc:104
#54 0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#55 0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 15 (Thread 0x7fffa2ffd6c0 (LWP 657190) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffb623af06 in tbb::detail::r1::futex_wait (futex=0x7fffb527ef24, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffb527ef24) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::rml::internal::thread_monitor::wait (this=0x7fffb527ef20) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/rml_thread_monitor.h:235
#4  tbb::detail::r1::rml::private_worker::run (this=0x7fffb527ef00) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:273
#5  0x00007fffb623adc6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fffb527ef24) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
--Type <RET> for more, q to quit, c to continue without paging--

Thread 14 (Thread 0x7fffa3fff6c0 (LWP 657188) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffb623af06 in tbb::detail::r1::futex_wait (futex=0x7fffb527eea4, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffb527eea4) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::rml::internal::thread_monitor::wait (this=0x7fffb527eea0) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/rml_thread_monitor.h:235
#4  tbb::detail::r1::rml::private_worker::run (this=0x7fffb527ee80) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:273
#5  0x00007fffb623adc6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fffb527eea4) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 13 (Thread 0x7fffa37fe6c0 (LWP 657189) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffb623af06 in tbb::detail::r1::futex_wait (futex=0x7fffb527ee24, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffb527ee24) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::rml::internal::thread_monitor::wait (this=0x7fffb527ee20) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/rml_thread_monitor.h:235
#4  tbb::detail::r1::rml::private_worker::run (this=0x7fffb527ee00) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:273
#5  0x00007fffb623adc6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fffb527ee24) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 12 (Thread 0x7fffa89316c0 (LWP 657187) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffb623af06 in tbb::detail::r1::futex_wait (futex=0x7fffb527f024, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffb527f024) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::rml::internal::thread_monitor::wait (this=0x7fffb527f020) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/rml_thread_monitor.h:235
#4  tbb::detail::r1::rml::private_worker::run (this=0x7fffb527f000) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:273
#5  0x00007fffb623adc6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fffb527f024) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 11 (Thread 0x7fffa99336c0 (LWP 657186) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffb623af06 in tbb::detail::r1::futex_wait (futex=0x7fffb527f124, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffb527f124) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::rml::internal::thread_monitor::wait (this=0x7fffb527f120) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/rml_thread_monitor.h:235
#4  tbb::detail::r1::rml::private_worker::run (this=0x7fffb527f100) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:273
#5  0x00007fffb623adc6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fffb527f124) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 10 (Thread 0x7fffa91326c0 (LWP 657185) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffb623af06 in tbb::detail::r1::futex_wait (futex=0x7fffb527efa4, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffb527efa4) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::rml::internal::thread_monitor::wait (this=0x7fffb527efa0) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/rml_thread_monitor.h:235
#4  tbb::detail::r1::rml::private_worker::run (this=0x7fffb527ef80) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:273
#5  0x00007fffb623adc6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fffb527efa4) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
--Type <RET> for more, q to quit, c to continue without paging--
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 9 (Thread 0x7fffaa1346c0 (LWP 657184) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffb623af06 in tbb::detail::r1::futex_wait (futex=0x7fffb527f0a4, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffb527f0a4) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::rml::internal::thread_monitor::wait (this=0x7fffb527f0a0) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/rml_thread_monitor.h:235
#4  tbb::detail::r1::rml::private_worker::run (this=0x7fffb527f080) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:273
#5  0x00007fffb623adc6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fffb527f0a4) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 8 (Thread 0x7fffe2bf96c0 (LWP 657154) "python"):
#0  0x00007ffff7d42d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff186ba60 <thread_status+864>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7ffff186ba60 <thread_status+864>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff186ba60 <thread_status+864>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7d457dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff186ba10 <thread_status+784>, cond=0x7ffff186ba38 <thread_status+824>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff186ba38 <thread_status+824>, mutex=0x7ffff186ba10 <thread_status+784>) at ./nptl/pthread_cond_wait.c:627
#5  0x00007ffff07bfb8b in blas_thread_server () from /home/opavlyk/mamba/envs/dev_dpctl/lib/python3.12/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 7 (Thread 0x7fffe33fa6c0 (LWP 657153) "python"):
#0  0x00007ffff7d42d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff186b9e0 <thread_status+736>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7ffff186b9e0 <thread_status+736>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff186b9e0 <thread_status+736>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7d457dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff186b990 <thread_status+656>, cond=0x7ffff186b9b8 <thread_status+696>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff186b9b8 <thread_status+696>, mutex=0x7ffff186b990 <thread_status+656>) at ./nptl/pthread_cond_wait.c:627
#5  0x00007ffff07bfb8b in blas_thread_server () from /home/opavlyk/mamba/envs/dev_dpctl/lib/python3.12/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 6 (Thread 0x7fffe5bfb6c0 (LWP 657152) "python"):
#0  0x00007ffff7d42d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff186b960 <thread_status+608>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7ffff186b960 <thread_status+608>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff186b960 <thread_status+608>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7d457dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff186b910 <thread_status+528>, cond=0x7ffff186b938 <thread_status+568>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff186b938 <thread_status+568>, mutex=0x7ffff186b910 <thread_status+528>) at ./nptl/pthread_cond_wait.c:627
#5  0x00007ffff07bfb8b in blas_thread_server () from /home/opavlyk/mamba/envs/dev_dpctl/lib/python3.12/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 5 (Thread 0x7fffe83fc6c0 (LWP 657151) "python"):
#0  0x00007ffff7d42d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff186b8e0 <thread_status+480>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7ffff186b8e0 <thread_status+480>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff186b8e0 <thread_status+480>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=--Type <RET> for more, q to quit, c to continue without paging--
private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7d457dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff186b890 <thread_status+400>, cond=0x7ffff186b8b8 <thread_status+440>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff186b8b8 <thread_status+440>, mutex=0x7ffff186b890 <thread_status+400>) at ./nptl/pthread_cond_wait.c:627
#5  0x00007ffff07bfb8b in blas_thread_server () from /home/opavlyk/mamba/envs/dev_dpctl/lib/python3.12/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 4 (Thread 0x7fffeabfd6c0 (LWP 657150) "python"):
#0  0x00007ffff7d42d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff186b860 <thread_status+352>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7ffff186b860 <thread_status+352>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff186b860 <thread_status+352>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7d457dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff186b810 <thread_status+272>, cond=0x7ffff186b838 <thread_status+312>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff186b838 <thread_status+312>, mutex=0x7ffff186b810 <thread_status+272>) at ./nptl/pthread_cond_wait.c:627
#5  0x00007ffff07bfb8b in blas_thread_server () from /home/opavlyk/mamba/envs/dev_dpctl/lib/python3.12/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 3 (Thread 0x7fffef3fe6c0 (LWP 657149) "python"):
#0  0x00007ffff7d42d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff186b7e0 <thread_status+224>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7ffff186b7e0 <thread_status+224>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff186b7e0 <thread_status+224>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7d457dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff186b790 <thread_status+144>, cond=0x7ffff186b7b8 <thread_status+184>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff186b7b8 <thread_status+184>, mutex=0x7ffff186b790 <thread_status+144>) at ./nptl/pthread_cond_wait.c:627
#5  0x00007ffff07bfb8b in blas_thread_server () from /home/opavlyk/mamba/envs/dev_dpctl/lib/python3.12/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 2 (Thread 0x7fffefbff6c0 (LWP 657148) "python"):
#0  0x00007ffff7d42d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff186b760 <thread_status+96>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7ffff186b760 <thread_status+96>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff186b760 <thread_status+96>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7d457dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff186b710 <thread_status+16>, cond=0x7ffff186b738 <thread_status+56>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff186b738 <thread_status+56>, mutex=0x7ffff186b710 <thread_status+16>) at ./nptl/pthread_cond_wait.c:627
#5  0x00007ffff07bfb8b in blas_thread_server () from /home/opavlyk/mamba/envs/dev_dpctl/lib/python3.12/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#6  0x00007ffff7d46a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd3c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 1 (Thread 0x7ffff7ca3b80 (LWP 657133) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffb622a244 in tbb::detail::r1::futex_wait (futex=0x7fffffff9b38, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffffff9b38) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::sleep_node<unsigned long>::wait (this=0x7fffffff9b10) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:170
#4  0x00007fffb62287a3 in tbb::detail::r1::concurrent_monitor_base<unsigned long>::commit_wait (this=0x7fffb4fe0628, node=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:232
#5  tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:797
#6  0x00007fffc1e9a0f4 in Intel::OpenCL::TaskExecutor::out_of_order_command_list::WaitForIdle() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
--Type <RET> for more, q to quit, c to continue without paging--
#7  0x00007fffc1e991f2 in Intel::OpenCL::TaskExecutor::base_command_list::WaitForCompletion(Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::TaskExecutor::ITaskBase> const&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#8  0x00007fffc1e713f5 in Intel::OpenCL::CPUDevice::CPUDevice::clDevCommandListWaitCompletion(void*, cl_dev_cmd_desc*) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#9  0x00007fffc1e2a149 in Intel::OpenCL::Framework::IOclCommandQueueBase::WaitForCompletion(Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::Framework::QueueEvent> const&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#10 0x00007fffc1df82a4 in Intel::OpenCL::Framework::EventsManager::WaitForEvents(unsigned int, _cl_event* const*, bool) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#11 0x00007fffc1e0d033 in Intel::OpenCL::Framework::ExecutionModule::WaitForEvents(unsigned int, _cl_event* const*) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#12 0x00007fffc1d66856 in clWaitForEvents () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#13 0x00007fffca9df36a in urEventWait () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_opencl.so.0
#14 0x00007ffff28f79eb in ur_loader::urEventWait(unsigned int, ur_event_handle_t_* const*) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
#15 0x00007ffff290ac27 in urEventWait () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
#16 0x00007ffff31eeaef in sycl::_V1::detail::event_impl::waitInternal(bool*) () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#17 0x00007ffff31eefa2 in sycl::_V1::detail::event_impl::wait(std::shared_ptr<sycl::_V1::detail::event_impl>, bool*) () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#18 0x00007ffff32f176e in sycl::_V1::event::wait() () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#19 0x00007fffdd782ec4 in ?? () from /home/opavlyk/repos/dpctl/dpctl/tensor/_tensor_impl.cpython-312-x86_64-linux-gnu.so
#20 0x00007fffdd340849 in ?? () from /home/opavlyk/repos/dpctl/dpctl/tensor/_tensor_impl.cpython-312-x86_64-linux-gnu.so
#21 0x00007fffdd3235b7 in ?? () from /home/opavlyk/repos/dpctl/dpctl/tensor/_tensor_impl.cpython-312-x86_64-linux-gnu.so
#22 0x000055555577f401 in cfunction_call (func=0x7fffde27f510, args=0x555555b90d88 <_PyRuntime+76264>, kwargs=0x7fffa158e580) at /usr/local/src/conda/python-3.12.3/Objects/methodobject.c:537
#23 0x000055555575f04b in _PyObject_MakeTpCall (tstate=0x555555bee728 <_PyRuntime+459656>, callable=0x7fffde27f510, args=<optimized out>, nargs=0, keywords=0x7ffff6f01ad0) at /usr/local/src/conda/python-3.12.3/Objects/call.c:240
#24 0x0000555555666fd9 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7ffff7ae5ac8, throwflag=<optimized out>) at Python/bytecodes.c:2706
#25 0x00007fffda5b151e in __pyx_pf_5dpctl_6tensor_9_usmarray_11usm_ndarray_68__setitem__(PyUSMArrayObject*, _object*, _object*) () from /home/opavlyk/repos/dpctl/dpctl/tensor/_usmarray.cpython-312-x86_64-linux-gnu.so
#26 0x000055555567289e in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7ffff7ae59d8, throwflag=<optimized out>) at Python/bytecodes.c:552
#27 0x00007fffda5a55a2 in __pyx_pw_5dpctl_6tensor_9_usmarray_11usm_ndarray_127__ixor__(_object*, _object*) () from /home/opavlyk/repos/dpctl/dpctl/tensor/_usmarray.cpython-312-x86_64-linux-gnu.so
#28 0x000055555582aa2f in binary_iop1 (v=0x7fffa1589690, w=0x555555b7f068 <_PyRuntime+3272>, iop_slot=<optimized out>, op_slot=112) at /usr/local/src/conda/python-3.12.3/Objects/abstract.c:1179
#29 0x000055555582a962 in binary_iop (v=0x7fffa1589690, w=0x555555b7f068 <_PyRuntime+3272>, iop_slot=<optimized out>, op_slot=<optimized out>, op_name=0x5555558b4e0f "^=", op_slot=<optimized out>,
 iop_slot=<optimized out>) at /usr/local/src/conda/python-3.12.3/Objects/abstract.c:1204
#30 0x0000555555668749 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7ffff7ae56d0, throwflag=<optimized out>) at Python/bytecodes.c:3382
#31 0x0000555555761ad1 in _PyObject_FastCallDictTstate (tstate=0x555555bee728 <_PyRuntime+459656>, callable=0x7ffff76bf060, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at
 /usr/local/src/conda/python-3.12.3/Objects/call.c:144
#32 0x000055555578d839 in _PyObject_Call_Prepend (tstate=tstate@entry=0x555555bee728 <_PyRuntime+459656>, callable=callable@entry=0x7ffff76bf060, obj=obj@entry=0x7ffff6f4acf0, args=args@entry=0x555555b90d88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7fffa158e1c0) at /usr/local/src/conda/python-3.12.3/Objects/call.c:508
#33 0x0000555555862203 in slot_tp_call (self=0x7ffff6f4acf0, args=0x555555b90d88 <_PyRuntime+76264>, kwds=0x7fffa158e1c0) at /usr/local/src/conda/python-3.12.3/Objects/typeobject.c:8770
#34 0x000055555575f04b in _PyObject_MakeTpCall (tstate=0x555555bee728 <_PyRuntime+459656>, callable=0x7ffff6f4acf0, args=<optimized out>, nargs=0, keywords=0x7ffff70c8760) at /usr/local/src/conda/python-3.12.3/Objects/call.c:240
#35 0x0000555555666fd9 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7ffff7ae5340, throwflag=<optimized out>) at Python/bytecodes.c:2706
#36 0x0000555555761ad1 in _PyObject_FastCallDictTstate (tstate=0x555555bee728 <_PyRuntime+459656>, callable=0x7ffff76bf060, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at
 /usr/local/src/conda/python-3.12.3/Objects/call.c:144
#37 0x000055555578d839 in _PyObject_Call_Prepend (tstate=tstate@entry=0x555555bee728 <_PyRuntime+459656>, callable=callable@entry=0x7ffff76bf060, obj=obj@entry=0x7ffff6f4aed0, args=args@entry=0x555555b90d88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7fffcc849980) at /usr/local/src/conda/python-3.12.3/Objects/call.c:508
#38 0x0000555555862203 in slot_tp_call (self=0x7ffff6f4aed0, args=0x555555b90d88 <_PyRuntime+76264>, kwds=0x7fffcc849980) at /usr/local/src/conda/python-3.12.3/Objects/typeobject.c:8770
#39 0x0000555555790575 in _PyObject_Call (tstate=0x555555bee728 <_PyRuntime+459656>, callable=0x7ffff6f4aed0, args=0x555555b90d88 <_PyRuntime+76264>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.12.3/Objects/call.c:367
#40 0x0000555555667d0d in PyCFunction_Call (kwargs=0x7fffcc849980, args=0x555555b90d88 <_PyRuntime+76264>, callable=0x7ffff6f4aed0) at /usr/local/src/conda/python-3.12.3/Objects/call.c:387
#41 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7ffff7ae4fd8, throwflag=<optimized out>) at Python/bytecodes.c:3254
#42 0x0000555555761ad1 in _PyObject_FastCallDictTstate (tstate=0x555555bee728 <_PyRuntime+459656>, callable=0x7ffff76bf060, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at
 /usr/local/src/conda/python-3.12.3/Objects/call.c:144
#43 0x000055555578d839 in _PyObject_Call_Prepend (tstate=tstate@entry=0x555555bee728 <_PyRuntime+459656>, callable=callable@entry=0x7ffff76bf060, obj=obj@entry=0x7ffff6f4b060, args=args@entry=0x55--Type <RET> for more, q to quit, c to continue without paging--
5555b90d88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7fffcaf4e980) at /usr/local/src/conda/python-3.12.3/Objects/call.c:508
#44 0x0000555555862203 in slot_tp_call (self=0x7ffff6f4b060, args=0x555555b90d88 <_PyRuntime+76264>, kwds=0x7fffcaf4e980) at /usr/local/src/conda/python-3.12.3/Objects/typeobject.c:8770
#45 0x000055555575f04b in _PyObject_MakeTpCall (tstate=0x555555bee728 <_PyRuntime+459656>, callable=0x7ffff6f4b060, args=<optimized out>, nargs=0, keywords=0x7ffff719b780) at /usr/local/src/conda/python-3.12.3/Objects/call.c:240
#46 0x0000555555666fd9 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7ffff7ae4a00, throwflag=<optimized out>) at Python/bytecodes.c:2706
#47 0x0000555555761ad1 in _PyObject_FastCallDictTstate (tstate=0x555555bee728 <_PyRuntime+459656>, callable=0x7ffff76bf060, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at
 /usr/local/src/conda/python-3.12.3/Objects/call.c:144
#48 0x000055555578d839 in _PyObject_Call_Prepend (tstate=tstate@entry=0x555555bee728 <_PyRuntime+459656>, callable=callable@entry=0x7ffff76bf060, obj=obj@entry=0x7ffff6f4b150, args=args@entry=0x555555b90d88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7ffff6e67880) at /usr/local/src/conda/python-3.12.3/Objects/call.c:508
#49 0x0000555555862203 in slot_tp_call (self=0x7ffff6f4b150, args=0x555555b90d88 <_PyRuntime+76264>, kwds=0x7ffff6e67880) at /usr/local/src/conda/python-3.12.3/Objects/typeobject.c:8770
#50 0x000055555575f04b in _PyObject_MakeTpCall (tstate=0x555555bee728 <_PyRuntime+459656>, callable=0x7ffff6f4b150, args=<optimized out>, nargs=0, keywords=0x7ffff7586e60) at /usr/local/src/conda/python-3.12.3/Objects/call.c:240
#51 0x0000555555666fd9 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7ffff7ae4728, throwflag=<optimized out>) at Python/bytecodes.c:2706
#52 0x0000555555761ad1 in _PyObject_FastCallDictTstate (tstate=0x555555bee728 <_PyRuntime+459656>, callable=0x7ffff76bf060, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at
 /usr/local/src/conda/python-3.12.3/Objects/call.c:144
#53 0x000055555578d839 in _PyObject_Call_Prepend (tstate=tstate@entry=0x555555bee728 <_PyRuntime+459656>, callable=callable@entry=0x7ffff76bf060, obj=obj@entry=0x7ffff6f4a390, args=args@entry=0x555555b90d88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7ffff6e85500) at /usr/local/src/conda/python-3.12.3/Objects/call.c:508
#54 0x0000555555862203 in slot_tp_call (self=0x7ffff6f4a390, args=0x555555b90d88 <_PyRuntime+76264>, kwds=0x7ffff6e85500) at /usr/local/src/conda/python-3.12.3/Objects/typeobject.c:8770
#55 0x000055555575f04b in _PyObject_MakeTpCall (tstate=0x555555bee728 <_PyRuntime+459656>, callable=0x7ffff6f4a390, args=<optimized out>, nargs=0, keywords=0x7ffff75874f0) at /usr/local/src/conda/python-3.12.3/Objects/call.c:240
#56 0x0000555555666fd9 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7ffff7ae42a8, throwflag=<optimized out>) at Python/bytecodes.c:2706
#57 0x000055555581940e in PyEval_EvalCode (co=co@entry=0x7ffff6ed3870, globals=globals@entry=0x7ffff7c45bc0, locals=locals@entry=0x7ffff7c45bc0) at /usr/local/src/conda/python-3.12.3/Python/ceval.c:578
#58 0x00005555558341ad in builtin_exec_impl (module=<optimized out>, closure=<optimized out>, locals=0x7ffff7c45bc0, globals=0x7ffff7c45bc0, source=0x7ffff6ed3870) at /usr/local/src/conda/python-3.12.3/Python/bltinmodule.c:1096
#59 builtin_exec (module=<optimized out>, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.3/Python/clinic/bltinmodule.c.h:586
#60 0x0000555555774236 in cfunction_vectorcall_FASTCALL_KEYWORDS (func=<optimized out>, args=0x7ffff7ae4180, nargsf=<optimized out>, kwnames=0x0) at /usr/local/src/conda/python-3.12.3/Objects/methodobject.c:438
#61 0x0000555555773fbf in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=9223372036854775810, args=0x7ffff7ae4180, callable=0x7ffff7be1e40, tstate=0x555555bee728 <_PyRuntime+459656>) at /usr/local/src/conda/python-3.12.3/Include/internal/pycore_call.h:92
#62 PyObject_Vectorcall (callable=0x7ffff7be1e40, args=0x7ffff7ae4180, nargsf=9223372036854775810, kwnames=0x0) at /usr/local/src/conda/python-3.12.3/Objects/call.c:325
#63 0x0000555555666fd9 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7ffff7ae40d8, throwflag=<optimized out>) at Python/bytecodes.c:2706
#64 0x00005555558496d8 in pymain_run_module (modname=<optimized out>, set_argv0=set_argv0@entry=1) at /usr/local/src/conda/python-3.12.3/Modules/main.c:300
#65 0x0000555555848af5 in pymain_run_python (exitcode=0x7fffffffc744) at /usr/local/src/conda/python-3.12.3/Modules/main.c:623
#66 Py_RunMain () at /usr/local/src/conda/python-3.12.3/Modules/main.c:709
#67 0x0000555555801997 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.12.3/Modules/main.c:763
#68 0x00007ffff7cd41ca in __libc_start_call_main (main=main@entry=0x5555558018d0 <main>, argc=argc@entry=5, argv=argv@entry=0x7fffffffc9d8) at ../sysdeps/nptl/libc_start_call_main.h:58
#69 0x00007ffff7cd428b in __libc_start_main_impl (main=0x5555558018d0 <main>, argc=5, argv=0x7fffffffc9d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffc9c8) at ../csu/libc-start.c:360
#70 0x0000555555801811 in _start ()
(gdb)

@ndgrigorian
Copy link
Collaborator

@oleksandr-pavlyk
Pretty much the same backtrace for me, only I didn't need CL_CONFIG_CPU_VECTORIZER_MODE=4

backtrace
(gdb) thread apply all bt

Thread 8 (Thread 0x7fffaeffd6c0 (LWP 568168) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffc0a42064 in tbb::detail::r1::futex_wait (futex=0x7fffaeffb330, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffaeffb330) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::sleep_node<tbb::detail::r1::market_context>::wait (this=0x7fffaeffb300) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:170
#4  0x00007fffc0a41c8f in tbb::detail::r1::concurrent_monitor_base<tbb::detail::r1::market_context>::commit_wait (this=0x7fffbf7f3a80, node=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:232
#5  tbb::detail::r1::concurrent_monitor_base<tbb::detail::r1::market_context>::wait<tbb::detail::r1::sleep_node<tbb::detail::r1::market_context>, tbb::detail::r1::external_waiter::pause(tbb::detail::r1::arena_slot&)::{lambda()#1}&>(tbb::detail::r1::external_waiter::pause(tbb::detail::r1::arena_slot&)::{lambda()#1}&, tbb::detail::r1::sleep_node<tbb::detail::r1::market_context>&&) (this=0x7fffbf7f3a80, node=..., pred=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:262
#6  tbb::detail::r1::sleep_waiter::sleep<tbb::detail::r1::external_waiter::pause(tbb::detail::r1::arena_slot&)::{lambda()#1}>(unsigned long, tbb::detail::r1::external_waiter::pause(tbb::detail::r1::arena_slot&)::{lambda()#1}) (this=0x7fffaeffb4a8, uniq_tag=<optimized out>, wakeup_condition=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/waiters.h:133
#7  tbb::detail::r1::external_waiter::pause (this=this@entry=0x7fffaeffb4a8) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/waiters.h:160
#8  0x00007fffc0a41364 in tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::external_waiter> (this=0x7fffbf7e2880, tls=..., ed=..., waiter=..., isolation=0, fifo_allowed=<optimized out>, critical_allowed=<optimized out>) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:232
#9  tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (this=this@entry=0x7fffbf7e2880, t=<optimized out>, t@entry=0x0, waiter=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:362
#10 0x00007fffc0a3f10c in tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (this=0--Type <RET> for more, q to quit, c to continue without paging--c
x7fffbf7e2880, t=0x0, waiter=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:470
#11 tbb::detail::r1::task_dispatcher::execute_and_wait (t=0x0, wait_ctx=..., w_ctx=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.cpp:168
#12 0x00007fffcc654e8a in tbb::detail::d2::task_group_base::wait() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#13 0x00007fffcc65a055 in tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#14 0x00007fffc0a28a37 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=warning: RTTI symbol not found for class 'tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>'
...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:821
#15 0x00007fffcc659c9d in Intel::OpenCL::TaskExecutor::TaskGroup::WaitForAll() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#16 0x00007fffcc659b5d in tbb::detail::d1::task_arena_function<TaskGroupWaiter, void>::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#17 0x00007fffc0a2a41d in tbb::detail::r1::delegated_task::execute(tbb::detail::d1::execution_data&)::{lambda()#1}::operator()() const (this=<optimized out>) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:734
#18 tbb::detail::d0::try_call_proxy<tbb::detail::r1::delegated_task::execute(tbb::detail::d1::execution_data&)::{lambda()#1}>::on_completion<tbb::detail::r1::delegated_task::execute(tbb::detail::d1::execution_data&)::{lambda()#2}>(tbb::detail::r1::delegated_task::execute(tbb::detail::d1::execution_data&)::{lambda()#2}) (this=<optimized out>, on_completion_body=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/../../include/oneapi/tbb/detail/_template_helpers.h:230
#19 tbb::detail::r1::delegated_task::execute (this=0x7fffffff8f00, ed=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:735
#20 0x00007fffc0a410ae in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (this=this@entry=0x7fffbf7e2880, t=0x7fffffff8f00, t@entry=0x0, waiter=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/../../include/oneapi/tbb/task_group.h:382
#21 0x00007fffc0a3f10c in tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (this=0x7fffbf7e2880, t=0x0, waiter=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:470
#22 tbb::detail::r1::task_dispatcher::execute_and_wait (t=0x0, wait_ctx=..., w_ctx=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.cpp:168
#23 0x00007fffcc654e8a in tbb::detail::d2::task_group_base::wait() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#24 0x00007fffcc65a055 in tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#25 0x00007fffc0a28a37 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=warning: RTTI symbol not found for class 'tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>'
...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:821
#26 0x00007fffcc659dd0 in Intel::OpenCL::TaskExecutor::SpawningTaskGroup::WaitForAll() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#27 0x00007fffcc640436 in Intel::OpenCL::TaskExecutor::out_of_order_executor_task::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#28 0x00007fffcc659a88 in tbb::detail::d1::enqueue_task<Intel::OpenCL::TaskExecutor::ArenaFunctorRunner<Intel::OpenCL::TaskExecutor::out_of_order_executor_task> >::execute(tbb::detail::d1::execution_data&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#29 0x00007fffc0a410ae in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (this=this@entry=0x7fffbf7e2880, t=0x7fffbf7efa00, t@entry=0x0, waiter=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/../../include/oneapi/tbb/task_group.h:382
#30 0x00007fffc0a3f10c in tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (this=0x7fffbf7e2880, t=0x0, waiter=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.h:470
#31 tbb::detail::r1::task_dispatcher::execute_and_wait (t=0x0, wait_ctx=..., w_ctx=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/task_dispatcher.cpp:168
#32 0x00007fffcc654e8a in tbb::detail::d2::task_group_base::wait() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#33 0x00007fffcc65a055 in tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#34 0x00007fffc0a28a37 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=warning: RTTI symbol not found for class 'tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorWaiter, void>'
...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:821
#35 0x00007fffcc659dd0 in Intel::OpenCL::TaskExecutor::SpawningTaskGroup::WaitForAll() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#36 0x00007fffcc640436 in Intel::OpenCL::TaskExecutor::out_of_order_executor_task::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#37 0x00007fffcc659b12 in tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorRunner<Intel::OpenCL::TaskExecutor::out_of_order_executor_task>, void>::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#38 0x00007fffc0a28a37 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=warning: RTTI symbol not found for class 'tbb::detail::d1::task_arena_function<Intel::OpenCL::TaskExecutor::ArenaFunctorRunner<Intel::OpenCL::TaskExecutor::out_of_order_executor_task>, void>'
...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:821
#39 0x00007fffcc656c42 in Intel::OpenCL::TaskExecutor::out_of_order_command_list::LaunchExecutorTask(bool, Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::TaskExecutor::ITaskBase> const&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#40 0x00007fffcc65633e in Intel::OpenCL::TaskExecutor::base_command_list::InternalFlush(bool) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#41 0x00007fffcc6561e2 in Intel::OpenCL::TaskExecutor::base_command_list::WaitForCompletion(Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::TaskExecutor::ITaskBase> const&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#42 0x00007fffcc62e3f5 in Intel::OpenCL::CPUDevice::CPUDevice::clDevCommandListWaitCompletion(void*, cl_dev_cmd_desc*) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#43 0x00007fffcc5e7149 in Intel::OpenCL::Framework::IOclCommandQueueBase::WaitForCompletion(Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::Framework::QueueEvent> const&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#44 0x00007fffcc5b52a4 in Intel::OpenCL::Framework::EventsManager::WaitForEvents(unsigned int, _cl_event* const*, bool) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#45 0x00007fffcc5ca033 in Intel::OpenCL::Framework::ExecutionModule::WaitForEvents(unsigned int, _cl_event* const*) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#46 0x00007fffcc523856 in clWaitForEvents () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#47 0x00007fffd5b5436a in urEventWait () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_opencl.so.0
#48 0x00007ffff3a7f9eb in ur_loader::urEventWait(unsigned int, ur_event_handle_t_* const*) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
#49 0x00007ffff3a92c27 in urEventWait () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
#50 0x00007ffff4446c64 in sycl::_V1::detail::DispatchHostTask::waitForEvents() const () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#51 0x00007ffff4445aa4 in sycl::_V1::detail::DispatchHostTask::operator()() const () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#52 0x00007ffff4380dd5 in sycl::_V1::detail::ThreadPool::worker() () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#53 0x00007ffff654cdb4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#54 0x00007ffff7d43a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#55 0x00007ffff7dd0c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 7 (Thread 0x7fffaffff6c0 (LWP 568167) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffc0a3af06 in tbb::detail::r1::futex_wait (futex=0x7fffbfa7f024, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffbfa7f024) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::rml::internal::thread_monitor::wait (this=0x7fffbfa7f020) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/rml_thread_monitor.h:235
#4  tbb::detail::r1::rml::private_worker::run (this=0x7fffbfa7f000) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:273
#5  0x00007fffc0a3adc6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fffbfa7f024) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#6  0x00007ffff7d43a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd0c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 6 (Thread 0x7fffaf7fe6c0 (LWP 568166) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffc0a3af06 in tbb::detail::r1::futex_wait (futex=0x7fffbfa7f124, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffbfa7f124) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::rml::internal::thread_monitor::wait (this=0x7fffbfa7f120) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/rml_thread_monitor.h:235
#4  tbb::detail::r1::rml::private_worker::run (this=0x7fffbfa7f100) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:273
#5  0x00007fffc0a3adc6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fffbfa7f124) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#6  0x00007ffff7d43a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd0c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 5 (Thread 0x7fffb4a346c0 (LWP 568165) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffc0a3af06 in tbb::detail::r1::futex_wait (futex=0x7fffbfa7f0a4, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffbfa7f0a4) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::rml::internal::thread_monitor::wait (this=0x7fffbfa7f0a0) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/rml_thread_monitor.h:235
#4  tbb::detail::r1::rml::private_worker::run (this=0x7fffbfa7f080) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:273
#5  0x00007fffc0a3adc6 in tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fffbfa7f0a4) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/private_server.cpp:221
#6  0x00007ffff7d43a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd0c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 4 (Thread 0x7fffebbfd6c0 (LWP 568159) "python"):
#0  0x00007ffff7d3fd61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff29c8860 <thread_status+352>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7ffff29c8860 <thread_status+352>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff29c8860 <thread_status+352>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7d427dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff29c8810 <thread_status+272>, cond=0x7ffff29c8838 <thread_status+312>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff29c8838 <thread_status+312>, mutex=0x7ffff29c8810 <thread_status+272>) at ./nptl/pthread_cond_wait.c:627
#5  0x00007ffff191cb8b in blas_thread_server () from /home/ngrigori/miniforge3/envs/dpctl_dev/lib/python3.11/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#6  0x00007ffff7d43a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd0c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 3 (Thread 0x7ffff03fe6c0 (LWP 568158) "python"):
#0  0x00007ffff7d3fd61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff29c87e0 <thread_status+224>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7ffff29c87e0 <thread_status+224>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff29c87e0 <thread_status+224>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7d427dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff29c8790 <thread_status+144>, cond=0x7ffff29c87b8 <thread_status+184>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff29c87b8 <thread_status+184>, mutex=0x7ffff29c8790 <thread_status+144>) at ./nptl/pthread_cond_wait.c:627
#5  0x00007ffff191cb8b in blas_thread_server () from /home/ngrigori/miniforge3/envs/dpctl_dev/lib/python3.11/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#6  0x00007ffff7d43a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd0c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 2 (Thread 0x7ffff0bff6c0 (LWP 568157) "python"):
#0  0x00007ffff7d3fd61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff29c8760 <thread_status+96>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7ffff29c8760 <thread_status+96>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff29c8760 <thread_status+96>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7d427dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff29c8710 <thread_status+16>, cond=0x7ffff29c8738 <thread_status+56>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff29c8738 <thread_status+56>, mutex=0x7ffff29c8710 <thread_status+16>) at ./nptl/pthread_cond_wait.c:627
#5  0x00007ffff191cb8b in blas_thread_server () from /home/ngrigori/miniforge3/envs/dpctl_dev/lib/python3.11/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#6  0x00007ffff7d43a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7dd0c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 1 (Thread 0x7ffff7ca0b80 (LWP 568154) "python"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fffc0a2a244 in tbb::detail::r1::futex_wait (futex=0x7fffffff8ef8, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffffff8ef8) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::sleep_node<unsigned long>::wait (this=0x7fffffff8ed0) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:170
#4  0x00007fffc0a287a3 in tbb::detail::r1::concurrent_monitor_base<unsigned long>::commit_wait (this=0x7fffbf7e23a8, node=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:232
#5  tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:797
#6  0x00007fffcc6570f4 in Intel::OpenCL::TaskExecutor::out_of_order_command_list::WaitForIdle() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#7  0x00007fffcc6561f2 in Intel::OpenCL::TaskExecutor::base_command_list::WaitForCompletion(Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::TaskExecutor::ITaskBase> const&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#8  0x00007fffcc62e3f5 in Intel::OpenCL::CPUDevice::CPUDevice::clDevCommandListWaitCompletion(void*, cl_dev_cmd_desc*) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#9  0x00007fffcc5e7149 in Intel::OpenCL::Framework::IOclCommandQueueBase::WaitForCompletion(Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::Framework::QueueEvent> const&) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#10 0x00007fffcc5b52a4 in Intel::OpenCL::Framework::EventsManager::WaitForEvents(unsigned int, _cl_event* const*, bool) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#11 0x00007fffcc5ca033 in Intel::OpenCL::Framework::ExecutionModule::WaitForEvents(unsigned int, _cl_event* const*) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#12 0x00007fffcc523856 in clWaitForEvents () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#13 0x00007fffd5b5436a in urEventWait () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_opencl.so.0
#14 0x00007ffff3a7f9eb in ur_loader::urEventWait(unsigned int, ur_event_handle_t_* const*) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
#15 0x00007ffff3a92c27 in urEventWait () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
#16 0x00007ffff4376aef in sycl::_V1::detail::event_impl::waitInternal(bool*) () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#17 0x00007ffff4376fa2 in sycl::_V1::detail::event_impl::wait(std::shared_ptr<sycl::_V1::detail::event_impl>, bool*) () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#18 0x00007ffff447976e in sycl::_V1::event::wait() () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#19 0x00007fffe8788024 in ?? () from /home/ngrigori/repos/dpctl/dpctl/tensor/_tensor_impl.cpython-311-x86_64-linux-gnu.so
#20 0x00007fffe834672f in ?? () from /home/ngrigori/repos/dpctl/dpctl/tensor/_tensor_impl.cpython-311-x86_64-linux-gnu.so
#21 0x00007fffe832a06f in ?? () from /home/ngrigori/repos/dpctl/dpctl/tensor/_tensor_impl.cpython-311-x86_64-linux-gnu.so
#22 0x0000555555755b06 in cfunction_call (func=0x7fffe9209da0, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/methodobject.c:542
#23 0x00005555557348b3 in _PyObject_MakeTpCall (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7fffe9209da0, args=<optimized out>, nargs=0, keywords=0x7ffff6fbebb0) at /usr/local/src/conda/python-3.11.9/Objects/call.c:214
#24 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae2ab8, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:4769
#25 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae2ab8, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#26 _PyEval_Vector (kwnames=<optimized out>, argcount=2, args=0x7fffffffa228, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#27 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fffffffa228, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#28 0x00007fffe55b9a36 in __pyx_pf_5dpctl_6tensor_9_usmarray_11usm_ndarray_68__setitem__(PyUSMArrayObject*, _object*, _object*) () from /home/ngrigori/repos/dpctl/dpctl/tensor/_usmarray.cpython-311-x86_64-linux-gnu.so
#29 0x0000555555746071 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae2768, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:2297
#30 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae2768, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#31 _PyEval_Vector (kwnames=<optimized out>, argcount=3, args=0x7fffffffa480, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#32 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fffffffa480, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#33 0x00007fffe55ae5db in __pyx_pw_5dpctl_6tensor_9_usmarray_11usm_ndarray_127__ixor__(_object*, _object*) () from /home/ngrigori/repos/dpctl/dpctl/tensor/_usmarray.cpython-311-x86_64-linux-gnu.so
#34 0x00005555557c5984 in binary_iop1 (op_slot=112, iop_slot=216, w=0x555555aa7ac0 <_Py_FalseStruct>, v=0x7fffd5e36960) at /usr/local/src/conda/python-3.11.9/Objects/abstract.c:1190
#35 binary_iop (op_name=0x555555889b71 "^=", op_slot=112, iop_slot=216, w=0x555555aa7ac0 <_Py_FalseStruct>, v=0x7fffd5e36960) at /usr/local/src/conda/python-3.11.9/Objects/abstract.c:1215
#36 PyNumber_InPlaceXor (v=0x7fffd5e36960, w=0x555555aa7ac0 <_Py_FalseStruct>) at /usr/local/src/conda/python-3.11.9/Objects/abstract.c:1248
#37 0x0000555555742fe9 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae26c8, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:5548
#38 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae26c8, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#39 _PyEval_Vector (kwnames=<optimized out>, argcount=0, args=0x7fffd5e48058, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#40 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fffd5e48058, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#41 0x000055555576f6b0 in _PyVectorcall_Call (kwargs=<optimized out>, tuple=<optimized out>, callable=0x7ffff5f867a0, func=0x555555765800 <_PyFunction_Vectorcall>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:257
#42 _PyObject_Call (kwargs=<optimized out>, args=<optimized out>, callable=0x7ffff5f867a0, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:328
#43 PyObject_Call (callable=0x7ffff5f867a0, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:355
#44 0x00005555557466e4 in do_call_core (use_tracing=<optimized out>, kwdict=0x7fffd5e45ec0, callargs=0x555555ab65f8 <_PyRuntime+58904>, func=0x7ffff5f867a0, tstate=<optimized out>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349
#45 _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae2610, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:5376
#46 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae2610, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#47 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7fffd71062a8, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#48 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fffd71062a8, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#49 0x00005555557466e4 in do_call_core (use_tracing=<optimized out>, kwdict=0x0, callargs=0x7fffd7106290, func=0x7ffff7118fe0, tstate=<optimized out>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349
#50 _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae23a8, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:5376
#51 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae23a8, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#52 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7fffd5e28058, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#53 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fffd5e28058, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#54 0x0000555555739420 in _PyObject_FastCallDictTstate (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffff76ae480, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:152
#55 0x000055555576d399 in _PyObject_Call_Prepend (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, callable=callable@entry=0x7ffff76ae480, obj=obj@entry=0x7ffff6de5bc0, args=args@entry=0x555555ab65f8 <_PyRuntime+58904>, kwargs=kwargs@entry=0x7fffd5e45a80) at /usr/local/src/conda/python-3.11.9/Objects/call.c:482
#56 0x000055555583f108 in slot_tp_call (self=0x7ffff6de5bc0, args=0x555555ab65f8 <_PyRuntime+58904>, kwds=0x7fffd5e45a80) at /usr/local/src/conda/python-3.11.9/Objects/typeobject.c:7624
#57 0x00005555557348b3 in _PyObject_MakeTpCall (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffff6de5bc0, args=<optimized out>, nargs=0, keywords=0x7ffff7113af0) at /usr/local/src/conda/python-3.11.9/Objects/call.c:214
#58 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae22c8, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:4769
#59 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae22c8, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#60 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7ffff6ebfaa8, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#61 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7ffff6ebfaa8, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#62 0x00005555557466e4 in do_call_core (use_tracing=<optimized out>, kwdict=0x0, callargs=0x7ffff6ebfa90, func=0x7ffff704efc0, tstate=<optimized out>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349
#63 _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae2060, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:5376
#64 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae2060, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#65 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7fffd5e48418, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#66 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fffd5e48418, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#67 0x0000555555739420 in _PyObject_FastCallDictTstate (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffff76ae480, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:152
#68 0x000055555576d399 in _PyObject_Call_Prepend (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, callable=callable@entry=0x7ffff76ae480, obj=obj@entry=0x7ffff6de5da0, args=args@entry=0x555555ab65f8 <_PyRuntime+58904>, kwargs=kwargs@entry=0x7fffd7cf2800) at /usr/local/src/conda/python-3.11.9/Objects/call.c:482
#69 0x000055555583f108 in slot_tp_call (self=self@entry=0x7ffff6de5da0, args=args@entry=0x555555ab65f8 <_PyRuntime+58904>, kwds=0x7fffd7cf2800) at /usr/local/src/conda/python-3.11.9/Objects/typeobject.c:7624
#70 0x000055555576f63e in _PyObject_Call (kwargs=<optimized out>, args=0x555555ab65f8 <_PyRuntime+58904>, callable=0x7ffff6de5da0, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:343
#71 PyObject_Call (callable=0x7ffff6de5da0, args=0x555555ab65f8 <_PyRuntime+58904>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:355
#72 0x00005555557466e4 in do_call_core (use_tracing=<optimized out>, kwdict=0x7fffd7cf2800, callargs=0x555555ab65f8 <_PyRuntime+58904>, func=0x7ffff6de5da0, tstate=<optimized out>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349
#73 _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae1d00, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:5376
#74 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae1d00, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#75 _PyEval_Vector (kwnames=<optimized out>, argcount=2, args=0x7fffe92ad618, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#76 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7fffe92ad618, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#77 0x00005555557466e4 in do_call_core (use_tracing=<optimized out>, kwdict=0x0, callargs=0x7fffe92ad600, func=0x7ffff704ed40, tstate=<optimized out>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349
#78 _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae1a98, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:5376
#79 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae1a98, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#80 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7ffff715b878, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#81 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7ffff715b878, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#82 0x0000555555739420 in _PyObject_FastCallDictTstate (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffff76ae480, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:152
#83 0x000055555576d399 in _PyObject_Call_Prepend (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, callable=callable@entry=0x7ffff76ae480, obj=obj@entry=0x7ffff6de5f30, args=args@entry=0x555555ab65f8 <_PyRuntime+58904>, kwargs=kwargs@entry=0x7fffd5dd8800) at /usr/local/src/conda/python-3.11.9/Objects/call.c:482
#84 0x000055555583f108 in slot_tp_call (self=0x7ffff6de5f30, args=0x555555ab65f8 <_PyRuntime+58904>, kwds=0x7fffd5dd8800) at /usr/local/src/conda/python-3.11.9/Objects/typeobject.c:7624
#85 0x00005555557348b3 in _PyObject_MakeTpCall (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffff6de5f30, args=<optimized out>, nargs=0, keywords=0x7ffff735d180) at /usr/local/src/conda/python-3.11.9/Objects/call.c:214
#86 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae1a00, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:4769
#87 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae1a00, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#88 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7ffff6ebcda8, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#89 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7ffff6ebcda8, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#90 0x00005555557466e4 in do_call_core (use_tracing=<optimized out>, kwdict=0x0, callargs=0x7ffff6ebcd90, func=0x7ffff704fe20, tstate=<optimized out>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349
#91 _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae1798, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:5376
#92 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae1798, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#93 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7ffff715b898, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#94 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7ffff715b898, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#95 0x0000555555739420 in _PyObject_FastCallDictTstate (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffff76ae480, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:152
#96 0x000055555576d399 in _PyObject_Call_Prepend (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, callable=callable@entry=0x7ffff76ae480, obj=obj@entry=0x7ffff6de6020, args=args@entry=0x555555ab65f8 <_PyRuntime+58904>, kwargs=kwargs@entry=0x7ffff6e91480) at /usr/local/src/conda/python-3.11.9/Objects/call.c:482
#97 0x000055555583f108 in slot_tp_call (self=0x7ffff6de6020, args=0x555555ab65f8 <_PyRuntime+58904>, kwds=0x7ffff6e91480) at /usr/local/src/conda/python-3.11.9/Objects/typeobject.c:7624
#98 0x00005555557348b3 in _PyObject_MakeTpCall (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffff6de6020, args=<optimized out>, nargs=0, keywords=0x7ffff75a90f0) at /usr/local/src/conda/python-3.11.9/Objects/call.c:214
#99 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae15f0, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:4769
#100 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae15f0, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#101 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7ffff6ebcaa8, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#102 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7ffff6ebcaa8, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#103 0x00005555557466e4 in do_call_core (use_tracing=<optimized out>, kwdict=0x0, callargs=0x7ffff6ebca90, func=0x7ffff704fce0, tstate=<optimized out>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349
#104 _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae1388, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:5376
#105 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae1388, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#106 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7ffff715b858, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#107 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7ffff715b858, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#108 0x0000555555739420 in _PyObject_FastCallDictTstate (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffff76ae480, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:152
#109 0x000055555576d399 in _PyObject_Call_Prepend (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, callable=callable@entry=0x7ffff76ae480, obj=obj@entry=0x7ffff6de5350, args=args@entry=0x555555ab65f8 <_PyRuntime+58904>, kwargs=kwargs@entry=0x7ffff6eab5c0) at /usr/local/src/conda/python-3.11.9/Objects/call.c:482
#110 0x000055555583f108 in slot_tp_call (self=0x7ffff6de5350, args=0x555555ab65f8 <_PyRuntime+58904>, kwds=0x7ffff6eab5c0) at /usr/local/src/conda/python-3.11.9/Objects/typeobject.c:7624
#111 0x00005555557348b3 in _PyObject_MakeTpCall (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffff6de5350, args=<optimized out>, nargs=0, keywords=0x7ffff75a97b0) at /usr/local/src/conda/python-3.11.9/Objects/call.c:214
#112 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae11b8, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:4769
#113 0x00005555557f9a8d in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae11b8, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#114 _PyEval_Vector (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, func=func@entry=0x7ffff7c21f80, locals=locals@entry=0x7ffff7c3eb80, args=args@entry=0x0, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#115 0x00005555557f911f in PyEval_EvalCode (co=co@entry=0x7ffff6fdddf0, globals=globals@entry=0x7ffff7c3eb80, locals=locals@entry=0x7ffff7c3eb80) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:1148
#116 0x00005555558106ee in builtin_exec_impl (module=<optimized out>, closure=<optimized out>, locals=0x7ffff7c3eb80, globals=0x7ffff7c3eb80, source=0x7ffff6fdddf0) at /usr/local/src/conda/python-3.11.9/Python/bltinmodule.c:1077
#117 builtin_exec (module=<optimized out>, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Python/clinic/bltinmodule.c.h:465
#118 0x000055555574efbf in cfunction_vectorcall_FASTCALL_KEYWORDS (func=0x7ffff7bd0f90, args=0x7ffff7ae1180, nargsf=<optimized out>, kwnames=0x0) at /usr/local/src/conda/python-3.11.9/Include/cpython/methodobject.h:52
#119 0x000055555574eeac in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7ffff7bd0f90, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92
#120 PyObject_Vectorcall (callable=0x7ffff7bd0f90, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:299
#121 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x555555ad0998 <_PyRuntime+166328>, frame=<optimized out>, frame@entry=0x7ffff7ae1020, throwflag=throwflag@entry=0) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:4769
#122 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7ae1020, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73
#123 _PyEval_Vector (kwnames=<optimized out>, argcount=2, args=0x7ffff779ea98, locals=0x0, func=<optimized out>, tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434
#124 _PyFunction_Vectorcall (func=<optimized out>, stack=0x7ffff779ea98, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393
#125 0x0000555555823158 in pymain_run_module (modname=<optimized out>, set_argv0=set_argv0@entry=1) at /usr/local/src/conda/python-3.11.9/Modules/main.c:300
#126 0x0000555555822ad9 in pymain_run_python (exitcode=0x7fffffffc724) at /usr/local/src/conda/python-3.11.9/Modules/main.c:595
#127 Py_RunMain () at /usr/local/src/conda/python-3.11.9/Modules/main.c:680
#128 0x00005555557e9027 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.11.9/Modules/main.c:734
#129 0x00007ffff7cd11ca in __libc_start_call_main (main=main@entry=0x5555557e8f80 <main>, argc=argc@entry=4, argv=argv@entry=0x7fffffffc988) at ../sysdeps/nptl/libc_start_call_main.h:58
#130 0x00007ffff7cd128b in __libc_start_main_impl (main=0x5555557e8f80 <main>, argc=4, argv=0x7fffffffc988, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffc978) at ../csu/libc-start.c:360
#131 0x00005555557e8ecd in _start ()

@oleksandr-pavlyk
Copy link
Collaborator Author

I modified the test script to add PID and data type worked on as follows:

import pytest
import os

import dpctl.tensor as dpt
from dpctl.tests.helper import get_queue_or_skip, skip_if_dtype_not_supported
from dpctl.tests.elementwise.utils import _integral_dtypes

@pytest.mark.parametrize("iters", range(1, 2001))
@pytest.mark.parametrize("dtype", ["?"] + _integral_dtypes)
def test_bitwise_xor_inplace_python_scalar(dtype, iters):
    assert iters
    q = get_queue_or_skip()
    skip_if_dtype_not_supported(dtype, q)
    if iters == 1:
        print((dtype, q.sycl_device, os.getpid()))
    X = dpt.zeros((10, 10), dtype=dtype, sycl_queue=q)
    dt_kind = X.dtype.kind
    if dt_kind == "b":
        X ^= False
    else:
        X ^= int(0)

then created a shell script to drive execution repeated to reproduce the hang:

export ONEAPI_DEVICE_SELECTOR=opencl:cpu

for i in `seq 0 60`;
do
    echo "Run ${i}"
    taskset -c 0-3 python -m pytest -s test_hang.py
done

Once hung (which may hang for any data type), use gdb to attach the process.

In gdb, thread states are as follows:

(gdb) info threads
  Id   Target Id                                    Frame
* 1    Thread 0x7fd8a1f4db80 (LWP 1458892) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2    Thread 0x7fd85f2716c0 (LWP 1458911) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  3    Thread 0x7fd85fa726c0 (LWP 1458910) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  4    Thread 0x7fd8602736c0 (LWP 1458909) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  5    Thread 0x7fd860a746c0 (LWP 1458908) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  6    Thread 0x7fd899bfd6c0 (LWP 1458896) "python" 0x00007fd8a1fecd61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x7fd89c92c860 <thread_status+352>) at ./nptl/futex-internal.c:57
  7    Thread 0x7fd89a3fe6c0 (LWP 1458895) "python" 0x00007fd8a1fecd61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x7fd89c92c7e0 <thread_status+224>) at ./nptl/futex-internal.c:57
  8    Thread 0x7fd89abff6c0 (LWP 1458894) "python" 0x00007fd8a1fecd61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x7fd89c92c760 <thread_status+96>) at ./nptl/futex-internal.c:57

Threads 6-8 are spawned by openBLAS, since their backtrace is

(gdb) thread 6
[Switching to thread 6 (Thread 0x7fd899bfd6c0 (LWP 1458896))]
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/futex-internal.c.
#0  0x00007fd8a1fecd61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fd89c92c860 <thread_status+352>)
    at ./nptl/futex-internal.c:57
warning: 57     ./nptl/futex-internal.c: No such file or directory
(gdb) bt
#0  0x00007fd8a1fecd61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fd89c92c860 <thread_status+352>)
    at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7fd89c92c860 <thread_status+352>) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fd89c92c860 <thread_status+352>, expected=expected@entry=0, clockid=clockid@entry=0,
    abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007fd8a1fef7dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fd89c92c810 <thread_status+272>, cond=0x7fd89c92c838 <thread_status+312>)
    at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7fd89c92c838 <thread_status+312>, mutex=0x7fd89c92c810 <thread_status+272>) at ./nptl/pthread_cond_wait.c:627
#5  0x00007fd89b880b8b in blas_thread_server ()
   from /home/opavlyk/mamba/envs/dev_dpctl/lib/python3.12/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#6  0x00007fd8a1ff0a94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007fd8a207dc3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

The top of thread 1 backtrace is:

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fd871a2a244 in tbb::detail::r1::futex_wait (futex=0x7fffc288a2b8, comparand=2) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:101
#2  tbb::detail::r1::binary_semaphore::P (this=0x7fffc288a2b8) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/semaphore.h:253
#3  tbb::detail::r1::sleep_node<unsigned long>::wait (this=0x7fffc288a290) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:170
#4  0x00007fd871a287a3 in tbb::detail::r1::concurrent_monitor_base<unsigned long>::commit_wait (this=0x7fd8707e23a8, node=...)
    at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/concurrent_monitor.h:232
#5  tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /localdisk/tmp/onetbb-ci/onetbb_source_code/src/tbb/arena.cpp:797
#6  0x00007fd87d5cd0f4 in Intel::OpenCL::TaskExecutor::out_of_order_command_list::WaitForIdle() () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#7  0x00007fd87d5cc1f2 in Intel::OpenCL::TaskExecutor::base_command_list::WaitForCompletion(Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::TaskExecutor::ITaskBase> const&) ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#8  0x00007fd87d5a43f5 in Intel::OpenCL::CPUDevice::CPUDevice::clDevCommandListWaitCompletion(void*, cl_dev_cmd_desc*) ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#9  0x00007fd87d55d149 in Intel::OpenCL::Framework::IOclCommandQueueBase::WaitForCompletion(Intel::OpenCL::Utils::SharedPtr<Intel::OpenCL::Framework::QueueEvent> const&) ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#10 0x00007fd87d52b2a4 in Intel::OpenCL::Framework::EventsManager::WaitForEvents(unsigned int, _cl_event* const*, bool) ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#11 0x00007fd87d540033 in Intel::OpenCL::Framework::ExecutionModule::WaitForEvents(unsigned int, _cl_event* const*) () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#12 0x00007fd87d499856 in clWaitForEvents () from /opt/intel/oneapi/compiler/2025.0/lib/libintelocl.so
#13 0x00007fd87f83736a in urEventWait () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_opencl.so.0
#14 0x00007fd89da80c27 in urEventWait () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
#15 0x00007fd89e364aef in sycl::_V1::detail::event_impl::waitInternal(bool*) () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#16 0x00007fd89e364fa2 in sycl::_V1::detail::event_impl::wait(std::shared_ptr<sycl::_V1::detail::event_impl>, bool*) () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#17 0x00007fd89e46776e in sycl::_V1::event::wait() () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#18 0x00007fd892786ec4 in ?? () from /home/opavlyk/repos/dpctl/dpctl/tensor/_tensor_impl.cpython-312-x86_64-linux-gnu.so
#19 0x00007fd892344849 in ?? () from /home/opavlyk/repos/dpctl/dpctl/tensor/_tensor_impl.cpython-312-x86_64-linux-gnu.so
#20 0x00007fd8923275b7 in ?? () from /home/opavlyk/repos/dpctl/dpctl/tensor/_tensor_impl.cpython-312-x86_64-linux-gnu.so

The frame 2 of the backtrace indicates that the hang is in the loop of P() method, https://github.com/uxlfoundation/oneTBB/blob/v2022.0.0/src/tbb/semaphore.h#L253 which never terminates (for some reason).

The registers in frame 0 (syscall.S:38) is as follows:

(gdb) info registers
rax            0xfffffffffffffe00  -512
rbx            0x7fffc288a290      140736457122448
rcx            0x7fd8a207b25d      140568408076893
rdx            0x2                 2
rsi            0x80                128
rdi            0x7fffc288a2b8      140736457122488
rbp            0x7fffc288a440      0x7fffc288a440
rsp            0x7fffc288a1d8      0x7fffc288a1d8
r8             0x0                 0
r9             0x7fd800000000      140565689663488
r10            0x0                 0
r11            0x346               838
r12            0xffffffffffffffff  -1
r13            0x7fffc288a298      140736457122456
r14            0x7fffc288a2b8      140736457122488
r15            0x7fd8707e23b8      140567576978360
rip            0x7fd8a207b25d      0x7fd8a207b25d <syscall+29>
eflags         0x246               [ PF ZF IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0xc2c0c0c2          3267412162
k1             0x0                 0
k2             0x7f                127
k3             0xffffff80          4294967168
k4             0x0                 0
k5             0x1                 1
k6             0x0                 0
k7             0x1                 1
fs_base        0x7fd8a1f4db80      140568406842240
gs_base        0x0                 0

The value of %rax register (-512) is suspicious, as it does not correspond to any of the documented error codes (https://lwn.net/Articles/22172/) as per https://elixir.bootlin.com/linux/v6.12.3/source/include/uapi/asm-generic/errno.h

So this hang, is likely an adverse interaction between GNU OpenMP threads spawned by openBLAS, and TBB threads.

It might be instructive to try to reproduce the hang with NumPy from Intel channel, which is linked to Intel MKL, and run the test with MKL_THREADING_LAYER=TBB.

@oleksandr-pavlyk
Copy link
Collaborator Author

The hang is still reproducible when NumPy linked against MKL rather than openBLAS is used.

We could not reproduce the hang on a NUC with Tiger Lake CPU and Ubuntu 22.04, with NumPy powered by openBLAS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants