Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in helo world example witch UCX error #39

Open
piotrm-nvidia opened this issue Jan 17, 2025 · 0 comments
Open

Deadlock in helo world example witch UCX error #39

piotrm-nvidia opened this issue Jan 17, 2025 · 0 comments

Comments

@piotrm-nvidia
Copy link
Contributor

piotrm-nvidia commented Jan 17, 2025

User branch https://github.com/triton-inference-server/triton-distributed/tree/piotrm-add-nats-hosts

Start NATS.io server

nats-server -js

Start example with default host and port passed:

python3 single_file.py  --request-plane-uri htp://127.0.0.1:4222

Log indicates no request was processed

Starting Workers
22:01:56 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x769741a60f40>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder',
                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_CPU'}],
                                                              'parameters': {'delay': {'string_value': '0'}}}},
                                       log_level=None)],
             triton_log_path=None,
             name='encoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50000)
	<SpawnProcess name='encoder.0' parent=2466 initial>

22:01:56 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x769741a60f40>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='decoder',
                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_GPU'}],
                                                              'parameters': {'delay': {'string_value': '0'}}}},
                                       log_level=None)],
             triton_log_path=None,
             name='decoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50001)
	<SpawnProcess name='decoder.0' parent=2466 initial>

22:01:56 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x769741a60f40>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder_decoder',
                                       implementation=<class '__main__.EncodeDecodeOperator'>,
                                       repository=None,
                                       version=1,
                                       max_inflight_requests=100,
                                       parameters={},
                                       log_level=None)],
             triton_log_path=None,
             name='encoder_decoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50002)
	<SpawnProcess name='encoder_decoder.0' parent=2466 initial>

Sending Requests
Sending Requests:   0%| 

Log encoder_decoder.0.ab22e895-d51e-11ef-9355-88a4c2b6c3a5.2511.stdout.log:

22:03:14 single_file.py:55[OPERATOR('encoder_decoder', 1)] INFO: got request!

Log encoder_decoder.0.ab22e895-d51e-11ef-9355-88a4c2b6c3a5.2511.stderr.log:

future: <Task finished name='Task-21' coro=<Worker._process_request_task() done, defined at /workspace/worker/src/python/triton_distributed/worker/worker.py:176> exception=DataPlaneError('Error Referencing Tensor:\n{remote_tensor}')>
Traceback (most recent call last):
  File "/workspace/icp/src/python/triton_distributed/icp/ucp_data_plane.py", line 400, in _create_remote_tensor_reference
    endpoint = await self._create_endpoint(host, port)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/icp/src/python/triton_distributed/icp/ucp_data_plane.py", line 497, in _create_endpoint
    endpoint = await ucp.create_endpoint(host, port)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ucp/core.py", line 1016, in create_endpoint
    return await _get_ctx().create_endpoint(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ucp/core.py", line 328, in create_endpoint
    peer_info = await exchange_peer_info(
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ucp/core.py", line 60, in exchange_peer_info
    await asyncio.wait_for(
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
ucp._libs.exceptions.UCXUnreachable: <stream_recv>:

The above exception was the direct cause of the following exception:

Network configuration in docker:

root@ulalegionbuntu:/workspace/examples/hello_world/logs# ifconfig 
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:e8:c4:ef:de  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eno1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 88:a4:c2:b6:c3:a5  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 3155  bytes 853385 (853.3 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3155  bytes 853385 (853.3 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlp4s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.9.193  netmask 255.255.252.0  broadcast 192.168.11.255
        inet6 fe80::86a5:f982:b461:e36d  prefixlen 64  scopeid 0x20<link>
        ether c0:3c:59:4c:02:c0  txqueuelen 1000  (Ethernet)
        RX packets 33102  bytes 12013650 (12.0 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5005  bytes 1052106 (1.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

All python thread stopped at idle:

root@ulalegionbuntu:/workspace/examples/hello_world# ps aux | grep python
root         648  0.0  0.0  17048 12416 pts/0    S    21:52   0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
root         652  0.6  1.5 10636956 451144 pts/0 Sl   21:52   0:04 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
root        1127  0.0  0.0  17048 12544 pts/0    S    21:52   0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
root        1133  0.6  1.5 10636960 450712 pts/0 Sl   21:52   0:03 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
root        1609  0.0  0.0  17048 12288 pts/0    S    21:59   0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
root        1610  1.2  2.5 9600068 719432 pts/0  Sl   21:59   0:02 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=21) --multiprocessing-fork
root        2015  0.0  0.0  17048 12544 pts/0    S    22:00   0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
root        2021  1.6  1.5 10637080 450428 pts/0 Sl   22:00   0:02 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
root        2466  3.8  0.9 9539144 282900 pts/0  Sl   22:01   0:03 python3 single_file.py --request-plane-uri htp://127.0.0.1:4222
root        2505  0.0  0.0  17048 12544 pts/0    S    22:01   0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
root        2506  2.9  2.5 13794504 719692 pts/0 Sl   22:01   0:02 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=21) --multiprocessing-fork
root        2507  3.3  2.5 13794632 724220 pts/0 Sl   22:01   0:02 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=23) --multiprocessing-fork
root        2511  3.0  2.5 11235536 716496 pts/0 Sl   22:01   0:02 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
root        2776  2.4  0.7 4448284 219448 pts/0  Sl   22:01   0:01 /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/encoder/1/model.py triton_python_backend_shm_region_64d9fc50-5d6a-4fdc-882d-37e5a33bbcdc 1048576 1048576 2506 /opt/tritonserver/backends/python 336 encoder_0_0 DEFAULT
root        2789  2.3  0.7 4448280 219508 pts/0  Sl   22:01   0:01 /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/decoder/1/model.py triton_python_backend_shm_region_a3dc8bb9-70b3-4f3b-a160-0a3ad569266d 1048576 1048576 2507 /opt/tritonserver/backends/python 336 decoder_0_0 DEFAULT
root        2963  0.0  0.0   3532  1792 pts/0    S+   22:03   0:00 grep --color=auto python
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dumo --pid 2466
error: The subcommand 'dumo' wasn't recognized

	Did you mean 'dump'?

If you believe you received this message in error, try re-running with 'py-spy -- dumo'

USAGE:
    py-spy <SUBCOMMAND>

For more information try --help
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2466
Process 2466: python3 single_file.py --request-plane-uri htp://127.0.0.1:4222
Python v3.12.3 (/usr/bin/python3.12)

Thread 2466 (idle): "MainThread"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    <module> (hello_world/single_file.py:215)
Thread 2514 (idle): "asyncio_0"
    _worker (concurrent/futures/thread.py:89)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 2515 (idle): "Thread-1 (_run_event_loop)"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    _run_event_loop (icp/ucp_data_plane.py:150)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 2518 (idle): "Thread-2"
    wait (threading.py:359)
    wait (threading.py:655)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
root@ulalegionbuntu:/workspace/examples/hello_world# ps aux | grep triton
root        2776  0.8  0.7 4448284 219448 pts/0  Sl   22:01   0:01 /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/encoder/1/model.py triton_python_backend_shm_region_64d9fc50-5d6a-4fdc-882d-37e5a33bbcdc 1048576 1048576 2506 /opt/tritonserver/backends/python 336 encoder_0_0 DEFAULT
root        2789  0.7  0.7 4448280 219508 pts/0  Sl   22:01   0:01 /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/decoder/1/model.py triton_python_backend_shm_region_a3dc8bb9-70b3-4f3b-a160-0a3ad569266d 1048576 1048576 2507 /opt/tritonserver/backends/python 336 decoder_0_0 DEFAULT
root        2967  0.0  0.0   3532  1792 pts/0    S+   22:05   0:00 grep --color=auto triton
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 648 
Process 648: /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
Python v3.12.3 (/usr/bin/python3.12)

Thread 648 (idle): "MainThread"
    main (multiprocessing/resource_tracker.py:227)
    <module> (<string>:1)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 652
Process 652: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 652 (idle): "MainThread"
    wait (threading.py:355)
    result (concurrent/futures/_base.py:451)
    release_tensor (icp/ucp_data_plane.py:491)
    __del__ (worker/remote_tensor.py:105)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 1127
Process 1127: /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
Python v3.12.3 (/usr/bin/python3.12)

Thread 1127 (idle): "MainThread"
    main (multiprocessing/resource_tracker.py:227)
    <module> (<string>:1)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 1133
Process 1133: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 1133 (idle): "MainThread"
    wait (threading.py:355)
    result (concurrent/futures/_base.py:451)
    release_tensor (icp/ucp_data_plane.py:491)
    __del__ (worker/remote_tensor.py:105)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 1609
Process 1609: /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
Python v3.12.3 (/usr/bin/python3.12)

Thread 1609 (idle): "MainThread"
    main (multiprocessing/resource_tracker.py:227)
    <module> (<string>:1)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 1610
Process 1610: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=21) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 1610 (idle): "MainThread"
    load (tritonserver/_api/_server.py:931)
    __init__ (worker/triton_core_operator.py:89)
    _import_operators (worker/worker.py:146)
    serve (worker/worker.py:256)
    _run (asyncio/events.py:88)
    _run_once (asyncio/base_events.py:1987)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    start (worker/worker.py:370)
    _start_worker (worker/deployment.py:64)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:135)
    spawn_main (multiprocessing/spawn.py:122)
    <module> (<string>:1)
Thread 1762 (idle): "asyncio_0"
    _worker (concurrent/futures/thread.py:89)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 1768 (idle): "Thread-1 (_run_event_loop)"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    _run_event_loop (icp/ucp_data_plane.py:150)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2015
Process 2015: /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
Python v3.12.3 (/usr/bin/python3.12)

Thread 2015 (idle): "MainThread"
    main (multiprocessing/resource_tracker.py:227)
    <module> (<string>:1)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2021
Process 2021: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 2021 (idle): "MainThread"
    wait (threading.py:355)
    result (concurrent/futures/_base.py:451)
    release_tensor (icp/ucp_data_plane.py:491)
    __del__ (worker/remote_tensor.py:105)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2505
Process 2505: /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(18)
Python v3.12.3 (/usr/bin/python3.12)

Thread 2505 (idle): "MainThread"
    main (multiprocessing/resource_tracker.py:227)
    <module> (<string>:1)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2506
Process 2506: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=21) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 2506 (idle): "MainThread"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    start (worker/worker.py:376)
    _start_worker (worker/deployment.py:64)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:135)
    spawn_main (multiprocessing/spawn.py:122)
    <module> (<string>:1)
Thread 2655 (idle): "asyncio_0"
    _worker (concurrent/futures/thread.py:89)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 2663 (idle): "Thread-1 (_run_event_loop)"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    _run_event_loop (icp/ucp_data_plane.py:150)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2507
Process 2507: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=23) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 2507 (idle): "MainThread"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    start (worker/worker.py:370)
    _start_worker (worker/deployment.py:64)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:135)
    spawn_main (multiprocessing/spawn.py:122)
    <module> (<string>:1)
Thread 2656 (idle): "asyncio_0"
    _worker (concurrent/futures/thread.py:89)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 2661 (idle): "Thread-1 (_run_event_loop)"
    select (selectors.py:468)
    _run_once (asyncio/base_events.py:1949)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    _run_event_loop (icp/ucp_data_plane.py:150)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2511
Process 2511: /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=19, pipe_handle=25) --multiprocessing-fork
Python v3.12.3 (/usr/bin/python3.12)

Thread 2511 (idle): "MainThread"
    wait (threading.py:355)
    result (concurrent/futures/_base.py:451)
    create_input_tensor_reference (icp/ucp_data_plane.py:461)
    _set_model_infer_request_inputs (worker/remote_request.py:78)
    to_model_infer_request (worker/remote_request.py:199)
    async_infer (worker/remote_operator.py:108)
    execute (hello_world/single_file.py:56)
    _process_request (worker/worker.py:172)
    _process_request_task (worker/worker.py:188)
    _run (asyncio/events.py:88)
    _run_once (asyncio/base_events.py:1987)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    start (worker/worker.py:370)
    _start_worker (worker/deployment.py:64)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:135)
    spawn_main (multiprocessing/spawn.py:122)
    <module> (<string>:1)
Thread 2654 (idle): "asyncio_0"
    _worker (concurrent/futures/thread.py:89)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 2662 (idle): "Thread-1 (_run_event_loop)"
    wait (threading.py:355)
    result (concurrent/futures/_base.py:451)
    release_tensor (icp/ucp_data_plane.py:491)
    __del__ (worker/remote_tensor.py:105)
    __enter__ (contextlib.py:132)
    inner (contextlib.py:80)
    _arm_worker (ucp/continuous_ucx_progress.py:98)
    _run (asyncio/events.py:88)
    _run_once (asyncio/base_events.py:1987)
    run_forever (asyncio/base_events.py:641)
    run_until_complete (asyncio/base_events.py:674)
    run (asyncio/runners.py:118)
    run (asyncio/runners.py:194)
    _run_event_loop (icp/ucp_data_plane.py:150)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2776
Process 2776: /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/encoder/1/model.py triton_python_backend_shm_region_64d9fc50-5d6a-4fdc-882d-37e5a33bbcdc 1048576 1048576 2506 /opt/tritonserver/backends/python 336 encoder_0_0 DEFAULT
Python v3.12.3 (/opt/tritonserver/backends/python/triton_python_backend_stub)

Thread 2776 (idle): "MainThread"
root@ulalegionbuntu:/workspace/examples/hello_world# py-spy dump --pid 2789
Process 2789: /opt/tritonserver/backends/python/triton_python_backend_stub /workspace/examples/hello_world/operators/triton_core_models/decoder/1/model.py triton_python_backend_shm_region_a3dc8bb9-70b3-4f3b-a160-0a3ad569266d 1048576 1048576 2507 /opt/tritonserver/backends/python 336 decoder_0_0 DEFAULT
Python v3.12.3 (/opt/tritonserver/backends/python/triton_python_backend_stub)

Thread 2789 (idle): "MainThread"

@piotrm-nvidia piotrm-nvidia changed the title Deadlock in helo wolrd example Deadlock in helo wolrd example eiyh UCX error error Jan 17, 2025
@piotrm-nvidia piotrm-nvidia changed the title Deadlock in helo wolrd example eiyh UCX error error Deadlock in helo wolrd example witch UCX error Jan 17, 2025
@piotrm-nvidia piotrm-nvidia changed the title Deadlock in helo wolrd example witch UCX error Deadlock in helo world example witch UCX error Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant