Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default execution for single_file.py`` fails when connecting to nats #35

Open
piotrm-nvidia opened this issue Jan 17, 2025 · 2 comments

Comments

@piotrm-nvidia
Copy link
Contributor

The single_file.py fails with default execution:

root@ulalegionbuntu:/workspace/examples/hello_world# python3 single_file.py 
Starting Workers
20:23:42 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x714b1c759080>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder',
                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_CPU'}],
                                                              'parameters': {'delay': {'string_value': '0'}}}},
                                       log_level=None)],
             triton_log_path=None,
             name='encoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50000)
	<SpawnProcess name='encoder.0' parent=130 initial>

20:23:42 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x714b1c759080>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='decoder',
                                       implementation=<class 'triton_distributed.worker.triton_core_operator.TritonCoreOperator'>,
                                       repository='/workspace/examples/hello_world/operators/triton_core_models',
                                       version=1,
                                       max_inflight_requests=1,
                                       parameters={'config': {'instance_group': [{'count': 1,
                                                                                  'kind': 'KIND_GPU'}],
                                                              'parameters': {'delay': {'string_value': '0'}}}},
                                       log_level=None)],
             triton_log_path=None,
             name='decoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50001)
	<SpawnProcess name='decoder.0' parent=130 initial>

20:23:42 deployment.py:115[triton_distributed.worker.deployment] INFO: 

Starting Worker:

	Config:
	WorkerConfig(request_plane=<class 'triton_distributed.icp.nats_request_plane.NatsRequestPlane'>,
             data_plane=<function UcpDataPlane at 0x714b1c759080>,
             request_plane_args=(['nats://localhost:4223'], {}),
             data_plane_args=([], {}),
             log_level=1,
             operators=[OperatorConfig(name='encoder_decoder',
                                       implementation=<class '__main__.EncodeDecodeOperator'>,
                                       repository=None,
                                       version=1,
                                       max_inflight_requests=100,
                                       parameters={},
                                       log_level=None)],
             triton_log_path=None,
             name='encoder_decoder.0',
             log_dir='/workspace/examples/hello_world/logs',
             metrics_port=50002)
	<SpawnProcess name='encoder_decoder.0' parent=130 initial>

Sending Requests
nats: encountered error
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/client.py", line 1353, in _select_next_server
    await self._transport.connect(
  File "/usr/local/lib/python3.12/dist-packages/nats/aio/transport.py", line 121, in connect
    r, w = await asyncio.wait_for(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/streams.py", line 48, in open_connection
    transport, _ = await loop.create_connection(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1122, in create_connection
    raise exceptions[0]
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1104, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1007, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/lib/python3.12/asyncio/selector_events.py", line 651, in sock_connect
    return await fut
           ^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/selector_events.py", line 691, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('127.0.0.1', 4223)
Sending Requests:   0%|                                                                                                                | 0/100 [00:00<?, ?request/s]

@piotrm-nvidia
Copy link
Contributor Author

#34

The control plane implementation doesn't support other address for NATS.io than localhost. My machine is configured in such a way that default bind for 0.0.0.0 used in NATS.io is not binding to 127.0.0.1 but to real WiFi port. I can't pass host to control plane so I can't run example.

@nnshah1
Copy link
Contributor

nnshah1 commented Jan 19, 2025

@piotrm-nvidia - is there a way we can detect the binding from the NatsServer class directly and then use that to encapsulate the uri?

For the single_file example - would like to see if we can avoid any arguments .... maybe we could use environment variables - but goal for single file is to avoid any customization outside of the file - maybe we can add an arg in the file itself - or in the NatsServer class to toggle between localhost and others ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants