Distributed inference is easy with the help RPC backend. Basically, start one or more RPC server(s), then they can be used just like a backend device. Each RPC server corresponds to a single backend device. For example, if there are two GPUs in a box, then two RPC servers needs to be started, one for each GPU.
Use --serve_rpc SPEC
to start a RPC server. Examples of SPEC
:
-
8080
: start a server on0.0.0.0:8080
with device #0. -
127.0.0.1:9000
: start a server on127.0.0.1:9000
with device #0. -
8080@1
: start a server on0.0.0.0:8080
with device #1. -
127.0.0.1:9000@1
: start a server on127.0.0.1:9000
with device #1.
Don't forget to use --show_devices
to check device IDs, and --log_level 2
to view more logs. Example:
main --serve_rpc 80 --log_level 2
Itrying to start RPC server at 0.0.0.0:80, using
IVulkan - Vulkan0 (NVIDIA GeForce ...)
I type: GPU
I memory total: .. B
I memory free : .. B
After RPC servers are started, they can be used. Use --rcp_endpoints EPS
to register RPC servers (each is called an endpoint.).
Each endpoint is specified by HOST:PORT
, and when HOST
is omitted, 127.0.0.1
is assumed. Multiple endpoints are joined by ;
.
Use --show_devices
to check if everything is Okay:
main --rpc_endpoints 80 --show_devices
0: .....
1: RPC - RPC[127.0.0.1:80] (NVIDIA GeForce ...)
type: GPU
memory total: .. B
memory free : .. B
1: CPU - CPU
...
Now, let go with distributed inference:
main --rpc_endpoints 80 -ngl 1:all -m ...