Setup for Actor-Learner API for Distributed Collection and Training in a Cluster #676

JCMiles · 2021-11-17T02:12:19Z

Hi Team, I'm trying to run the Actor-Learner API for Distributed Collection and Training as exaplained here:
https://github.com/tensorflow/agents/tree/master/tf_agents/experimental/distributed/examples/sac
but on multipe machines.

Based on Reverb docs, let's say I have 3 machines
A > IP: 227.57.48.210
B > IP: 227.57.48.211
C > IP: 227.57.48.212

    1. on machine A -> Run  sac_reverb_server.py on port 8008

    2. on machine B -> Run  sac_collect.py with:
          -- replay_buffer_server_address='27.57.48.210:8008'
          --variable_container_server_address='27.57.48.210:8008'

    3. on machine C -> Run  sac_train.py with:               
          -- replay_buffer_server_address='27.57.48.210:8008'
          --variable_container_server_address='27.57.48.210:8008'

But in the example from the link above both sac_reverb_server.py and sac_collect.py wait for sac_train.py to write the policies in a given folder on the same machine before running their respective operations.
In a multi device setup how can sac_reverb_server.py and sac_collect.py being informed from where to load the policy?
There is a tf_agents built-in function or a defined procedure to manage that or this need to be implemented from scratch with a custom script ?

The text was updated successfully, but these errors were encountered:

tfboyd · 2021-11-19T16:14:00Z

@JCMiles I have failed before to set expectations. I am looking at distributed training (using multiple machines) on google cloud. To share the model I think you can put it in a GCS bucket as the location. I am 99% sure we are using the tensorflow checkpoint reader which supports a bunch (I think a bunch is the right word) of network storage options that are not file system native, e.g. GCS and maybe even S3 (although I think when I was doing AWS I just mount a shared disk...it has been 3+ years since I has AWS knowledge). I suspect you want something to scale larger but you can get a lot of scale out of a single machine by spinning up a bunch of agents on a single machine.

I hope due to this other project I can update those documents. But if you want/can I am happy to chat back and forth in to try to help here. I assigned myself so I should see comments you can also AT me.

JCMiles · 2021-12-01T12:19:30Z

@tfboyd sorry for the delay I just saw this. Thx for your effort. It would be amazing to have a clear documentation about the correct setup for this type of training pipeline. My timezone is UTC+2 so let me know what time fits you better. I'm available for chatting tomorrow and on friday.

tfboyd self-assigned this Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup for Actor-Learner API for Distributed Collection and Training in a Cluster #676

Setup for Actor-Learner API for Distributed Collection and Training in a Cluster #676

JCMiles commented Nov 17, 2021

tfboyd commented Nov 19, 2021

JCMiles commented Dec 1, 2021

Setup for Actor-Learner API for Distributed Collection and Training in a Cluster #676

Setup for Actor-Learner API for Distributed Collection and Training in a Cluster #676

Comments

JCMiles commented Nov 17, 2021

tfboyd commented Nov 19, 2021

JCMiles commented Dec 1, 2021