Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup for Actor-Learner API for Distributed Collection and Training in a Cluster #676

Open
JCMiles opened this issue Nov 17, 2021 · 2 comments
Assignees

Comments

@JCMiles
Copy link

JCMiles commented Nov 17, 2021

Hi Team, I'm trying to run the Actor-Learner API for Distributed Collection and Training as exaplained here:
https://github.com/tensorflow/agents/tree/master/tf_agents/experimental/distributed/examples/sac
but on multipe machines.

Based on Reverb docs, let's say I have 3 machines
A > IP: 227.57.48.210
B > IP: 227.57.48.211
C > IP: 227.57.48.212

    1. on machine A -> Run  sac_reverb_server.py on port 8008

    2. on machine B -> Run  sac_collect.py with:
          -- replay_buffer_server_address='27.57.48.210:8008'
          --variable_container_server_address='27.57.48.210:8008'

    3. on machine C -> Run  sac_train.py with:               
          -- replay_buffer_server_address='27.57.48.210:8008'
          --variable_container_server_address='27.57.48.210:8008'

But in the example from the link above both sac_reverb_server.py and sac_collect.py wait for sac_train.py to write the policies in a given folder on the same machine before running their respective operations.
In a multi device setup how can sac_reverb_server.py and sac_collect.py being informed from where to load the policy?
There is a tf_agents built-in function or a defined procedure to manage that or this need to be implemented from scratch with a custom script ?

@tfboyd tfboyd self-assigned this Nov 19, 2021
@tfboyd
Copy link
Member

tfboyd commented Nov 19, 2021

@JCMiles I have failed before to set expectations. I am looking at distributed training (using multiple machines) on google cloud. To share the model I think you can put it in a GCS bucket as the location. I am 99% sure we are using the tensorflow checkpoint reader which supports a bunch (I think a bunch is the right word) of network storage options that are not file system native, e.g. GCS and maybe even S3 (although I think when I was doing AWS I just mount a shared disk...it has been 3+ years since I has AWS knowledge). I suspect you want something to scale larger but you can get a lot of scale out of a single machine by spinning up a bunch of agents on a single machine.

I hope due to this other project I can update those documents. But if you want/can I am happy to chat back and forth in to try to help here. I assigned myself so I should see comments you can also AT me.

@JCMiles
Copy link
Author

JCMiles commented Dec 1, 2021

@tfboyd sorry for the delay I just saw this. Thx for your effort. It would be amazing to have a clear documentation about the correct setup for this type of training pipeline. My timezone is UTC+2 so let me know what time fits you better. I'm available for chatting tomorrow and on friday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants