You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Based on Reverb docs, let's say I have 3 machines
A > IP: 227.57.48.210
B > IP: 227.57.48.211
C > IP: 227.57.48.212
1. on machine A -> Run sac_reverb_server.py on port 8008
2. on machine B -> Run sac_collect.py with:
-- replay_buffer_server_address='27.57.48.210:8008'
--variable_container_server_address='27.57.48.210:8008'
3. on machine C -> Run sac_train.py with:
-- replay_buffer_server_address='27.57.48.210:8008'
--variable_container_server_address='27.57.48.210:8008'
But in the example from the link above both sac_reverb_server.py and sac_collect.py wait for sac_train.py to write the policies in a given folder on the same machine before running their respective operations.
In a multi device setup how can sac_reverb_server.py and sac_collect.py being informed from where to load the policy?
There is a tf_agents built-in function or a defined procedure to manage that or this need to be implemented from scratch with a custom script ?
The text was updated successfully, but these errors were encountered:
@JCMiles I have failed before to set expectations. I am looking at distributed training (using multiple machines) on google cloud. To share the model I think you can put it in a GCS bucket as the location. I am 99% sure we are using the tensorflow checkpoint reader which supports a bunch (I think a bunch is the right word) of network storage options that are not file system native, e.g. GCS and maybe even S3 (although I think when I was doing AWS I just mount a shared disk...it has been 3+ years since I has AWS knowledge). I suspect you want something to scale larger but you can get a lot of scale out of a single machine by spinning up a bunch of agents on a single machine.
I hope due to this other project I can update those documents. But if you want/can I am happy to chat back and forth in to try to help here. I assigned myself so I should see comments you can also AT me.
@tfboyd sorry for the delay I just saw this. Thx for your effort. It would be amazing to have a clear documentation about the correct setup for this type of training pipeline. My timezone is UTC+2 so let me know what time fits you better. I'm available for chatting tomorrow and on friday.
Hi Team, I'm trying to run the Actor-Learner API for Distributed Collection and Training as exaplained here:
https://github.com/tensorflow/agents/tree/master/tf_agents/experimental/distributed/examples/sac
but on multipe machines.
Based on Reverb docs, let's say I have 3 machines
A > IP: 227.57.48.210
B > IP: 227.57.48.211
C > IP: 227.57.48.212
But in the example from the link above both sac_reverb_server.py and sac_collect.py wait for sac_train.py to write the policies in a given folder on the same machine before running their respective operations.
In a multi device setup how can sac_reverb_server.py and sac_collect.py being informed from where to load the policy?
There is a tf_agents built-in function or a defined procedure to manage that or this need to be implemented from scratch with a custom script ?
The text was updated successfully, but these errors were encountered: