-
-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed support (rework) #996
base: master
Are you sure you want to change the base?
Distributed support (rework) #996
Conversation
It seems that unit tests failed due to a temporary error while downloading CIFAR. I triggered a re-run and unit tests are ok! |
return avalanche_forward(self.model, self.mb_x, self.mb_task_id) | ||
|
||
@final | ||
def model_adaptation(self, model=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this. Local adaptation probably needs all the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason here is that we want the users to override _model_adaptation
instead of model_adaptation
. The _model_adaptation is then called from within the with self.use_local_model()
context. This is needed to ensure the users are changing the local model instead of the model
(which may be the wrapped one).
def forward(self): | ||
"""Compute the model's output given the current mini-batch.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need _forward
and forward
? I would keep only one of them. Does it make sense to move this method to the DistributedModelStrategy helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as _model_adaptation
and model_adaptation
. The wrapper is mainly needed to add the context managers.
return avalanche_forward(self.model, self.mb_x, self.mb_task_id) | ||
|
||
@final | ||
def model_adaptation(self, model=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as _forward
above.
An intermediate abstract class in charge of synchronizing data batches. | ||
|
||
This class can handle batches as either tuples of elements (as usual) or | ||
even single values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. Why do we need a single class to deal with both tuples and single values? It seems easier to have different classes each with their own merge and sync methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A class that can handle both situations may be useful for unsupervised strategies that may have a batch made of the X value only. In that case, the batch is not a tuple and things need to be managed in a slightly different way. Switching classes depending on that may be more complicated than doing this...
I left some general comments about the API and some things I didn't understand about the implementation. Overall, I think it's now much easier to add distributed support and integrate it. I think the API is a bit more verbose than it needs to be, especially due to the large number of methods that we need to add to a new distributed value. Personally, I would have chosen a more minimal set of methods but the solution as is is already quite good. One thing that is missing, and it's quite important, is a test that checks which plugins actually work with the distributed training, which may not be obvious right now. |
Yes, the current issue is to check which plugins may not work. It seems that replay and the scheduler plugin are currently working as expected, but there are many more to test. I'm working on this! |
@AntonioCarta I fixed the naming of the methods you mentioned and added the We still have to finalize a choice regarding the |
Thanks!
Maybe we could have a |
…ed_support_rework
…Additional tests. Detection WIP.
Oh no! It seems there are some PEP8 errors! 😕
|
…efault loggers creation. Added distributed training integration unit tests.
Oh no! It seems there are some PEP8 errors! 😕
|
@AntonioCarta The PR is almost completed and I'd like to add more unit tests to be sure everything works fine. The only real part that I still need to figure out is the
The good part is that we can use instances of this class to populate the collate_fn field of AvalancheDatasets because they have a call method that calls the collate_fn. For the moment, I have implemented the Collate class for classification and detection. However, this works for the input batch. For distributed support, we also need to implement the same thing, but on the outputs. Synchronizing the output across processes is important to have the plugins and metrics aligned. I already took care of the loss by averaging it across processes, but mb_output needs a proper collate. The problem is that it has to be different from the input one (as the format may be different for inputs and outputs). We can't leverage the collate_fn from the experience dataset unless we enrich the Collate class to also contain the collate methods needed to manage the output (this is one possible solution). A different solution is to have the strategy itself define the collate for the output. Do you any suggestion on this? |
I need to think about this. Unfortunately the output collate cannot be part of the dataset since it depends on the model's definition and its output. At this point, I'm wondering whether it's just easier to enforce a more structured and general mini-batch definition. Something like tensordict. It would provide a single and well defined method to collate values (also for scalar and tensor metrics). |
I agree that I think if we use dictionaries and 1. a collate for single values + 2. a collate that merges single value batches (often the same as 1.) we could recover all the other collate operations. |
I was definitely underestimating the complexity of possible collate functions.
There is also an additional problem. In general, it may not be true that the single collate is the same for all the values in the minibatch. |
Oh no! It seems there are some PEP8 errors! 😕
|
Oh no! It seems there are some PEP8 errors! 😕
|
Hi, Thanks for the amazing job! |
I'm moving the development status to #1315. This PR, in its current state, is very difficult to merge in a single step as it is too big and based on a too old codebase. |
This draft contains the implementation for supporting distributed (multi-GPU) training.
The examples folder contains a simple example (to run naive, replay, replay+scheduler) and a bash script to launch it.
This is an alternative to #907 created with the idea to minimize the changes required to the strategies.