v0.5.0 Dispatch batches from main DataLoader
v0.5.0 Dispatch batches from main DataLoader
This release introduces support for iterating through a DataLoader
only on the main process, that then dispatches the batches to all processes.
Dispatch batches from main DataLoader
The motivation behind this come from dataset streaming which introduces two difficulties:
- there might be some timeouts for some elements of the dataset, which might then be different in each process launched, thus it's impossible to make sure the data is iterated though the same way on each process
- when using IterableDataset, each process goes through the dataset, thus applies the preprocessing on all elements. This can yield to the training being slowed down by this preprocessing.
This new feature is activated by default for all IterableDataset
.
Various fixes
- fix fp16 covert back to fp32 for issue: unsupported operand type(s) for /: 'dict' and 'int' #149 (@Doragd)
- [Docs] Machine config is yaml not json #151 (@patrickvonplaten)
- Fix gather for 0d tensor #152 (@sgugger)
- [DeepSpeed] allow untested optimizers deepspeed #150 (@patrickvonplaten)
- Raise errors instead of warnings with better tests #170 (@sgugger)