Training the canary model with the default sampler #9511
-
Hi, the current samplers used for the Lhotse datasets to train the canary encode-decoder model are TypeError: You seem to have configured a sampler in your DataLoader which does not provide `__len__` method. The sampler was about to be replaced by `DistributedSamplerWrapper` since `use_distributed_sampler` is True and you are using distributed training. Either provide `__len__` method in your sampler, remove it from DataLoader or set `use_distributed_sampler=False` if you want to handle distributed sampling yourself. Have you trained the canary model with the use_distributed_sampler=False settings? In this case, the training hangs up with the following CUDA error: [rank1]:[E ProcessGroupNCCL.cpp:768] Some NCCL operations have failed or timed out. Due to the asynchronous
nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. I guess it is because on some GPUs batches are excedded early. What is the preferred approach in this case? Should I implement the len method on my own? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
Are you using Lhotse ? We trained Canary with it, and Lhotse is the default data backend for ASR going forward Fyi @pzelasko in case you have any suggestions |
Beta Was this translation helpful? Give feedback.
Yes, we trained Canary with Lhotse dataloaders. Please check that you have specified all required PyTorch Lightning trainer flags as described in this example in the documentation https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#enabling-lhotse-via-configuration
If that doesn't help, please share the config you're running it with.