Training the canary model with the default sampler #9511

systemdevart · 2024-06-20T09:01:49Z

systemdevart
Jun 20, 2024

Hi, the current samplers used for the Lhotse datasets to train the canary encode-decoder model are DynamicBucketingSampler and DynamicCutSampler both don't implement the len method, so the training on the multiple gpus fails with the following error:

TypeError: You seem to have configured a sampler in your DataLoader which does not provide `__len__` method. The sampler was about to be replaced by `DistributedSamplerWrapper` since `use_distributed_sampler` is True and you are using distributed training. Either provide `__len__` method in your sampler, remove it from DataLoader or set `use_distributed_sampler=False` if you want to handle distributed sampling yourself.

Have you trained the canary model with the use_distributed_sampler=False settings? In this case, the training hangs up with the following CUDA error:

[rank1]:[E ProcessGroupNCCL.cpp:768] Some NCCL operations have failed or timed out. Due to the asynchronous
 nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

I guess it is because on some GPUs batches are excedded early.

What is the preferred approach in this case? Should I implement the len method on my own?

Answered by pzelasko

Jun 20, 2024

Yes, we trained Canary with Lhotse dataloaders. Please check that you have specified all required PyTorch Lightning trainer flags as described in this example in the documentation https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#enabling-lhotse-via-configuration

If that doesn't help, please share the config you're running it with.

View full answer

titu1994 · 2024-06-20T09:12:26Z

titu1994
Jun 20, 2024
Maintainer

Are you using Lhotse ? We trained Canary with it, and Lhotse is the default data backend for ASR going forward

Fyi @pzelasko in case you have any suggestions

5 replies

systemdevart Jun 20, 2024
Author

Yes, I'm using Lhotse, and it's the main problem because the samplers implemented for it don't work with the default examples/asr/conf/speech_multitask/fast-conformer_aed.yaml config due to the problems in the description.

pzelasko Jun 20, 2024
Collaborator

Yes, we trained Canary with Lhotse dataloaders. Please check that you have specified all required PyTorch Lightning trainer flags as described in this example in the documentation https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#enabling-lhotse-via-configuration

If that doesn't help, please share the config you're running it with.

Answer selected by systemdevart

pzelasko Jun 20, 2024
Collaborator

Also, please post a full stack trace or ideally the full log.

systemdevart Jun 28, 2024
Author

Thanks! The problem was that I didn't specify the ++trainer.limit_train_batches=1000 flags, which is kind of unusual that during the training we don't have the information of the actual batch count.

pzelasko Jun 28, 2024
Collaborator

Yes, with dynamic batch sizes it is not viable to estimate the number of batches up front.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training the canary model with the default sampler #9511

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training the canary model with the default sampler #9511

systemdevart Jun 20, 2024

Replies: 1 comment · 5 replies

titu1994 Jun 20, 2024 Maintainer

systemdevart Jun 20, 2024 Author

pzelasko Jun 20, 2024 Collaborator

pzelasko Jun 20, 2024 Collaborator

systemdevart Jun 28, 2024 Author

pzelasko Jun 28, 2024 Collaborator

systemdevart
Jun 20, 2024

Replies: 1 comment 5 replies

titu1994
Jun 20, 2024
Maintainer

systemdevart Jun 20, 2024
Author

pzelasko Jun 20, 2024
Collaborator

pzelasko Jun 20, 2024
Collaborator

systemdevart Jun 28, 2024
Author

pzelasko Jun 28, 2024
Collaborator