You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The "overall_throughput" that is calculated in https://github.com/instructlab/training/blob/main/src/instructlab/training/main_ds.py#L422 taking args.samples_per_gpu for the batch size instead of the actual "micro_batch_size".
In each step the batch_size is different, but overall_throughput calculated based on a constant value :
Part of a log for example with batch_size values of 125,112,121:
The "overall_throughput" that is calculated in https://github.com/instructlab/training/blob/main/src/instructlab/training/main_ds.py#L422 taking args.samples_per_gpu for the batch size instead of the actual "micro_batch_size".
In each step the batch_size is different, but overall_throughput calculated based on a constant value :
Part of a log for example with batch_size values of 125,112,121:
Epoch 0: 97%|█████████▋| 76/78 [03:54<00:05, 2.94s/it]�[92m{
"epoch": 0,
"step": 76,
"rank": 0,
"overall_throughput": 44.94857943825548,
"lr": 2.0000000000000003e-06,
"cuda_mem_allocated": 1.2444758415222168,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 25623,
"batch_size": 125,
"total_loss": 3.9130468719509817,
"samples_seen": 9661,
"timestamp": "2024-12-20T13:51:34.253834"
}�[0m
Epoch: 0, Step: 77, Rank: 3, loss = 0.95703125Epoch: 0, Step: 77, Rank: 1, loss = 0.71484375Epoch: 0, Step: 77, Rank: 2, loss = 0.64453125Epoch: 0, Step: 77, Rank: 5, loss = 2.953125Epoch: 0, Step: 77, Rank: 7, loss = 12.5Epoch: 0, Step: 77, Rank: 6, loss = 10.5
Epoch: 0, Step: 77, Rank: 4, loss = 1.765625
Epoch: 0, Step: 77, Rank: 0, loss = 0.921875
Epoch 0: 99%|█████████▊| 77/78 [03:57<00:02, 2.89s/it]�[92m{
"epoch": 0,
"step": 77,
"rank": 0,
"overall_throughput": 47.957271498777644,
"lr": 2.0000000000000003e-06,
"cuda_mem_allocated": 1.2483596801757812,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 23046,
"batch_size": 112,
"total_loss": 3.8774624663716044,
"samples_seen": 9773,
"timestamp": "2024-12-20T13:51:37.052739"
}�[0m
Epoch: 0, Step: 78, Rank: 0, loss = 0.8671875Epoch: 0, Step: 78, Rank: 5, loss = 2.15625Epoch: 0, Step: 78, Rank: 7, loss = 12.75Epoch: 0, Step: 78, Rank: 3, loss = 0.72265625Epoch: 0, Step: 78, Rank: 4, loss = 1.1640625Epoch: 0, Step: 78, Rank: 6, loss = 14.8125
Epoch: 0, Step: 78, Rank: 2, loss = 0.57421875Epoch: 0, Step: 78, Rank: 1, loss = 0.2314453125
Epoch 0: 100%|██████████| 78/78 [04:00<00:00, 2.91s/it]�[92m{
"epoch": 0,
"step": 78,
"rank": 0,
"overall_throughput": 45.40726680806918,
"lr": 2.0000000000000003e-06,
"cuda_mem_allocated": 1.2466816902160645,
"cuda_malloc_retries": 0,
"num_loss_counted_tokens": 27044,
"batch_size": 121,
"total_loss": 4.160331311936104,
"samples_seen": 9894,
"timestamp": "2024-12-20T13:51:39.872213"
}�[0m
The text was updated successfully, but these errors were encountered: