You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 395, in main
benchmark.run()
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 183, in wrapper
x = func(*args, **kwargs)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 342, in run
steps = self._train(epoch)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/data_loader/dali_data_loader.py", line 116, in next
outputs = pipe.share_outputs()
File "/root/anaconda3/envs/mlperf/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 1132, in share_outputs
raise StopIteration
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 183, in wrapper
x = func(*args, **kwargs)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 263, in _train
for batch in dlp.iter(loader.next()):
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 203, in iter
for v in func:
RuntimeError: generator raised StopIteration
StopIteration
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Actually it seems like it could run the very first epoch, and got stuck on the second.
I also tried the answers for the issue#41 but it didn't work. I didn't add the subfolder parameter while executing ./benchmark.sh datagen or ./benchmark.sh run.
I don't know how to debug this. Please help, thanks a lot!
The text was updated successfully, but these errors were encountered:
I can run all of the workloads well except cosmoflow. And I got stuck on executing the following command:
./benchmark.sh run --hosts x.x.x.x --workload cosmoflow --accelerator-type h100 --num-accelerators 8 --results-dir resultsdir/cosmoflow --param dataset.num_files_train=6000 --param dataset.data_folder=/home/cyc/dataset/cosmoflow
[INFO] 2024-03-06T15:01:12.348520 Ending block 1 - 750 steps completed in 3.55 s [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:216]
[INFO] 2024-03-06T15:01:12.350001 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 74.2248 [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:219]
[INFO] 2024-03-06T15:01:12.350112 Epoch 1 - Block 1 [Training] Throughput (samples/second): 1692.2993 [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:220]
[INFO] 2024-03-06T15:01:12.350413 Ending epoch 1 - 750 steps completed in 3.55 s [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:159]
[INFO] 2024-03-06T15:01:12.350998 Starting epoch 2: 750 steps expected [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:128]
[INFO] 2024-03-06T15:01:12.351128 Starting block 1 [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:198]
Error executing job with overrides: ['workload=cosmoflow_h100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=6000', '++workload.dataset.data_folder=/home/cyc/dataset/cosmoflow', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/data_loader/dali_data_loader.py", line 116, in next
outputs = pipe.share_outputs()
File "/root/anaconda3/envs/mlperf/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 1132, in share_outputs
raise StopIteration
StopIteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 395, in main
benchmark.run()
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 183, in wrapper
x = func(*args, **kwargs)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 342, in run
steps = self._train(epoch)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/data_loader/dali_data_loader.py", line 116, in next
outputs = pipe.share_outputs()
File "/root/anaconda3/envs/mlperf/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 1132, in share_outputs
raise StopIteration
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 183, in wrapper
x = func(*args, **kwargs)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 263, in _train
for batch in dlp.iter(loader.next()):
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 203, in iter
for v in func:
RuntimeError: generator raised StopIteration
StopIteration
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Actually it seems like it could run the very first epoch, and got stuck on the second.
I also tried the answers for the issue#41 but it didn't work. I didn't add the subfolder parameter while executing ./benchmark.sh datagen or ./benchmark.sh run.
I don't know how to debug this. Please help, thanks a lot!
The text was updated successfully, but these errors were encountered: