failed on running cosmoflow with v1.0 branch #49

caspaseyc · 2024-03-06T07:13:56Z

I can run all of the workloads well except cosmoflow. And I got stuck on executing the following command:

./benchmark.sh run --hosts x.x.x.x --workload cosmoflow --accelerator-type h100 --num-accelerators 8 --results-dir resultsdir/cosmoflow --param dataset.num_files_train=6000 --param dataset.data_folder=/home/cyc/dataset/cosmoflow

[INFO] 2024-03-06T15:01:12.348520 Ending block 1 - 750 steps completed in 3.55 s [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:216]
[INFO] 2024-03-06T15:01:12.350001 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 74.2248 [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:219]
[INFO] 2024-03-06T15:01:12.350112 Epoch 1 - Block 1 [Training] Throughput (samples/second): 1692.2993 [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:220]
[INFO] 2024-03-06T15:01:12.350413 Ending epoch 1 - 750 steps completed in 3.55 s [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:159]
[INFO] 2024-03-06T15:01:12.350998 Starting epoch 2: 750 steps expected [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:128]
[INFO] 2024-03-06T15:01:12.351128 Starting block 1 [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:198]
Error executing job with overrides: ['workload=cosmoflow_h100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=6000', '++workload.dataset.data_folder=/home/cyc/dataset/cosmoflow', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/data_loader/dali_data_loader.py", line 116, in next
outputs = pipe.share_outputs()
File "/root/anaconda3/envs/mlperf/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 1132, in share_outputs
raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
Traceback (most recent call last):
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 395, in main
benchmark.run()
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 183, in wrapper
x = func(*args, **kwargs)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 342, in run
steps = self._train(epoch)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/data_loader/dali_data_loader.py", line 116, in next
outputs = pipe.share_outputs()
File "/root/anaconda3/envs/mlperf/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 1132, in share_outputs
raise StopIteration
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 183, in wrapper
x = func(*args, **kwargs)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 263, in _train
for batch in dlp.iter(loader.next()):
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 203, in iter
for v in func:
RuntimeError: generator raised StopIteration
StopIteration

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Actually it seems like it could run the very first epoch, and got stuck on the second.
I also tried the answers for the issue#41 but it didn't work. I didn't add the subfolder parameter while executing ./benchmark.sh datagen or ./benchmark.sh run.
I don't know how to debug this. Please help, thanks a lot!

zhenghh04 · 2024-03-07T15:49:25Z

We have a PR in DLIO to fix this.

caspaseyc closed this as completed Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed on running cosmoflow with v1.0 branch #49

failed on running cosmoflow with v1.0 branch #49

caspaseyc commented Mar 6, 2024 •

edited

Loading

zhenghh04 commented Mar 7, 2024

failed on running cosmoflow with v1.0 branch #49

failed on running cosmoflow with v1.0 branch #49

Comments

caspaseyc commented Mar 6, 2024 • edited Loading

zhenghh04 commented Mar 7, 2024

caspaseyc commented Mar 6, 2024 •

edited

Loading