Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed on running cosmoflow with v1.0 branch #49

Closed
caspaseyc opened this issue Mar 6, 2024 · 1 comment
Closed

failed on running cosmoflow with v1.0 branch #49

caspaseyc opened this issue Mar 6, 2024 · 1 comment

Comments

@caspaseyc
Copy link

caspaseyc commented Mar 6, 2024

I can run all of the workloads well except cosmoflow. And I got stuck on executing the following command:

./benchmark.sh run --hosts x.x.x.x --workload cosmoflow --accelerator-type h100 --num-accelerators 8 --results-dir resultsdir/cosmoflow --param dataset.num_files_train=6000 --param dataset.data_folder=/home/cyc/dataset/cosmoflow

[INFO] 2024-03-06T15:01:12.348520 Ending block 1 - 750 steps completed in 3.55 s [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:216]
[INFO] 2024-03-06T15:01:12.350001 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 74.2248 [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:219]
[INFO] 2024-03-06T15:01:12.350112 Epoch 1 - Block 1 [Training] Throughput (samples/second): 1692.2993 [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:220]
[INFO] 2024-03-06T15:01:12.350413 Ending epoch 1 - 750 steps completed in 3.55 s [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:159]
[INFO] 2024-03-06T15:01:12.350998 Starting epoch 2: 750 steps expected [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:128]
[INFO] 2024-03-06T15:01:12.351128 Starting block 1 [/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/utils/statscounter.py:198]
Error executing job with overrides: ['workload=cosmoflow_h100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=6000', '++workload.dataset.data_folder=/home/cyc/dataset/cosmoflow', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/data_loader/dali_data_loader.py", line 116, in next
outputs = pipe.share_outputs()
File "/root/anaconda3/envs/mlperf/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 1132, in share_outputs
raise StopIteration
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
Traceback (most recent call last):
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 395, in main
benchmark.run()
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 183, in wrapper
x = func(*args, **kwargs)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 342, in run
steps = self._train(epoch)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/data_loader/dali_data_loader.py", line 116, in next
outputs = pipe.share_outputs()
File "/root/anaconda3/envs/mlperf/lib/python3.10/site-packages/nvidia/dali/pipeline.py", line 1132, in share_outputs
raise StopIteration
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 183, in wrapper
x = func(*args, **kwargs)
File "/home/cyc/storage-1.0-branch/dlio_benchmark/dlio_benchmark/main.py", line 263, in _train
for batch in dlp.iter(loader.next()):
File "/home/cyc/dlio-profiler-0.0.3/dlio_profiler/logger.py", line 203, in iter
for v in func:
RuntimeError: generator raised StopIteration
StopIteration

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Actually it seems like it could run the very first epoch, and got stuck on the second.
I also tried the answers for the issue#41 but it didn't work. I didn't add the subfolder parameter while executing ./benchmark.sh datagen or ./benchmark.sh run.
I don't know how to debug this. Please help, thanks a lot!

@zhenghh04
Copy link
Contributor

We have a PR in DLIO to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants