Skip to content

Commit

Permalink
Merge pull request #53 from mlcommons/main
Browse files Browse the repository at this point in the history
v1.0 rc1 release
  • Loading branch information
johnugeorge authored Mar 15, 2024
2 parents edd2a28 + 88e4f59 commit bb3a1d2
Show file tree
Hide file tree
Showing 8 changed files with 89 additions and 31 deletions.
60 changes: 58 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ MLPerf Storage is a benchmark suite to characterize the performance of storage s
- [Configuration](#configuration)
- [Workloads](#workloads)
- [U-Net3D](#u-net3d)
- [ResNet-50](#resnet-50)
- [CosmoFlow](#cosmoflow)
- [Parameters](#parameters)
- [CLOSED](#closed)
- [OPEN](#open)
Expand Down Expand Up @@ -72,7 +74,7 @@ sudo apt-get install mpich
Clone the latest release from [MLCommons Storage](https://github.com/mlcommons/storage) repository and install Python dependencies.

```bash
git clone -b v1.0-rc0 --recurse-submodules https://github.com/mlcommons/storage.git
git clone -b v1.0-rc1 --recurse-submodules https://github.com/mlcommons/storage.git
cd storage
pip3 install -r dlio_benchmark/requirements.txt
```
Expand Down Expand Up @@ -207,10 +209,12 @@ Note: The `reportgen` script must be run in the launcher client host.
## Workloads
Currently, the storage benchmark suite supports benchmarking of 3 deep learning workloads
- Image segmentation using U-Net3D model
- Image classification using Resnet-50 model
- Cosmology parameter prediction using CosmoFlow model

### U-Net3D

Calculate minimum dataset size required for the benchmark run
Calculate minimum dataset size required for the benchmark run based on your client configuration

```bash
./benchmark.sh datasize --workload unet3d --accelerator-type a100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
Expand All @@ -234,6 +238,58 @@ All results will be stored in the directory configured using `--results-dir`(or
./benchmark.sh reportgen --results-dir resultsdir
```

### ResNet-50

Calculate minimum dataset size required for the benchmark run based on your client configuration

```bash
./benchmark.sh datasize --workload resnet50 --accelerator-type a100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
```

Generate data for the benchmark run

```bash
./benchmark.sh datagen --workload resnet50 --accelerator-type h100 --num-parallel 8 --param dataset.num_files_train=1200 --param dataset.data_folder=resnet50_data
```

Run the benchmark.

```bash
./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload resnet50 --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir --param dataset.num_files_train=1200 --param dataset.data_folder=resnet50_data
```

All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host.

```bash
./benchmark.sh reportgen --results-dir resultsdir
```

### CosmoFlow

Calculate minimum dataset size required for the benchmark run based on your client configuration

```bash
./benchmark.sh datasize --workload cosmoflow --accelerator-type a100 --num-accelerators 8 --num-client-hosts 2 --client-host-memory-in-gb 128
```

Generate data for the benchmark run

```bash
./benchmark.sh datagen --workload cosmoflow --accelerator-type h100 --num-parallel 8 --param dataset.num_files_train=1200 --param dataset.data_folder=cosmoflow_data
```

Run the benchmark.

```bash
./benchmark.sh run --hosts 10.117.61.121,10.117.61.165 --workload cosmoflow --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir --param dataset.num_files_train=1200 --param dataset.data_folder=cosmoflow_data
```

All results will be stored in the directory configured using `--results-dir`(or `-r`) argument. To generate the final report, run the following in the launcher client host.

```bash
./benchmark.sh reportgen --results-dir resultsdir
```

## Parameters

### CLOSED
Expand Down
2 changes: 1 addition & 1 deletion dlio_benchmark
Submodule dlio_benchmark updated 35 files
+28 −10 .github/workflows/python-package-conda.yml
+0 −0 dlio_benchmark/configs/workload/bert_v100.yaml
+0 −24 dlio_benchmark/configs/workload/cosmoflow.yaml
+5 −3 dlio_benchmark/configs/workload/cosmoflow_a100.yaml
+5 −3 dlio_benchmark/configs/workload/cosmoflow_h100.yaml
+0 −25 dlio_benchmark/configs/workload/cosmoflow_pt.yaml
+4 −2 dlio_benchmark/configs/workload/cosmoflow_v100.yaml
+0 −24 dlio_benchmark/configs/workload/resnet50.yaml
+8 −9 dlio_benchmark/configs/workload/resnet50_a100.yaml
+8 −9 dlio_benchmark/configs/workload/resnet50_h100.yaml
+24 −0 dlio_benchmark/configs/workload/resnet50_tf.yaml
+7 −8 dlio_benchmark/configs/workload/resnet50_v100.yaml
+0 −34 dlio_benchmark/configs/workload/unet3d.yaml
+2 −2 dlio_benchmark/configs/workload/unet3d_a100.yaml
+2 −2 dlio_benchmark/configs/workload/unet3d_h100.yaml
+3 −1 dlio_benchmark/data_generator/csv_generator.py
+4 −4 dlio_benchmark/data_generator/data_generator.py
+3 −1 dlio_benchmark/data_generator/hdf5_generator.py
+7 −3 dlio_benchmark/data_generator/indexed_binary_generator.py
+3 −1 dlio_benchmark/data_generator/jpeg_generator.py
+3 −1 dlio_benchmark/data_generator/npy_generator.py
+3 −1 dlio_benchmark/data_generator/npz_generator.py
+3 −1 dlio_benchmark/data_generator/png_generator.py
+3 −1 dlio_benchmark/data_generator/tf_generator.py
+38 −24 dlio_benchmark/data_loader/dali_data_loader.py
+10 −8 dlio_benchmark/data_loader/native_dali_data_loader.py
+5 −4 dlio_benchmark/data_loader/tf_data_loader.py
+41 −13 dlio_benchmark/data_loader/torch_data_loader.py
+1 −1 dlio_benchmark/main.py
+3 −7 dlio_benchmark/reader/dali_image_reader.py
+2 −2 dlio_benchmark/reader/dali_npy_reader.py
+6 −10 dlio_benchmark/reader/dali_tfrecord_reader.py
+1 −0 dlio_benchmark/reader/npz_reader.py
+5 −1 dlio_benchmark/reader/reader_handler.py
+21 −12 dlio_benchmark/utils/config.py
8 changes: 5 additions & 3 deletions storage-conf/workload/cosmoflow_a100.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ workflow:
dataset:
data_folder: data/cosmoflow_pt
num_files_train: 524288
num_files_eval: 65536
num_samples_per_file: 1
record_length: 2828486
record_length_stdev: 71311
Expand All @@ -19,7 +18,10 @@ reader:
data_loader: dali
read_threads: 4
batch_size: 1

dont_use_mmap: True
file_shuffle: seed
sample_shuffle: seed

train:
epochs: 4
epochs: 5
computation_time: 0.00551
8 changes: 5 additions & 3 deletions storage-conf/workload/cosmoflow_h100.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ workflow:
dataset:
data_folder: data/cosmoflow_pt
num_files_train: 524288
num_files_eval: 65536
num_samples_per_file: 1
record_length: 2828486
record_length_stdev: 71311
Expand All @@ -19,7 +18,10 @@ reader:
data_loader: dali
read_threads: 4
batch_size: 1

dont_use_mmap: True
file_shuffle: seed
sample_shuffle: seed

train:
epochs: 4
epochs: 5
computation_time: 0.00350
17 changes: 8 additions & 9 deletions storage-conf/workload/resnet50_a100.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,20 @@ workflow:
train: True

dataset:
num_files_train: 1281167
num_files_eval: 50000
num_samples_per_file: 1
num_files_train: 1024
num_samples_per_file: 1251
record_length: 114660.07
record_length_std: 136075.82
record_length_resize: 150528
data_folder: data/resnet50
format: png
format: tfrecord

train:
computation_time: 0.151
computation_time: 0.435
epochs: 5

reader:
data_loader: pytorch
data_loader: dali
read_threads: 8
computation_threads: 8
batch_size: 64
batch_size_eval: 128
batch_size: 400
dont_use_mmap: True
17 changes: 8 additions & 9 deletions storage-conf/workload/resnet50_h100.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,20 @@ workflow:
train: True

dataset:
num_files_train: 1281167
num_files_eval: 50000
num_samples_per_file: 1
num_files_train: 1024
num_samples_per_file: 1251
record_length: 114660.07
record_length_std: 136075.82
record_length_resize: 150528
data_folder: data/resnet50
format: png
format: tfrecord

train:
computation_time: 0.103
computation_time: 0.224
epochs: 5

reader:
data_loader: pytorch
data_loader: dali
read_threads: 8
computation_threads: 8
batch_size: 64
batch_size_eval: 128
batch_size: 400
dont_use_mmap: True
4 changes: 2 additions & 2 deletions storage-conf/workload/unet3d_a100.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,14 @@ dataset:

reader:
data_loader: pytorch
batch_size: 4
batch_size: 7
read_threads: 4
file_shuffle: seed
sample_shuffle: seed

train:
epochs: 5
computation_time: 0.375
computation_time: 0.636

checkpoint:
checkpoint_folder: checkpoints/unet3d
Expand Down
4 changes: 2 additions & 2 deletions storage-conf/workload/unet3d_h100.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,14 @@ dataset:

reader:
data_loader: pytorch
batch_size: 4
batch_size: 7
read_threads: 4
file_shuffle: seed
sample_shuffle: seed

train:
epochs: 5
computation_time: 0.188
computation_time: 0.323

checkpoint:
checkpoint_folder: checkpoints/unet3d
Expand Down

0 comments on commit bb3a1d2

Please sign in to comment.