Skip to content

Latest commit

 

History

History
546 lines (424 loc) · 19.5 KB

sc22-scc-mlperf-part2.md

File metadata and controls

546 lines (424 loc) · 19.5 KB

[ Back to index ]

Tutorial: modularizing and automating MLPerf (part 2)

Click here to see the table of contents.

Introduction

We expect that you have completed the 1st part of this tutorial and managed to run the MLPerf inference benchmark for object detection with RetinaNet FP32, Open Images and ONNX runtime on a CPU target.

This tutorial shows you how to customize the MLPerf inference benchmark and run it with a C++ implementation, CUDA and PyTorch.

Update CM framework and automation repository

Note that the CM automation meta-framework and the repository with automation scripts are being continuously updated by the community to improve the portability and interoperability of all reusable components for MLOps and DevOps.

You can get the latest version of the CM framework and automation repository as follows (though be careful since CM CLI and APIs may change):

python3 -m pip install cmind -U
cm pull repo mlcommons@ck --checkout=master

CM automation for the MLPerf benchmark

MLPerf inference - C++ - RetinaNet FP32 - Open Images - ONNX - CPU - Offline

Let's now run a universal and modular C++ implementation of the MLPerf inference benchmark (developed by Thomas Zhu during his internship at OctoML).

Note that CM will reuse already installed and preprocessed Open Images dataset, model and tools from the CM cache installed during the 1st part of this tutorial while installing the ONNX runtime library with C++ bindings for your system.

If you want to reinstall all dependencies, you can clean the CM cache again and restart the above command:

cm rm cache -f

You can run C++ implementation by simply changing _python variation to _cpp variation in our high-level CM MLPerf script that will then set up the correct dependencies and will run the C++ implementation of this script

cm run script "app mlperf inference generic _cpp _retinanet _onnxruntime _cpu" \
     --adr.python.version_min=3.8 \
     --adr.compiler.tags=gcc \
     --adr.openimages-preprocessed.tags=_500 \
     --scenario=Offline \
     --mode=accuracy \
     --test_query_count=10 \
     --rerun

CM will download the ONNX binaries for your system, compile our C++ implementation with the ONNX backend and will run the MLPerf inference benchmark. You should normally see the following output:

...

loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.01s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.10s).
Accumulating evaluation results...
DONE (t=0.12s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.548
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.787
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.714
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.304
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.631
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.433
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.648
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.663
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.731
mAP=54.814%

    - running time of script "run,mlperf,mlcommons,accuracy,mlc,process-accuracy": 1.18 sec.
  - running time of script "app,vision,language,mlcommons,mlperf,inference,reference,generic,ref": 53.81 sec.

You can then obtain performance using the C++ implemnentation of the MLPerf inference benchmark as follows:

cm run script "app mlperf inference generic _cpp _retinanet _onnxruntime _cpu" \
     --adr.python.version_min=3.8 \
     --adr.compiler.tags=gcc \
     --adr.openimages-preprocessed.tags=_500 \
     --scenario=Offline \
     --mode=performance \
     --test_query_count=10 \
     --rerun

You should get the following output (QPS will depend on the speed of your machine):

================================================
MLPerf Results Summary
================================================
SUT name : QueueSUT
Scenario : Offline
Mode     : PerformanceOnly
Samples per second: 0.631832
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes

================================================
Additional Stats
================================================
Min latency (ns)                : 14547257820
Max latency (ns)                : 15826999233
Mean latency (ns)               : 15129106642
50.00 percentile latency (ns)   : 15045448544
90.00 percentile latency (ns)   : 15826999233
95.00 percentile latency (ns)   : 15826999233
97.00 percentile latency (ns)   : 15826999233
99.00 percentile latency (ns)   : 15826999233
99.90 percentile latency (ns)   : 15826999233

================================================
Test Parameters Used
================================================
samples_per_query : 10
target_qps : 1
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 0
max_duration (ms): 0
min_query_count : 1
max_query_count : 10
qsl_rng_seed : 14284205019438841327
sample_index_rng_seed : 4163916728725999944
schedule_rng_seed : 299063814864929621
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 64

No warnings encountered during test.

No errors encountered during test.

  - running time of script "app,vision,language,mlcommons,mlperf,inference,reference,generic,ref": 50.24 sec.
rsa-key-fgg-universal@mlperf-tests-e2-x86-16-64-ubuntu

We plan to continue optimizing this implementation of the MLPerf inference benchmark together with the community across different ML engines, models, data sets and systems.

Summary

You can now test the end-to-end benchmarking and submission with the C++ implementation and ONNX on CPU using Python virtual environment as follows (just substitute "Community" with your name or organization or anything else):

cm pull repo mlcommons@ck

cm run script "get sys-utils-cm" --quiet

cm run script "install python-venv" --version=3.10.8 --name=mlperf

cm run script --tags=run,mlperf,inference,generate-run-cmds,_submission,_short,_dashboard \
      --adr.python.name=mlperf \
      --adr.python.version_min=3.8 \
      --adr.compiler.tags=gcc \
      --adr.openimages-preprocessed.tags=_500 \
      --submitter="Community" \
      --implementation=cpp \
      --hw_name=default \
      --model=retinanet \
      --backend=onnxruntime \
      --device=cpu \
      --scenario=Offline \
      --test_query_count=10 \
      --clean

In case of a successfull run, you should see your crowd-testing results at this live W&B dashboard.

MLPerf inference - Python - RetinaNet FP32 - Open Images - ONNX - GPU - Offline

Prepare CUDA

If your system has an Nvidia GPU, you can run the MLPerf inference benchmark on this GPU using the CM automation.

First you need to detect CUDA and cuDNN installation using CM as follows:

cm run script "get cuda" --out=json

You should see the output similar to the following one (for CUDA 11.3):

{
  "deps": [],
  "env": {
    "+CPLUS_INCLUDE_PATH": [
      "/usr/local/cuda-11.3/include"
    ],
    "+C_INCLUDE_PATH": [
      "/usr/local/cuda-11.3/include"
    ],
    "+DYLD_FALLBACK_LIBRARY_PATH": [],
    "+LD_LIBRARY_PATH": [],
    "+PATH": [
      "/usr/local/cuda-11.3/bin"
    ],
    "CM_CUDA_CACHE_TAGS": "version-11.3",
    "CM_CUDA_INSTALLED_PATH": "/usr/local/cuda-11.3",
    "CM_CUDA_PATH_BIN": "/usr/local/cuda-11.3/bin",
    "CM_CUDA_PATH_INCLUDE": "/usr/local/cuda-11.3/include",
    "CM_CUDA_PATH_LIB": "/usr/local/cuda-11.3/lib64",
    "CM_CUDA_PATH_LIB_CUDNN": "/usr/local/cuda-11.3/lib64/libcudnn.so",
    "CM_CUDA_PATH_LIB_CUDNN_EXISTS": "yes",
    "CM_CUDA_VERSION": "11.3",
    "CM_NVCC_BIN": "nvcc",
    "CM_NVCC_BIN_WITH_PATH": "/usr/local/cuda-11.3/bin/nvcc"
  },
  "new_env": {
    "+CPLUS_INCLUDE_PATH": [
      "/usr/local/cuda-11.3/include"
    ],
    "+C_INCLUDE_PATH": [
      "/usr/local/cuda-11.3/include"
    ],
    "+DYLD_FALLBACK_LIBRARY_PATH": [],
    "+LD_LIBRARY_PATH": [],
    "+PATH": [
      "/usr/local/cuda-11.3/bin"
    ],
    "CM_CUDA_CACHE_TAGS": "version-11.3",
    "CM_CUDA_INSTALLED_PATH": "/usr/local/cuda-11.3",
    "CM_CUDA_PATH_BIN": "/usr/local/cuda-11.3/bin",
    "CM_CUDA_PATH_INCLUDE": "/usr/local/cuda-11.3/include",
    "CM_CUDA_PATH_LIB": "/usr/local/cuda-11.3/lib64",
    "CM_CUDA_PATH_LIB_CUDNN": "/usr/local/cuda-11.3/lib64/libcudnn.so",
    "CM_CUDA_PATH_LIB_CUDNN_EXISTS": "yes",
    "CM_CUDA_VERSION": "11.3",
    "CM_NVCC_BIN": "nvcc",
    "CM_NVCC_BIN_WITH_PATH": "/usr/local/cuda-11.3/bin/nvcc"
  },
  "new_state": {},
  "return": 0,
  "state": {}
}

You can obtain the information about your GPU using CM as follows:

cm run script "get cuda-devices"

Prepare Python with virtual environment

We suggest you to install Python virtual environment to avoid mixing up your local Python:

cm run script "get sys-utils-cm" --quiet

cm run script "install python-venv" --version=3.10.8 --name=mlperf-cuda

Run MLPerf inference benchmark (offline, accuracy)

You are now ready to run the MLPerf object detection benchmark on GPU with Python virtual environment as folllows:

cm run script "app mlperf inference generic _python _retinanet _onnxruntime _cuda" \
     --adr.python.name=mlperf-cuda \
     --scenario=Offline \
     --mode=accuracy \
     --test_query_count=10 \
     --clean

This CM script will automatically find or install all dependencies described in its CM meta description, aggregate all environment variables, preprocess all files and assemble the MLPerf benchmark CMD.

It will take a few minutes to run it and you should see the following accuracy:

loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.02s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=0.09s).
Accumulating evaluation results...
DONE (t=0.11s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.548
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.787
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.714
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.304
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.631
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.433
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.648
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.663
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.731

mAP=54.814%

Run MLPerf inference benchmark (offline, performance)

Let's run the MLPerf object detection on GPU while measuring performance:

cm run script "app mlperf inference generic _python _retinanet _onnxruntime _cuda" \
     --adr.python.name=mlperf-cuda \
     --scenario=Offline \
     --mode=performance \
     --clean

It will run for 2-5 minutes and you should see the output similar to the following one in the end (the QPS is the performance result of this benchmark that depends on the speed of your system):

TestScenario.Offline qps=8.44, mean=4.7238, time=78.230, queries=660, tiles=50.0:4.8531,80.0:5.0225,90.0:5.1124,95.0:5.1658,99.0:5.2730,99.9:5.3445


================================================
MLPerf Results Summary
================================================
...

No warnings encountered during test.

No errors encountered during test.

  - running time of script "app,vision,language,mlcommons,mlperf,inference,reference,generic,ref": 86.90 sec.

Summary

You can now run MLPerf in the submission mode (accuracy and performance) on GPU using the following CM command with Python virtual env (just substitute "Community" with your organization or any other identifier):

cm pull repo mlcommons@ck

cm run script "get sys-utils-cm" --quiet

cm run script "install python-venv" --version=3.10.8 --name=mlperf-cuda

cm run script --tags=run,mlperf,inference,generate-run-cmds,_submission,_short,_dashboard \
      --adr.python.name=mlperf-cuda \
      --adr.python.version_min=3.8 \
      --adr.compiler.tags=gcc \
      --adr.openimages-preprocessed.tags=_500 \
      --submitter="Community" \
      --implementation=python \
      --hw_name=default \
      --model=retinanet \
      --backend=onnxruntime \
      --device=gpu \
      --scenario=Offline \
      --test_query_count=10 \
      --clean

In case of a successfull run, you should see your crowd-testing results at this live W&B dashboard.

MLPerf inference - C++ - RetinaNet FP32 - Open Images - ONNX - GPU - Offline

Summary

After installing and detecting CUDA using CM in the previous section, you can also run the C++ implementation of the MLPerf vision benchmark with CUDA as follows (just substitute "Community" with your organization or any other identifier):

cm pull repo mlcommons@ck

cm run script "get sys-utils-cm" --quiet

cm run script "install python-venv" --version=3.10.8 --name=mlperf-cuda

cm run script --tags=run,mlperf,inference,generate-run-cmds,_submission,_short,_dashboard \
      --adr.python.name=mlperf-cuda \
      --adr.python.version_min=3.8 \
      --adr.compiler.tags=gcc \
      --adr.openimages-preprocessed.tags=_500 \
      --submitter="Community" \
      --implementation=cpp \
      --hw_name=default \
      --model=retinanet \
      --backend=onnxruntime \
      --device=gpu \
      --scenario=Offline \
      --test_query_count=10 \
      --clean

In case of a successfull run, you should see your crowd-testing results at this live W&B dashboard.

MLPerf inference - Python - RetinaNet FP32 - Open Images - PyTorch - CPU - Offline

Summary

You can now try to use PyTorch instead of ONNX as follows:

cm pull repo mlcommons@ck

cm run script "get sys-utils-cm" --quiet

cm run script "install python-venv" --version=3.10.8 --name=mlperf

cm run script --tags=run,mlperf,inference,generate-run-cmds,_submission,_short,_dashboard \
      --adr.python.name=mlperf \
      --adr.python.version_min=3.8 \
      --adr.compiler.tags=gcc \
      --adr.ml-engine-torchvision.version_max=0.12.1 \
      --adr.openimages-preprocessed.tags=_500 \
      --submitter="Community" \
      --implementation=python \
      --hw_name=default \
      --model=retinanet \
      --backend=onnxruntime \
      --device=cpu \
      --scenario=Offline \
      --test_query_count=10 \
      --num_threads=1 \
      --clean

CM will install PyTorch and PyTorch Vision <= 0.12.1 (we need that because current MLPerf inference implementation fails with other PyTorch Vision version - this will be fixed by the MLCommons inference WG) and will run this benchmark with 1 thread (this is needed because the current PyTorch implementation sometimes fail with a high number of threads - this will be fixed by the MLCommons inference WG)

In case of a successfull run, you should see your crowd-testing results at this live W&B dashboard.

The next steps

Please check other parts of this tutorial to learn how to customize and optimize MLPerf inference benchmark using MLCommons CM (under preparation):

  • 1st part: customize MLPerf inference (Python ref implementation, Open images, ONNX, CPU)
  • 3rd part: customize MLPerf inference (ResNet50 Int8, ImageNet, TVM)
  • To be continued

You are welcome to join the open MLCommons taskforce on automation and reproducibility to contribute to this project and continue optimizing this benchmark and prepare an official submission for MLPerf inference v3.0 (March 2023) with the help of the community.

See the development roadmap here.

Authors

Acknowledgments

We thank Hai Ah Nam, Steve Leak, Vijay Janappa Reddi, Tom Jablin, Ramesh N Chukka, Peter Mattson, David Kanter, Pablo Gonzalez Mesa, Thomas Zhu, Thomas Schmid and Gaurav Verma for their suggestions and contributions.