Skip to content

Commit

Permalink
[IE CLDNN] Updated GPU device config (openvinotoolkit#6040)
Browse files Browse the repository at this point in the history
  • Loading branch information
vladimir-paramuzov authored Jun 9, 2021
1 parent f0cf1df commit aa4a18d
Show file tree
Hide file tree
Showing 29 changed files with 301 additions and 278 deletions.
15 changes: 12 additions & 3 deletions docs/IE_DG/API_Changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,15 @@ The sections below contain detailed list of changes made to the Inference Engine
* InferenceEngine::Parameter(std::shared_ptr<ngraph::Variant>& var)
* std::shared_ptr<ngraph::Variant> InferenceEngine::Parameter::asVariant() const
* InferenceEngine::Parameter::operator std::shared_ptr<ngraph::Variant>() const
* KEY_CLDNN_NV12_TWO_INPUTS GPU plugin option. Use KEY_GPU_NV12_TWO_INPUTS instead
* KEY_CLDNN_PLUGIN_PRIORITY GPU plugin option. Use KEY_GPU_PLUGIN_PRIORITY instead
* KEY_CLDNN_PLUGIN_THROTTLE GPU plugin option. Use KEY_GPU_PLUGIN_THROTTLE instead
* KEY_CLDNN_MEM_POOL GPU plugin option
* KEY_CLDNN_GRAPH_DUMPS_DIR GPU plugin option
* KEY_CLDNN_SOURCES_DUMPS_DIR GPU plugin option
* KEY_DUMP_KERNELS GPU plugin option
* KEY_TUNING_MODE GPU plugin option
* KEY_TUNING_FILE GPU plugin option

## 2021.3

Expand Down Expand Up @@ -528,7 +537,7 @@ The sections below contain detailed list of changes made to the Inference Engine
* DLIA_CONFIG_KEY(ENABLE_STREAMING) config key

### Removed API

* InferenceEngine::EltwiseLayer::Select from InferenceEngine::EltwiseLayer::eOperation enumeration

## 2019 R2
Expand Down Expand Up @@ -577,7 +586,7 @@ The sections below contain detailed list of changes made to the Inference Engine
* DLIA_CONFIG_KEY(IO_TRANSFORMATIONS_NATIVE) config key
* DLIA_CONFIG_KEY(DUMP_SUPPORTED_LAYERS_INFORMATION) config key
* GNA_CONFIG_VALUE(SW_FP32) config value for GNA_CONFIG_KEY(DEVICE_MODE) key
* MULTI_CONFIG_KEY(DEVICE_PRIORITIES) config key for `MULTI` device
* MULTI_CONFIG_KEY(DEVICE_PRIORITIES) config key for `MULTI` device
* InferenceEngine::CNNNetReader::ReadNetwork(const std::wstring &filepath) new method
* InferenceEngine::CNNNetReader::ReadWeights(const std::wstring &filepath) new method
* InferenceEngine::ExecutableNetwork::ExecutableNetwork(IExecutableNetwork::Ptr actual, InferenceEnginePluginPtr plg) constructor with additional `plg` parameter
Expand All @@ -593,7 +602,7 @@ The sections below contain detailed list of changes made to the Inference Engine
* InferenceEngine::EltwiseLayer::Logical_NOT, InferenceEngine::EltwiseLayer::Mean, InferenceEngine::EltwiseLayer::Select extensions to InferenceEngine::EltwiseLayer::eOperation enumeration
* InferenceEngine::OneHotLayer new class
* InferenceEngine::SelectLayer new class
* InferenceEngine::BroadcastLayer new class
* InferenceEngine::BroadcastLayer new class
* InferenceEngine::MathLayer new class
* InferenceEngine::ReduceLayer new class
* InferenceEngine::TopKLayer new class
Expand Down
16 changes: 0 additions & 16 deletions docs/IE_DG/Extensibility_DG/GPU_Kernel.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,22 +219,6 @@ __kernel void example_relu_kernel(
## Debugging Tips<a name="debugging-tips"></a>
* **Dumping the Resulting Kernels**.
It is recommended to get a dump of the kernel with all of
the values set by the Inference Engine, such as tensor sizes,
floating-point, and integer kernel parameters. To get the dump, add the
following line to your code that configures the GPU plugin to output the
custom kernels:
@snippet snippets/GPU_Kernel.cpp part1
When the Inference Engine compiles the kernels for the specific network,
it also outputs the resulting code for the custom kernels. In the
directory of your executable, find files like
`clDNN_program0.cl`, `clDNN_program1.cl`. There are as many files as
distinct sets of parameters for your custom kernel: different input
tensor sizes and kernel parameters.
* **Using `printf` in the OpenCL™ Kernels**.
To debug the specific values, you can use `printf` in your kernels.
However, be careful: for instance, do not output excessively
Expand Down
39 changes: 0 additions & 39 deletions docs/IE_DG/GPU_Kernels_Tuning.md

This file was deleted.

20 changes: 5 additions & 15 deletions docs/IE_DG/Intro_to_Performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ $ benchmark_app -m <model.xml> -enforcebf16=false
Notice that for quantized (e.g. INT8) models the bfloat16 calculations (of the layers that remain in FP32) is disabled by default.
Refer to the [CPU Plugin documentation](supported_plugins/CPU.md) for more details.

Similarly, the GPU device has a dedicated config key to enable FP16 execution of the layers that remain in FP32 in the quantized models (as the quantization is typically performed on the FP32 models), refer to the ENABLE_FP16_FOR_QUANTIZED_MODELS key in the [GPU Plugin documentation](supported_plugins/CL_DNN.md)
Similarly, the GPU device has a dedicated config key to enable FP16 execution of the layers that remain in FP32 in the quantized models (as the quantization is typically performed on the FP32 models), refer to the ENABLE_FP16_FOR_QUANTIZED_MODELS key in the [GPU Plugin documentation](supported_plugins/GPU.md)

## Latency vs. Throughput
One way to increase computational efficiency is batching, which combines many (potentially tens) of
Expand Down Expand Up @@ -72,30 +72,20 @@ Instead, it is possible to keep a separate infer request per camera or another s

## Benchmark App
[Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample is the best performance reference.
It has a lot of device-specific knobs, but the primary usage is as simple as:
It has a lot of device-specific knobs, but the primary usage is as simple as:
```bash
$ ./benchmark_app –d GPU –m <model> -i <input>
```
to measure the performance of the model on the GPU.
to measure the performance of the model on the GPU.
Or
```bash
$ ./benchmark_app –d CPU –m <model> -i <input>
```
to execute on the CPU instead.

For example, for the CPU throughput mode from the previous section, you can play with number of streams (`-nstreams` command-line param).
Try different values of the `-nstreams` argument from `1` to a number of CPU cores and find one that provides the best performance. For example, on a 8-core CPU, compare the `-nstreams 1` (which is a latency-oriented scenario) to the `2`, `4` and `8` streams. Notice that `benchmark_app` automatically queries/creates/runs number of requests required to saturate the given number of streams.
For example, for the CPU throughput mode from the previous section, you can play with number of streams (`-nstreams` command-line param).
Try different values of the `-nstreams` argument from `1` to a number of CPU cores and find one that provides the best performance. For example, on a 8-core CPU, compare the `-nstreams 1` (which is a latency-oriented scenario) to the `2`, `4` and `8` streams. Notice that `benchmark_app` automatically queries/creates/runs number of requests required to saturate the given number of streams.

Finally, notice that when you don't specify number of streams with `-nstreams`, "AUTO" value for the streams is used, e.g. for the CPU this is [CPU_THROUGHPUT_AUTO](supported_plugins/CPU.md). You can spot the actual value behind "AUTO" for your machine in the application output.
Notice that the "AUTO" number is not necessarily most optimal, so it is generally recommended to play either with the benchmark_app's "-nstreams" as described above, or via [new Workbench tool](@ref workbench_docs_Workbench_DG_Introduction).This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance.
Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API.

## Kernels Tuning for GPU

GPU backend comes with a feature, that allows models tuning, so the workload is configured to fit better into hardware.

Tuning is time consuming process, which internally execute every layer several (or even hundreds) times to find most performant configuration.

This configuration is saved into json-formatted file, whose name can be passed as plugin param to network. GPU backend will process this data to configure kernels for the best performance.

For more details about Kernels Tuning and How-To please refer to [GPU Kernels Tuning](GPU_Kernels_Tuning.md).
Loading

0 comments on commit aa4a18d

Please sign in to comment.