Skip to content

QNN Backend Overview

nullname edited this page Mar 1, 2025 · 15 revisions

Summary

This fork is based on zhouwg's initial PR and performs further refactoring and improvements to introduce support for the Qualcomm QNN backend to GGML.

This backend is organized into three distinct integration layers:

graph TB
    subgraph GGML Adaptation Layer
        A1[Graph Caching, Mapping, and Execution]
        A2[Tensor Binding and Execution Flow]
    end

    subgraph QNN Object Layer
        B1[QNN System and Instance Management]
        B2[Dynamic Resource Handling]
    end

    subgraph Utility Layer
        C1[Dynamic Library Loading & Search Path Management]
        C2[General Utilities]
    end

    %% Relations to illustrate stack dependency
    A1 -->|Uses| B1
    A2 -->|Uses| B1
    B1 -->|Relies on| C1
Loading
  1. GGML Adaptation Layer

    • Graph Caching, Mapping, and Execution:

      • Provides a robust mechanism to map a GGML computation graph into a corresponding QNN graph, allowing efficient offloading of operations to the QNN accelerator.
      • Implements graph caching strategies (in backend-ops.cpp) to minimize redundant graph creation and boost execution performance.
      • Seamlessly translates GGML operations into corresponding QNN op objects using specialized op constructors and configuration functions (configured in op-config-caps.cpp and op-config-impl.cpp).
    • Tensor Binding and Execution Flow:

      • Adapts GGML tensor objects to the QNN backend (see tensor.hpp and graph.hpp), managing both host and RPC memory via buffer interfaces like qnn_buffer_interface.
      • Ensures proper data flow between GGML graphs and QNN execution contexts through carefully handled tensor binding/unbinding procedures.
  2. QNN Object Layer

    • QNN System and Instance Management:

      • Encapsulates the QNN system via the qnn_system_interface class, originally derived from executorch, to create and free the QNN system context.
      • Manages QNN instance creation and initialization via the qnn_instance class
      • Implements backend loading routines (e.g., load_backend() and load_system()) that retrieve provider lists and choose valid QNN interfaces based on API version checks.
      • Uses caching mechanisms for loaded backends and tracks library handles to guarantee proper cleanup during finalization.
    • Dynamic Resource Handling:

      • Integrates fallback mechanisms in load_lib_with_fallback() to reliably load both the system and RPC libraries.
      • Manages RPC memory allocation and deallocation via function pointer resolution from the loaded RPC library.
  3. Utility Layer

    • Dynamic Library Loading & Search Path Management:

      • Implements functions in qnn-lib.cpp to manage dynamic library loading with fallbacks.
      • Uses helper routines such as insert_path() and set_qnn_lib_search_path() to configure environment variables (like LD_LIBRARY_PATH on Linux and ADSP_LIBRARY_PATH on Android) based on a custom library search path.
    • General Utilities:

      • Provides detailed error and debug logging through QNN logging macros.

Key Features and Improvements

  • Graph Mapping Mechanism:

    • Efficient mapping of GGML graphs into QNN graphs is a standout feature, enabling the offloading and execution of computation graphs on hardware accelerators (see graph.hpp and backend-ops.cpp).
    • Graph caching strategies help reuse QNN graphs to reduce redundancy and enhance performance.
    • The translation of GGML operations into corresponding QNN ops supports various data types and parameter configurations.
  • Backend Context and Device Management:

    • Comprehensive QNN instance initialization supports API negotiation, enhanced error handling, and detailed device property logging.
    • Detailed logs (chipset description, HTP architecture, VTCM memory size) facilitate debugging and performance tuning.

Testing

  • Basic functionality of the QNN backend has been verified on Android, Linux, and Windows platforms using test-backend-ops—this is integrated into the pipeline for each commit node of the dev-refactoring branch.

    Platform test-backend-ops full console output
    Android 2ac8fce111ee0047a5a8b43808047ff2 test-backend-ops_all_android_ff033e1.log
    Linux image test-backend-ops_all_linux_ff033e1.log
    Windows To be Fill
  • Proper graph creation and execution paths are confirmed through detailed log messages.

  • Memory registration and cleanup within tensor binding functions have been thoroughly checked.

  • Table below shows GIFs of qnn backend running on different platforms

    Platform Soc Model Gif Origin video
    Android 8 Gen 2 llama-3-8B-Instruct-Q4_K_M Recording_Muted_hevc_14_126_640 Recording_Muted_hevc.mp4
    Windows To be Fill

Current state

  • The test-backend-ops suite passes on all platforms, including support for both qnn-npu and qnn-gpu devices.
  • Testing with llama3.2-1b/3b-f16/32 models yields expected results.
  • Quantized matrix multiplication is under development; for quantized modules, the CPU backend may be used as a fallback.

Future development

  • Further feature support and device-specific optimizations are planned (see also the project backlog).
  • Future iterations will add support for quantization data types, with efforts underway to map GGML's block quantization structure into QNN.

Resources

Project

qnn backend

Build and run tests

Clone this wiki locally