Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add TensorRT support for GNNs #4016

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

benjaminhuth
Copy link
Member

@benjaminhuth benjaminhuth commented Jan 9, 2025

Cannot be compiled currently in the CI

--- END COMMIT MESSAGE ---

Any further description goes here, @-mentions are ok here!

  • Use a conventional commits prefix: quick summary
    • We mostly use feat, fix, refactor, docs, chore and build types.
  • A milestone will be assigned by one of the maintainers

Summary by CodeRabbit

Release Notes

  • New Features

    • Added TensorRT support for edge classification in the ExaTrkX plugin.
    • Introduced a new TensorRT-based edge classifier for machine learning tasks.
  • Infrastructure

    • Updated build configuration to support TensorRT integration.
    • Added a new build job for TensorRT-enabled components.
  • Technical Improvements

    • Enhanced plugin with GPU-accelerated inference capabilities.
    • Integrated TensorRT runtime and execution context for efficient processing.
    • Improved build flexibility by removing previous dependency checks in the configuration.
    • Introduced a CMake module for finding and configuring TensorRT.

Copy link

coderabbitai bot commented Jan 9, 2025

Walkthrough

A new job, build_gnn_tensorrt, added to the .gitlab-ci.yml, it has. For TensorRT-powered edge classification, this job prepares. Introduced, the TensorRTEdgeClassifier is, enabling GPU-accelerated inference in track finding. Multiple files, the changes span: Python bindings, CMake modifications, and core implementation files to support this new functionality, included they are.

Changes

File Change Summary
.gitlab-ci.yml Added build_gnn_tensorrt job with TensorRT Docker image and dependency configuration
Examples/Python/src/ExaTrkXTrackFinding.cpp Introduced TensorRTEdgeClassifier class and Python bindings
Plugins/ExaTrkX/CMakeLists.txt Added TensorRT package discovery and library linking
Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp Defined TensorRTEdgeClassifier header with configuration and interface
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp Implemented TensorRTEdgeClassifier with TensorRT inference logic
CMakeLists.txt Removed conditional check for ACTS_EXATRKX_ENABLE_ONNX or ACTS_EXATRKX_ENABLE_TORCH
cmake/FindTensorRT.cmake Introduced module for locating and configuring TensorRT

Possibly related PRs

  • feat: Add CUDA connected components & track building #4015: The changes in this PR introduce a new job in the .gitlab-ci.yml file related to CUDA functionality, which aligns with the new job build_gnn_tensorrt added in the main PR. Both PRs focus on enhancing the build process for components that utilize CUDA and TensorRT.

Suggested Labels

automerge

Suggested Reviewers

  • paulgessinger
  • stephenswat

Poem

In circuits of silicon bright, 🖥️
TensorRT dances with neural might, 🤖
Edges classified with grace,
GPU's computational embrace,
Track finding's quantum delight! 🚀

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Sorry, something went wrong.

@github-actions github-actions bot added Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins labels Jan 9, 2025
@github-actions github-actions bot added this to the next milestone Jan 9, 2025
Copy link

github-actions bot commented Jan 9, 2025

📊: Physics performance monitoring for 7243544

Full contents

physmon summary

@benjaminhuth benjaminhuth marked this pull request as ready for review January 15, 2025 13:53
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (4)
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp (1)

98-100: Use ACTS logging instead of std::cout, prefer you should.

For consistency within the codebase, replace std::cout with ACTS logging macros.

Apply this diff to use the logging framework:

 ~TimePrinter() {
-  std::cout << name << ": " << milliseconds(t0, t1) << std::endl;
+  ACTS_INFO(name << ": " << milliseconds(t0, t1) << " ms");
 }
Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp (2)

38-41: Destructor to be marked override, consider you should.

Since the base class has a virtual destructor, marking the destructor in the derived class with override good practice it is.

Apply this diff for clarity:

   ~TensorRTEdgeClassifier();
+  ~TensorRTEdgeClassifier() override;

49-58: Member variables' initialization order, ensure you must.

Initialize member variables in the order they are declared to avoid warnings.

Ensure that m_cfg is initialized before m_trtLogger, as declared.

Examples/Python/src/ExaTrkXTrackFinding.cpp (1)

110-128: Logger name, more specific make you should.

For clarity and consistency, use a distinct logger name for TensorRTEdgeClassifier.

Apply this diff to specify the logger name:

                       return std::make_shared<Alg>(
-                          c, getDefaultLogger("EdgeClassifier", lvl));
+                          c, getDefaultLogger("TensorRTEdgeClassifier", lvl));
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0e65bc and f5a819a.

📒 Files selected for processing (5)
  • .gitlab-ci.yml (1 hunks)
  • Examples/Python/src/ExaTrkXTrackFinding.cpp (2 hunks)
  • Plugins/ExaTrkX/CMakeLists.txt (1 hunks)
  • Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp (1 hunks)
  • Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: CI Bridge / build_gnn_tensorrt
  • GitHub Check: linux_physmon
  • GitHub Check: linux_examples_test
  • GitHub Check: missing_includes
  • GitHub Check: linux_ubuntu_extra (ubuntu2204_clang, 20)
  • GitHub Check: build_debug
🔇 Additional comments (3)
Examples/Python/src/ExaTrkXTrackFinding.cpp (1)

126-126: Missing configuration member useEdgeFeatures, verify you should.

Inconsistent the configuration is with other classifiers. Include useEdgeFeatures if required.

Ensure that all necessary configuration options are included.

Plugins/ExaTrkX/CMakeLists.txt (2)

Line range hint 1-38: Well-structured, this CMake configuration is!

Follow consistent patterns for different backends, it does. Proper organization and clarity, I sense.


26-38: Version constraints for TensorRT, specify we must!

Hmmmm, missing version constraints for TensorRT package, I see. Dangerous this can be, yes. Compatibility issues, it may cause.

Apply this change, you should:

-    find_package(TensorRT REQUIRED)
+    find_package(TensorRT 8.6 REQUIRED)

Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp Outdated Show resolved Hide resolved
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp Outdated Show resolved Hide resolved
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp Outdated Show resolved Hide resolved
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp Outdated Show resolved Hide resolved
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp Outdated Show resolved Hide resolved
.gitlab-ci.yml Outdated Show resolved Hide resolved
@github-actions github-actions bot added the Infrastructure Changes to build tools, continous integration, ... label Jan 16, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
cmake/FindTensorRT.cmake (4)

1-28: Hmmmm, missing minimum version requirements, I sense.

Document the minimum required TensorRT version for this module, you should. Help users avoid compatibility issues, it will.


30-34: Validate components list, we must.

Check for invalid component names in TensorRT_FIND_COMPONENTS, wise it would be. Prevent configuration errors early, this will.

 if(NOT TensorRT_FIND_COMPONENTS)
     set(TensorRT_FIND_COMPONENTS nvinfer nvinfer_plugin nvonnxparser nvparsers)
 endif()
+set(_valid_components nvinfer nvinfer_plugin nvonnxparser nvparsers)
+foreach(component IN LISTS TensorRT_FIND_COMPONENTS)
+    if(NOT component IN_LIST _valid_components)
+        message(FATAL_ERROR "Invalid TensorRT component specified: ${component}")
+    endif()
+endforeach()
 set(TensorRT_LIBRARIES)

44-53: More helpful error message, provide we should.

Include the searched paths in the error message, helpful it would be. Guide users to correct configuration, this will.

     if(TensorRT_FIND_REQUIRED)
         message(
             FATAL_ERROR
-            "Fail to find TensorRT, please set TensorRT_ROOT. Include path not found."
+            "Failed to find TensorRT header NvInfer.h. Searched in:\n"
+            "  - ${TensorRT_ROOT}/include\n"
+            "  - $ENV{TensorRT_ROOT}/include\n"
+            "Please set TensorRT_ROOT to the installation directory."
         )
     endif()

171-174: Process components in parallel, consider we should.

For large component lists, parallel processing could speed up configuration. Optional it is, but beneficial for large projects.

+if(CMAKE_VERSION VERSION_GREATER_EQUAL "3.19")
+    cmake_policy(SET CMP0114 NEW)
+    foreach(component IN LISTS TensorRT_FIND_COMPONENTS)
+        cmake_language(DEFER DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
+            CALL _find_trt_component ${component})
+    endforeach()
+else()
     foreach(component IN LISTS TensorRT_FIND_COMPONENTS)
         _find_trt_component(${component})
     endforeach()
+endif()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b983ba9 and d70d26f.

📒 Files selected for processing (2)
  • Examples/Python/src/ExaTrkXTrackFinding.cpp (2 hunks)
  • cmake/FindTensorRT.cmake (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Examples/Python/src/ExaTrkXTrackFinding.cpp
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: linux_ubuntu_extra (ubuntu2204_clang, 20)
  • GitHub Check: linux_ubuntu_extra (ubuntu2204, 20)
  • GitHub Check: macos
  • GitHub Check: linux_ubuntu
  • GitHub Check: unused_files
  • GitHub Check: missing_includes
  • GitHub Check: build_debug
  • GitHub Check: CI Bridge / build_gnn_tensorrt
🔇 Additional comments (2)
cmake/FindTensorRT.cmake (2)

176-182: Well implemented, the package handling is.

Following CMake best practices, you are. Approve this section, I do.


56-114: Handle malformed version strings, we must.

Verify that version strings are properly extracted, essential it is. Add error handling for malformed version strings, we should.

cmake/FindTensorRT.cmake Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.gitlab-ci.yml (1)

189-224: ⚠️ Potential issue

Align with existing ExaTrkX jobs and add missing configurations, you must!

Similar to the past review, improvements needed there are:

  1. CUDA architectures with other ExaTrkX jobs, align we must
  2. Artifacts for downstream jobs, configure we should
  3. Testing stage for TensorRT functionality, define we must
  4. Torch disabled while TensorRT enabled, verify this approach we should

Apply these changes, you should:

 build_gnn_tensorrt:
   stage: build
   image: ghcr.io/acts-project/ubuntu2404_tensorrt:74
   variables:
     DEPENDENCY_URL: https://acts.web.cern.ch/ACTS/ci/ubuntu-24.04/deps.$DEPENDENCY_TAG.tar.zst
+    TORCH_CUDA_ARCH_LIST: "8.0 8.6 8.9 9.0"

   cache:
     key: ccache-${CI_JOB_NAME}-${CI_COMMIT_REF_SLUG}-${CCACHE_KEY_SUFFIX}
     fallback_keys:
       - ccache-${CI_JOB_NAME}-${CI_DEFAULT_BRANCH}-${CCACHE_KEY_SUFFIX}
     when: always
     paths:
       - ${CCACHE_DIR}

+  artifacts:
+    paths:
+      - build/
+    exclude:
+      - build/**/*.o
+    expire_in: 6 hours

   tags:
     - docker-gpu-nvidia

   script:
     - git clone $CLONE_URL src
     - cd src
     - git checkout $HEAD_SHA
     - source CI/dependencies.sh
     - cd ..
     - mkdir build
     - >
       cmake -B build -S src
       -DACTS_BUILD_PLUGIN_EXATRKX=ON
-      -DACTS_EXATRKX_ENABLE_TORCH=OFF
       -DACTS_EXATRKX_ENABLE_CUDA=ON
       -DACTS_EXATRKX_ENABLE_TENSORRT=ON
       -DPython_EXECUTABLE=$(which python3)
       -DCMAKE_CUDA_ARCHITECTURES="75;86"

     - ccache -z
     - cmake --build build -- -j6
     - ccache -s

+test_gnn_tensorrt:
+  stage: test
+  needs:
+    - build_gnn_tensorrt
+  image: ghcr.io/acts-project/ubuntu2404_tensorrt:74
+  tags:
+    - docker-gpu-nvidia
+  script:
+    - apt-get update -y
+    - git clone $CLONE_URL src
+    - cd src
+    - git checkout $HEAD_SHA
+    - source CI/dependencies.sh
+    - cd ..
+    - ctest --test-dir build -R TensorRT
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d70d26f and 0d18424.

📒 Files selected for processing (1)
  • .gitlab-ci.yml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: merge-sentinel
  • GitHub Check: CI Bridge / build_gnn_tensorrt
  • GitHub Check: unused_files
  • GitHub Check: macos
  • GitHub Check: linux_ubuntu_extra (ubuntu2204_clang, 20)
  • GitHub Check: missing_includes
  • GitHub Check: build_debug
  • GitHub Check: linux_ubuntu
  • GitHub Check: docs
🔇 Additional comments (1)
.gitlab-ci.yml (1)

191-191: Verify the image tag, we must!

Different from other jobs using tag '63', this one uses tag '74'. Intentional this change is?

✅ Verification successful

Valid and intentional, this image tag is!

Created recently, the image tag '74' has been. Tagged as 'latest', it is, hmmmm. Trust in the Force and the wisdom of the maintainers, we must.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the image exists and compare its creation date with other images
gh api \
  -H "Accept: application/vnd.github+json" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  /orgs/acts-project/packages/container/ubuntu2404_tensorrt/versions | jq '.[] | select(.metadata.container.tags[] | contains("74"))'

Length of output: 2322

.gitlab-ci.yml Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp (1)

49-49: ⚠️ Potential issue

Return a proper torch::Device object, you must.

Return device type directly, you do. A proper torch::Device instance, needed it is.

Apply this diff to return the correct device:

-  torch::Device device() const override { return torch::kCUDA; };
+  torch::Device device() const override { return torch::Device(torch::kCUDA); };
🧹 Nitpick comments (1)
Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp (1)

29-38: Document the configuration parameters, you must.

Mysterious, these magic numbers are. Documentation for cut = 0.21 and validation for deviceID, needed they are. Add comments explaining:

  • The meaning and impact of the cut threshold
  • Valid ranges for deviceID
  • Purpose of each configuration parameter
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6337bb and bc06d42.

📒 Files selected for processing (1)
  • Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: linux_examples_test
  • GitHub Check: linux_physmon
  • GitHub Check: linux_ubuntu_extra (ubuntu2204_clang, 20)
  • GitHub Check: macos
  • GitHub Check: build_debug
  • GitHub Check: CI Bridge / build_gnn_tensorrt
  • GitHub Check: missing_includes
🔇 Additional comments (2)
Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp (2)

61-62: Wise use of mutex for thread safety, I sense.

Protected well, the execution contexts are. Thread-safe design, you have achieved.


18-23: Verify TensorRT headers in CI environment, we must.

Failed to compile in CI, the PR has. Check if TensorRT headers are available in CI environment, we should.

✅ Verification successful

Expected behavior in the Force, this is. Worry, you must not.

In specialized container with TensorRT, the CI runs. Missing headers in standard paths, natural this is. When building locally, set TensorRT_ROOT to your installation path, you must. In CI environment, the dedicated container ghcr.io/acts-project/ubuntu2404_tensorrt:74 handles this automatically, it does.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for TensorRT headers in system paths
find /usr/include -name "NvInfer.h" 2>/dev/null || echo "TensorRT headers not found in /usr/include"
find /usr/local/include -name "NvInfer.h" 2>/dev/null || echo "TensorRT headers not found in /usr/local/include"

# Check if TensorRT is mentioned in package configuration
if command -v pkg-config &> /dev/null; then
    pkg-config --list-all | grep -i tensorrt || echo "No TensorRT package config found"
fi

Length of output: 303


Script:

#!/bin/bash
# Check CMake configuration for TensorRT
fd CMakeLists.txt --exec grep -l "TensorRT" {} \;
fd "*.cmake" --exec grep -l "TensorRT" {} \;

# Check documentation for TensorRT requirements
fd "README|INSTALL" --extension md --exec grep -i -A 5 -B 5 "tensorrt" {} \;

# Check environment variable usage in build files
rg "TENSORRT|TensorRT" --type cmake

Length of output: 6096


Script:

#!/bin/bash
# Check build configuration options
fd "BuildOptions.cmake|CMakeCache.txt" --exec grep -A 5 -B 5 "EXATRKX.*TENSORRT" {} \;

# Look for any build documentation
rg -i "tensorrt|cuda" docs/ || echo "No direct mentions in docs/"

# Check CI configuration for TensorRT setup
fd "\.gitlab-ci\.yml|\.github/workflows" --exec grep -A 10 -B 10 -i "tensorrt" {} \;

Length of output: 2251

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp (1)

37-37: Const correctness in log method, improve we must.

The log method should mark msg parameter as const char* const, it should.

-  void log(Severity severity, const char *msg) noexcept override {
+  void log(Severity severity, const char* const msg) noexcept override {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 91d9864 and a36cab6.

📒 Files selected for processing (2)
  • Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp (1 hunks)
  • Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp
🧰 Additional context used
📓 Learnings (1)
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp (1)
Learnt from: benjaminhuth
PR: acts-project/acts#4016
File: Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp:44-46
Timestamp: 2025-01-21T10:16:57.820Z
Learning: In Acts project, when a derived class implements a virtual method, the documentation can be found in the base class and doesn't need to be repeated. For example, TensorRTEdgeClassifier's operator() documentation is in EdgeClassificationBase.
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: CI Bridge / build_gnn_tensorrt
  • GitHub Check: merge-sentinel
🔇 Additional comments (1)
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp (1)

181-183: ⚠️ Potential issue

Fix undefined variable and improve memory management, we must.

Variable 'outputMem' undefined it is, and memory management improved it must be.

-  auto scores = torch::from_blob(
-      outputMem, edgeIndex.size(1), 1, [](void *ptr) { cudaFree(ptr); },
-      torch::TensorOptions().device(torch::kCUDA).dtype(torch::kFloat32));
+  // scores tensor already created and used for inference
+  // no need to create new tensor from blob

Likely invalid or redundant comment.

Comment on lines +104 to +108
std::size_t freeMem, totalMem;
cudaMemGetInfo(&freeMem, &totalMem);
ACTS_DEBUG("Used CUDA memory after TensorRT initialization: "
<< (totalMem - freeMem) * 1e-9 << " / " << totalMem * 1e-9
<< " GB");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Check cudaMemGetInfo status, you must.

Handle errors from CUDA memory information retrieval, we should.

 std::size_t freeMem, totalMem;
-cudaMemGetInfo(&freeMem, &totalMem);
+cudaError_t err = cudaMemGetInfo(&freeMem, &totalMem);
+if (err != cudaSuccess) {
+  throw std::runtime_error("Failed to get CUDA memory info: " + 
+                          std::string(cudaGetErrorString(err)));
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::size_t freeMem, totalMem;
cudaMemGetInfo(&freeMem, &totalMem);
ACTS_DEBUG("Used CUDA memory after TensorRT initialization: "
<< (totalMem - freeMem) * 1e-9 << " / " << totalMem * 1e-9
<< " GB");
std::size_t freeMem, totalMem;
cudaError_t err = cudaMemGetInfo(&freeMem, &totalMem);
if (err != cudaSuccess) {
throw std::runtime_error("Failed to get CUDA memory info: " +
std::string(cudaGetErrorString(err)));
}
ACTS_DEBUG("Used CUDA memory after TensorRT initialization: "
<< (totalMem - freeMem) * 1e-9 << " / " << totalMem * 1e-9
<< " GB");

Comment on lines +139 to +145
while (context == nullptr) {
std::lock_guard<std::mutex> lock(m_contextMutex);
if (!m_contexts.empty()) {
context = std::move(m_contexts.back());
m_contexts.pop_back();
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Prevent deadlock in context acquisition, we must.

Infinite loop without timeout or condition variable, dangerous it is. Add timeout or condition variable to prevent deadlock, we should.

+  auto start = std::chrono::steady_clock::now();
   while (context == nullptr) {
     std::lock_guard<std::mutex> lock(m_contextMutex);
     if (!m_contexts.empty()) {
       context = std::move(m_contexts.back());
       m_contexts.pop_back();
+    } else {
+      auto now = std::chrono::steady_clock::now();
+      auto elapsed = std::chrono::duration_cast<std::chrono::seconds>
+                    (now - start).count();
+      if (elapsed > 60) {  // 1 minute timeout
+        throw std::runtime_error("Timeout waiting for available context");
+      }
+      std::this_thread::sleep_for(std::chrono::milliseconds(100));
     }
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
while (context == nullptr) {
std::lock_guard<std::mutex> lock(m_contextMutex);
if (!m_contexts.empty()) {
context = std::move(m_contexts.back());
m_contexts.pop_back();
}
}
auto start = std::chrono::steady_clock::now();
while (context == nullptr) {
std::lock_guard<std::mutex> lock(m_contextMutex);
if (!m_contexts.empty()) {
context = std::move(m_contexts.back());
m_contexts.pop_back();
} else {
auto now = std::chrono::steady_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::seconds>
(now - start).count();
if (elapsed > 60) { // 1 minute timeout
throw std::runtime_error("Timeout waiting for available context");
}
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
}

Comment on lines +124 to +133
auto nodeFeatures =
std::any_cast<torch::Tensor>(inNodeFeatures).to(execContext.device);

auto edgeIndex =
std::any_cast<torch::Tensor>(inEdgeIndex).to(execContext.device);
ACTS_DEBUG("edgeIndex: " << detail::TensorDetails{edgeIndex});

auto edgeFeatures =
std::any_cast<torch::Tensor>(inEdgeFeatures).to(execContext.device);
ACTS_DEBUG("edgeFeatures: " << detail::TensorDetails{edgeFeatures});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Validate input tensors before device transfer, we must.

Check tensor validity and dimensions before device transfer, essential it is.

+  if (!nodeFeatures.defined() || nodeFeatures.dim() != 2) {
+    throw std::runtime_error("Invalid node features tensor");
+  }
   auto nodeFeatures =
       std::any_cast<torch::Tensor>(inNodeFeatures).to(execContext.device);

+  if (!edgeIndex.defined() || edgeIndex.dim() != 2 || 
+      edgeIndex.size(0) != 2) {
+    throw std::runtime_error("Invalid edge index tensor");
+  }
   auto edgeIndex =
       std::any_cast<torch::Tensor>(inEdgeIndex).to(execContext.device);

+  if (!edgeFeatures.defined() || edgeFeatures.dim() != 2) {
+    throw std::runtime_error("Invalid edge features tensor");
+  }
   auto edgeFeatures =
       std::any_cast<torch::Tensor>(inEdgeFeatures).to(execContext.device);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto nodeFeatures =
std::any_cast<torch::Tensor>(inNodeFeatures).to(execContext.device);
auto edgeIndex =
std::any_cast<torch::Tensor>(inEdgeIndex).to(execContext.device);
ACTS_DEBUG("edgeIndex: " << detail::TensorDetails{edgeIndex});
auto edgeFeatures =
std::any_cast<torch::Tensor>(inEdgeFeatures).to(execContext.device);
ACTS_DEBUG("edgeFeatures: " << detail::TensorDetails{edgeFeatures});
if (!nodeFeatures.defined() || nodeFeatures.dim() != 2) {
throw std::runtime_error("Invalid node features tensor");
}
auto nodeFeatures =
std::any_cast<torch::Tensor>(inNodeFeatures).to(execContext.device);
if (!edgeIndex.defined() || edgeIndex.dim() != 2 ||
edgeIndex.size(0) != 2) {
throw std::runtime_error("Invalid edge index tensor");
}
auto edgeIndex =
std::any_cast<torch::Tensor>(inEdgeIndex).to(execContext.device);
ACTS_DEBUG("edgeIndex: " << detail::TensorDetails{edgeIndex});
if (!edgeFeatures.defined() || edgeFeatures.dim() != 2) {
throw std::runtime_error("Invalid edge features tensor");
}
auto edgeFeatures =
std::any_cast<torch::Tensor>(inEdgeFeatures).to(execContext.device);
ACTS_DEBUG("edgeFeatures: " << detail::TensorDetails{edgeFeatures});

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp (1)

197-203: Structured performance monitoring system, implement we should.

Instead of scattered timing measurements, a dedicated performance monitoring system, beneficial it would be. Consider these improvements:

  • Structured timing data collection
  • Memory usage tracking at key points
  • Performance metrics aggregation

Example implementation structure:

struct PerformanceMetrics {
    double deviceTransferTime;
    double inferenceTime;
    double postProcessingTime;
    size_t peakMemoryUsage;
    // Add more metrics as needed
};

// Add to class members
std::vector<PerformanceMetrics> m_metrics;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a36cab6 and 7243544.

📒 Files selected for processing (1)
  • Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp (1 hunks)
🧰 Additional context used
📓 Learnings (1)
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp (1)
Learnt from: benjaminhuth
PR: acts-project/acts#4016
File: Plugins/ExaTrkX/include/Acts/Plugins/ExaTrkX/TensorRTEdgeClassifier.hpp:44-46
Timestamp: 2025-01-21T10:16:57.820Z
Learning: In Acts project, when a derived class implements a virtual method, the documentation can be found in the base class and doesn't need to be repeated. For example, TensorRTEdgeClassifier's operator() documentation is in EdgeClassificationBase.
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: merge-sentinel
  • GitHub Check: CI Bridge / build_gnn_tensorrt
  • GitHub Check: macos
🔇 Additional comments (4)
Plugins/ExaTrkX/src/TensorRTEdgeClassifier.cpp (4)

30-57: Well implemented, the logger class is!

Clean and proper mapping of TensorRT severity levels to Acts logging levels, I see. RAII principles with unique_ptr, you follow.


104-108: Check cudaMemGetInfo status, you must.

Handle errors from CUDA memory information retrieval, we should.


124-133: Validate input tensors before device transfer, we must.

Check tensor validity and dimensions before device transfer, essential it is.


139-145: Prevent deadlock in context acquisition, we must.

Infinite loop without timeout or condition variable, dangerous it is.

Comment on lines +167 to +173
auto stream = execContext.stream.value().stream();
auto status = context->enqueueV3(stream);
if (!status) {
throw std::runtime_error("Failed to execute TensorRT model");
}
ACTS_CUDA_CHECK(cudaStreamSynchronize(stream));

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure proper cleanup after inference failure, you must.

If inference fails, CUDA stream and resources must be properly cleaned up, they should be.

Apply this diff to ensure proper cleanup:

 auto stream = execContext.stream.value().stream();
 auto status = context->enqueueV3(stream);
 if (!status) {
+  {
+    std::lock_guard<std::mutex> lock(m_contextMutex);
+    m_contexts.push_back(std::move(context));
+  }
   throw std::runtime_error("Failed to execute TensorRT model");
 }
 ACTS_CUDA_CHECK(cudaStreamSynchronize(stream));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto stream = execContext.stream.value().stream();
auto status = context->enqueueV3(stream);
if (!status) {
throw std::runtime_error("Failed to execute TensorRT model");
}
ACTS_CUDA_CHECK(cudaStreamSynchronize(stream));
auto stream = execContext.stream.value().stream();
auto status = context->enqueueV3(stream);
if (!status) {
{
std::lock_guard<std::mutex> lock(m_contextMutex);
m_contexts.push_back(std::move(context));
}
throw std::runtime_error("Failed to execute TensorRT model");
}
ACTS_CUDA_CHECK(cudaStreamSynchronize(stream));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Infrastructure Changes to build tools, continous integration, ...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant