[Core] Refactor `Dot` product implementation in `MathUtils` to use `std::inner_prod` #13106

loumalouomega · 2025-02-07T21:41:37Z

📝 Description

Details of 3:

This PR refactors the implementation of the Dot product for better performance and introduces a new benchmark for the Dot product function in the MathUtils class.

Refactor Dot Product Implementation:
- In math_utils.h, the old Dot product implementation is replaced with a more modern and efficient version that utilizes std::inner_product for improved readability and performance.
- The function now iterates through the vectors using std::inner_product, which is a standard C++ algorithm.
New Benchmark:
- A new benchmark file mathutils_benchmark.cpp is added to the project. It contains a benchmark for testing the performance of the Dot product function using benchmark::State.
- The benchmark compares the dot product of two 3D vectors, with one non-zero component in each vector, ensuring the result is 0.0.

Related #13101

🆕 Changelog

…oduct for improved readability and performance

matekelemen · 2025-02-08T20:26:27Z

👍

What's up with the CPU time being much higher for std::dot_product than the old implementation?

loumalouomega · 2025-02-09T17:58:21Z

🤷

RiccardoRossi · 2025-02-10T09:44:18Z

can i suggest a simple for loop?

my guess is that it will be faster than any of the previous

loumalouomega · 2025-02-10T09:59:59Z

a priori it will be closer to the original one?

loumalouomega · 2025-02-10T10:04:24Z

Something like this?

double temp = 0.0;
for (Vector::const_iterator i = rFirstVector.begin(), j = rSecondVector.begin();
     i != rFirstVector.end();
     ++i, ++j) {
    temp += *i * *j;
}
return temp;

RiccardoRossi · 2025-02-10T10:49:48Z

i would suggest to avoid iterators:

double temp = 0.0;
for (unsigned int i=0; i<rFirstVector.size(); ++i){
    temp += rFirstVector[i]+rSecondVector[i];
}
return temp;

the difference is potentially in the vectorization

loumalouomega · 2025-02-10T10:51:32Z

I guess the inner_prod would take SIMD as well. I will try your suggestion.

loumalouomega · 2025-02-10T11:58:18Z

@RiccardoRossi was right (take a screenshoot)

matekelemen · 2025-02-10T12:13:47Z

kratos/utilities/math_utils.h

-        while(i != rFirstVector.end()) {
-            temp += *i++ * *j++;
+        for (std::size_t i=0; i<rFirstVector.size(); ++i) {
+            temp += rFirstVector[i]+rSecondVector[i];


this does not compute a dot product.

matekelemen · 2025-02-10T12:14:53Z

I'm even more confused about these plots now. @loumalouomega can you please provide the raw output from the benchmark?

matekelemen · 2025-02-10T12:16:57Z

(properly implemented) contiguous iterators are optimized away by the compiler. I'd do the benchmarks and believe the results (after making sure that were actually benchmarking what we want to measure and can make sense of the results).

loumalouomega · 2025-02-10T12:21:39Z

output_new.json
output_old.json
output_for.json

loumalouomega · 2025-02-10T13:51:24Z

Now for some reason the Ci is failing, so meaybe better revert

matekelemen · 2025-02-10T14:19:01Z

As I already wrote, you're not computing a dot product but summing up components.

By the way the benchmark is probably optimized away completely, which is why we're seeing gibberish results. It says the average CPU time is around ~1 nanosecond, that is about 3 cycles on a 3 GHz CPU (=> no matter what magic your compiler or CPU does, that's way below the theoretical limit for a 3-component double precision floating point dot product).

This reverts commit 9d562dc.

This reverts commit 0372cda.

matekelemen · 2025-02-10T14:49:40Z

Take a look at the following benchmark (computes a dot product with a c-style loop, CXX11 iterators, and finally std::inner_product on a compile time vector size ARRAY_SIZE).

As you play around with ARRAY_SIZE between 3 and 48 you'll notice that the performance stays identical for all three benchmarks: 4x no-op time. But somewhere between 48 and 64 it suddenly degrades in performance and jumps almost 2 orders of magnitude.

What I think is going on here is that the compiler does small vector optimizations and can compute the result on tiny arrays (at least until 48 entries) at compile time and just hard-code the result. At some point between 48 and 64 entries, it probably decides that the vector belongs on the heap and thus cannot compute the dot product at compile time anymore.

My takeaway: double check your benchmark code and results and make sure they make sense (or at least they're not completely nonsense).

#include <vector>
#include <numeric>

constexpr std::size_t ARRAY_SIZE = 64;

template <class T, std::size_t Dimensions>
void CLoop(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = static_cast<T>(0);
    for (typename Container::size_type i=0; i<left.size(); ++i) {
      out += left[i] * right[i];
    }
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}


template <class T, std::size_t Dimensions>
void IteratorLoop(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = static_cast<T>(0);
    for (auto it_left=left.begin(), it_right=right.begin(); it_left != left.end(); ++it_left, ++it_right) {
      out += *it_left * *it_right;
    }
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}


template <class T, std::size_t Dimensions>
void StandardInnerProduct(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = std::inner_product(left.begin(),
                               left.end(),
                               right.begin(),
                               static_cast<T>(0));
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}

//BENCHMARK_TEMPLATE(CLoop, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(CLoop, double, ARRAY_SIZE);

//BENCHMARK_TEMPLATE(IteratorLoop, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(IteratorLoop, double, ARRAY_SIZE);

//BENCHMARK_TEMPLATE(StandardInnerProduct, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(StandardInnerProduct, double, ARRAY_SIZE);

kratos/utilities/math_utils.h

Co-authored-by: Máté Kelemen <[email protected]>

loumalouomega · 2025-02-11T15:41:33Z

Take a look at the following benchmark (computes a dot product with a c-style loop, CXX11 iterators, and finally std::inner_product on a compile time vector size ARRAY_SIZE).

As you play around with ARRAY_SIZE between 3 and 48 you'll notice that the performance stays identical for all three benchmarks: 4x no-op time. But somewhere between 48 and 64 it suddenly degrades in performance and jumps almost 2 orders of magnitude.

What I think is going on here is that the compiler does small vector optimizations and can compute the result on tiny arrays (at least until 48 entries) at compile time and just hard-code the result. At some point between 48 and 64 entries, it probably decides that the vector belongs on the heap and thus cannot compute the dot product at compile time anymore.

My takeaway: double check your benchmark code and results and make sure they make sense (or at least they're not completely nonsense).

#include <vector>
#include <numeric>

constexpr std::size_t ARRAY_SIZE = 64;

template <class T, std::size_t Dimensions>
void CLoop(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = static_cast<T>(0);
    for (typename Container::size_type i=0; i<left.size(); ++i) {
      out += left[i] * right[i];
    }
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}


template <class T, std::size_t Dimensions>
void IteratorLoop(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = static_cast<T>(0);
    for (auto it_left=left.begin(), it_right=right.begin(); it_left != left.end(); ++it_left, ++it_right) {
      out += *it_left * *it_right;
    }
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}


template <class T, std::size_t Dimensions>
void StandardInnerProduct(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = std::inner_product(left.begin(),
                               left.end(),
                               right.begin(),
                               static_cast<T>(0));
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}

//BENCHMARK_TEMPLATE(CLoop, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(CLoop, double, ARRAY_SIZE);

//BENCHMARK_TEMPLATE(IteratorLoop, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(IteratorLoop, double, ARRAY_SIZE);

//BENCHMARK_TEMPLATE(StandardInnerProduct, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(StandardInnerProduct, double, ARRAY_SIZE);

With the new benchmark

matekelemen · 2025-02-11T19:16:18Z

thx, makes sense now!

loumalouomega added 2 commits February 7, 2025 19:01

Add benchmark for Dot product in MathUtils

5ac7646

Refactor Dot product implementation in MathUtils to use std::inner_pr…

fa54615

…oduct for improved readability and performance

loumalouomega added Kratos Core Performance FastPR This Pr is simple and / or has been already tested and the revision should be fast labels Feb 7, 2025

loumalouomega requested review from roigcarlo and pooyan-dadvand February 7, 2025 21:41

loumalouomega requested a review from a team as a code owner February 7, 2025 21:41

loumalouomega mentioned this pull request Feb 7, 2025

[GeoMechanicsApplication] Replace MathUtils<>::Dot by std::inner_product #13101

Merged

loumalouomega added 2 commits February 10, 2025 13:02

Suggestion

0372cda

Style

9d562dc

matekelemen requested changes Feb 10, 2025

View reviewed changes

loumalouomega added 2 commits February 10, 2025 15:40

Revert "Style"

e8e91d1

This reverts commit 9d562dc.

Revert "Suggestion"

42ab9e2

This reverts commit 0372cda.

matekelemen reviewed Feb 10, 2025

View reviewed changes

kratos/utilities/math_utils.h Outdated Show resolved Hide resolved

Update kratos/utilities/math_utils.h

9f34e7b

Co-authored-by: Máté Kelemen <[email protected]>

Enhance dot product benchmark to support multiple vector sizes

0416fb4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Refactor `Dot` product implementation in `MathUtils` to use `std::inner_prod` #13106

[Core] Refactor `Dot` product implementation in `MathUtils` to use `std::inner_prod` #13106

loumalouomega commented Feb 7, 2025 •

edited

Loading

matekelemen commented Feb 8, 2025

loumalouomega commented Feb 9, 2025

RiccardoRossi commented Feb 10, 2025

loumalouomega commented Feb 10, 2025

loumalouomega commented Feb 10, 2025

RiccardoRossi commented Feb 10, 2025

loumalouomega commented Feb 10, 2025

loumalouomega commented Feb 10, 2025

matekelemen Feb 10, 2025

matekelemen commented Feb 10, 2025

matekelemen commented Feb 10, 2025 •

edited

Loading

loumalouomega commented Feb 10, 2025

loumalouomega commented Feb 10, 2025

matekelemen commented Feb 10, 2025

matekelemen commented Feb 10, 2025 •

edited

Loading

loumalouomega commented Feb 11, 2025

matekelemen commented Feb 11, 2025 •

edited

Loading

[Core] Refactor Dot product implementation in MathUtils to use std::inner_prod #13106

Are you sure you want to change the base?

[Core] Refactor Dot product implementation in MathUtils to use std::inner_prod #13106

Conversation

loumalouomega commented Feb 7, 2025 • edited Loading

matekelemen commented Feb 8, 2025

loumalouomega commented Feb 9, 2025

RiccardoRossi commented Feb 10, 2025

loumalouomega commented Feb 10, 2025

loumalouomega commented Feb 10, 2025

RiccardoRossi commented Feb 10, 2025

loumalouomega commented Feb 10, 2025

loumalouomega commented Feb 10, 2025

matekelemen Feb 10, 2025

Choose a reason for hiding this comment

matekelemen commented Feb 10, 2025

matekelemen commented Feb 10, 2025 • edited Loading

loumalouomega commented Feb 10, 2025

loumalouomega commented Feb 10, 2025

matekelemen commented Feb 10, 2025

matekelemen commented Feb 10, 2025 • edited Loading

loumalouomega commented Feb 11, 2025

matekelemen commented Feb 11, 2025 • edited Loading

[Core] Refactor `Dot` product implementation in `MathUtils` to use `std::inner_prod` #13106

[Core] Refactor `Dot` product implementation in `MathUtils` to use `std::inner_prod` #13106

loumalouomega commented Feb 7, 2025 •

edited

Loading

matekelemen commented Feb 10, 2025 •

edited

Loading

matekelemen commented Feb 10, 2025 •

edited

Loading

matekelemen commented Feb 11, 2025 •

edited

Loading