Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Refactor Dot product implementation in MathUtils to use std::inner_prod #13106

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

loumalouomega
Copy link
Member

@loumalouomega loumalouomega commented Feb 7, 2025

📝 Description

Details of 3:
Figure_1

Figure_1

This PR refactors the implementation of the Dot product for better performance and introduces a new benchmark for the Dot product function in the MathUtils class.

  1. Refactor Dot Product Implementation:

    • In math_utils.h, the old Dot product implementation is replaced with a more modern and efficient version that utilizes std::inner_product for improved readability and performance.
    • The function now iterates through the vectors using std::inner_product, which is a standard C++ algorithm.
  2. New Benchmark:

    • A new benchmark file mathutils_benchmark.cpp is added to the project. It contains a benchmark for testing the performance of the Dot product function using benchmark::State.
    • The benchmark compares the dot product of two 3D vectors, with one non-zero component in each vector, ensuring the result is 0.0.

Related #13101

🆕 Changelog

@loumalouomega loumalouomega added Kratos Core Performance FastPR This Pr is simple and / or has been already tested and the revision should be fast labels Feb 7, 2025
@loumalouomega loumalouomega requested a review from a team as a code owner February 7, 2025 21:41
@matekelemen
Copy link
Contributor

👍

What's up with the CPU time being much higher for std::dot_product than the old implementation?

@loumalouomega
Copy link
Member Author

👍

What's up with the CPU time being much higher for std::dot_product than the old implementation?

🤷

@RiccardoRossi
Copy link
Member

can i suggest a simple for loop?

my guess is that it will be faster than any of the previous

@loumalouomega
Copy link
Member Author

can i suggest a simple for loop?

my guess is that it will be faster than any of the previous

can i suggest a simple for loop?

my guess is that it will be faster than any of the previous

a priori it will be closer to the original one?

@loumalouomega
Copy link
Member Author

can i suggest a simple for loop?
my guess is that it will be faster than any of the previous

can i suggest a simple for loop?
my guess is that it will be faster than any of the previous

a priori it will be closer to the original one?

Something like this?

double temp = 0.0;
for (Vector::const_iterator i = rFirstVector.begin(), j = rSecondVector.begin();
     i != rFirstVector.end();
     ++i, ++j) {
    temp += *i * *j;
}
return temp;

@RiccardoRossi
Copy link
Member

i would suggest to avoid iterators:

double temp = 0.0;
for (unsigned int i=0; i<rFirstVector.size(); ++i){
    temp += rFirstVector[i]+rSecondVector[i];
}
return temp;

the difference is potentially in the vectorization

@loumalouomega
Copy link
Member Author

i would suggest to avoid iterators:

double temp = 0.0;
for (unsigned int i=0; i<rFirstVector.size(); ++i){
    temp += rFirstVector[i]+rSecondVector[i];
}
return temp;

the difference is potentially in the vectorization

I guess the inner_prod would take SIMD as well. I will try your suggestion.

@loumalouomega
Copy link
Member Author

i would suggest to avoid iterators:

double temp = 0.0;
for (unsigned int i=0; i<rFirstVector.size(); ++i){
    temp += rFirstVector[i]+rSecondVector[i];
}
return temp;

the difference is potentially in the vectorization

Figure_1

@RiccardoRossi was right (take a screenshoot)

while(i != rFirstVector.end()) {
temp += *i++ * *j++;
for (std::size_t i=0; i<rFirstVector.size(); ++i) {
temp += rFirstVector[i]+rSecondVector[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not compute a dot product.

@matekelemen
Copy link
Contributor

i would suggest to avoid iterators:

double temp = 0.0;
for (unsigned int i=0; i<rFirstVector.size(); ++i){
    temp += rFirstVector[i]+rSecondVector[i];
}
return temp;

the difference is potentially in the vectorization

Figure_1

@RiccardoRossi was right (take a screenshoot)

I'm even more confused about these plots now. @loumalouomega can you please provide the raw output from the benchmark?

@matekelemen
Copy link
Contributor

matekelemen commented Feb 10, 2025

i would suggest to avoid iterators:

double temp = 0.0;
for (unsigned int i=0; i<rFirstVector.size(); ++i){
    temp += rFirstVector[i]+rSecondVector[i];
}
return temp;

the difference is potentially in the vectorization

(properly implemented) contiguous iterators are optimized away by the compiler. I'd do the benchmarks and believe the results (after making sure that were actually benchmarking what we want to measure and can make sense of the results).

@loumalouomega
Copy link
Member Author

i would suggest to avoid iterators:

double temp = 0.0;
for (unsigned int i=0; i<rFirstVector.size(); ++i){
    temp += rFirstVector[i]+rSecondVector[i];
}
return temp;

the difference is potentially in the vectorization

Figure_1
@RiccardoRossi was right (take a screenshoot)

I'm even more confused about these plots now. @loumalouomega can you please provide the raw output from the benchmark?

output_new.json
output_old.json
output_for.json

@loumalouomega
Copy link
Member Author

i would suggest to avoid iterators:

double temp = 0.0;
for (unsigned int i=0; i<rFirstVector.size(); ++i){
    temp += rFirstVector[i]+rSecondVector[i];
}
return temp;

the difference is potentially in the vectorization

Figure_1
@RiccardoRossi was right (take a screenshoot)

I'm even more confused about these plots now. @loumalouomega can you please provide the raw output from the benchmark?

output_new.json output_old.json output_for.json

Now for some reason the Ci is failing, so meaybe better revert

@matekelemen
Copy link
Contributor

Now for some reason the Ci is failing, so meaybe better revert

As I already wrote, you're not computing a dot product but summing up components.

By the way the benchmark is probably optimized away completely, which is why we're seeing gibberish results. It says the average CPU time is around ~1 nanosecond, that is about 3 cycles on a 3 GHz CPU (=> no matter what magic your compiler or CPU does, that's way below the theoretical limit for a 3-component double precision floating point dot product).

This reverts commit 9d562dc.
This reverts commit 0372cda.
@matekelemen
Copy link
Contributor

matekelemen commented Feb 10, 2025

Take a look at the following benchmark (computes a dot product with a c-style loop, CXX11 iterators, and finally std::inner_product on a compile time vector size ARRAY_SIZE).

As you play around with ARRAY_SIZE between 3 and 48 you'll notice that the performance stays identical for all three benchmarks: 4x no-op time. But somewhere between 48 and 64 it suddenly degrades in performance and jumps almost 2 orders of magnitude.

What I think is going on here is that the compiler does small vector optimizations and can compute the result on tiny arrays (at least until 48 entries) at compile time and just hard-code the result. At some point between 48 and 64 entries, it probably decides that the vector belongs on the heap and thus cannot compute the dot product at compile time anymore.

My takeaway: double check your benchmark code and results and make sure they make sense (or at least they're not completely nonsense).

#include <vector>
#include <numeric>

constexpr std::size_t ARRAY_SIZE = 64;

template <class T, std::size_t Dimensions>
void CLoop(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = static_cast<T>(0);
    for (typename Container::size_type i=0; i<left.size(); ++i) {
      out += left[i] * right[i];
    }
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}


template <class T, std::size_t Dimensions>
void IteratorLoop(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = static_cast<T>(0);
    for (auto it_left=left.begin(), it_right=right.begin(); it_left != left.end(); ++it_left, ++it_right) {
      out += *it_left * *it_right;
    }
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}


template <class T, std::size_t Dimensions>
void StandardInnerProduct(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = std::inner_product(left.begin(),
                               left.end(),
                               right.begin(),
                               static_cast<T>(0));
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}

//BENCHMARK_TEMPLATE(CLoop, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(CLoop, double, ARRAY_SIZE);

//BENCHMARK_TEMPLATE(IteratorLoop, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(IteratorLoop, double, ARRAY_SIZE);

//BENCHMARK_TEMPLATE(StandardInnerProduct, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(StandardInnerProduct, double, ARRAY_SIZE);

@loumalouomega
Copy link
Member Author

Take a look at the following benchmark (computes a dot product with a c-style loop, CXX11 iterators, and finally std::inner_product on a compile time vector size ARRAY_SIZE).

As you play around with ARRAY_SIZE between 3 and 48 you'll notice that the performance stays identical for all three benchmarks: 4x no-op time. But somewhere between 48 and 64 it suddenly degrades in performance and jumps almost 2 orders of magnitude.

What I think is going on here is that the compiler does small vector optimizations and can compute the result on tiny arrays (at least until 48 entries) at compile time and just hard-code the result. At some point between 48 and 64 entries, it probably decides that the vector belongs on the heap and thus cannot compute the dot product at compile time anymore.

My takeaway: double check your benchmark code and results and make sure they make sense (or at least they're not completely nonsense).

#include <vector>
#include <numeric>

constexpr std::size_t ARRAY_SIZE = 64;

template <class T, std::size_t Dimensions>
void CLoop(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = static_cast<T>(0);
    for (typename Container::size_type i=0; i<left.size(); ++i) {
      out += left[i] * right[i];
    }
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}


template <class T, std::size_t Dimensions>
void IteratorLoop(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = static_cast<T>(0);
    for (auto it_left=left.begin(), it_right=right.begin(); it_left != left.end(); ++it_left, ++it_right) {
      out += *it_left * *it_right;
    }
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}


template <class T, std::size_t Dimensions>
void StandardInnerProduct(benchmark::State& rState) {
  using Container = std::vector<T>;
  Container left(Dimensions), right(Dimensions);
  std::iota(left.begin(), left.end(), 0);
  std::iota(right.begin(), right.end(), Dimensions);

  T dummy = 0;
  for ([[maybe_unused]] auto _ : rState) {
    T out = std::inner_product(left.begin(),
                               left.end(),
                               right.begin(),
                               static_cast<T>(0));
    //rState.PauseTiming();
    dummy += out;
    //rState.ResumeTiming();
  }

  benchmark::DoNotOptimize(dummy);
}

//BENCHMARK_TEMPLATE(CLoop, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(CLoop, double, ARRAY_SIZE);

//BENCHMARK_TEMPLATE(IteratorLoop, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(IteratorLoop, double, ARRAY_SIZE);

//BENCHMARK_TEMPLATE(StandardInnerProduct, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(StandardInnerProduct, double, ARRAY_SIZE);

Figure_1

With the new benchmark

@matekelemen
Copy link
Contributor

matekelemen commented Feb 11, 2025

thx, makes sense now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FastPR This Pr is simple and / or has been already tested and the revision should be fast Kratos Core Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants