-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Refactor Dot
product implementation in MathUtils
to use std::inner_prod
#13106
base: master
Are you sure you want to change the base?
Conversation
…oduct for improved readability and performance
👍 What's up with the CPU time being much higher for |
🤷 |
can i suggest a simple for loop? my guess is that it will be faster than any of the previous |
a priori it will be closer to the original one? |
Something like this? double temp = 0.0;
for (Vector::const_iterator i = rFirstVector.begin(), j = rSecondVector.begin();
i != rFirstVector.end();
++i, ++j) {
temp += *i * *j;
}
return temp; |
i would suggest to avoid iterators: double temp = 0.0;
for (unsigned int i=0; i<rFirstVector.size(); ++i){
temp += rFirstVector[i]+rSecondVector[i];
}
return temp; the difference is potentially in the vectorization |
I guess the inner_prod would take SIMD as well. I will try your suggestion. |
@RiccardoRossi was right (take a screenshoot) |
kratos/utilities/math_utils.h
Outdated
while(i != rFirstVector.end()) { | ||
temp += *i++ * *j++; | ||
for (std::size_t i=0; i<rFirstVector.size(); ++i) { | ||
temp += rFirstVector[i]+rSecondVector[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not compute a dot product.
I'm even more confused about these plots now. @loumalouomega can you please provide the raw output from the benchmark? |
(properly implemented) contiguous iterators are optimized away by the compiler. I'd do the benchmarks and believe the results (after making sure that were actually benchmarking what we want to measure and can make sense of the results). |
|
Now for some reason the Ci is failing, so meaybe better revert |
As I already wrote, you're not computing a dot product but summing up components. By the way the benchmark is probably optimized away completely, which is why we're seeing gibberish results. It says the average CPU time is around ~1 nanosecond, that is about 3 cycles on a 3 GHz CPU (=> no matter what magic your compiler or CPU does, that's way below the theoretical limit for a 3-component double precision floating point dot product). |
Take a look at the following benchmark (computes a dot product with a c-style loop, CXX11 iterators, and finally As you play around with What I think is going on here is that the compiler does small vector optimizations and can compute the result on tiny arrays (at least until My takeaway: double check your benchmark code and results and make sure they make sense (or at least they're not completely nonsense). #include <vector>
#include <numeric>
constexpr std::size_t ARRAY_SIZE = 64;
template <class T, std::size_t Dimensions>
void CLoop(benchmark::State& rState) {
using Container = std::vector<T>;
Container left(Dimensions), right(Dimensions);
std::iota(left.begin(), left.end(), 0);
std::iota(right.begin(), right.end(), Dimensions);
T dummy = 0;
for ([[maybe_unused]] auto _ : rState) {
T out = static_cast<T>(0);
for (typename Container::size_type i=0; i<left.size(); ++i) {
out += left[i] * right[i];
}
//rState.PauseTiming();
dummy += out;
//rState.ResumeTiming();
}
benchmark::DoNotOptimize(dummy);
}
template <class T, std::size_t Dimensions>
void IteratorLoop(benchmark::State& rState) {
using Container = std::vector<T>;
Container left(Dimensions), right(Dimensions);
std::iota(left.begin(), left.end(), 0);
std::iota(right.begin(), right.end(), Dimensions);
T dummy = 0;
for ([[maybe_unused]] auto _ : rState) {
T out = static_cast<T>(0);
for (auto it_left=left.begin(), it_right=right.begin(); it_left != left.end(); ++it_left, ++it_right) {
out += *it_left * *it_right;
}
//rState.PauseTiming();
dummy += out;
//rState.ResumeTiming();
}
benchmark::DoNotOptimize(dummy);
}
template <class T, std::size_t Dimensions>
void StandardInnerProduct(benchmark::State& rState) {
using Container = std::vector<T>;
Container left(Dimensions), right(Dimensions);
std::iota(left.begin(), left.end(), 0);
std::iota(right.begin(), right.end(), Dimensions);
T dummy = 0;
for ([[maybe_unused]] auto _ : rState) {
T out = std::inner_product(left.begin(),
left.end(),
right.begin(),
static_cast<T>(0));
//rState.PauseTiming();
dummy += out;
//rState.ResumeTiming();
}
benchmark::DoNotOptimize(dummy);
}
//BENCHMARK_TEMPLATE(CLoop, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(CLoop, double, ARRAY_SIZE);
//BENCHMARK_TEMPLATE(IteratorLoop, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(IteratorLoop, double, ARRAY_SIZE);
//BENCHMARK_TEMPLATE(StandardInnerProduct, int, ARRAY_SIZE);
BENCHMARK_TEMPLATE(StandardInnerProduct, double, ARRAY_SIZE); |
Co-authored-by: Máté Kelemen <[email protected]>
With the new benchmark |
thx, makes sense now! |
📝 Description
Details of 3:
This PR refactors the implementation of the Dot product for better performance and introduces a new benchmark for the
Dot
product function in theMathUtils
class.Refactor Dot Product Implementation:
math_utils.h
, the oldDot
product implementation is replaced with a more modern and efficient version that utilizesstd::inner_product
for improved readability and performance.std::inner_product
, which is a standard C++ algorithm.New Benchmark:
mathutils_benchmark.cpp
is added to the project. It contains a benchmark for testing the performance of theDot
product function usingbenchmark::State
.0.0
.Related #13101
🆕 Changelog
std::inner_product
, which is simpler, more efficient, and part of the standard libraryBM_MathUtilsDot
benchmark to measure the performance of theDot
product between two vectors