Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for histogram CPU implementation #1930
RFC for histogram CPU implementation #1930
Changes from 23 commits
a03577e
ce117f5
ccc001e
d518a14
6e03468
10c4e50
efa7c9b
1ac82fd
02523c4
ac7b654
506fb62
1c6cb47
ceee3e3
0711090
3c5ad12
06a734f
b858a0e
53f4643
d718e0e
bb9e6f9
17e0510
9614209
2964a9e
9287fd2
cdf5092
215c2b7
04d5127
fe1efa2
60ec0e5
54e16b6
77435a3
52bab0d
b25411b
5d23f2a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me share my thoughts:
In my understanding RFC is not a book... So, I would preferer to have a short, concise and precise description of what is offered, without frills, like a mathematical theorem. For example:
"The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These APIs are defined in the oneAPI Specification 1.4. Please see the
oneAPI Specification for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending histogram support to these backends."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've accepted your language here. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to it is not a story telling, I would suggest omitting introductory expressions like "It may make more sense" or "It's natural for a user to expect"... Only short and exact information.
For example,
"There are many cases to use a host-side serial or a host-side implementation of histogram. Another motivation for adding the support is simply to be spec compliant with the oneAPI specification."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken suggestion. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking into account my shared thought above, I would propose to re-prahse it keeping the main point shorter:
"A histogram algorithm is a memory-bound algorithm. So, the implementation should care of reducing memory accesses and minimizing temporary memory traffic."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken mostly. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can this topic omit at all. It tells nothing about 'histogram', just general wording, which can be applied for any new feature in oneDPL...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed some of the general language and added something which is important for histogram in an attempt to answer feedback from @akukanov to clarify where the implementation of the algorithm will live.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second sentence may be omitted.
Based on the first sentence we can conclude that "OneDPL does not directly use any intrinsics..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can calculate the bin indexes for the input data in SIMD manner.
After that we can process the result in a serial loop.
No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is applicable only for the even binned case. Without using intrinsic operations, we must do this with omp simd and the
ordered
structured block. Initial investigation seemed to indicate that this was unsuccessful for generating vectorized code, and my suspicion is that it will not really help anyway. I can revisit this and attempt it, but the intention for now was to omit vectorizations from this first phase.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I'll ask that we leave it as described in the RFC, which gives some understanding of how this can be improved in the future, but starts without vectorization for this phase.
We can add an issue to explore using
simd ordered
to get some improvement for histogram even, and leave it out for this RFC and the initial PR implementation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.. Ok, lets leave it as described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/oneapi-src/oneDPL/pull/1930/files#diff-fb5f6394ad0d350d719b9f31b139fa60c347ec64795c78e56875c4f002aeb0e7R25
We already have the
key requirements
topic where we enumerate all backends that we propose to support.It is good enough I think, and we also can omit this topic "Serial Backend".
Explanation what is "Serial Backend" means as the others backends mean, is a kind of "oneDPL general description" and not related to RFC for histogram feature, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With some recent changes, there is some specifics about the serial implementation I wanted to add here so I've kept the section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we intent to give some information about OneDPL parallel backend patterns on which histogram can based on, I would notify, there is not "count_if" pattern, there is "reduce"("transform_reduce") pattern.
When a man says "reduce", it becomes more or less obvious that histogram calculation based on reduce is not effective at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I clarified the language a little here to make it more clear that copy_if uses reduce internally. I still think it deserves some text describing it as it may not be immediately obvious to everyone that reduce is not well matched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I have a curiosity question. Which approach does NVidia use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVidia has a similar API within CUB but not within Thrust, and therefore does not have a CPU implementation that I am aware of, only one specifically for a GPU device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does
Embarrassingly Parallel
term mean?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: got it.. https://en.wikipedia.org/wiki/Embarrassingly_parallel
Of course, if you are solving a concrete task and you are allowed to use the all machine recourses, and there are no any other workloads on the node, the best way for histogram calculation - to make static dividing of amount of work, each thread is calculating a local histogram, and after the local histograms are reducing into one.
But, talking about parallelism in a kind of general library we have to keep in mind that a final user's application can work in "different circumstances", depends on their application type, task, real-time data, other workloads on the same host and other many things..
When we were developing TBB backend we kept in mind that things and preferred to use TBB
auto partitioner
(instead of static f.e).Also composability reasons make sense here.
BTW, have you considered a "common parallel reduce" (in general) pattern (and
tbb::parallel_reduce
pattern, in particular) for histogram calculation? It seems the parallel histogram calculation matches on the common reduce (with a certain "big" grainsize): eachBody
calculates a local histogram (bins),Combiner
summaries the all local bins into final ones.Additionally, if number of bins is "big" we can apply the second level of parallelism within
Combiner
code - SIMD or even "parallel_for" and SIMD, if number of bins is "too big".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, although I think there is no reason here to do static division of work, but rather rely upon our existing
parallel_for
implementation to provide a good composable implementation.Agree, which is why the intent is to use the existing
parallel_for
structure (including partitioners) to implement the parallelism. If we were to do it from scratch, we would do it in a similarly composable way, but better to rely upon existing infrastructureYes, I thought about this. For TBB and even more for openMP the built in reduction functionality is geared toward very simple lightweight types as the reduction variable where we may have an arbitrarily large array. Especially since we want a unified implementation, it does not seem like these backend are really set up to handle these large reduction variables. It seems we should take more control to ensure no unnecessary copies are made, and that the final combination is done performantly, based on knowledge we have of the task. The implementation remains quite simple and unified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bins[thread_id]
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) My understanding is that in TBB while it may be technically possible to get the the local thread id within an arena, it is an undocumented API and generally discouraged and against the TBB mindset. Using TLS seems to be the preferred method specifically with TBB.
b) While what you suggest perhaps fits better within OpenMP, we want to create a single implementation and not require a
__parallel_histogram
within every current and future backend, but rather depend upon existing functionality within the backend as much as we can (in this case__parallel_for
).c) With smaller values of
n
,num_bins
and larger number of threads, not all threads should be used because of the overhead associated with allocation and initialization of each temporary bin copy. We can let the partitioner decide how many blocks to employ, but we want to avoid unnecessary allocation and initialization overheads wherever possible.I will mention a downside for completeness, but it is outweighed here in my opinion:
It requires implementation of a thread local storage class for each backend. This is only non-trivial for OpenMP. It has been written generically though to serve future patterns though so it is nice to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_bins
), and accumulating across the temporary histograms which were used from different threads. TBB does guarantee that each thread will always use its own TLS for each grain of work though, when retrieved throughlocal()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to explain with example:
Usage of a general TBB pattern
tbb::paralle_for
doesn't suppose using system thread directly. There is only a "Body" which is called (with a part of data(tbb range) by executing thread. Imagine the input range is split into 4 parts. Two threads call 2 parts simultaneously. The Body stores local bin results in TLS, associated with mentioned threads.After, to "finalize the work", TBB should call Body two time to process final 2 parts of input range. These final two calls may be done by another threads which have the other associated TLSs. So, it is impossible to make final reduce of local bins, located in TLSs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 parallel_for calls, each of which is "embarrassingly parallel" where no thread body depends on previous thread bodies. The first parallel_for must complete before the second one starts though. The first parallel_for uses the TLS as normal, and just accumulates sections of the input data into each thread's individual TLS.
The second parallel_for call does not use the TLS as normal, but rather has every thread visit a section of every TLS which was created one by one, processing a section of the histogram bins in parallel, combining the work of different threads from the first loop into the final global histogram.
The TLS we propose here (that is also implemented in the PR) supports this, and we obtain the correct result. We will not have perfect cache effects when accessing TLS from different threads than it was created upon but that is just something we have to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand your answer, Dan...
It seems you don't catch my question/concerns...
I try to explain my concern again:
tbb::paralle_for
"produces several calls ofBody
, which is passed totbb::paralle_for
. You don't know how many callbacks is, because "tbb auto-partioner" is applied by default.Each call of this body may be done by the different threads. Moreover, The first calls of the body may be done by threads "ids" 0-3, the last calls may be done by another threads, "ids" 4-7 by example. Each TLS is associated with its own thread. you don't know IDs of threads.... I don't understand how you can get the calculated local bins from the all TLSs....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are basically using TBB's enumerable_thread_specific as a model here, and implementing a stripped down version for omp and a trivial version for serial backend.
enumerable_thread_specific
has two ways of accessing the data.a)
local()
which gets the TLS for the current thread,b)
begin()
andend()
which provide iterators to the sequence of all local storage from all threads.This allows us to use (a) in the first parallel loop and (b) in the second parallel loop. The second parallel loop does not use the
enumerable_thread_specific
as a "Thread Local Storage" but rather a 2-D array space which it iterates over summing across columns (corresponding to individual histogram bins from different threads). This allows us to accumulate the data from all threads into the global space histogram copy no matter which threads are used and when.I'm not sure how else I can explain it. The code in the implementation is tested, working, and pretty concise, if you want to see the details you can look at the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean "TBB TLS", not system TLS...
I clarified that question with Alexey.
TBB TLS is a kind of container and allows to iterate the all local bins... I was not aware of that.
Now I got it.