-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Call for CompiledSDFG
#1467
Conversation
Before this was generating an error, because there was no object `nan` inside the `dace::math` namespace. This commit adds a `nan` object to the namespace, the implementation is based on `typeless_pi`.
There are two bugs. - On the CI `std::enable_if_t` was not there, so I changed it to the `typename std::enable_if`. - Also added operations regarding `typeless_nan` with itself (`typeless_pi` is missing them).
The same is also not used for `pi`.
Before `__call__()` has to do some reordering (from the one given by `argnames` to `_sig`) and transforms the `ndarray`s to pointers, which was a quite expensive task. This commit introduced the `_fast_call()` method, which allows to bypass these operations. It expects that the arguments (the one for the call and the one for the initialization) were prepared outside, and no further check is done. This can be used for example during benchmarking to avoid measuring the overhead of the previously mentioned transformations.
The function was rather big and I splitted it into the different subtasks. Furthermore, refactored it a bit to reduce the number of operations which gives a tiny bit of increase.
@BenWeber42 Now ready for review. |
Hi, thanks for the PR! Unfortunately, I'm not too familiar with these parts of the codebase, so it would take me quite some time to do a proper review. Let's see if we can find someone who could do the review more efficiently... |
@philip-paul-mueller general comment before the review: please remove the unrelated (NaN) changes from the PR. |
@tbennun Sorry that thing got somehow mixed up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea for having an argument-check-free and/or pregenerated argument fastcall is good. However, the PR may introduce different performance issues. The original code was written to be "fast" (using tuples, comprehensions, and fewer function calls in the original code), and the new fastcall should be just as fast.
Additionally, it sounds like this should be an external API for use by users (otherwise the PR does not make much sense), but the call is prefixed with an underscore. Moreover, some of the code (preparing outputs) should also not be in the context of the fastcall. See comments.
dace/codegen/compiled_sdfg.py
Outdated
if self.do_not_execute is False: | ||
self._cfunc(self._libhandle, *argtuple) | ||
self._cfunc(self._libhandle, *callargs) | ||
|
||
if self.has_gpu_code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section may also not belong in a fast call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this. According to my understand this calls the actual compiled function. Thus without it nothing would be done, or do I miss something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at the line the comment points to: the GPU runtime check belongs in the normal call, not fast call. If you want to ensure execution is fast you can skip this check, which might be expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to give fast_call()
the possibility of performing the check, however it is disabled by default. The main reason for that was removing of code duplication.
Thanks for the explanation, so the comment always points to the "last line that is shown"?
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Co-authored-by: Tal Ben-Nun <[email protected]>
Thanks for your work and especially for correcting my typos. Regarding your comments about my splitting of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing the formatting changes. However, my major comments were still not addressed.
dace/codegen/compiled_sdfg.py
Outdated
if self.do_not_execute is False: | ||
self._cfunc(self._libhandle, *argtuple) | ||
self._cfunc(self._libhandle, *callargs) | ||
|
||
if self.has_gpu_code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at the line the comment points to: the GPU runtime check belongs in the normal call, not fast call. If you want to ensure execution is fast you can skip this check, which might be expensive.
dace/codegen/compiled_sdfg.py
Outdated
for desc, arr in zip(self._retarray_shapes, self._return_arrays): | ||
kwargs[desc[0]] = arr | ||
|
||
# Argument construction | ||
arglist, argtypes, argnames = self._construct_args_arglist(kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function calls can be expensive, performance-wise.
If you think they are not, please show some runtime results that say otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As requested I performed some experiment.
I used the attached function, which is based on the spmv
tutorial (randomly selected, if I should test another function please provide one).
I got the following results for the current master (09d37e9e
):
min: 0.0001241610007127747s
max: 0.0007202319975476712s
mean: 0.0001336343956439426s
std: 3.586189058597028e-05
q_{1, 5, 25, 50, 75, 95, 99}: [0.00012455 0.00012489 0.00012555 0.00012643 0.00012851 0.00014832 0.00037179]
For my previous state, denoted append version, (eb0198a5
) I got the following results:
min: 0.00010327700147172436s
max: 0.0006421329999284353s
mean: 0.00011245609261338056s
std: 3.2908162087660314e-05
q_{1, 5, 25, 50, 75, 95, 99}: [0.00010416 0.00010479 0.0001056 0.00010633 0.00010738 0.00012325 0.00030589]
And for the current version (ff3ced35
):
min: 0.00010886099698836915s
max: 0.0006989000030444004s
mean: 0.00011409736964318047s
std: 2.089365416625181e-05
q_{1, 5, 25, 50, 75, 95, 99}: [0.00010925 0.00010962 0.00011023 0.000111 0.00011243 0.00011744 0.00020909]
As you can see from these results even the append version is faster than the current master.
According to my memory the majority of these gains come from caching some type checking results.
We also see that the new version is a little bit slower (in case of min
) than the append version, but the differences are very small, but its 99 percentile is much lower.
I think this should suffice to show that this version is not slower than the original version.
call_args.py.gz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you insisted so much on it I decided to put everything back into one function (1a1fca4b
).
However, I also run the test again with the following results:
min: 0.00010630900214891881s
max: 0.0006737619987688959s
mean: 0.00011147427259614536s
std: 2.1202032309282628e-05
q_{1, 5, 25, 50, 75, 95, 99}: [0.00010679 0.00010708 0.00010762 0.00010819 0.00010998 0.00011565 0.00020852]
As you can see there is not much difference compared to the previous (ff3ced35
) version where the _construct_args()
function was separated into smaller pieces.
Hello Tal, I which you a happy new year. Best, Philip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds a new possibility how a
CompiledSDFG
can be called.Before
__call__()
performed a lot of operations, which can be summarized as:ndarray
to pointers).Especially when benchmarking smaller SDFG this is a lot of overhead, which actually dominates the execution.
Furthermore, the runtime heavily depends on the number of arguments.
To solve this, this PR introduces the
_fast_call()
function, which expects that its arguments are already in the right order and casted to the right type.In addition it does some refactoring and splits
_construct_args()
into multiple parts.