Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Call for CompiledSDFG #1467

Merged
merged 28 commits into from
Jan 4, 2024
Merged

Conversation

philip-paul-mueller
Copy link
Collaborator

This PR adds a new possibility how a CompiledSDFG can be called.

Before __call__() performed a lot of operations, which can be summarized as:

  • Ensuring that the arguments are ordered correctly.
  • Transforming them to the right type (ndarray to pointers).

Especially when benchmarking smaller SDFG this is a lot of overhead, which actually dominates the execution.
Furthermore, the runtime heavily depends on the number of arguments.

To solve this, this PR introduces the _fast_call() function, which expects that its arguments are already in the right order and casted to the right type.
In addition it does some refactoring and splits _construct_args() into multiple parts.

philip-paul-mueller and others added 8 commits November 21, 2023 14:22
Before this was generating an error, because there was no object `nan` inside the `dace::math` namespace.
This commit adds a `nan` object to the namespace, the implementation is based on `typeless_pi`.
There are two bugs.
- On the CI `std::enable_if_t` was not there, so I changed it to the `typename std::enable_if`.
- Also added operations regarding `typeless_nan` with itself (`typeless_pi` is missing them).
Before `__call__()` has to do some reordering (from the one given by `argnames` to `_sig`) and transforms the `ndarray`s to pointers, which was a quite expensive task.
This commit introduced the `_fast_call()` method, which allows to bypass these operations.
It expects that the arguments (the one for the call and the one for the initialization) were prepared outside, and no further check is done.
This can be used for example during benchmarking to avoid measuring the overhead of the previously mentioned transformations.
The function was rather big and I splitted it into the different subtasks.
Furthermore, refactored it a bit to reduce the number of operations which gives a tiny bit of increase.
@philip-paul-mueller philip-paul-mueller marked this pull request as draft December 4, 2023 11:52
@philip-paul-mueller philip-paul-mueller marked this pull request as ready for review December 5, 2023 07:35
@philip-paul-mueller
Copy link
Collaborator Author

@BenWeber42 Now ready for review.

@BenWeber42
Copy link
Contributor

Hi, thanks for the PR!

Unfortunately, I'm not too familiar with these parts of the codebase, so it would take me quite some time to do a proper review. Let's see if we can find someone who could do the review more efficiently...

@BenWeber42 BenWeber42 removed their request for review December 11, 2023 12:56
@tbennun tbennun self-requested a review December 15, 2023 08:11
@tbennun
Copy link
Collaborator

tbennun commented Dec 15, 2023

@philip-paul-mueller general comment before the review: please remove the unrelated (NaN) changes from the PR.

@philip-paul-mueller
Copy link
Collaborator Author

@tbennun Sorry that thing got somehow mixed up.

edopao added a commit to edopao/dace that referenced this pull request Dec 19, 2023
Copy link
Collaborator

@tbennun tbennun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea for having an argument-check-free and/or pregenerated argument fastcall is good. However, the PR may introduce different performance issues. The original code was written to be "fast" (using tuples, comprehensions, and fewer function calls in the original code), and the new fastcall should be just as fast.

Additionally, it sounds like this should be an external API for use by users (otherwise the PR does not make much sense), but the call is prefixed with an underscore. Moreover, some of the code (preparing outputs) should also not be in the context of the fastcall. See comments.

if self.do_not_execute is False:
self._cfunc(self._libhandle, *argtuple)
self._cfunc(self._libhandle, *callargs)

if self.has_gpu_code:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section may also not belong in a fast call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this. According to my understand this calls the actual compiled function. Thus without it nothing would be done, or do I miss something here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the line the comment points to: the GPU runtime check belongs in the normal call, not fast call. If you want to ensure execution is fast you can skip this check, which might be expensive.

Copy link
Collaborator Author

@philip-paul-mueller philip-paul-mueller Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to give fast_call() the possibility of performing the check, however it is disabled by default. The main reason for that was removing of code duplication.

Thanks for the explanation, so the comment always points to the "last line that is shown"?

dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
@philip-paul-mueller
Copy link
Collaborator Author

Thanks for your work and especially for correcting my typos.
I have addressed most of your suggestions, but for some I am unsure (see comment).

Regarding your comments about my splitting of _construct_args(), I would be fine to revert this if you would prefer that.
When I did the split (and all further modifications) I constantly checked its runtime and according to that it is now faster.
One reason is, that it spend less time with type checking since the results are cached.

@tbennun

Copy link
Collaborator

@tbennun tbennun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the formatting changes. However, my major comments were still not addressed.

if self.do_not_execute is False:
self._cfunc(self._libhandle, *argtuple)
self._cfunc(self._libhandle, *callargs)

if self.has_gpu_code:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the line the comment points to: the GPU runtime check belongs in the normal call, not fast call. If you want to ensure execution is fast you can skip this check, which might be expensive.

dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
for desc, arr in zip(self._retarray_shapes, self._return_arrays):
kwargs[desc[0]] = arr

# Argument construction
arglist, argtypes, argnames = self._construct_args_arglist(kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function calls can be expensive, performance-wise.
If you think they are not, please show some runtime results that say otherwise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As requested I performed some experiment.
I used the attached function, which is based on the spmv tutorial (randomly selected, if I should test another function please provide one).

I got the following results for the current master (09d37e9e):

min: 0.0001241610007127747s
max: 0.0007202319975476712s
mean: 0.0001336343956439426s
std: 3.586189058597028e-05
q_{1, 5, 25, 50, 75, 95, 99}: [0.00012455 0.00012489 0.00012555 0.00012643 0.00012851 0.00014832 0.00037179]

For my previous state, denoted append version, (eb0198a5) I got the following results:

min: 0.00010327700147172436s
max: 0.0006421329999284353s
mean: 0.00011245609261338056s
std: 3.2908162087660314e-05
q_{1, 5, 25, 50, 75, 95, 99}: [0.00010416 0.00010479 0.0001056 0.00010633 0.00010738 0.00012325 0.00030589]

And for the current version (ff3ced35):

min: 0.00010886099698836915s
max: 0.0006989000030444004s
mean: 0.00011409736964318047s
std: 2.089365416625181e-05
q_{1, 5, 25, 50, 75, 95, 99}: [0.00010925 0.00010962 0.00011023 0.000111 0.00011243 0.00011744 0.00020909]

As you can see from these results even the append version is faster than the current master.
According to my memory the majority of these gains come from caching some type checking results.
We also see that the new version is a little bit slower (in case of min) than the append version, but the differences are very small, but its 99 percentile is much lower.
I think this should suffice to show that this version is not slower than the original version.
call_args.py.gz

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you insisted so much on it I decided to put everything back into one function (1a1fca4b).
However, I also run the test again with the following results:

min: 0.00010630900214891881s
max: 0.0006737619987688959s
mean: 0.00011147427259614536s
std: 2.1202032309282628e-05
q_{1, 5, 25, 50, 75, 95, 99}: [0.00010679 0.00010708 0.00010762 0.00010819 0.00010998 0.00011565 0.00020852]

As you can see there is not much difference compared to the previous (ff3ced35) version where the _construct_args() function was separated into smaller pieces.

dace/codegen/compiled_sdfg.py Outdated Show resolved Hide resolved
@philip-paul-mueller
Copy link
Collaborator Author

Hello Tal,
thanks for your patience with me, I have addressed all your comments.
Except one I have removed all .append() loops, the one I kept had the advantages of fusing two loops.
I also did the benchmarking you requested, see below, which show that the new code is at least as fast as the original one and reverting would hurt performance.

I which you a happy new year.

Best, Philip

Copy link
Collaborator

@tbennun tbennun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tbennun tbennun added this pull request to the merge queue Jan 4, 2024
Merged via the queue into spcl:master with commit bfe6923 Jan 4, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants