Faster Call for `CompiledSDFG` #1467

philip-paul-mueller · 2023-12-04T09:06:04Z

This PR adds a new possibility how a CompiledSDFG can be called.

Before __call__() performed a lot of operations, which can be summarized as:

Ensuring that the arguments are ordered correctly.
Transforming them to the right type (ndarray to pointers).

Especially when benchmarking smaller SDFG this is a lot of overhead, which actually dominates the execution.
Furthermore, the runtime heavily depends on the number of arguments.

To solve this, this PR introduces the _fast_call() function, which expects that its arguments are already in the right order and casted to the right type.
In addition it does some refactoring and splits _construct_args() into multiple parts.

Before this was generating an error, because there was no object `nan` inside the `dace::math` namespace. This commit adds a `nan` object to the namespace, the implementation is based on `typeless_pi`.

There are two bugs. - On the CI `std::enable_if_t` was not there, so I changed it to the `typename std::enable_if`. - Also added operations regarding `typeless_nan` with itself (`typeless_pi` is missing them).

The same is also not used for `pi`.

Before `__call__()` has to do some reordering (from the one given by `argnames` to `_sig`) and transforms the `ndarray`s to pointers, which was a quite expensive task. This commit introduced the `_fast_call()` method, which allows to bypass these operations. It expects that the arguments (the one for the call and the one for the initialization) were prepared outside, and no further check is done. This can be used for example during benchmarking to avoid measuring the overhead of the previously mentioned transformations.

The function was rather big and I splitted it into the different subtasks. Furthermore, refactored it a bit to reduce the number of operations which gives a tiny bit of increase.

philip-paul-mueller · 2023-12-05T07:36:19Z

@BenWeber42 Now ready for review.

BenWeber42 · 2023-12-11T12:56:02Z

Hi, thanks for the PR!

Unfortunately, I'm not too familiar with these parts of the codebase, so it would take me quite some time to do a proper review. Let's see if we can find someone who could do the review more efficiently...

tbennun · 2023-12-15T08:12:42Z

@philip-paul-mueller general comment before the review: please remove the unrelated (NaN) changes from the PR.

philip-paul-mueller · 2023-12-16T09:51:58Z

@tbennun Sorry that thing got somehow mixed up.

tbennun

The idea for having an argument-check-free and/or pregenerated argument fastcall is good. However, the PR may introduce different performance issues. The original code was written to be "fast" (using tuples, comprehensions, and fewer function calls in the original code), and the new fastcall should be just as fast.

Additionally, it sounds like this should be an external API for use by users (otherwise the PR does not make much sense), but the call is prefixed with an underscore. Moreover, some of the code (preparing outputs) should also not be in the context of the fastcall. See comments.

tbennun · 2023-12-15T08:17:18Z

dace/codegen/compiled_sdfg.py

                if self.do_not_execute is False:
-                    self._cfunc(self._libhandle, *argtuple)
+                    self._cfunc(self._libhandle, *callargs)

            if self.has_gpu_code:


This section may also not belong in a fast call

I do not understand this. According to my understand this calls the actual compiled function. Thus without it nothing would be done, or do I miss something here?

Look at the line the comment points to: the GPU runtime check belongs in the normal call, not fast call. If you want to ensure execution is fast you can skip this check, which might be expensive.

I decided to give fast_call() the possibility of performing the check, however it is disabled by default. The main reason for that was removing of code duplication.

Thanks for the explanation, so the comment always points to the "last line that is shown"?

dace/codegen/compiled_sdfg.py

Co-authored-by: Tal Ben-Nun <[email protected]>

philip-paul-mueller · 2023-12-21T09:40:04Z

Thanks for your work and especially for correcting my typos.
I have addressed most of your suggestions, but for some I am unsure (see comment).

Regarding your comments about my splitting of _construct_args(), I would be fine to revert this if you would prefer that.
When I did the split (and all further modifications) I constantly checked its runtime and according to that it is now faster.
One reason is, that it spend less time with type checking since the results are cached.

@tbennun

tbennun

Thank you for addressing the formatting changes. However, my major comments were still not addressed.

tbennun · 2023-12-24T21:19:08Z

dace/codegen/compiled_sdfg.py

                if self.do_not_execute is False:
-                    self._cfunc(self._libhandle, *argtuple)
+                    self._cfunc(self._libhandle, *callargs)

            if self.has_gpu_code:


Look at the line the comment points to: the GPU runtime check belongs in the normal call, not fast call. If you want to ensure execution is fast you can skip this check, which might be expensive.

dace/codegen/compiled_sdfg.py

tbennun · 2023-12-24T21:23:14Z

dace/codegen/compiled_sdfg.py

        for desc, arr in zip(self._retarray_shapes, self._return_arrays):
            kwargs[desc[0]] = arr

-        # Argument construction
+        arglist, argtypes, argnames = self._construct_args_arglist(kwargs)


Function calls can be expensive, performance-wise.
If you think they are not, please show some runtime results that say otherwise.

As requested I performed some experiment.
I used the attached function, which is based on the spmv tutorial (randomly selected, if I should test another function please provide one).

I got the following results for the current master (09d37e9e):

min: 0.0001241610007127747s max: 0.0007202319975476712s mean: 0.0001336343956439426s std: 3.586189058597028e-05 q_{1, 5, 25, 50, 75, 95, 99}: [0.00012455 0.00012489 0.00012555 0.00012643 0.00012851 0.00014832 0.00037179]

For my previous state, denoted append version, (eb0198a5) I got the following results:

min: 0.00010327700147172436s max: 0.0006421329999284353s mean: 0.00011245609261338056s std: 3.2908162087660314e-05 q_{1, 5, 25, 50, 75, 95, 99}: [0.00010416 0.00010479 0.0001056 0.00010633 0.00010738 0.00012325 0.00030589]

And for the current version (ff3ced35):

min: 0.00010886099698836915s max: 0.0006989000030444004s mean: 0.00011409736964318047s std: 2.089365416625181e-05 q_{1, 5, 25, 50, 75, 95, 99}: [0.00010925 0.00010962 0.00011023 0.000111 0.00011243 0.00011744 0.00020909]

As you can see from these results even the append version is faster than the current master.
According to my memory the majority of these gains come from caching some type checking results.
We also see that the new version is a little bit slower (in case of min) than the append version, but the differences are very small, but its 99 percentile is much lower.
I think this should suffice to show that this version is not slower than the original version.
call_args.py.gz

Because you insisted so much on it I decided to put everything back into one function (1a1fca4b).
However, I also run the test again with the following results:

min: 0.00010630900214891881s max: 0.0006737619987688959s mean: 0.00011147427259614536s std: 2.1202032309282628e-05 q_{1, 5, 25, 50, 75, 95, 99}: [0.00010679 0.00010708 0.00010762 0.00010819 0.00010998 0.00011565 0.00020852]

As you can see there is not much difference compared to the previous (ff3ced35) version where the _construct_args() function was separated into smaller pieces.

dace/codegen/compiled_sdfg.py

philip-paul-mueller · 2023-12-27T07:47:05Z

Hello Tal,
thanks for your patience with me, I have addressed all your comments.
Except one I have removed all .append() loops, the one I kept had the advantages of fusing two loops.
I also did the benchmarking you requested, see below, which show that the new code is at least as fast as the original one and reverting would hurt performance.

I which you a happy new year.

Best, Philip

tbennun

LGTM

philip-paul-mueller and others added 8 commits November 21, 2023 14:22

It is now possible to wite math.nan in a (Python) tasklet.

20c6295

Before this was generating an error, because there was no object `nan` inside the `dace::math` namespace. This commit adds a `nan` object to the namespace, the implementation is based on `typeless_pi`.

Fixed some bugs with the typeless_nan.

93e86e6

There are two bugs. - On the CI `std::enable_if_t` was not there, so I changed it to the `typename std::enable_if`. - Also added operations regarding `typeless_nan` with itself (`typeless_pi` is missing them).

Removed the exception in the int convertion of the typeless_nan.

a07eede

Modified how the function hijacking for typeless_nan behaves.

8a9cb2c

The same is also not used for `pi`.

Merge remote-tracking branch 'spcl/master' into nan_in_math

724a4d1

Merge branch 'master' into nan_in_math

d4a656e

Split CompiledSDFG._construct_args().

d2a1c85

The function was rather big and I splitted it into the different subtasks. Furthermore, refactored it a bit to reduce the number of operations which gives a tiny bit of increase.

philip-paul-mueller requested a review from BenWeber42 December 4, 2023 09:06

Fixed some issue that only surfaced diring the tests.

f629887

philip-paul-mueller marked this pull request as draft December 4, 2023 11:52

Fixed a missing argument.

a14d6c0

philip-paul-mueller marked this pull request as ready for review December 5, 2023 07:35

Merge remote-tracking branch 'spcl/master' into new_call_method

206be9b

BenWeber42 removed their request for review December 11, 2023 12:56

tbennun self-requested a review December 15, 2023 08:11

Merge remote-tracking branch 'spcl/master' into new_call_method

f22c7ad

edopao added a commit to edopao/dace that referenced this pull request Dec 19, 2023

Faster Call for CompiledSDFG spcl#1467

5556b75

tbennun requested changes Dec 20, 2023

View reviewed changes

philip-paul-mueller and others added 7 commits December 21, 2023 07:19

Update dace/codegen/compiled_sdfg.py

ac2bcc4

Co-authored-by: Tal Ben-Nun <[email protected]>

Update dace/codegen/compiled_sdfg.py

ebad5e6

Co-authored-by: Tal Ben-Nun <[email protected]>

Update dace/codegen/compiled_sdfg.py

b8d7743

Co-authored-by: Tal Ben-Nun <[email protected]>

Update dace/codegen/compiled_sdfg.py

d85fd2f

Co-authored-by: Tal Ben-Nun <[email protected]>

Update dace/codegen/compiled_sdfg.py

eb6f719

Co-authored-by: Tal Ben-Nun <[email protected]>

Update dace/codegen/compiled_sdfg.py

8f8b101

Co-authored-by: Tal Ben-Nun <[email protected]>

Update dace/codegen/compiled_sdfg.py

be10506

Co-authored-by: Tal Ben-Nun <[email protected]>

philip-paul-mueller and others added 6 commits December 21, 2023 08:22

Update dace/codegen/compiled_sdfg.py

ddd18c1

Co-authored-by: Tal Ben-Nun <[email protected]>

Update dace/codegen/compiled_sdfg.py

f600304

Co-authored-by: Tal Ben-Nun <[email protected]>

Update dace/codegen/compiled_sdfg.py

c4a8881

Co-authored-by: Tal Ben-Nun <[email protected]>

Update dace/codegen/compiled_sdfg.py

8e5b657

Co-authored-by: Tal Ben-Nun <[email protected]>

Update dace/codegen/compiled_sdfg.py

57017f8

Co-authored-by: Tal Ben-Nun <[email protected]>

Addressed Tal's suggestions.

eb0198a

philip-paul-mueller requested a review from tbennun December 22, 2023 06:21

tbennun reviewed Dec 24, 2023

View reviewed changes

Updated the computation of the arguments.

ff3ced3

philip-paul-mueller requested a review from tbennun December 27, 2023 07:48

Philip Mueller added 2 commits December 27, 2023 15:31

Merged the _construct_args() back into one function.

1a1fca4

Made a small correction.

310479b

tbennun approved these changes Jan 4, 2024

View reviewed changes

tbennun added this pull request to the merge queue Jan 4, 2024

Merged via the queue into spcl:master with commit bfe6923 Jan 4, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Call for `CompiledSDFG` #1467

Faster Call for `CompiledSDFG` #1467

philip-paul-mueller commented Dec 4, 2023

philip-paul-mueller commented Dec 5, 2023

BenWeber42 commented Dec 11, 2023

tbennun commented Dec 15, 2023

philip-paul-mueller commented Dec 16, 2023

tbennun left a comment

tbennun Dec 15, 2023

philip-paul-mueller Dec 21, 2023

tbennun Dec 24, 2023

philip-paul-mueller Dec 27, 2023 •

edited

Loading

philip-paul-mueller commented Dec 21, 2023

tbennun left a comment

tbennun Dec 24, 2023

tbennun Dec 24, 2023

philip-paul-mueller Dec 27, 2023

philip-paul-mueller Dec 27, 2023

philip-paul-mueller commented Dec 27, 2023

tbennun left a comment

Faster Call for CompiledSDFG #1467

Faster Call for CompiledSDFG #1467

Conversation

philip-paul-mueller commented Dec 4, 2023

philip-paul-mueller commented Dec 5, 2023

BenWeber42 commented Dec 11, 2023

tbennun commented Dec 15, 2023

philip-paul-mueller commented Dec 16, 2023

tbennun left a comment

Choose a reason for hiding this comment

tbennun Dec 15, 2023

Choose a reason for hiding this comment

philip-paul-mueller Dec 21, 2023

Choose a reason for hiding this comment

tbennun Dec 24, 2023

Choose a reason for hiding this comment

philip-paul-mueller Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

philip-paul-mueller commented Dec 21, 2023

tbennun left a comment

Choose a reason for hiding this comment

tbennun Dec 24, 2023

Choose a reason for hiding this comment

tbennun Dec 24, 2023

Choose a reason for hiding this comment

philip-paul-mueller Dec 27, 2023

Choose a reason for hiding this comment

philip-paul-mueller Dec 27, 2023

Choose a reason for hiding this comment

philip-paul-mueller commented Dec 27, 2023

tbennun left a comment

Choose a reason for hiding this comment

Faster Call for `CompiledSDFG` #1467

Faster Call for `CompiledSDFG` #1467

philip-paul-mueller Dec 27, 2023 •

edited

Loading