Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Dataset.xindexes #10074

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Cache Dataset.xindexes #10074

wants to merge 10 commits into from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Feb 24, 2025

Alternative to #10066, implementing #10066 (comment).

  • cache DataArray.xindexes

@hmaarrfk could you try your benchmark #10066 (comment) with this branch, please?

@benbovy benbovy added the run-benchmark Run the ASV benchmark workflow label Feb 24, 2025
@hmaarrfk
Copy link
Contributor

Funny enough you made things "worse" than #10066 in terms of raw performance. Not sure about the other axes you are looking at (didn't look at the change request).

# This branch
mark@nuthatch $ (cd /home/mark/git/xarray/ && git describe --tags ) && python software_timestamp_benchmark.py
v2025.01.2-26-ge37c4f21
software_timestamp.nc
h5netcdf  lazy      : 100%|██████████████████████████████| 100000/100000 [00:06<00:00, 14413.86it/s]
h5netcdf  computed  : 100%|██████████████████████████████| 100000/100000 [00:05<00:00, 16839.10it/s]
netcdf4   lazy      : 100%|██████████████████████████████| 100000/100000 [00:07<00:00, 14205.49it/s]
netcdf4   computed  : 100%|██████████████████████████████| 100000/100000 [00:05<00:00, 16987.86it/s]
software_timestamp_float64.nc
h5netcdf  lazy      : 100%|██████████████████████████████| 100000/100000 [00:06<00:00, 14617.49it/s]
h5netcdf  computed  : 100%|██████████████████████████████| 100000/100000 [00:05<00:00, 17303.16it/s]
netcdf4   lazy      : 100%|██████████████████████████████| 100000/100000 [00:07<00:00, 14194.85it/s]
netcdf4   computed  : 100%|██████████████████████████████| 100000/100000 [00:05<00:00, 16971.00it/s]

# 10066
(mcam_dev) ✔ /data/mark/benchmarks
mark@nuthatch $ (cd /home/mark/git/xarray/ && git describe --tags ) && python software_timestamp_benchmark.py
v2025.01.2-20-gfafd6b34
software_timestamp.nc
h5netcdf  lazy      : 100%|██████████████████████████████| 100000/100000 [00:06<00:00, 16102.34it/s]
h5netcdf  computed  : 100%|██████████████████████████████| 100000/100000 [00:05<00:00, 19187.48it/s]
netcdf4   lazy      : 100%|██████████████████████████████| 100000/100000 [00:06<00:00, 15944.28it/s]
netcdf4   computed  : 100%|██████████████████████████████| 100000/100000 [00:05<00:00, 19191.05it/s]
software_timestamp_float64.nc
h5netcdf  lazy      : 100%|██████████████████████████████| 100000/100000 [00:06<00:00, 16047.91it/s]
h5netcdf  computed  : 100%|██████████████████████████████| 100000/100000 [00:05<00:00, 19592.05it/s]
netcdf4   lazy      : 100%|██████████████████████████████| 100000/100000 [00:06<00:00, 15758.92it/s]
netcdf4   computed  : 100%|██████████████████████████████| 100000/100000 [00:05<00:00, 19310.62it/s]

While not exactly the same dataset I was using, here is the nc file I was using.
xarray_indexing_benchmark.zip

I verified that it gives similar results.

Now my benchmarks were done a Threadripper 2950X (worse decision I ever made to buy it). Single core performance is pretty wimpy. I don't doubt that you get 2x improvement on your laptop.

I hope this helps

@benbovy
Copy link
Member Author

benbovy commented Feb 24, 2025

Could reproduce similar results but I see, I only cached Dataset.xindexes while your benchmark uses DataArray.xindexes via DataArray.isel.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 24, 2025

Could reproduce similar results

great!

while your benchmark uses DataArray.xindexes via DataArray.isel.

Please only take my benchmark as a "single use case" but a real one from some1 half smart trying to read through all the xarray docs.

the data is from a high speed video, where we include some limited metadata on a per frame basis. We often have calling functions "slice" the data that we want lower level functions to analyze.

It may be that the loss of performance here is acceptable. that is for you all to decide

@benbovy
Copy link
Member Author

benbovy commented Feb 24, 2025

Looks good now with the last commit:

# This branch
metadata.nc
h5netcdf  lazy      : 100%|████████████████████████████████████████| 100000/100000 [00:01<00:00, 61519.64it/s]
h5netcdf  computed  : 100%|████████████████████████████████████████| 100000/100000 [00:01<00:00, 78281.21it/s]
netcdf4   lazy      : 100%|████████████████████████████████████████| 100000/100000 [00:01<00:00, 59646.72it/s]
netcdf4   computed  : 100%|████████████████████████████████████████| 100000/100000 [00:01<00:00, 78860.76it/s]

# 10066
metadata.nc
h5netcdf  lazy      : 100%|████████████████████████████████████████| 100000/100000 [00:01<00:00, 59552.66it/s]
h5netcdf  computed  : 100%|████████████████████████████████████████| 100000/100000 [00:01<00:00, 75268.51it/s]
netcdf4   lazy      : 100%|████████████████████████████████████████| 100000/100000 [00:01<00:00, 58502.35it/s]
netcdf4   computed  : 100%|████████████████████████████████████████| 100000/100000 [00:01<00:00, 75858.55it/s]

@benbovy
Copy link
Member Author

benbovy commented Feb 24, 2025

Please only take my benchmark as a "single use case" but a real one from some1 half smart trying to read through all the xarray docs.

Yes that's why I think it is worth caching the .xindexes property. It is also sometimes useful (or clearer or easy) internally to use that property instead of ._indexes so there's benefit beyond user's use cases.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 24, 2025

I am somewhat shocked that your machine is slower than mine. may I ask which machine you are using?

edit: I just can't read large number...

@benbovy
Copy link
Member Author

benbovy commented Feb 24, 2025

MacBook M4

@hmaarrfk
Copy link
Contributor

Oh i mean "5.9k" instead of "59k".... that makes much more sense. Your machine is running at "59k" vs "19k" on my Threadripper.

@benbovy benbovy marked this pull request as ready for review February 25, 2025 09:22
@benbovy
Copy link
Member Author

benbovy commented Feb 25, 2025

I was a bit confused and was wondering if your comment was ironic :-). Yeah I recently upgraded my machine and feel the difference.

@benbovy
Copy link
Member Author

benbovy commented Feb 25, 2025

This PR is ready for review and is an improved solution over #10066. Hopefully I didn't miss any place where the Dataset._xindexes and DataArray._xindexes caches need to be reset (should be covered by the existing tests).

@max-sixty
Copy link
Collaborator

Forgive me for dropping in without that much context and with something that could be interpreted with negativity, but:

if we want to add caching which requires xarray to do invalidation — i.e. requires remembering to cal self._data._xindexes = None in multiple places — then I think we need a really robust test suite which catches if we ever forget that. Even if we're confident that this PR implements everything correctly, others will make changes to the code in future. I would prioritize correctness of the core data structures far above additional features and most performance (and, v respectfully, am not sure we've always made the correct tradeoffs there)

@benbovy
Copy link
Member Author

benbovy commented Feb 25, 2025

I agree with you @max-sixty and I think this PR needs careful review.

About prioritizing correctness over performance I agree in general, but in this case I find @hmaarrfk's benchmark and comment #10074 (comment) relevant, where performance is more than nice to have. So caching might be worth it despite the risks.

To reduce the risk of breaking something in the future, we should definitely ensure that _indexes and (cached) _xindexes are in-sync in the tests' invariant checks. I can do that in this PR. I'm also happy to add something like assert self._xindexes == self._indexes as an extra safety guard.

@dcherian
Copy link
Contributor

I'm also happy to add something like assert self._xindexes == self._indexes as an extra safety guard.

👍!

@hmaarrfk
Copy link
Contributor

where performance is more than nice to have. So caching might be worth it despite the risks.

Just for clarity, at the time, I was hunting 2x and 3x performance improvements. The little decrease I saw recently (1-5%) seems ok. Looking back at those merge requests, I think it was due to a timestamp conversion, and the rest of the performance improvements were from refactors I found in my hunt for the underlying bug.

I think the little performance decrease is definitely ok if the goal is "simplicity". I believe that caching can be a large barrier to contributors and not warranted if the gains are small.

I would be honored if that metadata.nc made it to your CI benchmark ;). I never had time to add it myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run the ASV benchmark workflow topic-indexing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants