Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#6710: Don't materialize index in _groupby_shuffle internal function #6707

Merged
merged 2 commits into from
Nov 7, 2023

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Nov 4, 2023

What do these changes do?

Length materialization is in most cases cheaper than index materialization.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Don't materialize index in _groupby_shuffle internal function #6710
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@anmyachev anmyachev changed the title PERF-#0000: Don't materialize index in _groupby_shuffle internal function PERF-#6710: Don't materialize index in _groupby_shuffle internal function Nov 5, 2023
@anmyachev anmyachev marked this pull request as ready for review November 5, 2023 16:32
@dchigarev
Copy link
Collaborator

Length materialization is in most cases cheaper than index materialization.

It depends on which cache is available right now. If we're willing to do this kind of optimization, I would propose to add a __len__() method for ModinDataframe that would compute the length either using index or row lengths cache, depending on which information it has materialized right now

@anmyachev
Copy link
Collaborator Author

Length materialization is in most cases cheaper than index materialization.

It depends on which cache is available right now. If we're willing to do this kind of optimization, I would propose to add a __len__() method for ModinDataframe that would compute the length either using index or row lengths cache, depending on which information it has materialized right now

@dchigarev I added __len__.

@dchigarev dchigarev merged commit 19f035c into modin-project:master Nov 7, 2023
36 checks passed
@anmyachev anmyachev deleted the groupby-exp branch November 7, 2023 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't materialize index in _groupby_shuffle internal function
2 participants