PERF-#6710: Don't materialize index in `_groupby_shuffle` internal function #6707

anmyachev · 2023-11-04T19:23:29Z

What do these changes do?

Length materialization is in most cases cheaper than index materialization.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Don't materialize index in _groupby_shuffle internal function #6710
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

dchigarev · 2023-11-06T10:46:45Z

Length materialization is in most cases cheaper than index materialization.

It depends on which cache is available right now. If we're willing to do this kind of optimization, I would propose to add a __len__() method for ModinDataframe that would compute the length either using index or row lengths cache, depending on which information it has materialized right now

…nction Signed-off-by: Anatoly Myachev <[email protected]>

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2023-11-06T17:43:22Z

Length materialization is in most cases cheaper than index materialization.

It depends on which cache is available right now. If we're willing to do this kind of optimization, I would propose to add a __len__() method for ModinDataframe that would compute the length either using index or row lengths cache, depending on which information it has materialized right now

@dchigarev I added __len__.

anmyachev changed the title ~~PERF-#0000: Don't materialize index in _groupby_shuffle internal function~~ PERF-#6710: Don't materialize index in _groupby_shuffle internal function Nov 5, 2023

anmyachev marked this pull request as ready for review November 5, 2023 16:32

anmyachev requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, dchigarev and a team as code owners November 5, 2023 16:32

anmyachev added 2 commits November 6, 2023 18:07

PERF-#0000: Don't materialize index in '_groupby_shuffle' internal fu…

71ebe9e

…nction Signed-off-by: Anatoly Myachev <[email protected]>

add __len__ for PandasDataFrame

4cc0b75

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the groupby-exp branch from 2fef468 to 4cc0b75 Compare November 6, 2023 17:07

dchigarev approved these changes Nov 7, 2023

View reviewed changes

dchigarev merged commit 19f035c into modin-project:master Nov 7, 2023
36 checks passed

anmyachev deleted the groupby-exp branch November 7, 2023 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#6710: Don't materialize index in `_groupby_shuffle` internal function #6707

PERF-#6710: Don't materialize index in `_groupby_shuffle` internal function #6707

anmyachev commented Nov 4, 2023 •

edited

Loading

dchigarev commented Nov 6, 2023

anmyachev commented Nov 6, 2023

PERF-#6710: Don't materialize index in _groupby_shuffle internal function #6707

PERF-#6710: Don't materialize index in _groupby_shuffle internal function #6707

Conversation

anmyachev commented Nov 4, 2023 • edited Loading

What do these changes do?

dchigarev commented Nov 6, 2023

anmyachev commented Nov 6, 2023

PERF-#6710: Don't materialize index in `_groupby_shuffle` internal function #6707

PERF-#6710: Don't materialize index in `_groupby_shuffle` internal function #6707

anmyachev commented Nov 4, 2023 •

edited

Loading