Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#4605: Add native query compiler #7259

Merged
merged 19 commits into from
Aug 26, 2024
Merged

Conversation

arunjose696
Copy link
Collaborator

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Handle Empty/Small Data DataFrames as a separate case #4605
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@arunjose696 arunjose696 changed the title Adding small query compiler FEAT-#4605: Adding small query compiler May 13, 2024
@arunjose696 arunjose696 force-pushed the arun-sqc branch 3 times, most recently from f80e353 to 41bab97 Compare May 16, 2024 11:45
modin/pandas/base.py Fixed Show fixed Hide fixed
modin/pandas/base.py Fixed Show fixed Hide fixed
modin/pandas/dataframe.py Fixed Show fixed Hide fixed
modin/pandas/io.py Fixed Show fixed Hide fixed
modin/pandas/series.py Fixed Show fixed Hide fixed
docs/conf.py Outdated Show resolved Hide resolved
modin/config/envvars.py Outdated Show resolved Hide resolved
modin/core/dataframe/algebra/default2pandas/binary.py Outdated Show resolved Hide resolved
modin/pandas/base.py Outdated Show resolved Hide resolved
modin/pandas/dataframe.py Outdated Show resolved Hide resolved
modin/pandas/io.py Outdated Show resolved Hide resolved
modin/pandas/dataframe.py Outdated Show resolved Hide resolved
modin/pandas/series.py Outdated Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
modin/pandas/utils.py Fixed Show fixed Hide fixed
Copy link
Collaborator

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start on solving this problem! Is it possible to avoid so many of the test changes?

@@ -851,4 +851,11 @@ def _check_vars() -> None:
)


class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is probably a little confusing for users. I suggest something like SmallDataframeMode. This can be set to None by default, and users can set it to "pandas" or some other option in the future (we may have some other single node options coming).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devin-petersohn, do you think VanillaPandasMode is a good option? Also, why do you think we should make this config of string type to have choices None/pandas/etc.? Wouldn't it be sufficient to have this config boolean - enable/disable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we may add polars mode. If this happens, we might also want to have an option for that. Making it a string keeps it open to other options. If we have pandas in the name, we can only use that mode for pandas execution. I'm open to other names, but I think we don't want to keep adding more and more configs if we have more options later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this sound like we may have multiple storage formats for a single execution? Do we really want to support this in future?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, yes I think this is something we could support in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devin-petersohn, do you think we could support automatic initialization with small qc depending on a data size threshold in future?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to rename UsePlainPandasQueryCompiler to NativeDataframeMode and SmallQueryCompiler to NativeQueryCompiler by sort of analogy with HdkOnNative we had previously.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum, a more complete definition of this class in the docstring is required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the name to UsePlainPandasQueryCompiler to NativeDataframeMode and SmallQueryCompiler to NativeQueryCompiler.

@arunjose696
Copy link
Collaborator Author

Great start on solving this problem! Is it possible to avoid so many of the test changes?

The most changes in tests are disabling few checks as it wont be supported without partitions, and as the current changes dont yet support IO like pd.read_csv(), Is there something specific that should be avoided?

@devin-petersohn
Copy link
Collaborator

is there something specific that should be avoided?

Nothing specific, I was just trying to understand context. Thanks!

@arunjose696 arunjose696 marked this pull request as draft May 22, 2024 19:49
@arunjose696 arunjose696 force-pushed the arun-sqc branch 2 times, most recently from e6b035f to d406414 Compare May 23, 2024 11:08
modin/pandas/dataframe.py Fixed Show fixed Hide fixed
modin/pandas/dataframe.py Fixed Show fixed Hide fixed
@arunjose696 arunjose696 marked this pull request as ready for review June 24, 2024 13:05
modin/config/envvars.py Outdated Show resolved Hide resolved
modin/tests/pandas/dataframe/test_iter.py Outdated Show resolved Hide resolved
modin/tests/pandas/dataframe/test_iter.py Show resolved Hide resolved
modin/tests/test_utils.py Outdated Show resolved Hide resolved
modin/core/storage_formats/pandas/native_query_compiler.py Outdated Show resolved Hide resolved
modin/core/storage_formats/pandas/native_query_compiler.py Outdated Show resolved Hide resolved
modin/core/storage_formats/pandas/native_query_compiler.py Outdated Show resolved Hide resolved
Co-authored-by: Iaroslav Igoshev <[email protected]>
Copy link
Collaborator

@YarShev YarShev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arunjose696, please fix ci-required / lint (pydocstyle) (pull_request) job. Other than that, LGTM!

@arunjose696 arunjose696 force-pushed the arun-sqc branch 4 times, most recently from 3da8c6e to 17b12aa Compare July 8, 2024 10:51
Signed-off-by: arunjose696 <[email protected]>
YarShev
YarShev previously approved these changes Jul 8, 2024
Copy link
Collaborator

@YarShev YarShev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@YarShev YarShev changed the title FEAT-#4605: Adding small query compiler FEAT-#4605: Add native query compiler Jul 15, 2024
@YarShev YarShev merged commit da01571 into modin-project:main Aug 26, 2024
140 of 141 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle Empty/Small Data DataFrames as a separate case
4 participants