-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#4605: Implementation of Small Query Compiler to support small and empty DataFrames #5113
base: master
Are you sure you want to change the base?
Conversation
8c0f7f8
to
fe123de
Compare
This pull request introduces 9 alerts when merging fe123de71f23dfe852bbfa6bb74ae154f3c53f94 into f492ba9 - view on LGTM.com new alerts:
|
This pull request introduces 7 alerts when merging e0f36929ab8b4a6b5e7755714eaa72c61d891cf6 into 6f0ff79 - view on LGTM.com new alerts:
|
This pull request introduces 6 alerts when merging a3bac2c06ba1e6ce1d467f0281daf7ac870ada17 into 6f0ff79 - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! Left a couple of high level comments.
Do these changes need to be reflected in docs? |
912654e
to
140d0d8
Compare
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
…yCompiler Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
Signed-off-by: Bill Wang <[email protected]>
fa06e5d
to
43ff32b
Compare
test-small-query-compiler: | ||
- 'modin/experimental/core/storage_formats/pandas/small_query_compiler.py' | ||
- 'modin/core/storage_formats/pandas/query_compiler.py' | ||
- 'modin/core/storage_formats/base/query_compiler.py' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks clever, but I think it feels temporary to me - query compilers both use stuff from lower level and are used by higher level, so this integration should eventually be tested upon change in any of the things used throughout the pipeline... this should probably stay for now, but be removed towards the shiny future where the "auto-switch" would happen
queries for small data and empty ``PandasDataFrame``. | ||
""" | ||
|
||
from modin.config.envvars import InitializeWithSmallQueryCompilers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current design for configuration is that all config variables come from modin.config
regardless of what value source they have (even though we have one value source so far)
from modin.utils import MODIN_UNNAMED_SERIES_LABEL | ||
from modin.utils import ( | ||
_inherit_docstrings, | ||
try_cast_to_pandas, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no need to import from single module in multiple statements
) | ||
|
||
|
||
def _get_axis(axis): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we have this code somewhere already, we should reduce the copy-pasteness of this
by_names.append(by[i].name) | ||
elif isinstance(by[i], str): | ||
by_names.append(by[i]) | ||
if isinstance(by, pandas.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can it be if by
is squeezed at line 268?
if query_compiler is None and InitializeWithSmallQueryCompilers.get(): | ||
small_dataframe = pandas.DataFrame( | ||
data=data, index=index, columns=columns, dtype=dtype, copy=copy | ||
) | ||
self._query_compiler = SmallQueryCompiler(small_dataframe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels weird in terms of architecture
self._query_compiler = query_compiler.columnarize() | ||
if name is not None: | ||
self._query_compiler = self._query_compiler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this removal feels completely wrong
also this is leaking architecture IMHO
def _validate_dtypes_sum_prod_mean(self, axis, numeric_only, ignore_axis=False): | ||
def _validate_dtypes_sum_prod_mean( | ||
self, axis, numeric_only, ignore_axis=False | ||
): # noqa: PR01 | ||
""" | ||
Validate data dtype for `sum`, `prod` and `mean` methods. | ||
|
||
Parameters | ||
Parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change?
InitializeWithSmallQueryCompilers.get(), | ||
reason="SmallQueryCompiler does not currently support IO functions.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this "not supporting" make current PR a draft until it really does support I/O?
if InitializeWithSmallQueryCompilers.get(): | ||
return DataFrame(query_compiler=SmallQueryCompiler(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels completely off; I thought small query compiler should go the usual route for other QCs (in this case it should get picked by FactoryDispatcher
according to user's settings)
@billiam-wang, do you have a plan to continue this? |
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date