-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Empty/Small Data DataFrames as a separate case #4605
Comments
Signed-off-by: Naren Krishna <[email protected]>
@modin-project/modin-core @modin-project/modin-contributors @RehanSD @vnlitvinov @anmyachev Currently, indexes are processed asynchronously making it difficult to determine when a data frame will be empty or not without waiting on the index to complete. Wondering if anybody had any suggestions on how to approach this problem. Some ideas we have include changes at the query compiler level, API level, or modin core level whenever columns or rows are potentially added/removed. |
In most cases, axes are known, and I'm pretty sure most operations can be analyzed to see what effects such operations have on the axes, so in a typical case both axes would be known. We can simply make an assumption that we either know the axes (and as such can use their sizes to see which compiler to apply) or the dataframe is big. There are only a few operations which are unpredictable on outcoming axes - filtering by some user-defined condition (like |
Co-authored-by: Igoshev, Iaroslav <[email protected]> Signed-off-by: arunjose696 <[email protected]>
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. What kind of performance improvements would you like to see with this new API?
In our current approach, we default empty dataframes to pandas at the query compiler level which leads to some overhead as well as some bugs in empty dataframes (#4306, #4307). It would be ideal to default not only empty dataframes to pandas, but also dataframes with a small amount of data where distributing leads to more cost than it is worth.
The text was updated successfully, but these errors were encountered: