You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@Garra1980 thank you for reporting this issue. I can reproduce it at version 86d3610.
The root cause is that when Modin defaults to pandas for the dataframe __getitem__, it converts the boolean indexer df['col1'].isin(['1','2']) to pandas but the result has the wrong dtype. So Modin indexes the pandas dataframe with pandas.Series([], dtype="object", name='col1') instead of pandas.Series([], dtype=bool, name='col1'). For some reason, indexing with the former gives a dataframe with no columns instead of one with the correct columns.
Here's modin getting the wrong dtype for the indexer when converting to pandas:
importmodin.pandasaspddf=pd.DataFrame(columns=['col1', 'col2'])
modin_indexer=df['col1'].isin(['1','2'])
# Modin dtype is boolprint(modin_indexer.dtype)
# _to_pandas() dtype is objectprint(modin_indexer._to_pandas().dtype)
and here is the difference in behavior for the two indexers:
importpandaspdf=pandas.DataFrame(columns=['col1', 'col2'])
bool_indexer=pandas.Series([], dtype=bool, name='col1')
object_indexer=pandas.Series([], dtype="object", name='col1')
# prints Index(['col1', 'col2'], dtype='object')print(pdf[bool_indexer].columns)
# prints Index([], dtype='object')print(pdf[object_indexer].columns)
_to_pandas() is known to get incorrect dtypes for empty dataframe, e.g. in #4191 and #4060. #4605 tracks a way to robustly handle empty dataframes in general in Modin. We actually have a draft PR, #4606, ready for that feature. I think that PR should fix this bug.
System information
modin.__version__
):Another example of difference in modin and pure pandas for following snippet
df = pd.DataFrame(columns=['col1', 'col2'])
df = df[df['col1'].isin(['1','2'])]
print(df)
Describe the problem
modin.pandas will print:
Empty DataFrame
Columns: []
Index: []
default pandas will print:
Empty DataFrame
Columns: [col1, col2]
Index: []
Not sure pandas is super correct here though
Source code / logs
The text was updated successfully, but these errors were encountered: