-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue where quantify doesn't work on DataFrames containing string columns #217
Conversation
Nice fix! Add a test to pint_pandas/testsuite/test_issues.py. Make a class/function like
You can use this PR# as the issue #. the code in your post is basically the test already, just add an assert to check the two dfs are same
|
I've added a new |
Looks good, you'll need to run |
Out of interest, do you find the lack of visibility of the float64/object dtype making the PintArrays look the same but actually be different an issue? |
Yes, I think that might be the underlying issue here (for the testing). The two DataFrames should not pass the "isequal" test but the underlying pint arrays show that they are equal even though they really aren't as the data they contain is of a different type. Example: df = pd.DataFrame({'a':[1,2,3], 'b':['apple', 'pear', 'kiwi']})
p1 = PintArray(df.to_numpy()[:,0], 'm') #inner numpy array contains data of type 'object'
p2 = PintArray(np.array([1,2,3], dtype=np.float64), 'm')
p1.equals(p2) #returns True when it should return False |
Yea I think you're right, that's how pandas EA behaves: >>> p1.data
<NumpyExtensionArray>
[1, 2, 3]
Length: 3, dtype: object
>>> p2.data
<NumpyExtensionArray>
[1.0, 2.0, 3.0]
Length: 3, dtype: float64
>>> p1.data==p2.data
<NumpyExtensionArray>
[True, True, True]
Length: 3, dtype: bool
>>> p1.data.equals(p2.data)
False |
thanks for fixing that |
I just stumbled upon the same problem, but for data frames that also contain numeric columns without units. The current fix only seems to fix the unwanted type conversion to dtype='object' for columns with units, while the same happens with numeric columns without units. In the following example, the "something_else" column has numerical values, but no associated units: import pandas as pd
import pint_pandas
df = pd.DataFrame(
{
'power': pd.Series([1.0, 2.0, 3.0], dtype='pint[W]'),
'torque': pd.Series([4.0, 5.0, 6.0], dtype='pint[N*m]'),
'fruits': pd.Series(['apple', 'pear', 'kiwi']),
'something_else': pd.Series([7.0, 8.0, 9.0]),
}
)
df1 = df.pint.dequantify().pint.quantify(level=-1)
print(df.dtypes)
print(df1.dtypes) This could probably be fixed by changing the "else" branch at code from |
a PR to fix that is welcomed. |
I'm hoping someone can help me a bit. I've never contributed to an open source project and am not sure what the protocol is or how to run tests so I appreciate your patience. I'd be happy to modify my request as needed.
The simple change I made enables the
quantify
method on a DataFrame to work even if the DataFrame contains string columns. Currently this works, but there's a unintended side effect. I've included a minimal example below to show what I mean.The current code would output:
After the update I made:
As can be seen, the current code converts the underlying data to an
object
data type when it should remainfloat64
. This occurs because the current code takes the entire DataFrame and converts it to a 2D array before indexing (at which point the entire array is converted toobject
datatypes. My modification indexes into individual columns so that the data is never cast to a different underlying type.pre-commit run --all-files
with no errors