Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Management of pandas object #140

Open
lionelkusch opened this issue Jan 23, 2025 · 4 comments
Open

Management of pandas object #140

lionelkusch opened this issue Jan 23, 2025 · 4 comments
Labels
coding style question regarding formatting and declaration of functions method implementation Question regarding methods implementations

Comments

@lionelkusch
Copy link
Collaborator

For the moment, the management of pandas is not handled by all the methods. In the refactoring, I remove the management of and replace with only a transformation of data to numpy arrays.
In the long term, we need to provide a homogeneous way to handle the pandas's objects.

@lionelkusch lionelkusch added method implementation Question regarding methods implementations coding style question regarding formatting and declaration of functions labels Jan 23, 2025
@bthirion
Copy link
Contributor

Can you a be a bit more precise ?
The basic answer is that we should support pandas arrays everywhere sklearn supports it.

@lionelkusch
Copy link
Collaborator Author

For the moment, I am not that the support of pandas array is everywhere.
However, to be more precise, what support we want to offer.

The minimal support is to support pandas has input array and transform everything into numpy arrays.
A more elaborate support is that if the input is a pandas array at this moment, the output is also a pandas array. Additionally, if the input is a data frame, it could be nice to return a data frame with the correct name for the columns and rows.

@bthirion
Copy link
Contributor

Thx for clarifying. When you say input/output, you think of CPI ?

Until recently, I would have said: we consider only numpy arrays (and thus convert dataframes in a first step), but I increasinlgy think that X can be provided as a DataFrame, because it handles heterogeneous data, e.g. strings, floats etc.
By contrast, the importance scores and p-values are always floats. So your suggestion is to reuse rows and columns names when the inputs were provided as dfs ?

@lionelkusch
Copy link
Collaborator Author

lionelkusch commented Jan 27, 2025

Thx for clarifying. When you say input/output, you think of CPI ?

I was thinking in general.

So your suggestion is to reuse rows and columns names when the inputs were provided as dfs ?

Yes, this is one proposition. However, I am not an expert on pandas and I don't know yet the subtlety between the different types of pandas. From a quick look at pandas, I see that the main type is DataFrame. In this case, if we want to support pandas, we should handle DataFrame.

X can be provided as a DataFrame, because it handles heterogeneous data, e.g. strings, floats etc.

A short comment, numpy can also handle heterogeneous data even if it's not really targeting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
coding style question regarding formatting and declaration of functions method implementation Question regarding methods implementations
Projects
None yet
Development

No branches or pull requests

2 participants