How to preprocess invalid CSV in a canonical way #1835

PanCakeConnaisseur · 2022-09-06T19:44:22Z

PanCakeConnaisseur
Sep 6, 2022

What is the canonical way of fixing a CSV file with illegal syntax and then continue working with it? I cant' use type: pandas.CSVDataSet for it in the data catalog because parsing it would drop some illegal data.

So far I am using kedro.extras.datasets.text.TextDataSet and fix the raw string of the file in a node. But how should I create the next catalog entry. I tried telling the node to output it into a data entry of type: pandas.CSVDataSet but I get the error that str does not contain a to_csv attribute. Should I call pandas.read_csv()in my syntax fixing method manually? Or how do I add preprocssing steps to fix the faulty CSV?

noklam · 2022-09-06T19:48:49Z

noklam
Sep 6, 2022
Collaborator

What do you mean by illegal syntax? If it's not a valid csv file then you will just treat it as a normal text file.

In that case doing pd.read_csv within your node may not be too bad, it's is acting like a transformation logic (arguably a string2dataframe function) instead of I/O.

1 reply

PanCakeConnaisseur Sep 7, 2022
Author

What do you mean by illegal syntax? If it's not a valid csv file then you will just treat it as a normal text file.

Some of the (logical CSV) lines are split into multiple (file) lines and I need to combined into one line first. If I just treat it as a CSV and delcare a data entry with CSVDataSet all but the first split line are recognized as belonging to the wrong column.

In that case doing pd.read_csv within your node may not be too bad, it's is acting like a transformation logic (arguably a string2dataframe function) instead of I/O.

I thought maybe there are hooks or other built-in functionality to handle invalid CSV files, since this is probably a common occurence - at least it was in my ML projects. I think I can do it "manually" with TextDataSet but assumed that Kedro developers already accomodated for that case, thus my question for a canonical solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to preprocess invalid CSV in a canonical way #1835

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to preprocess invalid CSV in a canonical way #1835

PanCakeConnaisseur Sep 6, 2022

Replies: 1 comment · 1 reply

noklam Sep 6, 2022 Collaborator

PanCakeConnaisseur Sep 7, 2022 Author

PanCakeConnaisseur
Sep 6, 2022

Replies: 1 comment 1 reply

noklam
Sep 6, 2022
Collaborator

PanCakeConnaisseur Sep 7, 2022
Author