Pandas-cleaner is a Python package, built on top of pandas, that provides methods detect, analyze and clean errors in datasets with different types of data (numerical, categorical, text, datetimes...).
pip install pandas-cleaner
Pandas-cleaner offers functionnalities to automatically :
- detect different kind of potential errors in datasets such as outliers, inconsistencies, typos, wrong-typed ..., given predefined rules or statistiscal estimations, via an easy-to-use API extending pandas,
- analyze these errors, via reports and plots, to check the validity of the set and/or decide if any correction is needed,
- clean the datasets, either by dropping the lines with errors, emptying, correcting or replacing bad values,
- reapply the same rules to any other incoming fresh data.
Import the package
import pandas as pd
import pdcleaner
Create an example data series
series = pd.Series([1, 5, -6, 100, 10])
Detect the errors in the series with a given method (such as bounded
, iqr
, zscore
and many more depending the type of data...)
detector = series.cleaner.detect('bounded', lower=0, upper=10)
Inspect the result:
detector.report()
Detection report
==============================================================================
Method: bounded Nb samples: 5
Date: January 24,2022 Nb errors: 2
Time: 16:06:08 Nb rows with NaN: 0
------------------------------------------------------------------------------
lower 0 upper 10
inclusive both sided both
==============================================================================
Check the potential errors that have been detected
detector.detected()
2 -6
3 100
dtype: int64
Clean the detected errors from the series using the chosen method among drop
, to_na
, clip
, replace
...
series.cleaner.clean("drop", detector, inplace=True)
series
0 1
1 5
4 10
dtype: int64
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
Issues and bugs can be reported at https://github.com/eurodecision/pandas-cleaner/issues