Skip to content

Pandas extension to detect, analyze and clean errors in datasets with different types of data (numerical, categorical, text, datetimes...).

License

Notifications You must be signed in to change notification settings

eurodecision/pandas-cleaner

Pandas-cleaner

Documentation Status

Pandas-cleaner is a Python package, built on top of pandas, that provides methods detect, analyze and clean errors in datasets with different types of data (numerical, categorical, text, datetimes...).

Installation

pip install pandas-cleaner

Features

Pandas-cleaner offers functionnalities to automatically :

  • detect different kind of potential errors in datasets such as outliers, inconsistencies, typos, wrong-typed ..., given predefined rules or statistiscal estimations, via an easy-to-use API extending pandas,
  • analyze these errors, via reports and plots, to check the validity of the set and/or decide if any correction is needed,
  • clean the datasets, either by dropping the lines with errors, emptying, correcting or replacing bad values,
  • reapply the same rules to any other incoming fresh data.

Usage

Import the package

import pandas as pd
import pdcleaner

Create an example data series

series = pd.Series([1, 5, -6, 100, 10])

Detect the errors in the series with a given method (such as bounded, iqr, zscore and many more depending the type of data...)

detector = series.cleaner.detect('bounded', lower=0, upper=10)

Inspect the result:

detector.report()
                                 Detection report                               
==============================================================================
Method:                      bounded      Nb samples:                        5
Date:                January 24,2022      Nb errors:                         2
Time:                       16:06:08      Nb rows with NaN:                  0
------------------------------------------------------------------------------
lower                              0      upper                             10
inclusive                       both      sided                           both
==============================================================================

Check the potential errors that have been detected

detector.detected()
 2     -6
 3    100
 dtype: int64

Clean the detected errors from the series using the chosen method among drop, to_na, clip , replace...

series.cleaner.clean("drop", detector, inplace=True)
   series
 0      1
 1      5
 4     10
 dtype: int64

Contributing to pandas-cleaner

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Issues and bugs can be reported at https://github.com/eurodecision/pandas-cleaner/issues

About

Pandas extension to detect, analyze and clean errors in datasets with different types of data (numerical, categorical, text, datetimes...).

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages