Pandas-cleaner

Pandas-cleaner is a Python package, built on top of pandas, that provides methods detect, analyze and clean errors in datasets with different types of data (numerical, categorical, text, datetimes...).

Installation

pip install pandas-cleaner

Features

Pandas-cleaner offers functionnalities to automatically :

detect different kind of potential errors in datasets such as outliers, inconsistencies, typos, wrong-typed ..., given predefined rules or statistiscal estimations, via an easy-to-use API extending pandas,
analyze these errors, via reports and plots, to check the validity of the set and/or decide if any correction is needed,
clean the datasets, either by dropping the lines with errors, emptying, correcting or replacing bad values,
reapply the same rules to any other incoming fresh data.

Usage

Import the package

import pandas as pd
import pdcleaner

Create an example data series

series = pd.Series([1, 5, -6, 100, 10])

Detect the errors in the series with a given method (such as bounded, iqr, zscore and many more depending the type of data...)

detector = series.cleaner.detect('bounded', lower=0, upper=10)

Inspect the result:

detector.report()

                                 Detection report                               
==============================================================================
Method:                      bounded      Nb samples:                        5
Date:                January 24,2022      Nb errors:                         2
Time:                       16:06:08      Nb rows with NaN:                  0
------------------------------------------------------------------------------
lower                              0      upper                             10
inclusive                       both      sided                           both
==============================================================================

Check the potential errors that have been detected

detector.detected()

 2     -6
 3    100
 dtype: int64

Clean the detected errors from the series using the chosen method among drop, to_na, clip , replace...

series.cleaner.clean("drop", detector, inplace=True)
   series

 0      1
 1      5
 4     10
 dtype: int64

Contributing to pandas-cleaner

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

Issues and bugs can be reported at https://github.com/eurodecision/pandas-cleaner/issues

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github/workflows		.github/workflows
docs		docs
src/pdcleaner		src/pdcleaner
tests		tests
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.pylintrc		.pylintrc
.readthedocs.yaml		.readthedocs.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pandas-cleaner

Installation

Features

Usage

Contributing to pandas-cleaner

About

Releases 4

Contributors 2

Languages

License

eurodecision/pandas-cleaner

Folders and files

Latest commit

History

Repository files navigation

Pandas-cleaner

Installation

Features

Usage

Contributing to pandas-cleaner

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 4

Contributors 2

Languages