Skip to content

ClustProject/KETIPrePartialDataPreprocessing

Repository files navigation

Clust.clust.preprocessing

Preprocessing module for one single dataset.

It includes cleaning, imputation, outlier detection modules. And It also has dataRemoveByNaN module which remove a part of data according to the NaN status.

1. main.py (+ main.ipynb)

This is the test code full DataProcessing pipeline. Input can be both file or data from inlxlufDB If you want to change db and measurement name or add another data ingestion methods, modify data_manager/multipleSourceIngestion.py

2. DataProcessing.py

2-1. function all_preprocessing(input_data, refine_param, outlier_param, imputation_param)

This function gets cleaner data by all possible data preprocessing modules from Clust.clust.preprocessing packages. Refinging, Outlier Detction, Imputation

2-2. function multiDataset_all_preprocessing(multiple_dataset, process_param)

This function make multiple-Datasets through All preprocessing method.

2-2. DataPreprocessing (class)

This class provdies several preprocessing Method from this package.

  • So far, data refining, outlier removal, imputation module are available.
  • There is a plan to expand more preprocessing modules.

2-2-1. get_refinedData(self, data, refine_param)

  • input: data, refine_param
         refine_param = {
        "removeDuplication":{"flag":True},
        "staticFrequency":{"flag":True, "frequency":None}
    }
  1. Clust.clust.preprocessing.data_cleaning.RefineData.RemoveDuplicateData: Remove duplicated data
  2. Clust.clust.preprocessing.data_cleaning.RefineData.make_staticFrequencyData: Let the original data have a static frequency
  • output: datafrmae type

2-2-2. get_errorToNaNData(self, data, outlier_param)

  • errorToNaN.errorToNaN:Let outliered data be.
    outlier_param  = {
        "certainErrorToNaN":{"flag":True},
        "unCertainErrorToNaN":{
            "flag":True,
            "param":{"neighbor":0.5}
        },
        "data_type":"air"
    }

2-2-3. get_imputedData(self, data, impuation_param)

  • Replace missing data with substituted values according to the imputation parameter.
     imputation_param = {
        "serialImputation":{
            "flag":True,
            "imputation_method":[{"min":0,"max":10,"method":"linear" , "parameter":{}},{"min":11,"max":20,"method":"mean" , "parameter":{}}],
            "totalNonNanRatio":80
        }
    }

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published