Preprocessing module for one single dataset.
It includes cleaning, imputation, outlier detection modules. And It also has dataRemoveByNaN module which remove a part of data according to the NaN status.
This is the test code full DataProcessing pipeline. Input can be both file or data from inlxlufDB If you want to change db and measurement name or add another data ingestion methods, modify data_manager/multipleSourceIngestion.py
This function gets cleaner data by all possible data preprocessing modules from Clust.clust.preprocessing packages. Refinging, Outlier Detction, Imputation
This function make multiple-Datasets through All preprocessing method.
This class provdies several preprocessing Method from this package.
- So far, data refining, outlier removal, imputation module are available.
- There is a plan to expand more preprocessing modules.
- input: data, refine_param
refine_param = {
"removeDuplication":{"flag":True},
"staticFrequency":{"flag":True, "frequency":None}
}
- Clust.clust.preprocessing.data_cleaning.RefineData.RemoveDuplicateData: Remove duplicated data
- Clust.clust.preprocessing.data_cleaning.RefineData.make_staticFrequencyData: Let the original data have a static frequency
- output: datafrmae type
- errorToNaN.errorToNaN:Let outliered data be.
outlier_param = {
"certainErrorToNaN":{"flag":True},
"unCertainErrorToNaN":{
"flag":True,
"param":{"neighbor":0.5}
},
"data_type":"air"
}
- Replace missing data with substituted values according to the imputation parameter.
imputation_param = {
"serialImputation":{
"flag":True,
"imputation_method":[{"min":0,"max":10,"method":"linear" , "parameter":{}},{"min":11,"max":20,"method":"mean" , "parameter":{}}],
"totalNonNanRatio":80
}
}