Generic data reader for common ML routine.
Train multiple model though a pipeline with different input and output data format can be very annoying. For example, when we want to investigate performance of each stage, we need to deal different data format for each stage and this usually requires lots of hard-coding.
This package is aim to simplify this process by providing additional data format definition file. Though this file, data can be read, process, and form suitable dataframe for later use in clean, readable way.
- Merge data to single object across files from different locations.
- Data processing API for multi-dimensional data.
- Extensible API for custom data format.
- Enhance data processing API.
- Support shared data for all events.
- Provide data post-processing API to create missing column automatically.
You can clone this project and install with
pip3 install -e .
and walk through examples, or install package only by
pip3 install git+https://github.com/rlf23240/ExaTrkXDataIO
Before you start using this package, it is highly recommended seeing some examples in examples
folder. To run the example, you need:
-
Install package using
pip3 install -e .
-
Get data and place at least 10 event under
examples/data
. In this example, we useparticles/event{evt_id}-particles.csv
andfeature_store/{evt_id}
files. It should be placed as following: -
Read through
examples/configs/reader/default.yaml
andexamples/read.py
to see how configuration file works. -
Run
examples/read.py
.
EventFileParser
is responsible for loading data from file and extract desired columns from it. To customize file parsing, you can inherit EventFileParser
and implement following two method:
-
load(self, path: Path) -> Any
:Load your data from file and return it here.
-
extract(self, data: Any, tag: str) -> np.array
:Extract column from data you previously loaded in
load
method here.
Finally, declare your parser in configuration file and you are way to go.
EventDataProcessor
is responsible for process data into suitable way to fit into a column of dataframe. For flexibility, the process is breaking into series of procedure and you are free to define your custom step. To customize data processing, you can inherit EventDataProcessor
and implement following method:
-
process(self, data: np.array, **kwargs) -> np.array
:Process data and return your result here. No need to constrain yourself to return 1-D array, it is responsibility for user to guarantee the processing pipeline only resulting an 1-D array.
Finally, declare your processor in configuration file and you are way to go.