This repo is designed to followup on these ongoing conversations about using tabular file formats for trx: tee-ar-ex/trx-python#63
Benchmarks rely on data generously provided at: https://usherbrooke-my.sharepoint.com/:f:/g/personal/rhef1902_usherbrooke_ca/Es5HfYK6fEpAg1o6wMbaQvEBjhRuX5lf-CjshwrVEZpQXg?e=ss5XXA. Place that data in data
A conda environment specification is in env.yml. A parquet prototype exists but is not used, and instead the benchmarks rely on a bare-bones implementation in trxparquet.py. This means that the trx-parquet repo does not need to be installed.
Benchmarks are orchestrated with pytest-benchmark. To run and save the results as a json (in .benchmarks):
pytest --benchmark-autosave test_raw_io.py
For easier plotting, the resulting jsons can be converted into a csv.
pytest-benchmark compare --csv
These benchmarks focus on the i/o for the building blocks of a trx file. The current implementation (that is, trx-python) writes and reads data with numpy. The proposed changes would leverage some format that is specific to tabular data, including Apache Parquet and Feather. Are there differences in speed when working with single vectors of data?
Numpy arrays, feather tables, and parquet files can all have a mmap to them. Are the substantive differences in how long these take to create?
There are many ways in which the core tools could be used to build a trx file. The reference implementation stores several arrays in a (possibly compressed) zip archive. That method is compared against a similar method with parquet (see TrxParquet.to_file
), a method where all parquet tables are stored as separate tables in a directory (TrxParquet.to_dir
), and approach where all data is stored in a single, big parquet file (TrxParquet.to_table
), and one in which tables are gathered together as a duckdb (TrxParquet.to_duckdb
).
Additional tests will be added that focus on computation (e.g., differences in query speed, usability, code structure, etc).