Skip to content

psadil/trx_benchmarks

Repository files navigation

Initial Benchmarks for trx configurations

This repo is designed to followup on these ongoing conversations about using tabular file formats for trx: tee-ar-ex/trx-python#63

To install

Benchmarks rely on data generously provided at: https://usherbrooke-my.sharepoint.com/:f:/g/personal/rhef1902_usherbrooke_ca/Es5HfYK6fEpAg1o6wMbaQvEBjhRuX5lf-CjshwrVEZpQXg?e=ss5XXA. Place that data in data

A conda environment specification is in env.yml. A parquet prototype exists but is not used, and instead the benchmarks rely on a bare-bones implementation in trxparquet.py. This means that the trx-parquet repo does not need to be installed.

To benchmark

Benchmarks are orchestrated with pytest-benchmark. To run and save the results as a json (in .benchmarks):

pytest --benchmark-autosave test_raw_io.py

For easier plotting, the resulting jsons can be converted into a csv.

pytest-benchmark compare --csv

Existing Benchmarks

raw_io

test_raw_io.py

These benchmarks focus on the i/o for the building blocks of a trx file. The current implementation (that is, trx-python) writes and reads data with numpy. The proposed changes would leverage some format that is specific to tabular data, including Apache Parquet and Feather. Are there differences in speed when working with single vectors of data?

raw_mmap

Numpy arrays, feather tables, and parquet files can all have a mmap to them. Are the substantive differences in how long these take to create?

trx_io

There are many ways in which the core tools could be used to build a trx file. The reference implementation stores several arrays in a (possibly compressed) zip archive. That method is compared against a similar method with parquet (see TrxParquet.to_file), a method where all parquet tables are stored as separate tables in a directory (TrxParquet.to_dir), and approach where all data is stored in a single, big parquet file (TrxParquet.to_table), and one in which tables are gathered together as a duckdb (TrxParquet.to_duckdb).

[TODO]

Additional tests will be added that focus on computation (e.g., differences in query speed, usability, code structure, etc).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages