Initial Benchmarks for trx configurations

This repo is designed to followup on these ongoing conversations about using tabular file formats for trx: tee-ar-ex/trx-python#63

To install

Benchmarks rely on data generously provided at: https://usherbrooke-my.sharepoint.com/:f:/g/personal/rhef1902_usherbrooke_ca/Es5HfYK6fEpAg1o6wMbaQvEBjhRuX5lf-CjshwrVEZpQXg?e=ss5XXA. Place that data in data

A conda environment specification is in env.yml. A parquet prototype exists but is not used, and instead the benchmarks rely on a bare-bones implementation in trxparquet.py. This means that the trx-parquet repo does not need to be installed.

To benchmark

Benchmarks are orchestrated with pytest-benchmark. To run and save the results as a json (in .benchmarks):

pytest --benchmark-autosave test_raw_io.py

For easier plotting, the resulting jsons can be converted into a csv.

pytest-benchmark compare --csv

Existing Benchmarks

raw_io

test_raw_io.py

These benchmarks focus on the i/o for the building blocks of a trx file. The current implementation (that is, trx-python) writes and reads data with numpy. The proposed changes would leverage some format that is specific to tabular data, including Apache Parquet and Feather. Are there differences in speed when working with single vectors of data?

raw_mmap

Numpy arrays, feather tables, and parquet files can all have a mmap to them. Are the substantive differences in how long these take to create?

trx_io

There are many ways in which the core tools could be used to build a trx file. The reference implementation stores several arrays in a (possibly compressed) zip archive. That method is compared against a similar method with parquet (see TrxParquet.to_file), a method where all parquet tables are stored as separate tables in a directory (TrxParquet.to_dir), and approach where all data is stored in a single, big parquet file (TrxParquet.to_table), and one in which tables are gathered together as a duckdb (TrxParquet.to_duckdb).

[TODO]

Additional tests will be added that focus on computation (e.g., differences in query speed, usability, code structure, etc).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.benchmarks/Darwin-CPython-3.12-64bit		.benchmarks/Darwin-CPython-3.12-64bit
.vscode		.vscode
benchmarks		benchmarks
data		data
.gitignore		.gitignore
README.md		README.md
env.yml		env.yml
test_raw_io.py		test_raw_io.py
test_raw_mmap.py		test_raw_mmap.py
test_trx_io.py		test_trx_io.py
trxparquet.py		trxparquet.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Initial Benchmarks for trx configurations

To install

To benchmark

Existing Benchmarks

raw_io

raw_mmap

trx_io

[TODO]

About

Releases

Packages

Languages

psadil/trx_benchmarks

Folders and files

Latest commit

History

Repository files navigation

Initial Benchmarks for trx configurations

To install

To benchmark

Existing Benchmarks

raw_io

raw_mmap

trx_io

[TODO]

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages