Skip to content

Repository for ELEET: Efficient Learned Query Execution over Text and Tables

License

Notifications You must be signed in to change notification settings

DataManagementLab/eleet

Repository files navigation

ELEET: Efficient Learned Query Execution over Text and Tables

ELEET is an Execution Engine to run multi-modal queries over datasets containing texts and tables. This is the implementation described in

Matthias Urban and Carsten Binnig: "ELEET: Efficient Learned Query Execution over Text and Tables.", PVLDB, 17(13): 4867-4880, 2024. [PDF]

Image of ELEET

Project Structure

  • Code regarding Pre-training (i.e. corpus construction and pre-training scripts) is located in "eleet_pretrain"
  • Code for everything else (e.g. query plans, MMOps, baselines, benchmark) is located in "eleet".
  • Some scripts are located in "scripts" and "slurm" as described below.

Setup

Install ELEET and all necessary requirements.

git clone [email protected]:DataManagementLab/eleet.git  # clone repo
cd eleet
git submodule update --init
conda env create -f environment.yml  # setup environment
conda activate eleet
pip install git+https://github.com/meta-llama/llama.git@llama_v2  # install LLaMA
pip install -e .
cd TaBERT/ && pip install -e . && cd ..  # install TaBERT
python -m spacy download en_core_web_sm

Download pre-trained model and datasets

gdown 1JIvXC0ajRZRCENMlLD3En7SGoodjP6O3
tar -xzvf pretrained-model.tar.gz
gdown 1hFCwdf8CIWDpE3KdHVfT8uQVrpE1Bv1c
tar -xzvf datasets.tar.gz

Finetuning + Evaluation

  1. Run finetuning: sbatch slurm/rotowire/train-ours.slurm (Repeat for other datasets and models). --> Will store finetuned model in models/rotowire/ours/finetuned
  2. Run evaluation: python eleet/benchmark.py --slurm-mode --use-test-set
  3. Visualize results using Jupyter notebooks located in scripts/*.ipynb

Pre-Training and generation of datasets

  1. You can download the pre-training dataset here

Alternatively, you can also generate the pre-training dataset from Wikidata:

  1. Run MongoDB and set environment variables (MONGO_USER, MONGO_PASSWORD, MONGO_HOST, MONGO_PORT, MONGO_DB) https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-ubuntu/
  2. Start data pre-processing: python scripts/load_data.py trex-wikidata --> preprocessed data will appear in datasets/preprocessed_data/preprocessed_trex-wikidata*
  3. Use slurm/pretrain.slurm for pre-training (Adjust path in file first to point to pre-training dataset). --> Will store pretrained model in models/pretrained
  4. Generate TREx Dataset: python eleet/datasets/trex/generate.py
  5. Generate Rotowire Dataset: python eleet/datasets/rotowire/generate.py

Reference

If you use code or the benchmarks of this repository then please cite our paper:

@inproceedings{eleet,
  title={ELEET: Efficient Learned query Execution over Text and Tables},
  author = {Matthias Urban and Carsten Binnig},
  journal={Proceedings of the VLDB Endowment},
  volume={17},
  number={13},
  pages={4867--4880},
  year={2024},
  publisher={VLDB Endowment}
}

About

Repository for ELEET: Efficient Learned Query Execution over Text and Tables

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published