ELEET is an Execution Engine to run multi-modal queries over datasets containing texts and tables. This is the implementation described in
Matthias Urban and Carsten Binnig: "ELEET: Efficient Learned Query Execution over Text and Tables.", PVLDB, 17(13): 4867-4880, 2024. [PDF]
- Code regarding Pre-training (i.e. corpus construction and pre-training scripts) is located in "eleet_pretrain"
- Code for everything else (e.g. query plans, MMOps, baselines, benchmark) is located in "eleet".
- Some scripts are located in "scripts" and "slurm" as described below.
Install ELEET and all necessary requirements.
git clone [email protected]:DataManagementLab/eleet.git # clone repo
cd eleet
git submodule update --init
conda env create -f environment.yml # setup environment
conda activate eleet
pip install git+https://github.com/meta-llama/llama.git@llama_v2 # install LLaMA
pip install -e .
cd TaBERT/ && pip install -e . && cd .. # install TaBERT
python -m spacy download en_core_web_sm
gdown 1JIvXC0ajRZRCENMlLD3En7SGoodjP6O3
tar -xzvf pretrained-model.tar.gz
gdown 1hFCwdf8CIWDpE3KdHVfT8uQVrpE1Bv1c
tar -xzvf datasets.tar.gz
- Run finetuning:
sbatch slurm/rotowire/train-ours.slurm
(Repeat for other datasets and models). --> Will store finetuned model in models/rotowire/ours/finetuned - Run evaluation:
python eleet/benchmark.py --slurm-mode --use-test-set
- Visualize results using Jupyter notebooks located in
scripts/*.ipynb
- You can download the pre-training dataset here
Alternatively, you can also generate the pre-training dataset from Wikidata:
- Run MongoDB and set environment variables (MONGO_USER, MONGO_PASSWORD, MONGO_HOST, MONGO_PORT, MONGO_DB) https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-ubuntu/
- Start data pre-processing: python scripts/load_data.py trex-wikidata --> preprocessed data will appear in datasets/preprocessed_data/preprocessed_trex-wikidata*
- Use slurm/pretrain.slurm for pre-training (Adjust path in file first to point to pre-training dataset). --> Will store pretrained model in models/pretrained
- Generate TREx Dataset:
python eleet/datasets/trex/generate.py
- Generate Rotowire Dataset:
python eleet/datasets/rotowire/generate.py
If you use code or the benchmarks of this repository then please cite our paper:
@inproceedings{eleet,
title={ELEET: Efficient Learned query Execution over Text and Tables},
author = {Matthias Urban and Carsten Binnig},
journal={Proceedings of the VLDB Endowment},
volume={17},
number={13},
pages={4867--4880},
year={2024},
publisher={VLDB Endowment}
}