ELEET: Efficient Learned Query Execution over Text and Tables

ELEET is an Execution Engine to run multi-modal queries over datasets containing texts and tables. This is the implementation described in

Matthias Urban and Carsten Binnig: "ELEET: Efficient Learned Query Execution over Text and Tables.", PVLDB, 17(13): 4867-4880, 2024. [PDF]

Project Structure

Code regarding Pre-training (i.e. corpus construction and pre-training scripts) is located in "eleet_pretrain"
Code for everything else (e.g. query plans, MMOps, baselines, benchmark) is located in "eleet".
Some scripts are located in "scripts" and "slurm" as described below.

Setup

Install ELEET and all necessary requirements.

git clone [email protected]:DataManagementLab/eleet.git  # clone repo
cd eleet
git submodule update --init
conda env create -f environment.yml  # setup environment
conda activate eleet
pip install git+https://github.com/meta-llama/llama.git@llama_v2  # install LLaMA
pip install -e .
cd TaBERT/ && pip install -e . && cd ..  # install TaBERT
python -m spacy download en_core_web_sm

Download pre-trained model and datasets

gdown 1JIvXC0ajRZRCENMlLD3En7SGoodjP6O3
tar -xzvf pretrained-model.tar.gz
gdown 1hFCwdf8CIWDpE3KdHVfT8uQVrpE1Bv1c
tar -xzvf datasets.tar.gz

Finetuning + Evaluation

Run finetuning: sbatch slurm/rotowire/train-ours.slurm (Repeat for other datasets and models). --> Will store finetuned model in models/rotowire/ours/finetuned
Run evaluation: python eleet/benchmark.py --slurm-mode --use-test-set
Visualize results using Jupyter notebooks located in scripts/*.ipynb

Pre-Training and generation of datasets

You can download the pre-training dataset here

Alternatively, you can also generate the pre-training dataset from Wikidata:

Run MongoDB and set environment variables (MONGO_USER, MONGO_PASSWORD, MONGO_HOST, MONGO_PORT, MONGO_DB) https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-ubuntu/
Start data pre-processing: python scripts/load_data.py trex-wikidata --> preprocessed data will appear in datasets/preprocessed_data/preprocessed_trex-wikidata*
Use slurm/pretrain.slurm for pre-training (Adjust path in file first to point to pre-training dataset). --> Will store pretrained model in models/pretrained
Generate TREx Dataset: python eleet/datasets/trex/generate.py
Generate Rotowire Dataset: python eleet/datasets/rotowire/generate.py

Reference

If you use code or the benchmarks of this repository then please cite our paper:

@inproceedings{eleet,
  title={ELEET: Efficient Learned query Execution over Text and Tables},
  author = {Matthias Urban and Carsten Binnig},
  journal={Proceedings of the VLDB Endowment},
  volume={17},
  number={13},
  pages={4867--4880},
  year={2024},
  publisher={VLDB Endowment}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
TaBERT @ df432a5		TaBERT @ df432a5
eleet		eleet
eleet_pretrain		eleet_pretrain
scripts		scripts
slurm		slurm
text_to_table		text_to_table
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENCE		LICENCE
README.md		README.md
environment.yml		environment.yml
paper_img.png		paper_img.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ELEET: Efficient Learned Query Execution over Text and Tables

Project Structure

Setup

Download pre-trained model and datasets

Finetuning + Evaluation

Pre-Training and generation of datasets

Reference

About

Releases

Packages

Languages

License

DataManagementLab/eleet

Folders and files

Latest commit

History

Repository files navigation

ELEET: Efficient Learned Query Execution over Text and Tables

Project Structure

Setup

Download pre-trained model and datasets

Finetuning + Evaluation

Pre-Training and generation of datasets

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages