Dataset url: https://drive.google.com/drive/folders/1369YUMk-jdQFOjKb-48BWUBxwCiOA-3R?usp=sharing
This repository contains the code and data for the project.
The code is organized into the following directories:
The models
directory contains the following .ipynb
files:
models/e5.ipynb
: Embed passages to vectors using E5 model.models/instructor.ipynb
: Embed passages to vectors using instructor model.models/rocketqa.ipynb
: Embed passages to vectors using rocketqa.models/sentence_transformer.ipynb
: Embed passages to vectors using sentence-t5, all-MiniLM-L6-v2 and paraphrase-multilingual-MiniLM-L12-v2.
The src/baselines
directory contains the accuracy scripts for BM25, RocketQA and other models.
The src/data_preparation
directory contains the data process scripts.
The data used in this project includes the following files:
data/groundtruth/$model/
: Contains the chunk embeddings, passage embeddings and query embeddings.data/groundtruth/passages_chunk_avg_0708
: Each document is split to chunks.data/groundtruth/passages_table2text_0708
: Contains the generated passages by our Document Parser.data/groundtruth/passages_table2text_tid_0708
: Contains the corresponding relationship of the generated passages and original passages extracted from leaf nodes. For example:
"pid": 441,
"tid": 442,
"tag": "ttt"
it means after generated passages being inserted into the original passages list. The No.442 passages in new passages list comes from the No.441 passage in the original list, and the tag of No.442 passage is ttt. The meanings of tags are as follows:
p: it is the original passage that is extracted from leaf node.
ttt: it is the generated description for a table.
tttp: it is a paragraph in the generated description for a table.
This project is licensed under the MIT License.