Skip to content

Latest commit

 

History

History
48 lines (34 loc) · 1.87 KB

readme.md

File metadata and controls

48 lines (34 loc) · 1.87 KB

FinPR

Dataset url: https://drive.google.com/drive/folders/1369YUMk-jdQFOjKb-48BWUBxwCiOA-3R?usp=sharing

Description

This repository contains the code and data for the project.

Code

The code is organized into the following directories:

The models directory contains the following .ipynb files:

  • models/e5.ipynb: Embed passages to vectors using E5 model.
  • models/instructor.ipynb: Embed passages to vectors using instructor model.
  • models/rocketqa.ipynb: Embed passages to vectors using rocketqa.
  • models/sentence_transformer.ipynb: Embed passages to vectors using sentence-t5, all-MiniLM-L6-v2 and paraphrase-multilingual-MiniLM-L12-v2.

The src/baselines directory contains the accuracy scripts for BM25, RocketQA and other models.

The src/data_preparation directory contains the data process scripts.

Data

The data used in this project includes the following files:

  • data/groundtruth/$model/: Contains the chunk embeddings, passage embeddings and query embeddings.
  • data/groundtruth/passages_chunk_avg_0708: Each document is split to chunks.
  • data/groundtruth/passages_table2text_0708: Contains the generated passages by our Document Parser.
  • data/groundtruth/passages_table2text_tid_0708: Contains the corresponding relationship of the generated passages and original passages extracted from leaf nodes. For example:
"pid": 441,
"tid": 442,
"tag": "ttt"

it means after generated passages being inserted into the original passages list. The No.442 passages in new passages list comes from the No.441 passage in the original list, and the tag of No.442 passage is ttt. The meanings of tags are as follows:

p: it is the original passage that is extracted from leaf node.
ttt: it is the generated description for a table.
tttp: it is a paragraph in the generated description for a table.

License

This project is licensed under the MIT License.