Skip to content
/ FinPR Public

Enabling effective multi-modal financial passage retrieval with structural parsing and Table-to-text generation

Notifications You must be signed in to change notification settings

lkpsg/FinPR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

FinPR

Dataset url: https://drive.google.com/drive/folders/1369YUMk-jdQFOjKb-48BWUBxwCiOA-3R?usp=sharing

Description

This repository contains the code and data for the project.

Code

The code is organized into the following directories:

The models directory contains the following .ipynb files:

  • models/e5.ipynb: Embed passages to vectors using E5 model.
  • models/instructor.ipynb: Embed passages to vectors using instructor model.
  • models/rocketqa.ipynb: Embed passages to vectors using rocketqa.
  • models/sentence_transformer.ipynb: Embed passages to vectors using sentence-t5, all-MiniLM-L6-v2 and paraphrase-multilingual-MiniLM-L12-v2.

The src/baselines directory contains the accuracy scripts for BM25, RocketQA and other models.

The src/data_preparation directory contains the data process scripts.

Data

The data used in this project includes the following files:

  • data/groundtruth/$model/: Contains the chunk embeddings, passage embeddings and query embeddings.
  • data/groundtruth/passages_chunk_avg_0708: Each document is split to chunks.
  • data/groundtruth/passages_table2text_0708: Contains the generated passages by our Document Parser.
  • data/groundtruth/passages_table2text_tid_0708: Contains the corresponding relationship of the generated passages and original passages extracted from leaf nodes. For example:
"pid": 441,
"tid": 442,
"tag": "ttt"

it means after generated passages being inserted into the original passages list. The No.442 passages in new passages list comes from the No.441 passage in the original list, and the tag of No.442 passage is ttt. The meanings of tags are as follows:

p: it is the original passage that is extracted from leaf node.
ttt: it is the generated description for a table.
tttp: it is a paragraph in the generated description for a table.

License

This project is licensed under the MIT License.

About

Enabling effective multi-modal financial passage retrieval with structural parsing and Table-to-text generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published