GitHub

Multi-modal RAG APP for Scientific papers

Click on image below to my Youtube video for more in depth explanation on the demo

Many documents contain a mixture of content types, including text and images.

Yet, information captured in images is lost in most RAG applications.

With the emergence of multimodal LLMs, like GPT-4o, it is worth considering how to utilize images in RAG:

What we will be Using / Doing

Use a multimodal LLM (such as GPT-4o, LLaVA, or FUYU-8b) to produce text summaries from images
Embed and retrieve image summaries with a reference to the raw image
Pass raw images and text chunks to a multimodal LLM for answer synthesis

This Web APP project highlights Option 3 from the image shown below.

We will use Unstructured to parse images, text, and tables from documents (PDFs).
We will use the multi-vector retriever with Chroma to store raw text, tables and images along with their summaries for retrieval.

SetUp

The PDF partitioning used by Unstructured will use:

tesseract for Optical Character Recognition (OCR)
poppler for PDF rendering and processing

Refer to poppler installation instructions and tesseract installation instructions in your system.

The front_end is only to showcase the results and shows you how to integrate the API's.

Developer Guide for environment setup

Step 1: Clone the Repository

First, clone the repository to your local machine using the command:

git clone https://github.com/Abdulkadir19997/multimodal_rag_app.git

Keep the project archtiecture the same:

├── app
│   ├── ai_services
│   │   ├── create_image_summaries.py
│   │   ├── create_table_summaries.py
│   │   ├── preprocess_vector_db.py
│   │   ├── query_rag_chain.py
│   │   ├── render_preprocess_pdf.py
│   │   ├── __init__.py
│   ├── routes
│   │   ├── inference.py
│   │   ├── preprocess.py
│   │   ├── __init__.py
│   ├── schemas
│   │   ├── schemas.py
│   │   ├── __init__.py
│   ├── __init__.py
├── config.py
├── data
│   ├── chroma_langchain_db
│   │   ├── c4cb8f68-5169-4085-aede-44008a3a4ca7
│   │   ├── chroma.sqlite3
│   ├── content
│   │   ├── image_summaries.json
│   │   ├── table_summaries.json
│   ├── docstore.pkl
│   ├── readme_images
│   │   ├── multimodal_graph.png
├── front_end.py
├── main.py
├── readme.md
├── requirements.txt
├── .env
├── .gitignore
├── __init__.py

Step 2: Create Python Environment

Inside the downloaded 'multimodal_rag_app' folder, create a Python environment, I used 3.10.12 version of python. For example, to create an environment named 'multi_rag', use:

python -m venv multi_rag

Step 3: Activate Environment

Activate the environment with:

For Windows

.\multi_rag\Scripts\activate

For Linux

source multi_rag/bin/activate

Step 4: Install Requirements

After confirming that the multi_rag environment is active, install all necessary libraries from the 'requirements.txt' file:

pip install -r requirements.txt

Step 5: Download tesseract and poppler for pdf rendering

sudo apt-get install poppler-utils tesseract-ocr

Step 6: Add you OpenAI API and LangSmith keys to the .env file

Create an .env file inside the 'multimodal_rag_app' and add the Keys such as the given example:

# openai api key
OPENAI_API_KEY = "your_open_ai_api_key"
# langsmith traces
LANGCHAIN_API_KEY = "your_lang_smith_api_key"

Step 7: Run the Streamlit Application

In the active 'multi_rag' environment, run the 'front_end.py' file with:

streamlit run front_end.py

Step 8: Open a New Terminal Session

Open a new terminal inside the 'multimodal_rag_app' folder and activate the 'multi_rag' environment again:

For Winows

.\multi_rag\Scripts\activate

For Linux

source auto_inpaint/bin/activate

Step 9: Run FastAPI

In the second terminal with the 'multi_rag' environment active, start the FastAPI service with:

uvicorn main:app --host 127.0.0.1 --port 5004 --reload

Notes

To run locally, operate two different terminals each time: one with the 'multi_rag' environment to run 'streamlit run front_end.py', and another to execute 'uvicorn main:app --host 127.0.0.1 --port 5004 --reload'.

Acknowledgments

Many thanks to these excellent projects

Version

The current version is 1.0. Development is ongoing, and any support or suggestions are welcome. Please reach out to me: [email protected] & LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-modal RAG APP for Scientific papers

Click on image below to my Youtube video for more in depth explanation on the demo

What we will be Using / Doing

SetUp

Developer Guide for environment setup

Step 1: Clone the Repository

Step 2: Create Python Environment

Step 3: Activate Environment

Step 4: Install Requirements

Step 5: Download tesseract and poppler for pdf rendering

Step 6: Add you OpenAI API and LangSmith keys to the .env file

Step 7: Run the Streamlit Application

Step 8: Open a New Terminal Session

Step 9: Run FastAPI

Notes

Acknowledgments

Version

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
app		app
data		data
.gitignore		.gitignore
__init__.py		__init__.py
config.py		config.py
front_end.py		front_end.py
get_graph.py		get_graph.py
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt

Abdulkadir19997/multimodal_rag_app

Folders and files

Latest commit

History

Repository files navigation

Multi-modal RAG APP for Scientific papers

Click on image below to my Youtube video for more in depth explanation on the demo

What we will be Using / Doing

SetUp

Developer Guide for environment setup

Step 1: Clone the Repository

Step 2: Create Python Environment

Step 3: Activate Environment

Step 4: Install Requirements

Step 5: Download tesseract and poppler for pdf rendering

Step 6: Add you OpenAI API and LangSmith keys to the .env file

Step 7: Run the Streamlit Application

Step 8: Open a New Terminal Session

Step 9: Run FastAPI

Notes

Acknowledgments

Version

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages