Many documents contain a mixture of content types, including text and images.
Yet, information captured in images is lost in most RAG applications.
With the emergence of multimodal LLMs, like GPT-4o, it is worth considering how to utilize images in RAG:
- Use a multimodal LLM (such as GPT-4o, LLaVA, or FUYU-8b) to produce text summaries from images
- Embed and retrieve image summaries with a reference to the raw image
- Pass raw images and text chunks to a multimodal LLM for answer synthesis
This Web APP project highlights Option 3
from the image shown below.
- We will use Unstructured to parse images, text, and tables from documents (PDFs).
- We will use the multi-vector retriever with Chroma to store raw text, tables and images along with their summaries for retrieval.
The PDF partitioning used by Unstructured will use:
- tesseract for Optical Character Recognition (OCR)
- poppler for PDF rendering and processing
Refer to poppler installation instructions and tesseract installation instructions in your system.
The front_end is only to showcase the results and shows you how to integrate the API's.
First, clone the repository to your local machine using the command:
git clone https://github.com/Abdulkadir19997/multimodal_rag_app.git
Keep the project archtiecture the same:
├── app
│ ├── ai_services
│ │ ├── create_image_summaries.py
│ │ ├── create_table_summaries.py
│ │ ├── preprocess_vector_db.py
│ │ ├── query_rag_chain.py
│ │ ├── render_preprocess_pdf.py
│ │ ├── __init__.py
│ ├── routes
│ │ ├── inference.py
│ │ ├── preprocess.py
│ │ ├── __init__.py
│ ├── schemas
│ │ ├── schemas.py
│ │ ├── __init__.py
│ ├── __init__.py
├── config.py
├── data
│ ├── chroma_langchain_db
│ │ ├── c4cb8f68-5169-4085-aede-44008a3a4ca7
│ │ ├── chroma.sqlite3
│ ├── content
│ │ ├── image_summaries.json
│ │ ├── table_summaries.json
│ ├── docstore.pkl
│ ├── readme_images
│ │ ├── multimodal_graph.png
├── front_end.py
├── main.py
├── readme.md
├── requirements.txt
├── .env
├── .gitignore
├── __init__.py
Inside the downloaded 'multimodal_rag_app' folder, create a Python environment, I used 3.10.12 version of python. For example, to create an environment named 'multi_rag', use:
python -m venv multi_rag
Activate the environment with:
For Windows
.\multi_rag\Scripts\activate
For Linux
source multi_rag/bin/activate
After confirming that the multi_rag environment is active, install all necessary libraries from the 'requirements.txt' file:
pip install -r requirements.txt
sudo apt-get install poppler-utils tesseract-ocr
Create an .env file inside the 'multimodal_rag_app' and add the Keys such as the given example:
# openai api key
OPENAI_API_KEY = "your_open_ai_api_key"
# langsmith traces
LANGCHAIN_API_KEY = "your_lang_smith_api_key"
In the active 'multi_rag' environment, run the 'front_end.py' file with:
streamlit run front_end.py
Open a new terminal inside the 'multimodal_rag_app' folder and activate the 'multi_rag' environment again:
For Winows
.\multi_rag\Scripts\activate
For Linux
source auto_inpaint/bin/activate
In the second terminal with the 'multi_rag' environment active, start the FastAPI service with:
uvicorn main:app --host 127.0.0.1 --port 5004 --reload
To run locally, operate two different terminals each time: one with the 'multi_rag' environment to run 'streamlit run front_end.py', and another to execute 'uvicorn main:app --host 127.0.0.1 --port 5004 --reload'.
Many thanks to these excellent projects
The current version is 1.0. Development is ongoing, and any support or suggestions are welcome. Please reach out to me: [email protected] & LinkedIn