This repository demonstrates Retrieval-Augmented Generation (RAG) techniques applied to healthcare data using the MIMIC-III database. RAG systems are particularly valuable in healthcare for answering questions about patient records, providing evidence-based clinical information, and supporting decision-making.
This project implements a RAG system that can answer questions about hospital admission data by:
- Querying MIMIC-III admission records
- Converting text to embeddings using sentence transformers
- Building a FAISS vector index for efficient retrieval
- Using prompt engineering to improve answer generation with OpenAI models
Please note:
- This pipeline is not comprehensive or production-ready. It intentionally omits advanced error handling, extensive logging, modularization, and configurable parameters.
- Potential future work (e.g., an upcoming RAG pipeline with clinical notes) will incorporate these enhancements along with more robust preprocessing, healthcare-specific embedding models, and improved evaluation metrics.
- Simple but functional RAG pipeline for healthcare data
- Demonstrates how improved prompting can enhance the accuracy and relevance of generated answers.
- Demonstration of date/time reasoning in medical context
- Provides a foundation for extending the pipeline to handle more complex clinical notes.
- MIMIC-III Access: Obtain access to the MIMIC-III database through PhysioNet.
- Google Cloud Account: Set up a Google Cloud account with access to BigQuery.
- OpenAI API Key: Obtain an API key from OpenAI.
-
Clone this repository:
git clone https://github.com/kulsoom-abdullah/mimic-simple-rag.git cd mimic-simple-rag
-
Install dependencies:
pip install -r requirements.txt
-
Setup the
.env
file with your OpenAI API key:echo OPENAI_API_KEY="your-api-key-here" > .env
Before running the notebook, please ensure the following:
- Google Cloud Authentication:
Authenticate your Google Cloud account for BigQuery access by running:gcloud auth application-default login
Alternatively, set up a service account with the required permissions and define the environment variable GOOGLE_APPLICATION_CREDENTIALS
to point to your JSON key file. For more details, refer to the Google Cloud Authentication Guide.
-
Notebook Execution:
Open the Jupyter Notebook (mimic_simple_rag.ipynb
) and run the cells step by step to verify that everything executes without errors. -
Dependencies and First-Run Experience:
Run the following command to install all required dependencies:pip install -r requirements.txt
Note that some large models (e.g., the sentence-transformers model) will be downloaded on first use. This may take a few minutes during the initial run.
The main notebook demonstrates the complete pipeline:
# Set your Google Cloud project
os.environ["GOOGLE_CLOUD_PROJECT"] = "your-project-name"
The diagram (rag_pipeline.png
) illustrates the architecture of the Healthcare RAG Pipeline. Here's a breakdown of the steps:
- MIMIC-III Patient Admissions Data (BigQuery):
The pipeline starts by querying patient admission records from the MIMIC-III database stored in BigQuery.
- Text Chunking:
The admission records are preprocessed and split into smaller chunks using RecursiveCharacterTextSplitter to maintain clinical context.
- Embedding Generation:
Each text chunk is converted into a vector embedding using Sentence Transformers.
- FAISS Index:
The embeddings are stored in a FAISS index for efficient similarity search.
- User Query:
When a user submits a query, it is also converted into an embedding using Sentence Transformers.
- Vector Retrieval:
The FAISS index performs a similarity search to retrieve the most relevant text chunks based on the query embedding.
- Context-Aware Generation:
The retrieved chunks are combined with the user query and sent to OpenAI's GPT model for context-aware answer generation.
- Enhanced Prompting:
The system uses enhanced prompting techniques tailored for healthcare data to improve the relevance and accuracy of the generated answers.
- This is a minimal example using only admission data
- Limited to a small number of records for demonstration
- Does not include more complex healthcare-specific preprocessing
- Uses general-purpose embeddings rather than clinical embeddings
If you don’t have access to the MIMIC-III dataset or BigQuery, you can use the synthetic data version of this project. It provides a simplified RAG pipeline using randomly generated hospital admission data. Check out the synthetic_data_version folder for details.
The next phase will extend this approach to clinical notes in MIMIC-III, implementing:
- Advanced chunking strategies for longer medical texts
- Healthcare-specific embedding models
- Multi-chunk retrieval and context management
- Named entity recognition for medical terms
This project is licensed under the MIT License - see the LICENSE file for details.
- MIMIC-III Clinical Database
- PhysioNet for hosting the data
- LangChain for RAG components
- Sentence-Transformers for embeddings
- FAISS for vector indexing
- OpenAI for providing the GPT model used in this project.