Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #9

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
b48bac5
Project Outline Update
Sep 30, 2024
1c0357e
Project Reserach Updates
Oct 7, 2024
1835f3a
EDA updates
Oct 21, 2024
3dc3ba7
POC commit.
Nov 4, 2024
cb24c02
extract text from pdf
Nov 6, 2024
a25dfc9
Merge pull request #1 from BU-Spark/dev_akshat
akshatgurbuxani Nov 7, 2024
2d5430a
Merge pull request #3 from BU-Spark/dev_akshat_eda
akshatgurbuxani Nov 7, 2024
eb80ec6
Merge pull request #4 from BU-Spark/dev_abhaya_poc
akshatgurbuxani Nov 12, 2024
49006b8
Add files via upload
xud0218 Nov 22, 2024
4474120
Delete project-poc/BPS_chatbot.ipynb
xud0218 Nov 22, 2024
02c0d9c
implemented new features
xud0218 Nov 25, 2024
b69b79a
developed ui part
xud0218 Nov 27, 2024
a5e1464
Delete BPS_beta_screenshot.png
xud0218 Nov 27, 2024
c39ad90
Add files via upload
xud0218 Nov 27, 2024
2f136d9
Changes to fetch links
Dec 3, 2024
7571c7b
UI commit.
mcakuraju5 Dec 5, 2024
246ec7b
Add files via upload
xud0218 Dec 6, 2024
a15197e
Merge pull request #7 from BU-Spark/dev_mounika
akshatgurbuxani Dec 8, 2024
0506cf1
Merge pull request #6 from BU-Spark/dev_akshat
akshatgurbuxani Dec 8, 2024
94b99ca
Chatbot commit
Dec 9, 2024
ea0f690
Merge pull request #8 from BU-Spark/akshat_final
akshatgurbuxani Dec 10, 2024
7ab9bce
Delete project-UI directory
xud0218 Dec 10, 2024
10b6659
Fixed file paths. Add data folder. Updated README.md and DATASETDOC-f…
Dec 11, 2024
5a4c4c7
Provided links to main branch
Dec 11, 2024
f172d9c
Delete BPS_chatbot_with_ui directory
xud0218 Dec 11, 2024
c1ed5dd
Add files via upload
xud0218 Dec 11, 2024
ae143e2
Delete BPS_chatbot_with_ui directory
xud0218 Dec 11, 2024
546f7d9
Add files via upload
xud0218 Dec 11, 2024
83819ae
Update README.md
xud0218 Dec 11, 2024
d343cbf
Update README.md
xud0218 Dec 11, 2024
99c4483
Update app.py
xud0218 Dec 11, 2024
a3effa1
Merge pull request #10 from BU-Spark/akshat_final
akshatgurbuxani Dec 12, 2024
5797c5d
Merge pull request #11 from BU-Spark/xud
trgardos Dec 16, 2024
14027c0
Update README.md
trgardos Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -168,3 +168,8 @@ $RECYCLE.BIN/

# .nfs files are created when an open file is removed but is still being accessed
.nfs*

#pycache files ignored
project-chatbot/code/__pycache__/*

**/.DS_Store
129 changes: 129 additions & 0 deletions BPS_chatbot_with_ui/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Alternate Chatbot Implementation

This is an alternate implementation of a chatbot UI with different capabilities.

1. User Authentication via Chainlit:<br>
User Permissions: Access restricted to retrieving policy documents.<br>
Admin Permissions: Additional capabilities:

- Retrieve policy documents.
- Reindex the vector store.
- Upload new document(s) to the vector store.
- Remove specific document(s) from the vector store.

3. Chatbot Interface:
- Fully functional and user-friendly interface designed in-house.

4. Project Structure Improvements:
- Modular and reusable functions for better code maintainability.
- Clear and descriptive class naming conventions.

5. ChatGPT API Integration:
- Enhanced performance through advanced prompt engineering

## **Setup and Installation**

1. **Clone the Repository**:
```bash
git clone <repository-url>
cd <repository-folder>
```

2. **Install Requirements**:
```bash
pip install -r requirements.txt
```

3. **Environment Variables**:
- Replace the "OPENAI_API_KEY" and "CHAINLIT_AUTH_SECRET" in .env
- You can generate CHAINLIT_AUTH_SECRET using:
```bash
chainlit create-secret
```

4. **Run the Application**:
```bash
chainlit run app.py -w
```

---

## **Project Structure**

### **Key Files and Directories**

1. **`chunked_data_all_folders_with_links.json`**:
- This file contains the chunked policy documents with metadata such as folder name, file name, source link, and backup link.
- Each entry in the file represents a document chunk with the following structure:
```json
{
"content": "Document chunk content here...",
"folder_name": "Policy Folder Name",
"file_name": "Policy Document Name",
"source_link": "Original document source link",
"backup_link": "Backup document link"
}
```

1. **`source_links.json`**:
- This file contains all the links associated with each policy document
- User should periodically update it to retrieve the links from the RAG model.

2. **`vector_store/` Directory**:
- Contains the FAISS index and metadata files for the embedded policy documents.
- Files:
- **`faiss_index`**: The FAISS index stores vector representations of document chunks for efficient similarity search.
- **`faiss_meta.json`**: Metadata associated with the FAISS index, mapping vector IDs to the corresponding documents.

3. **`app.py`**:
- Backend of the application
- The main entry point of the application.
- Implements the chatbot interface using Chainlit and OpenAI's GPT API.

4. **`utils/` Directory**:
- Contains helper modules for core functionality:
- **`faiss_handler.py`**: Handles FAISS embedding, index loading, saving, adding to FAISS, removing from FAISS, and similarity searches.
- **`chunker.py`**: Processes and splits policy documents into smaller chunks.

5. **`requirements.txt`**:
- Lists all Python dependencies required for the project (e.g., Chainlit, FAISS, OpenAI API, etc.).

---

## **Usage**

1. **Admin and User**:
- User can only retrieve policy documents
- username: user
- password: 123
- Admin can retrieve policy documents, reindex the vector store, upload new document(s) to the vector store, and remove specific documents from the vector store.
- username: admin
- password: 321

2. **Reindex**:
- Enter: 'reindex' on chatbot interface.
- Put the policy documents in the `data/documents/dataset` directory.
- The application will re-chunk and re-embed the policy documents in `data/documents/dataset` directory.

3. **Upload**:
- Attach the document(s) and source link(s) associated with each document(s) on chatbot interface
- The application will chunk and embed ONLY the uploaded documents and add them to the vector store.

4. **Remove**:
- Enter: 'remove: <file_name>' to remove the specific document from the vector store.
- ex: remove: CAO-06 GPA Calculation Method
---

## **Error Management**

1. **Chainlit Timeout**:
- When performing 'reindex', the chainlit will timeout during the embedding process.
- Admin can Stop Task after the console print:
- FAISS index saved to ./data/vector_store/faiss_index
- Metadata saved to ./data/vector_store/faiss_meta

2. **Re-run reindex**:
- When an error occurs, the admin can run 'reindex' to fix the error.

3. **Contact**:
- Contact [email protected] with any further issues.
198 changes: 198 additions & 0 deletions BPS_chatbot_with_ui/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
import openai
import os
import tempfile
import asyncio
import re
import chainlit as cl
from openai import AsyncOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from utils.faiss_handler import *
from utils.chunker import *

# Initialize OpenAI client
client = AsyncOpenAI()

# OpenAI API key setup
openai.api_key = os.getenv("OPENAI_API_KEY")
cl.instrument_openai()

# OpenAI settings
settings = {
"model": "gpt-3.5-turbo",
"temperature": 0.5,
}

dataset_path = "./data/documents/dataset"
output_chunk_path = "./data/chunked_data_all_folders_with_links.json"
index_path = "./data/vector_store/faiss_index"
meta_path = "./data/vector_store/faiss_meta"
embeddings = get_embeddings()
faiss_db = load_faiss_from_files(index_path, meta_path, embeddings)

@cl.password_auth_callback
def auth_callback(username: str, password: str):
if (username, password) == ("admin", "321"):
return cl.User(
identifier="admin", metadata={"role": "admin", "provider": "credentials"}
)
elif (username, password) == ("user", "123"):
return cl.User(
identifier="user", metadata={"role": "user", "provider": "credentials"}
)
else:
return None

@cl.on_chat_start
async def on_chat_start():
user = cl.user_session.get("user")
if user:
if user.identifier == "admin":
await cl.Message(content="Welcome Admin! You can:\n"
"- Send a policy-related question.\n"
"- Upload new documents (attach with source links).\n"
"- Enter: 'reindex' to re-chunk and re-embed the FAISS vector store.\n"
"- Enter: 'remove: <file name>' to remove a document.").send()
else:
await cl.Message(content="Hi! I will assist you in finding documents based on your question. Let's get started!").send()

@cl.on_message
async def handle_user_message(message: cl.Message):
user = cl.user_session.get("user")
try:
if user and user.identifier == "admin":
if message.content.lower() == "reindex":
await reindex()
elif message.content.lower().startswith("remove:"):
await remove_doc(message)
elif message.elements:
await process_uploaded_files(message)
else:
await perform_chatgpt_query(message)
else:
await perform_chatgpt_query(message)
except Exception as e:
await cl.Message(content=f"Error handling your request: {e}").send()

async def process_uploaded_files(message):
global faiss_db
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Extract source links from the user's message
url_pattern = r'https?://[^\s]+'
source_links = re.findall(url_pattern, message.content)

files = [file for file in message.elements if "pdf" in file.mime]
if len(files) != len(source_links):
await cl.Message(content=f"Error: Number of links ({len(source_links)}) and uploaded files ({len(files)}) do not match.").send()
return

all_chunks = []
try:
for file, source_link in zip(files, source_links):
if not file.path or not os.path.exists(file.path):
raise ValueError(f"Invalid file path for {file.name}.")

with open(file.path, "rb") as f:
file_content = f.read()

with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_file:
temp_file.write(file_content)
temp_path = temp_file.name

# Remove document before update
faiss_db = remove_from_faiss(faiss_db, file.name)

loader = PyPDFLoader(temp_path)
document = loader.load()
chunks = text_splitter.split_documents(document)

for chunk in chunks:
chunk.metadata = chunk.metadata or {}
chunk.metadata.update({
"file_name": file.name,
"source_link": source_link,
"backup_link": backup_source_link(clean_name(file.name)),
})
# Remove 'page' and 'source'
chunk.metadata.pop("page", None)
chunk.metadata.pop("source", None)

all_chunks.extend(chunks)
os.remove(temp_path)

# Update FAISS database
faiss_db = add_to_faiss(faiss_db, embeddings, all_chunks)
save_faiss(faiss_db, index_path, meta_path)
await cl.Message(content=f"Added {len(all_chunks)} chunks to FAISS vector store.").send()

except Exception as e:
await cl.Message(content=f"Error during file processing: {e}").send()

async def perform_chatgpt_query(message: cl.Message):
if not faiss_db:
await cl.Message(content="FAISS vector store is not initialized.").send()
return

try:
search_results = search_faiss(faiss_db, message.content)
if not search_results:
await cl.Message(content="No relevant documents found.").send()
return

messages = [
{
"role": "system",
"content": (
"You are an assistant for Boston Public School policies. I will provide you with retrieved documents from the RAG model. Your task is to:"
"1. Refuse to engage with harmful, racist, or discriminatory questions. If such a query is detected, respond only with: I'm sorry, but I cannot assist with that request."
"2. Evaluate the relevance of each document based on its content and the query provided."
"3. Only summarize documents that are relevant to the query. A document is considered relevant if it contains policies or guidelines explicitly related to the query."
"4. If a document is not at least somewhat relevant, exclude it from the summary."
"5. Do not return anything other than reformatted or summarized documents"
"6. Remember to include the link"
"For relevant documents, reformat the content in this structured format:"
"Document: Policy file name here\n"
"Formatting Links: Google Drive Link (format Google Drive Link here)| Boston Public School Link (format Boston Public School Link here)\n"
"Summary: write 2-3 sentences summary.\n"
"Key points (1 sentence)"
"- [Bullet point 1]\n"
"- [Bullet point 2]\n"
"- [Bullet point 3]\n"
"NO ADDITIONAL TEXT AFTER THIS!"
),
},
{"role": "user", "content": search_results},
]

response = await client.chat.completions.create(messages=messages, **settings)
await cl.Message(content=response.choices[0].message.content).send()

except Exception as e:
await cl.Message(content=f"Error generating response: {e}").send()

async def reindex():
global faiss_db
try:
await cl.Message(content="Reindexing in progress...").send()
process_dataset(dataset_path=dataset_path, output_chunk_path=output_chunk_path)

await cl.Message(content="chunks created successfully. Now embedding...").send()
documents = json_to_documents(output_chunk_path)
faiss_db = FAISS.from_documents(documents, embeddings)
save_faiss(faiss_db, index_path, meta_path)
except Exception as e:
await cl.Message(content=f"Error during reindexing: {e}").send()

async def remove_doc(message):
global faiss_db

try:
file_name = message.content.split("remove:", 1)[1].strip()
file_name = file_name + ".pdf"
faiss_db = remove_from_faiss(faiss_db, file_name)
save_faiss(faiss_db, index_path, meta_path)
await cl.Message(content=f"Document '{file_name}' removed from FAISS vector store.").send()
except Exception as e:
await cl.Message(content=f"Error removing document: {e}").send()
14 changes: 14 additions & 0 deletions BPS_chatbot_with_ui/chainlit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Welcome to Chainlit! 🚀🤖

Hi there, Developer! 👋 We're excited to have you on board. Chainlit is a powerful tool designed to help you prototype, debug and share applications built on top of LLMs.

## Useful Links 🔗

- **Documentation:** Get started with our comprehensive [Chainlit Documentation](https://docs.chainlit.io) 📚
- **Discord Community:** Join our friendly [Chainlit Discord](https://discord.gg/k73SQ3FyUh) to ask questions, share your projects, and connect with other developers! 💬

We can't wait to see what you create with Chainlit! Happy coding! 💻😊

## Welcome screen

To modify the welcome screen, edit the `chainlit.md` file at the root of your project. If you do not want a welcome screen, just leave this file empty.
Loading
Loading