Skip to content

Commit

Permalink
Merge pull request #43 from dohuyduc2002/develop for adding Module 1 …
Browse files Browse the repository at this point in the history
…projects

Add all module 1 projects
  • Loading branch information
dohuyduc2002 authored Jul 5, 2024
2 parents 23c678f + 7bf0e50 commit a1909a6
Show file tree
Hide file tree
Showing 59 changed files with 2,653 additions and 1 deletion.
162 changes: 162 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
47 changes: 47 additions & 0 deletions Module_1/RAG_project/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# RAG Project Overview

This project, located in the `RAG_project` folder, leverages the vicuna-7b language model (LLM) for a question-answering system specifically designed to work with PDF files. Utilizing the Retrieval-Augmented Generation (RAG) approach, the system aims to provide accurate answers by retrieving relevant information from the provided PDF documents.

## Key Features

- **Language Model**: Utilizes `vicuna-7b-v1.5`, a powerful language model, for generating answers.
- **File Processing**: Supports PDF file uploads for question-answering tasks.
- **RAG Approach**: Employs Retrieval-Augmented Generation to enhance answer accuracy by retrieving relevant information from the uploaded documents.

## Getting Started

To get started with this project, follow the steps below:

1. **Clone the Repository**: Ensure you have the project cloned to your local machine.
2. **Install Dependencies**: A `requirements.txt` file will be provided. Install the necessary dependencies by running `pip install -r requirements.txt` in your terminal.
3. **Run the Application**: Navigate to the `RAG_project` directory and run `app.py` to start the application.

## Usage

1. **Upload a PDF**: Upon starting the application, you will be prompted to upload a PDF file.
2. **Ask Questions**: After uploading, you can start asking questions related to the content of the PDF file.
3. **Receive Answers**: The system will process your questions and provide answers based on the content of the uploaded PDF.

## Requirements

The project requires the following main dependencies:

- `torch`
- `transformers`
- `chainlit`
- `langchain`

A detailed list of all dependencies will be provided in the `requirements.txt` file.

To develop a demon using ngrok, install localtunnel using npm in `localtunnel.txt` file

## To host a local run using ngrok, follow this instructions
1. Create an account on ngrok
2. Get personal secret authentication key
3. Running all code block **Do not cancel notebook run during this process**
4. Click on `Your app is available at http://localhost:8000` to accessing your demo

This magic method will create an UI file using chainlit
```python
%%writefile app.py
```
132 changes: 132 additions & 0 deletions Module_1/RAG_project/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
import chainlit as cl
import torch

from chainlit.types import AskFileResponse

from transformers import BitsAndBytesConfig
from transformers import AutoTokenizer , AutoModelForCausalLM , pipeline
from langchain_huggingface.llms import HuggingFacePipeline

from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain.chains import ConversationalRetrievalChain

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain import hub

text_splitter = RecursiveCharacterTextSplitter (chunk_size =1000 ,chunk_overlap =100)
embedding = HuggingFaceEmbeddings ()

def process_file(file: AskFileResponse):
if file.type == 'text/plain':
loader = TextLoader
elif file.type == 'application/pdf':
loader = PyPDFLoader

loader = loader(file.path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i,doc in enumerate(docs):
doc.metadata['source'] = f'source_{i}'
return docs

def get_vector_db(file: AskFileResponse):
docs = process_file(file)
cl.user_session.set('docs',docs)
vector_db = Chroma.from_documents(
documents = docs,
embedding = embedding
)
return vector_db

def get_huggingface_llm(model_name: str = 'lmsys/vicuna-7b-v1.5',
max_new_token: int = 512):
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=nf4_config,
low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

model_pipeline = pipeline(
'text-generation',
model=model,
tokenizer=tokenizer,
max_new_tokens=max_new_token,
pad_token_id=tokenizer.eos_token_id,
device_map='auto'
)
llm = HuggingFacePipeline(pipeline=model_pipeline)
return llm

LLM = get_huggingface_llm()

welcome_message = """ Welcome to the PDF QA! To get started :
1.Upload a PDF or text file
2.Ask a question about the file
"""

@cl.on_chat_start
async def on_chat_start():
files = None
while files is None:
files = await cl.AskFileMessage(
content = welcome_message,
accept=['text/plain','application/pdf'],
max_size_mb = 20,
timeout=180
).send()
file = files[0]

msg = cl.Message(content = f'Processing file {file.name}...',
disable_feedback=True)
await msg.send()

vector_db = await cl.make_async(get_vector_db)(file)

message_history = ChatMessageHistory()
memory = ConversationBufferMemory(memory_key='chat_history',
output_key = 'answer',
chat_memory = message_history,
return_messages = True)
retriever = vector_db.as_retriever(search_type = 'mmr',search_kwargs = {'k':3})
chain = ConversationalRetrievalChain.from_llm(llm = LLM,
chain_type = 'stuff',
retriever = retriever,
memory = memory,return_source_documents = True)
msg.content = f"'{file.name} processed. You can now ask question!"
await msg.update()

cl.user_session.set('chain',chain)

@cl.on_message
async def on_message(message:cl.Message):
chain = cl.user_session.get("chain")
cb = cl.AsyncLangchainCallbackHandler ()
res = await chain.ainvoke(message.content,callbacks =[ cb ])
answer = res["answer"]
source_documents = res["source_documents"]
text_elements = []

if source_documents :
for source_idx , source_doc in enumerate(source_documents) :
source_name = f"source_{source_idx}"
text_elements.append(
cl.Text( content = source_doc.page_content ,name = source_name))
source_names = [text_el.name for text_el in text_elements ]
if source_names :
answer += f"\ nSources : {' ,'. join ( source_names )}"
else :
answer += "\nNo sources found "
await cl.Message(content = answer ,elements = text_elements).send()
14 changes: 14 additions & 0 deletions Module_1/RAG_project/chainlit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Welcome to Chainlit! 🚀🤖

Hi there, Developer! 👋 We're excited to have you on board. Chainlit is a powerful tool designed to help you prototype, debug and share applications built on top of LLMs.

## Useful Links 🔗

- **Documentation:** Get started with our comprehensive [Chainlit Documentation](https://docs.chainlit.io) 📚
- **Discord Community:** Join our friendly [Chainlit Discord](https://discord.gg/k73SQ3FyUh) to ask questions, share your projects, and connect with other developers! 💬

We can't wait to see what you create with Chainlit! Happy coding! 💻😊

## Welcome screen

To modify the welcome screen, edit the `chainlit.md` file at the root of your project. If you do not want a welcome screen, just leave this file empty.
1 change: 1 addition & 0 deletions Module_1/RAG_project/localtunnel.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
localtunnel
1 change: 1 addition & 0 deletions Module_1/RAG_project/rag-project.ipynb

Large diffs are not rendered by default.

26 changes: 26 additions & 0 deletions Module_1/RAG_project/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
transformers==4.41.2
bitsandbytes==0.43.1
accelerate==0.31.0
langchain==0.2.5
langchainhub==0.1.20
langchain-chroma==0.1.1
langchain-community==0.2.5
langchain_huggingface==0.0.3
python-dotenv==1.0.1
pypdf==4.2.0

transformers==4.41.2
bitsandbytes==0.43.1
accelerate==0.31.0
langchain==0.2.5
langchainhub==0.1.20
langchain-chroma==0.1.1
langchain-community==0.2.5
langchain-openai==0.1.9
langchain_huggingface==0.0.3
chainlit==1.1.304
python-dotenv==1.0.1
pypdf==4.2.0


pyngrok
Loading

0 comments on commit a1909a6

Please sign in to comment.