-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #43 from dohuyduc2002/develop for adding Module 1 …
…projects Add all module 1 projects
- Loading branch information
Showing
59 changed files
with
2,653 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
cover/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
.pybuilder/ | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# pyenv | ||
# For a library or package, you might want to ignore these files since the code is | ||
# intended to run in multiple environments; otherwise, check them in: | ||
# .python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# poetry | ||
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. | ||
# This is especially recommended for binary packages to ensure reproducibility, and is more | ||
# commonly ignored for libraries. | ||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control | ||
#poetry.lock | ||
|
||
# pdm | ||
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. | ||
#pdm.lock | ||
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it | ||
# in version control. | ||
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control | ||
.pdm.toml | ||
.pdm-python | ||
.pdm-build/ | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# pytype static type analyzer | ||
.pytype/ | ||
|
||
# Cython debug symbols | ||
cython_debug/ | ||
|
||
# PyCharm | ||
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can | ||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore | ||
# and can be added to the global gitignore or merged into this file. For a more nuclear | ||
# option (not recommended) you can uncomment the following to ignore the entire idea folder. | ||
#.idea/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# RAG Project Overview | ||
|
||
This project, located in the `RAG_project` folder, leverages the vicuna-7b language model (LLM) for a question-answering system specifically designed to work with PDF files. Utilizing the Retrieval-Augmented Generation (RAG) approach, the system aims to provide accurate answers by retrieving relevant information from the provided PDF documents. | ||
|
||
## Key Features | ||
|
||
- **Language Model**: Utilizes `vicuna-7b-v1.5`, a powerful language model, for generating answers. | ||
- **File Processing**: Supports PDF file uploads for question-answering tasks. | ||
- **RAG Approach**: Employs Retrieval-Augmented Generation to enhance answer accuracy by retrieving relevant information from the uploaded documents. | ||
|
||
## Getting Started | ||
|
||
To get started with this project, follow the steps below: | ||
|
||
1. **Clone the Repository**: Ensure you have the project cloned to your local machine. | ||
2. **Install Dependencies**: A `requirements.txt` file will be provided. Install the necessary dependencies by running `pip install -r requirements.txt` in your terminal. | ||
3. **Run the Application**: Navigate to the `RAG_project` directory and run `app.py` to start the application. | ||
|
||
## Usage | ||
|
||
1. **Upload a PDF**: Upon starting the application, you will be prompted to upload a PDF file. | ||
2. **Ask Questions**: After uploading, you can start asking questions related to the content of the PDF file. | ||
3. **Receive Answers**: The system will process your questions and provide answers based on the content of the uploaded PDF. | ||
|
||
## Requirements | ||
|
||
The project requires the following main dependencies: | ||
|
||
- `torch` | ||
- `transformers` | ||
- `chainlit` | ||
- `langchain` | ||
|
||
A detailed list of all dependencies will be provided in the `requirements.txt` file. | ||
|
||
To develop a demon using ngrok, install localtunnel using npm in `localtunnel.txt` file | ||
|
||
## To host a local run using ngrok, follow this instructions | ||
1. Create an account on ngrok | ||
2. Get personal secret authentication key | ||
3. Running all code block **Do not cancel notebook run during this process** | ||
4. Click on `Your app is available at http://localhost:8000` to accessing your demo | ||
|
||
This magic method will create an UI file using chainlit | ||
```python | ||
%%writefile app.py | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
import chainlit as cl | ||
import torch | ||
|
||
from chainlit.types import AskFileResponse | ||
|
||
from transformers import BitsAndBytesConfig | ||
from transformers import AutoTokenizer , AutoModelForCausalLM , pipeline | ||
from langchain_huggingface.llms import HuggingFacePipeline | ||
|
||
from langchain.memory import ConversationBufferMemory | ||
from langchain_community.chat_message_histories import ChatMessageHistory | ||
from langchain.chains import ConversationalRetrievalChain | ||
|
||
from langchain_huggingface import HuggingFaceEmbeddings | ||
from langchain_chroma import Chroma | ||
from langchain_community.document_loaders import PyPDFLoader, TextLoader | ||
from langchain.text_splitter import RecursiveCharacterTextSplitter | ||
from langchain_core.runnables import RunnablePassthrough | ||
from langchain_core.output_parsers import StrOutputParser | ||
from langchain import hub | ||
|
||
text_splitter = RecursiveCharacterTextSplitter (chunk_size =1000 ,chunk_overlap =100) | ||
embedding = HuggingFaceEmbeddings () | ||
|
||
def process_file(file: AskFileResponse): | ||
if file.type == 'text/plain': | ||
loader = TextLoader | ||
elif file.type == 'application/pdf': | ||
loader = PyPDFLoader | ||
|
||
loader = loader(file.path) | ||
documents = loader.load() | ||
docs = text_splitter.split_documents(documents) | ||
for i,doc in enumerate(docs): | ||
doc.metadata['source'] = f'source_{i}' | ||
return docs | ||
|
||
def get_vector_db(file: AskFileResponse): | ||
docs = process_file(file) | ||
cl.user_session.set('docs',docs) | ||
vector_db = Chroma.from_documents( | ||
documents = docs, | ||
embedding = embedding | ||
) | ||
return vector_db | ||
|
||
def get_huggingface_llm(model_name: str = 'lmsys/vicuna-7b-v1.5', | ||
max_new_token: int = 512): | ||
nf4_config = BitsAndBytesConfig( | ||
load_in_4bit=True, | ||
bnb_4bit_quant_type='nf4', | ||
bnb4bit_use_double_quant=True, | ||
bnb_4bit_compute_dtype=torch.bfloat16 | ||
) | ||
model = AutoModelForCausalLM.from_pretrained( | ||
model_name, | ||
quantization_config=nf4_config, | ||
low_cpu_mem_usage=True | ||
) | ||
tokenizer = AutoTokenizer.from_pretrained(model_name) | ||
|
||
model_pipeline = pipeline( | ||
'text-generation', | ||
model=model, | ||
tokenizer=tokenizer, | ||
max_new_tokens=max_new_token, | ||
pad_token_id=tokenizer.eos_token_id, | ||
device_map='auto' | ||
) | ||
llm = HuggingFacePipeline(pipeline=model_pipeline) | ||
return llm | ||
|
||
LLM = get_huggingface_llm() | ||
|
||
welcome_message = """ Welcome to the PDF QA! To get started : | ||
1.Upload a PDF or text file | ||
2.Ask a question about the file | ||
""" | ||
|
||
@cl.on_chat_start | ||
async def on_chat_start(): | ||
files = None | ||
while files is None: | ||
files = await cl.AskFileMessage( | ||
content = welcome_message, | ||
accept=['text/plain','application/pdf'], | ||
max_size_mb = 20, | ||
timeout=180 | ||
).send() | ||
file = files[0] | ||
|
||
msg = cl.Message(content = f'Processing file {file.name}...', | ||
disable_feedback=True) | ||
await msg.send() | ||
|
||
vector_db = await cl.make_async(get_vector_db)(file) | ||
|
||
message_history = ChatMessageHistory() | ||
memory = ConversationBufferMemory(memory_key='chat_history', | ||
output_key = 'answer', | ||
chat_memory = message_history, | ||
return_messages = True) | ||
retriever = vector_db.as_retriever(search_type = 'mmr',search_kwargs = {'k':3}) | ||
chain = ConversationalRetrievalChain.from_llm(llm = LLM, | ||
chain_type = 'stuff', | ||
retriever = retriever, | ||
memory = memory,return_source_documents = True) | ||
msg.content = f"'{file.name} processed. You can now ask question!" | ||
await msg.update() | ||
|
||
cl.user_session.set('chain',chain) | ||
|
||
@cl.on_message | ||
async def on_message(message:cl.Message): | ||
chain = cl.user_session.get("chain") | ||
cb = cl.AsyncLangchainCallbackHandler () | ||
res = await chain.ainvoke(message.content,callbacks =[ cb ]) | ||
answer = res["answer"] | ||
source_documents = res["source_documents"] | ||
text_elements = [] | ||
|
||
if source_documents : | ||
for source_idx , source_doc in enumerate(source_documents) : | ||
source_name = f"source_{source_idx}" | ||
text_elements.append( | ||
cl.Text( content = source_doc.page_content ,name = source_name)) | ||
source_names = [text_el.name for text_el in text_elements ] | ||
if source_names : | ||
answer += f"\ nSources : {' ,'. join ( source_names )}" | ||
else : | ||
answer += "\nNo sources found " | ||
await cl.Message(content = answer ,elements = text_elements).send() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Welcome to Chainlit! 🚀🤖 | ||
|
||
Hi there, Developer! 👋 We're excited to have you on board. Chainlit is a powerful tool designed to help you prototype, debug and share applications built on top of LLMs. | ||
|
||
## Useful Links 🔗 | ||
|
||
- **Documentation:** Get started with our comprehensive [Chainlit Documentation](https://docs.chainlit.io) 📚 | ||
- **Discord Community:** Join our friendly [Chainlit Discord](https://discord.gg/k73SQ3FyUh) to ask questions, share your projects, and connect with other developers! 💬 | ||
|
||
We can't wait to see what you create with Chainlit! Happy coding! 💻😊 | ||
|
||
## Welcome screen | ||
|
||
To modify the welcome screen, edit the `chainlit.md` file at the root of your project. If you do not want a welcome screen, just leave this file empty. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
localtunnel |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
transformers==4.41.2 | ||
bitsandbytes==0.43.1 | ||
accelerate==0.31.0 | ||
langchain==0.2.5 | ||
langchainhub==0.1.20 | ||
langchain-chroma==0.1.1 | ||
langchain-community==0.2.5 | ||
langchain_huggingface==0.0.3 | ||
python-dotenv==1.0.1 | ||
pypdf==4.2.0 | ||
|
||
transformers==4.41.2 | ||
bitsandbytes==0.43.1 | ||
accelerate==0.31.0 | ||
langchain==0.2.5 | ||
langchainhub==0.1.20 | ||
langchain-chroma==0.1.1 | ||
langchain-community==0.2.5 | ||
langchain-openai==0.1.9 | ||
langchain_huggingface==0.0.3 | ||
chainlit==1.1.304 | ||
python-dotenv==1.0.1 | ||
pypdf==4.2.0 | ||
|
||
|
||
pyngrok |
Oops, something went wrong.