Skip to content

Commit

Permalink
first update
Browse files Browse the repository at this point in the history
  • Loading branch information
YuanJianhao508 committed May 29, 2024
1 parent 856f0bc commit 5e379b7
Show file tree
Hide file tree
Showing 112 changed files with 120,523 additions and 1 deletion.
55 changes: 54 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,57 @@ Official GitHub repository for "RAG-Driver: Generalisable Driving Explanations w
**RAG-Driver** is a Multi-Modal Large Language Model with Retrieval-augmented In-context Learning capacity designed for generalisable and explainable end-to-end driving with strong zeroshot generalisation capacity.

## News <a name="highlight"></a>
Codes and models will be released soon
## 📰 News
* **[2024.05.27]** Code update is in Progress, this repo is under active maintenance.


## TODO List
- [ ] Uploading the processed version of BDDX.
- [ ] Uploading the model checkpoint.
- [ ] Releasing Spoken-SAX dataset.
- [ ] further cleaning of retrieval engine codebase

## Usage

### Requirements and Installation
* Python >= 3.10
* Pytorch == 2.0.1
* CUDA Version >= 11.7
* Install required packages:
```bash
git clone https://github.com/YuanJianhao508/RAG-Driver.git
cd RAG-DRIVER
conda create -n ragdriver python=3.10 -y
conda activate ragdriver
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d
```

### Instruction Tuning on BDD-X dataset

```bash
bash ./scripts/finetune.sh
```

- Download pre-trained Video-LLaVA LLM and projector checkpoint from [here](https://huggingface.co/LanguageBind/Video-LLaVA-7B) and [here](https://huggingface.co/LanguageBind/Video-LLaVA-Pretrain-7B) and specify path in '--model_name_or_path' and '--pretrain_mm_mlp_adapter'.
- Download pre-trained LanguageBind encoder from [here](https://huggingface.co/LanguageBind/LanguageBind_Video_merge) and specify path in '--video_tower'.
- Change the batch size '--per_device_train_batch_size' and gradient accumulation step '--gradient_accumulation_steps' based on the number of gpu available, please ensure the effective batch size (i.e. --per_device_train_batch_size * gradient accumulation step * number of gpus) equals '128'.


## Citations
If you find our paper and code useful in your research, please consider citing:
```BibTeX
@article{yuan2024rag,
title={RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model},
author={Yuan, Jianhao and Sun, Shuyang and Omeiza, Daniel and Zhao, Bo and Newman, Paul and Kunze, Lars and Gadd, Matthew},
journal={arXiv preprint arXiv:2402.10828},
year={2024}
}
}
```

## Acknowledgement
This repo is built on [Video-LLaVA](https://github.com/haotian-liu/LLaVA) and [ADAPT](https://github.com/jxbbb/ADAPT). We thank all the authors for their open-sourced codebase.
1 change: 1 addition & 0 deletions llava/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .model import LlavaLlamaForCausalLM
18 changes: 18 additions & 0 deletions llava/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
CONTROLLER_HEART_BEAT_EXPIRATION = 30
WORKER_HEART_BEAT_INTERVAL = 15

LOGDIR = "."

# Model Constants
IGNORE_INDEX = -100
X_TOKEN_INDEX = {'IMAGE': -200, 'VIDEO': -201, 'AUDIO': -202, 'THERMAL': -203, 'DEPTH': -204}
X_INDEX_TOKEN = {v: k for k, v in X_TOKEN_INDEX.items()}
# IMAGE_TOKEN_INDEX = -200
DEFAULT_X_TOKEN = {'IMAGE': "<image>", 'VIDEO': "<video>", 'AUDIO': "<audio>", 'THERMAL': "<thermal>", 'DEPTH': "<depth>"}
# DEFAULT_IMAGE_TOKEN = "<image>"
DEFAULT_X_PATCH_TOKEN = {'IMAGE': "<im_patch>", 'VIDEO': "<vi_patch>", 'AUDIO': "<au_patch>", 'THERMAL': "<th_patch>", 'DEPTH': "<de_patch>"}
# DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
DEFAULT_X_START_TOKEN = {'IMAGE': "<im_start>", 'VIDEO': "<vi_start>", 'AUDIO': "<au_start>", 'THERMAL': "<th_start>", 'DEPTH': "<de_start>"}
# DEFAULT_IM_START_TOKEN = "<im_start>"
DEFAULT_X_END_TOKEN = {'IMAGE': "<im_end>", 'VIDEO': "<vi_end>", 'AUDIO': "<au_end>", 'THERMAL': "<th_end>", 'DEPTH': "<de_end>"}
# DEFAULT_IM_END_TOKEN = "<im_end>"
Loading

0 comments on commit 5e379b7

Please sign in to comment.