Skip to content

Commit

Permalink
first push
Browse files Browse the repository at this point in the history
  • Loading branch information
hui-po-wang committed Oct 27, 2024
0 parents commit c170e61
Show file tree
Hide file tree
Showing 46 changed files with 2,917 additions and 0 deletions.
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
*__pycache__*
*txt
*cache*
*.DS_Store*
tokenized_datasets/*
exps/*
probs/*
.ipynb_checkpoints/
tiny-imagenet-200/
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Hui-Po Wang

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
136 changes: 136 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@

<br/>
<div align="center">

<br/>
<a href="https://github.com/ShaanCoding/ReadME-Generator">
<img src="images/lm-gc.png" alt="Logo" width="120" height="80">
</a>
<h3 align="center"></h3>
<p align="center">
The official implementation of "Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models" publised at NeurIPS 2024.
<br/>
<br/>
<a href="https://arxiv.org/abs/2409.17836">[Preprint]</a>
</p>
</div>

## Overview

![Product Screenshot](images/teaser.png)

This project provides the source code of LM-GC, the first LLM-powered gradient compressor.

Here are take-aways:

- We demonstrate that large language models (LLMs) hold significant potential as prior models for gradients, a concept that has been widely applied in other modalities but gradients.
- We introduce an novel serialization method that converts IEEE 754 floating points into hexadecimal format, enabling LLMs to comprehend and achieve state-of-the-art lossless gradient compression.
- Our LLM-based prior model could unlock new applications for gradients, similar to those in other modalities, such as super-resolution, denoising, generation, and more.

<br/>

*If you find the project interesting, don't forget to star and cite our work:*

```bibtex
@article{wang2024language,
title={Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models},
author={Wang, Hui-Po and Fritz, Mario},
journal={Advances in Neural Information Processing Systems},
year={2024}
}
```
## Getting Started
### Prerequisites

- torch ≥ 2.12.0
- transformers ≥ 4.40.1
- [torchac](https://github.com/fab-jul/torchac)
- [flash attention](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2) ≥ 2.5.8 via ```pip install flash-attn --no-build-isolation``` for NVIDIA GPUs

or

- install via ```pip```
```sh
pip install -r requirements.txt
```
**After set up the huggingface access token, ideally, the codebase will download language models automatically via HuggingFace except for LLAMA2. See [More LLMs](#more-llms) for more information.**

### Quickstart
We provide a quick demo here. Please refer to [Usage](#usage) for the detailed usage.
```bash
cd scripts
# compress gradients of a ConvNet trained on TinyImageNet using TinyLLAMA
bash pipeline.sh
```
## Usage
It takes three steps to reproduce the experiments in the paper, including (1) train neural networks to collect gradients; (2) serialize and tokenize raw gradients; (3) run LLMs and arithmetic (LM-GC).

### 1. Gradient collection
This step trains a network (e.g. a ConvNet on TinyImageNet in the following example) and collect gradients for compression later. See ```scripts/run_collect.sh``` for more details.
```bash
DATASET='tinyimagenet' # cifar10 # mnist
ARCH="convnet" # vgg16 # resnet18 # vit
for i in 0 1 2
do
python -u train_and_collect_grad.py -cfg settings/gradient_collection/$DATASET-$ARCH.yaml --tag $i --grad-interval 400 --download
done
```
### 2. Serialization and tokenization
For convenience, we process the data before conducting arithmetic encoding. The data is serialized and tokenized here. We create three preprocessed datasets here. See ```scripts/serialization.sh``` for more details.
```bash
NUM_SUBSAMPLE=10
DATASET='tinyimagenet' # cifar10 # mnist
ARCH="convnet" # vgg16 # resnet18 # vit
TYPE="grad"
COMPRESSOR="tinyllama" # llama2-7b # openllama3b
SEP="hex-none" # hex-space # hex-comma+space # iso # hex-semicolon
BPG=4 # 8
for i in 1 2 3
do
python -u tokenize_dataset.py --cfg settings/compression/cifar10-$SEP.yaml \
--data-path exps/$DATASET-$ARCH/0/grads/ --bytes-per-group $BPG \
--compressor $COMPRESSOR --exhaustive-listing --num-subsample $NUM_SUBSAMPLE \
--output-name $ARCH-$DATASET-$COMPRESSOR-$SEP-$NUM_SUBSAMPLE-$TYPE-$BPG-$i
done
```
### 3. Run compression
The processed data from the previous step is now divided into several disjoint windows. By default, LLMs see a set of 2048 (including 1 BOS token) tokens every time. The experimented are repeated three times. See ```scripts/compress.sh``` for more details.
```bash
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1
NUM_SUBSAMPLE=10
DATASET='tinyimagenet' # cifar10 # mnist
ARCH="convnet" # vgg16 # resnet18 # vit
TYPE="grad"
COMPRESSOR="tinyllama" # llama2-7b # openllama3b
SEP="hex-none" # hex-space # hex-comma+space # iso # hex-semicolon
BATCHSIZE=4 # depending on your GPUs
BPG=4 # 8
for i in 1 2 3
do
python -u compress.py -cfg settings/compression/cifar10-$SEP.yaml --compressor $COMPRESSOR --dataset tokenized_dataset \
--data-path ./tokenized_datasets/$ARCH-$DATASET-$COMPRESSOR-$SEP-$NUM_SUBSAMPLE-$TYPE-$BPG-$i.pkl --batch-size $BATCHSIZE
done
```

## Options

### More LLMs

### More models to compress

### Ablation study
- Bytes per group
- Context window size

## TO-DO
- [x] prepare `pipeline.sh`
- [x] sanity check
- [ ] how to add more LLMs
- [ ] provide a runnable encode/decode example
- [ ] Baseline codec
## License

Distributed under the MIT License. See [MIT License](https://opensource.org/licenses/MIT) for more information.

## Acknowledgments
This project is partially built up on [Deepmind's work](), and the readme file template comes from [makeread.me](https://github.com/ShaanCoding/ReadME-Generator).
Loading

0 comments on commit c170e61

Please sign in to comment.