Merge LoCo with Zero++ #6730

XingyuXie · 2024-11-08T13:58:26Z

Integration of LoCo Method into ZeRO++

Overview

This PR introduces the integration of the LoCo method, as outlined in this paper, into the ZeRO++ framework of DeepSpeed. The key enhancement involves applying error feedback compensation to 4-bit gradients before communication. This approach improves pre-training loss outcomes without additional time overhead, though it requires extra GPU memory. The extent of this memory increase depends on model size and training configuration.

Experimental Results

We conducted pre-training experiments using the Llama2 architecture, adjusting the number of layers and hidden size. The experiments included:

A smaller-scale model with 0.8B parameters trained on 30B tokens.
A larger-scale model with 8B parameters trained on 5B tokens.

The training data was sampled from Redpajama-V2.

Findings:

Smaller Models (0.8B parameters): Significant gains were observed when applying the LoCo method.
Larger Models (8B parameters): The gains were present but less pronounced. This could be due to:
1. Relatively smaller data volume.
2. Lower pre-training loss for larger models, making significant improvements harder to achieve.

However, even a smaller pre-training loss gap in larger models can translate to meaningful gains in downstream tasks.

Example Script

For reference, the run.sh script used for the 8B parameter, 5B tokens experiment is attached. The experiment was conducted using the DeepSpeed-Megatron platform.

Acknowledgments

Special thanks to cc @GuanhuaWang for ongoing communication and guidance throughout this work.

We appreciate your consideration of this PR and welcome any feedback or questions!

XingyuXie · 2024-11-08T14:02:39Z

@XingyuXie please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

GuanhuaWang

@XingyuXie thx for this effort.

Overall looks good to me. Just left a few comments

deepspeed/runtime/zero/config.py

deepspeed/runtime/zero/stage3.py

csrc/includes/quantization_utils.h

csrc/quantization/swizzled_quantize.cu

…into LoCo-Zero++

XingyuXie · 2024-11-27T09:28:37Z

As required by cc @GuanhuaWang , we added the unit-test code to verify the logic of cuda kernels used in LoCo-Zero++.

GuanhuaWang · 2024-12-02T20:19:25Z

Thx @XingyuXie for the pr updates on unit-test. Overall, it Looks good to me.

cc @tjruwase @hwchen2017

@GuanhuaWang

### Integration of LoCo Method into ZeRO++ #### Overview This PR introduces the integration of the **LoCo** method, as outlined in [this paper](https://arxiv.org/abs/2407.04480), into the ZeRO++ framework of DeepSpeed. The key enhancement involves applying error feedback compensation to 4-bit gradients before communication. This approach ***improves pre-training loss outcomes without additional time overhead***, though it requires extra GPU memory. The extent of this memory increase depends on model size and training configuration. #### Experimental Results We conducted pre-training experiments using the Llama2 architecture, adjusting the number of layers and hidden size. The experiments included: - **A smaller-scale model with 0.8B parameters trained on 30B tokens**. - **A larger-scale model with 8B parameters trained on 5B tokens**. The training data was sampled from **Redpajama-V2**. <p align="center"> <img src="https://github.com/user-attachments/assets/e7db9487-728c-4a17-9806-c15afa12f62e" width="49%" /> <img src="https://github.com/user-attachments/assets/3efec895-b71d-43ab-b5ce-65468ba8b9f1" width="49%" /> </p> **Findings**: - **Smaller Models (0.8B parameters)**: Significant gains were observed when applying the LoCo method. - **Larger Models (8B parameters)**: The gains were present but less pronounced. This could be due to: 1. Relatively smaller data volume. 2. Lower pre-training loss for larger models, making significant improvements harder to achieve. However, even a smaller pre-training loss gap in larger models can translate to meaningful gains in downstream tasks. #### Example Script For reference, the [run.sh](https://github.com/user-attachments/files/17679552/zeroplus-7b3.zip) script used for the 8B parameter, 5B tokens experiment is attached. The experiment was conducted using the **DeepSpeed-Megatron** platform. #### Acknowledgments Special thanks to cc @GuanhuaWang for ongoing communication and guidance throughout this work. --- We appreciate your consideration of this PR and welcome any feedback or questions! --------- Co-authored-by: ChuanxinTang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Hongwei Chen <[email protected]> Signed-off-by: siqi <[email protected]>

Merge LoCo with Zero++

9118612

XingyuXie requested review from tjruwase, awan-10 and tohtana as code owners November 8, 2024 13:58

loadams requested review from GuanhuaWang and removed request for awan-10 November 12, 2024 14:48

GuanhuaWang reviewed Nov 13, 2024

View reviewed changes

deepspeed/runtime/zero/config.py Outdated Show resolved Hide resolved

deepspeed/runtime/zero/stage3.py Outdated Show resolved Hide resolved

csrc/includes/quantization_utils.h Outdated Show resolved Hide resolved

jiachunp and others added 3 commits November 15, 2024 22:58

add brief introduction

2479b21

Merge branch 'master' into LoCo-Zero++

03dc2b9

Formatting

3c9c91a

tjruwase requested a review from hwchen2017 November 15, 2024 22:13

XingyuXie requested a review from GuanhuaWang November 16, 2024 12:28

Merge branch 'master' into LoCo-Zero++

7583c0f

hwchen2017 reviewed Nov 18, 2024

View reviewed changes

csrc/quantization/swizzled_quantize.cu Outdated Show resolved Hide resolved

XingyuXie added 2 commits November 19, 2024 23:59

Merge branch 'microsoft:master' into LoCo-Zero++

ac9df98

backtrack from __half to __half2

9e5402b

XingyuXie requested a review from hwchen2017 November 19, 2024 20:59

Merge branch 'master' into LoCo-Zero++

4f45cac

hwchen2017 reviewed Nov 20, 2024

View reviewed changes

csrc/quantization/swizzled_quantize.cu Outdated Show resolved Hide resolved

XingyuXie added 3 commits November 21, 2024 01:06

refine loco_swizzled_quant_kernel

f27711c

Merge branch 'LoCo-Zero++' of https://github.com/XingyuXie/DeepSpeed …

8cd96ab

…into LoCo-Zero++

Merge branch 'microsoft:master' into LoCo-Zero++

11cb119

XingyuXie requested a review from hwchen2017 November 20, 2024 17:17

XingyuXie added 3 commits November 24, 2024 19:03

Merge branch 'microsoft:master' into LoCo-Zero++

28334da

Merge branch 'microsoft:master' into LoCo-Zero++

fdf0c8a

add unit_test for LoCo Kernels

303f96c

XingyuXie requested a review from loadams as a code owner November 27, 2024 09:22

hwchen2017 approved these changes Dec 2, 2024

View reviewed changes

loadams and others added 5 commits December 5, 2024 12:50

Merge branch 'master' into LoCo-Zero++

c252391

Update documents

3fb3f21

Merge branch 'master' into LoCo-Zero++

bbf0439

Merge branch 'master' into LoCo-Zero++

a21ca86

Merge branch 'master' into LoCo-Zero++

a9d9e23

loadams enabled auto-merge December 6, 2024 22:29

loadams added 3 commits December 9, 2024 09:59

Merge branch 'master' into LoCo-Zero++

d117cd4

Merge branch 'master' into LoCo-Zero++

c76ef1c

Merge branch 'master' into LoCo-Zero++

fbf81f1

loadams added this pull request to the merge queue Dec 10, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 10, 2024

loadams added this pull request to the merge queue Dec 10, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 10, 2024

loadams merged commit 1b58ba5 into deepspeedai:master Dec 10, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge LoCo with Zero++ #6730

Merge LoCo with Zero++ #6730

XingyuXie commented Nov 8, 2024

XingyuXie commented Nov 8, 2024

GuanhuaWang left a comment •

edited

Loading

XingyuXie commented Nov 27, 2024 •

edited

Loading

GuanhuaWang commented Dec 2, 2024

Merge LoCo with Zero++ #6730

Merge LoCo with Zero++ #6730

Conversation

XingyuXie commented Nov 8, 2024

Integration of LoCo Method into ZeRO++

Overview

Experimental Results

Example Script

Acknowledgments

XingyuXie commented Nov 8, 2024

GuanhuaWang left a comment • edited Loading

Choose a reason for hiding this comment

XingyuXie commented Nov 27, 2024 • edited Loading

GuanhuaWang commented Dec 2, 2024

GuanhuaWang left a comment •

edited

Loading

XingyuXie commented Nov 27, 2024 •

edited

Loading