Autotp training #6922

inkcherry · 2025-01-02T03:22:50Z

FYI @tjruwase @GuanhuaWang @delock @skyshine102 context: #5445
changes/support

auto tensor parallel training for HF model(zero compatible. I only tested zero1 currently)
distributed ckpt save(UCP is not supported).
HF model files save(set gather_16bit_weights_on_model_save=True in ds config).
Dataloader check.
Uts.
tp layer refactor by abstract layer design.

HF trainer dependency:
transformer: https://github.com/inkcherry/transformers/tree/ds_tp
accelerate: https://github.com/inkcherry/accelerate/tree/ds_tp
I could send them once ds support these api.

Usage:
Users do not need to modify the client code, they only need to configure the settings in the config file to achieve the desired functionality.
Below is an example of code for fine-tuning a LLaMA 2 model (SFT). It supports Zero3/FSDP training and enables TP training by simply adjusting the configuration

https://github.com/inkcherry/stanford_alpaca/commits/tp_demo_1127/
This branch contains three commits, with the last two commits added for quick experiments and logging purposes.
results
loss curve（gbs=16）:
zero3(baseline)

tp(this)

zero1 with zero1+tp(zero compatible)

performance（For your reference only.）:
zero3(not enabled any acceleration.) : 18GB 2.3s/it
zero1：38GB 1.30s/it
zero1+tp: 24GB 1.66s/it
extension:
I think async-TP/domino .etc. can be implemented by inheriting a class and overriding the fwd/bwd methods. The logic for gather/partition can be reused to achieve this.(please correct me if I am wrong)

Complex sharding can also be achieved through independent partitioning and gathering. Partitioning is mandatory, while gathering is required for training.
TODO:
embedding vocab parallel
Currently, the parallelism for embeddings is primarily based on hidden_dim parallel combined with allreduce. This approach takes advantage of efficient reduction kernels. and it is not forced to use.
In training, however, the more common method is vocab parallelism. Enabling by default can save a certain amount of GPU memory.

thanks for @delock guidance.
I also verified inference with cpu-inference workloads(Optimized Model List in https://github.com/intel/intel-extension-for-pytorch/tree/main).
many thanks for @xuguangxin @ikurtchen @rogerxfeng8 ,@Yejing-Lai ,@ys950902 .etc. Help review and address matters related to inference.

…-precision version before the rebase, but the grad norm differs (display issue)

delock · 2025-01-02T05:44:19Z

@tjruwase @GuanhuaWang
We had internal review of @inkcherry 's PR. This PR allows train HF models with tensor parallel without need for megatron. Which is very friendly to user.

Let us know your plan for Domino integration. @inkcherry 's memory data looks good. With Domino we think it can have less impact on performance since TP communication can overlap with computation.

@inkcherry by design should autotp training work with ZeRO3 as well?

inkcherry · 2025-01-06T04:26:27Z

@tjruwase @GuanhuaWang We had internal review of @inkcherry 's PR. This PR allows train HF models with tensor parallel without need for megatron. Which is very friendly to user.

Let us know your plan for Domino integration. @inkcherry 's memory data looks good. With Domino we think it can have less impact on performance since TP communication can overlap with computation.

@inkcherry by design should autotp training work with ZeRO3 as well?

for Zero3 + TP: Currently, the logic to combine the saving of HF weights for TP & DP has not been implemented, but it is entirely feasible. If needed, it can be implemented in the future.

tjruwase · 2025-01-06T17:26:48Z

We had internal review of @inkcherry 's PR. This PR allows train HF models with tensor parallel without need for megatron. Which is very friendly to user.

Bravo @inkcherry, this is an excellent technology and massive usability benefit for users. This is really exciting!

Let us know your plan for Domino integration. @inkcherry 's memory data looks good. With Domino we think it can have less impact on performance since TP communication can overlap with computation.

In terms of Domino integration, @GuanhuaWang will take the lead on that.

for Zero3 + TP: Currently, the logic to combine the saving of HF weights for TP & DP has not been implemented, but it is entirely feasible. If needed, it can be implemented in the future.

I would love to prioritize enabling UCP support sooner than later. @inkcherry, can you please share the work needed here?

tjruwase · 2025-01-07T01:42:40Z

tests/unit/model_parallelism/test_autotp_training.py

+
+    def test(self):
+        set_autotp_mode(training=True)
+        tp_size = 4


Can you parametrize tp_size to improve coverage?

tjruwase · 2025-01-07T01:43:30Z

tests/unit/model_parallelism/test_autotp_training.py

+    reuse_dist_env = True
+
+    def testRowParallel(self):
+        tp_size = 4


Parametrize tp_size for coverage.

tjruwase · 2025-01-07T11:23:53Z

deepspeed/runtime/zero/config.py

@@ -339,7 +340,10 @@ class DeepSpeedZeroConfig(DeepSpeedConfigModel):
    """
    Override nn.Module apply function, for Stage 3.
    """
-
+    autotp_size: int = Field(0, ge=0, new_param="autotp_size")


Why is autotp_size defined as a subfield of zero, instead of a top-level field in ds_config? Is there a dependency on zero logic?

tjruwase · 2025-01-07T11:44:28Z

deepspeed/runtime/engine.py

+        Returns:
+        OrderedDict: The consolidated state dictionary if the current process rank is 0, otherwise None.
+        """
+        #TODO: If we use both Zero3 and tensor parallel simultaneously


Can you clarify what is meant by the gather mechanism of tensor parallelism?

Same question.
I could somehow understand as it's a similar function to

DeepSpeed/deepspeed/runtime/engine.py

Line 3574 in f2cc809

def _zero3_consolidated_16bit_state_dict(self, exclude_frozen_parameters=False):

but specific for TP. The function name can be improved.

@skyshine102, thanks for the comment. A key difference between is zero3 and TP is that partitioned zero3 modules materialized using allgather before compute, whereas TP modules compute in a partitioned manner. So, it is unclear to me what requires gathering for TP.

tjruwase · 2025-01-07T15:59:54Z

deepspeed/inference/engine.py

@@ -247,6 +248,11 @@ def _post_forward_hook(self, module, input, output):
        self._model_times.append(elapsed_time)

    def _create_model_parallel_group(self, config):
+
+        if is_autotp_training_mode():


Conceptually, control flow for training should not come here. I think some refactoring/restructuring is needed for code quality.

skyshine102

Thanks @inkcherry for this contribution. I have spent some time to read this PR and I'm happy to be involved in this discussion. (I'm not from deepspeed team but deepspeed user. My comments are relatively minor though.)

skyshine102 · 2025-01-07T16:00:36Z

deepspeed/module_inject/auto_tp.py

+                return Yuan_LinearALlreduce(child, self.mp_group)
+
+        # For MLP including chunk layer.
+        if 'gate_up_proj' in name or ('dense_h_to_4h' in name and 'GLM' in str(self.module)):


This additional code block is trying to deal with "MLP including chunk layer" (general case), but the returned module/object is in the name of GLM prefix.
It could be better to rename the GLM_LinearLayer to sth like GateUpPack_LinearLayer.

skyshine102 · 2025-01-07T16:01:46Z

deepspeed/module_inject/auto_tp.py

@@ -11,10 +11,12 @@
 from typing import Optional
 import torch
 from deepspeed import comm as dist
-from .layers import LinearAllreduce, LinearLayer, LmHeadLinearAllreduce
+from .layers import LinearAllreduce, LinearLayer, LmHeadLinearAllreduce, Yuan_LinearALlreduce, Yuan_LinearLayer, GLM_LinearLayer, Conv_LinearALlreduce, fused_LinearLayer, conv_LinearLayer


Original coding style is LinearAllreduce instead of LinearALlreduce.

skyshine102 · 2025-01-07T16:25:57Z

deepspeed/runtime/engine.py

+            broadcast_and_check(args, bcast_rank, bcast_group)
+            broadcast_and_check(kwargs, bcast_rank, bcast_group)
+
+            print(f"RANK[{dist.get_rank()}]:The Dataloader has passed the TP group consistency check.")


maybe use the logger at rank 0 instead of print.

skyshine102 · 2025-01-07T16:31:17Z

deepspeed/runtime/engine.py

+        Returns:
+        OrderedDict: The consolidated state dictionary if the current process rank is 0, otherwise None.
+        """
+        #TODO: If we use both Zero3 and tensor parallel simultaneously


Same question.
I could somehow understand as it's a similar function to

DeepSpeed/deepspeed/runtime/engine.py

Line 3574 in f2cc809

def _zero3_consolidated_16bit_state_dict(self, exclude_frozen_parameters=False):

but specific for TP. The function name can be improved.

skyshine102 · 2025-01-07T17:09:54Z

@tjruwase @GuanhuaWang We had internal review of @inkcherry 's PR. This PR allows train HF models with tensor parallel without need for megatron. Which is very friendly to user.

Let us know your plan for Domino integration. @inkcherry 's memory data looks good. With Domino we think it can have less impact on performance since TP communication can overlap with computation.

@inkcherry by design should autotp training work with ZeRO3 as well?

@inkcherry I have the same question. Does this PR support the flow like https://pytorch.org/tutorials/intermediate/TP_tutorial.html#combine-tensor-parallel-with-fully-sharded-data-parallel-together ? (TP to shared weight $W$ to $W_{tp_i}$, then further shard $W_{tp_i}$ by ZeRO-3 to $W_{tp_i, dp_j}$)

inkcherry added 30 commits April 22, 2024 17:15

auto tp training

674a873

update parallel_states

a2e4c47

Merge branch 'master' into HEAD

f4eb142

WA skips assertions, the loss remains exactly consistent with the low…

dd081ed

…-precision version before the rebase, but the grad norm differs (display issue)

save/load ckpt & save/load hf model basic POC

cdaed2f

finish all the basic functionalities

9aad0e7

update

2bb11fd

use groups for parallel_states

e75c1c2

enable bwd allreduce, enable scale loss by gas

840a5f2

add dataloader check

60bd6ab

refactor autoTP step1

9266383

rm parallel_states

07174a9

refactor autoTP step2

ee6323e

update ut step1

6461b84

update

4d73011

add uts

c79c3bb

finished all ut code base

97e659c

addllr scheduler test

a15905b

refine ut

e9802b0

fix bcast_objlist

88b8acf

refine layers.py

868be0b

refine gather

3788e07

pass codegen350M +TP2 ut

27b24f6

add mode choice

3d7b89f

fix chatglm

47a6b0b

fix chatglm2 with transformers=4.40 version

3a23997

uneven

e3ec46e

fix uneven

9685879

fix training

7b99b03

refine code

570645f

inkcherry added 11 commits December 19, 2024 06:24

update yuan

dadf915

optimize usage of move function

86c9399

refine args usage

2526dc6

format

c9fd699

zero1 compatible

797e71f

remove wa

86ae65e

fix cpu device name

3e40024

fix lm-head

7d94b77

add detach

b297950

fix ipex intergration

67ce220

fix tied_embedding

f818be9

inkcherry requested review from tjruwase, tohtana, hwchen2017, loadams and GuanhuaWang as code owners January 2, 2025 03:22

inkcherry added 2 commits January 2, 2025 03:53

Merge remote-tracking branch 'origin/master' into autotp_training

11c98f6

format

e22b625

tjruwase and others added 2 commits January 6, 2025 12:26

Merge branch 'master' into autotp_training

8531b64

Merge branch 'master' into autotp_training

8d19e01

tjruwase reviewed Jan 7, 2025

View reviewed changes

skyshine102 reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autotp training #6922

Autotp training #6922

inkcherry commented Jan 2, 2025 •

edited

Loading

delock commented Jan 2, 2025

inkcherry commented Jan 6, 2025

tjruwase commented Jan 6, 2025

tjruwase Jan 7, 2025

tjruwase Jan 7, 2025

tjruwase Jan 7, 2025

tjruwase Jan 7, 2025

skyshine102 Jan 7, 2025

tjruwase Jan 7, 2025

tjruwase Jan 7, 2025

skyshine102 left a comment

skyshine102 Jan 7, 2025

skyshine102 Jan 7, 2025

skyshine102 Jan 7, 2025 •

edited

Loading

skyshine102 Jan 7, 2025

skyshine102 commented Jan 7, 2025 •

edited

Loading

Autotp training #6922

Are you sure you want to change the base?

Autotp training #6922

Conversation

inkcherry commented Jan 2, 2025 • edited Loading

delock commented Jan 2, 2025

inkcherry commented Jan 6, 2025

tjruwase commented Jan 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyshine102 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyshine102 Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyshine102 commented Jan 7, 2025 • edited Loading

inkcherry commented Jan 2, 2025 •

edited

Loading

skyshine102 Jan 7, 2025 •

edited

Loading

skyshine102 commented Jan 7, 2025 •

edited

Loading