Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问全量微调hunyuan脚本是支持多机并行训练的吗? #138

Open
QingQingS opened this issue Jan 9, 2025 · 4 comments
Open

Comments

@QingQingS
Copy link

最近在尝试对hunyuanvideo进行微调,想请教下代码是能直接支持多机并行训练吗?训练数据的大小能支持720p 更多帧长吗?看到数据处理部分写的max_height=480,max_width=848,num_frames=93。

感谢作者的开源工作给了很多启发,谢谢!

@BrianChen1129
Copy link
Collaborator

Our Huanyun full finetune supports multi GPU training, and you can use 720p(1280*720) and 125 frames

@jzhang38
Copy link
Collaborator

jzhang38 commented Jan 9, 2025

You can support frames larger than 125 frames so long as: 1. you have long enough data. 2. you correctly set the num frames during preprocess. 3. You have enough cards to sufficiently shard the sequence in context parallel.

@QingQingS
Copy link
Author

嗯嗯,多谢,还有个问题,提前预处理VAE和text encoder的方式占用的存储成本太高,我接下来想将VAE模块添加回来,但这样又会消耗掉部分显存,只采用FSDP的方式或许不太够,所以我想咨询下能否参考pytorch官方 TP部分的教程,在这套代码上直接添加ColwiseParallel,RowwiseParallel这些TP操作,在添加的过程中需要注意些什么?因为教程实在过于简单,还没有找到其他更好的参考资料

https://pytorch.org/tutorials/intermediate/TP_tutorial.html

@zhuhz22
Copy link

zhuhz22 commented Jan 14, 2025

嗯嗯,多谢,还有个问题,提前预处理VAE和text encoder的方式占用的存储成本太高,我接下来想将VAE模块添加回来,但这样又会消耗掉部分显存,只采用FSDP的方式或许不太够,所以我想咨询下能否参考pytorch官方 TP部分的教程,在这套代码上直接添加ColwiseParallel,RowwiseParallel这些TP操作,在添加的过程中需要注意些什么?因为教程实在过于简单,还没有找到其他更好的参考资料

https://pytorch.org/tutorials/intermediate/TP_tutorial.html

Hi @QingQingS , may I ask that have you solved this issue? And have you already tried full-finetuning using the code in this repo? I tried full-finetuning it with bs=8, yet it seems that the performance of the model severely declined in 200 steps. So if you have tried to finetune, could you please share whether your training was successful so that I can determine whether the issue is due to a small batch size or related to the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants