Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kandinsky 4.0 pipelines #181

Open
2 of 5 tasks
seruva19 opened this issue Dec 13, 2024 · 2 comments
Open
2 of 5 tasks

Add Kandinsky 4.0 pipelines #181

seruva19 opened this issue Dec 13, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@seruva19
Copy link
Owner

seruva19 commented Dec 13, 2024

@seruva19 seruva19 added the enhancement New feature or request label Dec 13, 2024
@seruva19 seruva19 self-assigned this Dec 13, 2024
seruva19 added a commit that referenced this issue Dec 17, 2024
- added t2v-flash pipeline
- added option to render image instead of video
- updated GUI for settings
- Kandinsky 4.0 is now default model
- "low-hanging" optimizations applied (torchao quantization, vae slicing and tiling)
- default pytorch cuda version is now 12.4 (because of flash attention 2)
- 'torchao' and 'optimum-quanto' in main requirements.txt
- 'your gradio is too old' warning replaced
- regular t2i, i2i etc. tabs are hidden on k-4 activation (but still accessible via "send-to" buttons)
@seruva19
Copy link
Owner Author

seruva19 commented Dec 17, 2024

Commit e9824a4 added support for K4.0 T2V-Flash pipeline (currently the only x2V pipeline with opened weights). It's not on main branch yet - need to do more testing/debugging. Oh, and to add V2A pipeline, too.

Tested on Windows 11, RAM 64Gb, GTX 3090, SATA SSD.

examples of video generation

https://files.catbox.moe/kkznjl.mp4
https://files.catbox.moe/g2g2xd.mp4
https://files.catbox.moe/jzg1vm.mp4
https://files.catbox.moe/8652be.mp4

examples of image output mode

Screenshot 2024-12-17 130657

Screenshot 2024-12-17 142547

Optimizations

All currently implemented optimizations (8-bit quantization of the text encoder, VAE and transformer, VAE tiling/slicing) are enabled by default. They can be turned off, if desired, from Settings -> Native tab, but without them, VRAM usage fast rises above 24 Gb.

There is a checkbox for enabling model offloading, but it's currently not working 🦀

With all current optimizations enabled, for 16:9 [672x384] 12 sec. peak VRAM usage is ~18Gb.
Fresh start generation takes approx. 135 seconds, consequential generations ~60 secs.

Concerning RAM: on load (before quantization and transferring to GPU) DiT (which is a 22.8 Gb large .pt file) peaks at 40+ Gb RAM. I tried to decrease this by converting DiT weights to 8-bit FP .sft (in this case they occupy ~6 Gb and peak RAM usage on load is ~22 Gb), but it is not functional yet, because during inference K4 DiT pipeline uses some CUDA operations that torch still (2.4.1) has limited or no support for FP8 format, so getting errors like "div_true_cuda" not implemented for 'Float8_e4m3fn' is inevitable.

(I am also researching the possibility of using GGUF quants, but that's for another day 🐌)

Eventually, the system requirements should not exceed those that has CogVideoX-5B, because K4 shares the same text encoder and VAE as CogVideoX, and their custom diffusion transformer also has 5B parameters.

FlashAttention

Another important thing to consider is that Kandinsky 4 pipeline requires mandatory installation of FlashAttention.
(There should be a way to make it work without Flash Attention, but I haven't found it yet.)
(I would also try to replace it with SageAttention, there is even yet another not working checkbox in settings.)

But on Windows, unlike Linux, one does not simply install FlashAtention.

Either an appropriate prebuilt FlashAttention wheel, compatible with your Python/Torch/CUDA setup, has to be found, or you have to install it from source. Personally, with Python 3.10/Pytorch 2.4.1/CUDA 12.7, I was able to use this wheel,
taken from here. You can also try to use prebuilt wheels from other sources, like this.

For building your own wheel, the procedure is well described here. I tested it myself and can confirm it works by following all the steps in exact order. Only change is that I had to install VS2019 Community, because MSVC C++ build tools from VS2022 threw an error 🤷

upd. 02/01/2025
Converting original DiT weights to bf16 and using it instead of default transformer will reduce peak RAM usage to ~22 Gb. To convert, you may use script by @kijai, or, if you have 'Networks' extension installed, you can convert using GUI. First convert source .pt to .safetensors, and then convert it to "torch.bfloat16".

image

Then replace path to DiT on T2V tab (there is a field called 'DiT Checkpoint Path' at the bottom of the page) with path to converted weights file.

@seruva19 seruva19 pinned this issue Dec 17, 2024
seruva19 added a commit that referenced this issue Dec 23, 2024
- add pipeline and optimizations for running v2a on 24Gb VRAM GPU
- rework procedure of applying quantization to K4 pipelines
- adds multiple hacks to prevent errors
- adds tab for uploading and converting video to video+sound
- adds 'send to v2a' button to t2v tab
@seruva19
Copy link
Owner Author

seruva19 commented Jan 2, 2025

Commit 5783ea5 added v2a pipeline based on Kandinsky-4.
Because of some incompatibilities between libraries it required some monkey patching (hopefully, temporary) to fix errors (mostly originated from pytorchvideo and transformers).

examples of video-to-audio generation

https://files.catbox.moe/meslss.mp4
https://files.catbox.moe/priqkx.mp4
https://files.catbox.moe/698icg.mp4

Optimizations

Fully loaded in fp32 CogVLM2-Video-Llama3-Chat, used is v2a pipeline, being a 12B model, alone pumps up VRAM usage to 20+ Gb, so the whole pipeline won't fit into 24 Gb without quantization. For some reasons I had trouble applying optimum-quanto and torchao, so for pipeline v2a bitsandbytes quantization is currently applied. All required optimization flags are enabled by default. CogVLM2 model is quantized using nf4 method, while vae and unet (custom trained Riffusion checkpoint) are int8 (but it's possible to use nf4 too). With this, overall peak VRAM usage is ~20 Gb (this without model offloading). In current implementation there is apparently a memory leak, because on consequent use VRAM consumption rises (it should not).

Triton

While t2v pipeline boasted mandatory requirement for Flash Attention, v2a has another menace for Windows users, since CogVLM2 won't work without Triton. Thankfully, the community is already got experience with installing Triton (probably thanks to Hunyuan Video), so you can check this repo (https://github.com/woct0rdho/triton-windows) for precompiled wheels and detailed instruction about installing and debugging installation errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant