Add Kandinsky 4.0 pipelines #181

seruva19 · 2024-12-13T17:37:55Z

Add text-to-video (no weights published yet)
Add text-to-video (flash) e9824a4
Add image-to-video (no weights published yet)
Add video-to-audio 5783ea5
Make sure all this stuff runs fine on 16 GB of VRAM 😅

The text was updated successfully, but these errors were encountered:

- added t2v-flash pipeline - added option to render image instead of video - updated GUI for settings - Kandinsky 4.0 is now default model - "low-hanging" optimizations applied (torchao quantization, vae slicing and tiling) - default pytorch cuda version is now 12.4 (because of flash attention 2) - 'torchao' and 'optimum-quanto' in main requirements.txt - 'your gradio is too old' warning replaced - regular t2i, i2i etc. tabs are hidden on k-4 activation (but still accessible via "send-to" buttons)

seruva19 · 2024-12-17T18:52:36Z

Commit e9824a4 added support for K4.0 T2V-Flash pipeline (currently the only x2V pipeline with opened weights). It's not on main branch yet - need to do more testing/debugging. Oh, and to add V2A pipeline, too.

Tested on Windows 11, RAM 64Gb, GTX 3090, SATA SSD.

examples of video generation

https://files.catbox.moe/kkznjl.mp4
https://files.catbox.moe/g2g2xd.mp4
https://files.catbox.moe/jzg1vm.mp4
https://files.catbox.moe/8652be.mp4

examples of image output mode

Optimizations

All currently implemented optimizations (8-bit quantization of the text encoder, VAE and transformer, VAE tiling/slicing) are enabled by default. They can be turned off, if desired, from Settings -> Native tab, but without them, VRAM usage fast rises above 24 Gb.

There is a checkbox for enabling model offloading, but it's currently not working 🦀

With all current optimizations enabled, for 16:9 [672x384] 12 sec. peak VRAM usage is ~18Gb.
Fresh start generation takes approx. 135 seconds, consequential generations ~60 secs.

Concerning RAM: on load (before quantization and transferring to GPU) DiT (which is a 22.8 Gb large .pt file) peaks at 40+ Gb RAM. I tried to decrease this by converting DiT weights to 8-bit FP .sft (in this case they occupy ~6 Gb and peak RAM usage on load is ~22 Gb), but it is not functional yet, because during inference K4 DiT pipeline uses some CUDA operations that torch still (2.4.1) has limited or no support for FP8 format, so getting errors like "div_true_cuda" not implemented for 'Float8_e4m3fn' is inevitable.

(I am also researching the possibility of using GGUF quants, but that's for another day 🐌)

Eventually, the system requirements should not exceed those that has CogVideoX-5B, because K4 shares the same text encoder and VAE as CogVideoX, and their custom diffusion transformer also has 5B parameters.

FlashAttention

Another important thing to consider is that Kandinsky 4 pipeline requires mandatory installation of FlashAttention.
(There should be a way to make it work without Flash Attention, but I haven't found it yet.)
(I would also try to replace it with SageAttention, there is even yet another not working checkbox in settings.)

But on Windows, unlike Linux, one does not simply install FlashAtention.

Either an appropriate prebuilt FlashAttention wheel, compatible with your Python/Torch/CUDA setup, has to be found, or you have to install it from source. Personally, with Python 3.10/Pytorch 2.4.1/CUDA 12.7, I was able to use this wheel,
taken from here. You can also try to use prebuilt wheels from other sources, like this.

For building your own wheel, the procedure is well described here. I tested it myself and can confirm it works by following all the steps in exact order. Only change is that I had to install VS2019 Community, because MSVC C++ build tools from VS2022 threw an error 🤷

upd. 02/01/2025
Converting original DiT weights to bf16 and using it instead of default transformer will reduce peak RAM usage to ~22 Gb. To convert, you may use script by @kijai, or, if you have 'Networks' extension installed, you can convert using GUI. First convert source .pt to .safetensors, and then convert it to "torch.bfloat16".

Then replace path to DiT on T2V tab (there is a field called 'DiT Checkpoint Path' at the bottom of the page) with path to converted weights file.

- add pipeline and optimizations for running v2a on 24Gb VRAM GPU - rework procedure of applying quantization to K4 pipelines - adds multiple hacks to prevent errors - adds tab for uploading and converting video to video+sound - adds 'send to v2a' button to t2v tab

seruva19 · 2025-01-02T16:07:32Z

Commit 5783ea5 added v2a pipeline based on Kandinsky-4.
Because of some incompatibilities between libraries it required some monkey patching (hopefully, temporary) to fix errors (mostly originated from pytorchvideo and transformers).

examples of video-to-audio generation

https://files.catbox.moe/meslss.mp4
https://files.catbox.moe/priqkx.mp4
https://files.catbox.moe/698icg.mp4

Optimizations

Fully loaded in fp32 CogVLM2-Video-Llama3-Chat, used is v2a pipeline, being a 12B model, alone pumps up VRAM usage to 20+ Gb, so the whole pipeline won't fit into 24 Gb without quantization. For some reasons I had trouble applying optimum-quanto and torchao, so for pipeline v2a bitsandbytes quantization is currently applied. All required optimization flags are enabled by default. CogVLM2 model is quantized using nf4 method, while vae and unet (custom trained Riffusion checkpoint) are int8 (but it's possible to use nf4 too). With this, overall peak VRAM usage is ~20 Gb (this without model offloading). In current implementation there is apparently a memory leak, because on consequent use VRAM consumption rises (it should not).

Triton

While t2v pipeline boasted mandatory requirement for Flash Attention, v2a has another menace for Windows users, since CogVLM2 won't work without Triton. Thankfully, the community is already got experience with installing Triton (probably thanks to Hunyuan Video), so you can check this repo (https://github.com/woct0rdho/triton-windows) for precompiled wheels and detailed instruction about installing and debugging installation errors.

seruva19 added the enhancement New feature or request label Dec 13, 2024

seruva19 self-assigned this Dec 13, 2024

seruva19 pinned this issue Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Kandinsky 4.0 pipelines #181

Add Kandinsky 4.0 pipelines #181

seruva19 commented Dec 13, 2024 •

edited

Loading

seruva19 commented Dec 17, 2024 •

edited

Loading

seruva19 commented Jan 2, 2025

Add Kandinsky 4.0 pipelines #181

Add Kandinsky 4.0 pipelines #181

Comments

seruva19 commented Dec 13, 2024 • edited Loading

seruva19 commented Dec 17, 2024 • edited Loading

Optimizations

FlashAttention

seruva19 commented Jan 2, 2025

Optimizations

Triton

seruva19 commented Dec 13, 2024 •

edited

Loading

seruva19 commented Dec 17, 2024 •

edited

Loading