-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Kandinsky 4.0 pipelines #181
Comments
- added t2v-flash pipeline - added option to render image instead of video - updated GUI for settings - Kandinsky 4.0 is now default model - "low-hanging" optimizations applied (torchao quantization, vae slicing and tiling) - default pytorch cuda version is now 12.4 (because of flash attention 2) - 'torchao' and 'optimum-quanto' in main requirements.txt - 'your gradio is too old' warning replaced - regular t2i, i2i etc. tabs are hidden on k-4 activation (but still accessible via "send-to" buttons)
Commit e9824a4 added support for K4.0 T2V-Flash pipeline (currently the only x2V pipeline with opened weights). It's not on main branch yet - need to do more testing/debugging. Oh, and to add V2A pipeline, too. Tested on Windows 11, RAM 64Gb, GTX 3090, SATA SSD. examples of video generationhttps://files.catbox.moe/kkznjl.mp4 OptimizationsAll currently implemented optimizations (8-bit quantization of the text encoder, VAE and transformer, VAE tiling/slicing) are enabled by default. They can be turned off, if desired, from Settings -> Native tab, but without them, VRAM usage fast rises above 24 Gb. There is a checkbox for enabling model offloading, but it's currently not working 🦀 With all current optimizations enabled, for 16:9 [672x384] 12 sec. peak VRAM usage is ~18Gb. Concerning RAM: on load (before quantization and transferring to GPU) DiT (which is a 22.8 Gb large .pt file) peaks at 40+ Gb RAM. I tried to decrease this by converting DiT weights to 8-bit FP .sft (in this case they occupy ~6 Gb and peak RAM usage on load is ~22 Gb), but it is not functional yet, because during inference K4 DiT pipeline uses some CUDA operations that torch still (2.4.1) has limited or no support for FP8 format, so getting errors like (I am also researching the possibility of using GGUF quants, but that's for another day 🐌) Eventually, the system requirements should not exceed those that has CogVideoX-5B, because K4 shares the same text encoder and VAE as CogVideoX, and their custom diffusion transformer also has 5B parameters. FlashAttentionAnother important thing to consider is that Kandinsky 4 pipeline requires mandatory installation of FlashAttention. But on Windows, unlike Linux, one does not simply install FlashAtention. Either an appropriate prebuilt FlashAttention wheel, compatible with your Python/Torch/CUDA setup, has to be found, or you have to install it from source. Personally, with Python 3.10/Pytorch 2.4.1/CUDA 12.7, I was able to use this wheel, For building your own wheel, the procedure is well described here. I tested it myself and can confirm it works by following all the steps in exact order. Only change is that I had to install VS2019 Community, because MSVC C++ build tools from VS2022 threw an error 🤷 upd. 02/01/2025 Then replace path to DiT on T2V tab (there is a field called 'DiT Checkpoint Path' at the bottom of the page) with path to converted weights file. |
- add pipeline and optimizations for running v2a on 24Gb VRAM GPU - rework procedure of applying quantization to K4 pipelines - adds multiple hacks to prevent errors - adds tab for uploading and converting video to video+sound - adds 'send to v2a' button to t2v tab
Commit 5783ea5 added v2a pipeline based on Kandinsky-4. examples of video-to-audio generationhttps://files.catbox.moe/meslss.mp4 OptimizationsFully loaded in fp32 CogVLM2-Video-Llama3-Chat, used is v2a pipeline, being a 12B model, alone pumps up VRAM usage to 20+ Gb, so the whole pipeline won't fit into 24 Gb without quantization. For some reasons I had trouble applying optimum-quanto and torchao, so for pipeline v2a bitsandbytes quantization is currently applied. All required optimization flags are enabled by default. CogVLM2 model is quantized using nf4 method, while vae and unet (custom trained Riffusion checkpoint) are int8 (but it's possible to use nf4 too). With this, overall peak VRAM usage is ~20 Gb (this without model offloading). In current implementation there is apparently a memory leak, because on consequent use VRAM consumption rises (it should not). TritonWhile t2v pipeline boasted mandatory requirement for Flash Attention, v2a has another menace for Windows users, since CogVLM2 won't work without Triton. Thankfully, the community is already got experience with installing Triton (probably thanks to Hunyuan Video), so you can check this repo (https://github.com/woct0rdho/triton-windows) for precompiled wheels and detailed instruction about installing and debugging installation errors. |
The text was updated successfully, but these errors were encountered: