Generation still taking an hour on a w/12 GB GPUs using quantized models #22

halr9000 · 2025-02-08T16:26:59Z

Console output is below. Using defaults of 2 segments w/3000 tokens. This is repeatable using full, 12GB, and 10GB quantized models on my 3060 w/12GB VRAM.

Any ideas if the warnings here indicate an issue worth fixing, or is this normal?

Update: another user on Pinokio is seeing the same with a 4070s that also has 12GB VRAM. Discord link.

Had Gemini parse the console output to give me timing, which surpassed my expectations. :)

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.14it/s]
E:\hal\pinokio\api\yue.git\app\env\Lib\site-packages\torch\nn\utils\weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
E:\hal\pinokio\api\yue.git\app\inference\gradio_server.py:135: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  parameter_dict = torch.load(args.resume_path, map_location='cpu')
************ Memory Management for the GPU Poor (mmgp 3.1.4-15) by DeepBeepMeep ************
You have chosen a profile that requires at least 32 GB of RAM and 12 GB of VRAM. Some RAM is consumed to reduce VRAM consumption. 
Quantization of model 'transformer' started to format 'quanto.qint8'
Quantization of model 'transformer' done
Pinning data of 'transformer' to reserved RAM
The whole model was pinned to reserved RAM: 26 large blocks spread across 6266.83 MB
Hooked to model 'transformer' (LlamaForCausalLM)
Hooked to model 'stage2' (LlamaForCausalLM)
* Running on local URL:  http://localhost:42003

  0%|                                                                                                                                              | 0/2 [00:00<?, ?it/s]---Stage 1.1: Generating Sequence 1 out of 2
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
E:\hal\pinokio\api\yue.git\app\env\Lib\site-packages\transformers\generation\utils.py:2139: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cpu') before running `.generate()`.
  warnings.warn(

The text was updated successfully, but these errors were encountered:

deepbeepmeep · 2025-02-09T15:14:03Z

The warnings should be ignored.

I am not sure how Gemini managed to extract this info given there should be only two sub stages for stage 1 (here 4 according to Gemini) if you generate only two segments.

On the other hand there are probably more substages in stage 2.
It would help if you could please copy and paste here the terminal output without the lines that starts with « tokens: » (except the last of a substage).

olilanz · 2025-02-09T16:50:42Z

I confirm similar overall speeds on RTX3060 12GB, using profile 3.

joiemoie · 2025-02-11T11:11:08Z

The warnings should be ignored.

I am not sure how Gemini managed to extract this info given there should be only two sub stages for stage 1 (here 4 according to Gemini) if you generate only two segments.

On the other hand there are probably more substages in stage 2. It would help if you could please copy and paste here the terminal output without the lines that starts with « tokens: » (except the last of a substage).

Hi, general question. Would this repo still get a speedup in beefy gpus like an A100?

halr9000 · 2025-02-12T01:10:28Z

I am not sure how Gemini managed to extract this info given there should be only two sub stages for stage 1 (here 4 according to Gemini) if you generate only two segments.

Oh I think it was hallucinating 😆 but I did get some value out of the analysis.

It would help if you could please copy and paste here the terminal output without the lines that starts with « tokens: » (except the last of a substage).

--Stage 1.1: Generating Sequence 1 out of 4
---Stage 1.2: Generating Sequence 2 out of 4
---Stage 1.3: Generating Sequence 3 out of 4
---Stage 1.4: Generating Sequence 4 out of 4
---Stage 2.1: Sampling Vocal track
Segment 1 / 5
['./output\\stage2\\atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_vtrack.npy', './output\\stage2\\atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_itrack.npy']
Stage 2 DONE.

Processing ./output\stage2\atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_vtrack.npy
Compressed shape: (8, 5398)
Decoded in 0.07s (1579.49x RTF)
Saved: ./output\vocoder\stems\vtrack.mp3
Processing ./output\stage2\atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_itrack.npy
Compressed shape: (8, 5398)
Decoded in 0.07s (1636.89x RTF)
Saved: ./output\vocoder\stems\itrack.mp3
Created mix: ./output\vocoder\mix\atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_mixed.mp3
Successfully created 'atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_mixed.mp3' with matched low-frequency energy.

deepbeepmeep · 2025-02-15T20:20:34Z

Unfortunaltely, the information related to the duration is missing (i need the last "tokens:" of each sequence). But anyway according to your log you are generating 4 segments which means 2 minutes song if each segment has 3000 tokens. If that's the case you are probably running out of VRAM and this triggers swapping between the cpu and the vram which is quite slow. Please check that you generate only 2 segments of 3000 tokens.

halr9000 changed the title ~~Generation taking an hour on a 3060 w/12 GB using quantized models~~ Generation still taking an hour on a w/12 GB GPUs using quantized models Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation still taking an hour on a w/12 GB GPUs using quantized models #22

Generation still taking an hour on a w/12 GB GPUs using quantized models #22

halr9000 commented Feb 8, 2025 •

edited

Loading

deepbeepmeep commented Feb 9, 2025

olilanz commented Feb 9, 2025

joiemoie commented Feb 11, 2025

halr9000 commented Feb 12, 2025

deepbeepmeep commented Feb 15, 2025

Generation still taking an hour on a w/12 GB GPUs using quantized models #22

Generation still taking an hour on a w/12 GB GPUs using quantized models #22

Comments

halr9000 commented Feb 8, 2025 • edited Loading

deepbeepmeep commented Feb 9, 2025

olilanz commented Feb 9, 2025

joiemoie commented Feb 11, 2025

halr9000 commented Feb 12, 2025

deepbeepmeep commented Feb 15, 2025

halr9000 commented Feb 8, 2025 •

edited

Loading