Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation still taking an hour on a w/12 GB GPUs using quantized models #22

Open
halr9000 opened this issue Feb 8, 2025 · 5 comments

Comments

@halr9000
Copy link

halr9000 commented Feb 8, 2025

Console output is below. Using defaults of 2 segments w/3000 tokens. This is repeatable using full, 12GB, and 10GB quantized models on my 3060 w/12GB VRAM.

Any ideas if the warnings here indicate an issue worth fixing, or is this normal?

Update: another user on Pinokio is seeing the same with a 4070s that also has 12GB VRAM. Discord link.

Had Gemini parse the console output to give me timing, which surpassed my expectations. :)

Image

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.14it/s]
E:\hal\pinokio\api\yue.git\app\env\Lib\site-packages\torch\nn\utils\weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
E:\hal\pinokio\api\yue.git\app\inference\gradio_server.py:135: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  parameter_dict = torch.load(args.resume_path, map_location='cpu')
************ Memory Management for the GPU Poor (mmgp 3.1.4-15) by DeepBeepMeep ************
You have chosen a profile that requires at least 32 GB of RAM and 12 GB of VRAM. Some RAM is consumed to reduce VRAM consumption. 
Quantization of model 'transformer' started to format 'quanto.qint8'
Quantization of model 'transformer' done
Pinning data of 'transformer' to reserved RAM
The whole model was pinned to reserved RAM: 26 large blocks spread across 6266.83 MB
Hooked to model 'transformer' (LlamaForCausalLM)
Hooked to model 'stage2' (LlamaForCausalLM)
* Running on local URL:  http://localhost:42003

  0%|                                                                                                                                              | 0/2 [00:00<?, ?it/s]---Stage 1.1: Generating Sequence 1 out of 2
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
E:\hal\pinokio\api\yue.git\app\env\Lib\site-packages\transformers\generation\utils.py:2139: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cpu') before running `.generate()`.
  warnings.warn(
@halr9000 halr9000 changed the title Generation taking an hour on a 3060 w/12 GB using quantized models Generation still taking an hour on a w/12 GB GPUs using quantized models Feb 8, 2025
@deepbeepmeep
Copy link
Owner

The warnings should be ignored.

I am not sure how Gemini managed to extract this info given there should be only two sub stages for stage 1 (here 4 according to Gemini) if you generate only two segments.

On the other hand there are probably more substages in stage 2.
It would help if you could please copy and paste here the terminal output without the lines that starts with « tokens: » (except the last of a substage).

@olilanz
Copy link

olilanz commented Feb 9, 2025

I confirm similar overall speeds on RTX3060 12GB, using profile 3.

@joiemoie
Copy link

The warnings should be ignored.

I am not sure how Gemini managed to extract this info given there should be only two sub stages for stage 1 (here 4 according to Gemini) if you generate only two segments.

On the other hand there are probably more substages in stage 2. It would help if you could please copy and paste here the terminal output without the lines that starts with « tokens: » (except the last of a substage).

Hi, general question. Would this repo still get a speedup in beefy gpus like an A100?

@halr9000
Copy link
Author

I am not sure how Gemini managed to extract this info given there should be only two sub stages for stage 1 (here 4 according to Gemini) if you generate only two segments.

Oh I think it was hallucinating 😆 but I did get some value out of the analysis.

It would help if you could please copy and paste here the terminal output without the lines that starts with « tokens: » (except the last of a substage).

--Stage 1.1: Generating Sequence 1 out of 4
---Stage 1.2: Generating Sequence 2 out of 4
---Stage 1.3: Generating Sequence 3 out of 4
---Stage 1.4: Generating Sequence 4 out of 4
---Stage 2.1: Sampling Vocal track
Segment 1 / 5
['./output\\stage2\\atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_vtrack.npy', './output\\stage2\\atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_itrack.npy']
Stage 2 DONE.

Processing ./output\stage2\atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_vtrack.npy
Compressed shape: (8, 5398)
Decoded in 0.07s (1579.49x RTF)
Saved: ./output\vocoder\stems\vtrack.mp3
Processing ./output\stage2\atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_itrack.npy
Compressed shape: (8, 5398)
Decoded in 0.07s (1636.89x RTF)
Saved: ./output\vocoder\stems\itrack.mp3
Created mix: ./output\vocoder\mix\atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_mixed.mp3
Successfully created 'atmospheric-space-rock-male-vocals_tp0@93_T1@0_rp1@2_maxtk3000_d95c6f5a-d656-40b0-866d-58127bed3ab0_mixed.mp3' with matched low-frequency energy.

@deepbeepmeep
Copy link
Owner

Unfortunaltely, the information related to the duration is missing (i need the last "tokens:" of each sequence). But anyway according to your log you are generating 4 segments which means 2 minutes song if each segment has 3000 tokens. If that's the case you are probably running out of VRAM and this triggers swapping between the cpu and the vram which is quite slow. Please check that you generate only 2 segments of 3000 tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants