4090 performance #6436

playlogitech · 2023-01-06T17:17:04Z

playlogitech
Jan 6, 2023

before cudnn libs update less than 18it/s on 512x512 EulerA, after update 23-25
today i saw a guy with 30it/s on same settings, and he do not understand why there so big difference
i asked a few more people to test their 4090 and all of them have same results as me - 23-25 it/s

the only difference between us is windows version, i use w10, he is on w11, but im not really sure that is the reason

ClashSAN · 2023-01-06T17:56:49Z

ClashSAN
Jan 6, 2023
Collaborator

try a higher batch size and ask the user to test it with the same for a well-rounded test. Many users are doing 1 img 512x512 and how fast it shows may be a system thing. Perform the same tests on the same sampler, with the same commit.

4 replies

playlogitech Jan 6, 2023
Author

we did it, he is faster atleast on 5 it/s

ClashSAN Jan 6, 2023
Collaborator

did you do this?

[Tester Needed] Improve SD performance by disabling Hardware GPU scheduling #3889 Disable Hardware GPU scheduling.
disable browser hardware acceleration
Go in nvidia control panel, 3d parameters, and change power profile to "maximum performance"

also see about the driver versions

playlogitech Jan 6, 2023
Author

yes

lilidan Apr 7, 2023

wtf...After Disable Hardware GPU scheduling my 4090 robost from 12it/s to 35it/s

Alchete · 2023-01-06T17:57:50Z

Alchete
Jan 6, 2023

Are you guys manually updating and installing the latest CUDA for this or is it part of a git pull from Automatic1111?

1 reply

playlogitech Jan 6, 2023
Author

manually, automatic's libs are outdated

nathanleclaire · 2023-03-08T19:53:37Z

nathanleclaire
Mar 8, 2023

My 4090 performance also not looking amazing. Depending on the batch, I might get 15-20 it/s. This is with the latest cudnn DLLs and Cuda 11.7. Windows 10. I did the other tricks like disable browser hardware acceleration, etc.

One question I have for you @playlogitech -- what are the temps like in your computer/GPU? I noticed that due to poor cooling, mine keeps bumping up against 88C. That might explain at least some of the performance lag due to throttling? I'm going to keep monitoring and comparing based on this factor.

Couldn't quite figure out how to get Pytorch 2 to work, though I did find a CUDA 11.8 (Lovelace support IIUC) wheel. https://download.pytorch.org/whl/nightly/cu118/torch-2.1.0.dev20230308%2Bcu118-cp310-cp310-win_amd64.whl

3 replies

EfourC Mar 9, 2023

Sorry if this is getting a bit off-topic...

You can check into thermal throttling problems with HWiNFO. If you're hitting obvious thermal throttling, it may be that you got a sub-par application of thermal compound on your card at the factory (it happens), or the card is 'hot boxed' somehow in your case.

I was hitting huge throttling issues with my overclocked 2 year old card, and found out that while the GPU temp was near it's limit, the bigger problem was the Hot Spot temp. I replaced the thermal compound of the GPU processor with a high viscosity one like thermalright TFX. Didn't mess with the various thermal pads. Prior, I was hitting 105c for hot spot (!), now I'm nowhere near it and I never thermal throttle.
[Edit: Under stress test, my power use went from ~270 watts hitting thermal throttling, to 330 watts, which is max possible by design of my card -- still with vastly better temps]

[Disclaimer: I know nothing about removing the heatsink/fan assembly from the 4090's. Mine is a EVGA 2070S based card. Take care to not damage thermal pads pulling it apart if you can't replace them.]

MSI Afterburner's hw monitoring is also very nice (and performant, so I usually use it) but it lacks the Hot Spot reading, so that's why I recommend HWiNFO here. You can evaluate while stress testing with OCCT (which can also show Hot Spot temp, but it's not great for overall monitoring).
https://www.hwinfo.com/

HWiNFO example, with stable diffusion load (after fixing my cooling):
(I customized the settings to add the coloring)

nathanleclaire Mar 9, 2023

That’s cool info, thanks. In my case I doubt it’s any factory defect, I simply have another GPU stacked right on top of the 4090 and both plugging hard, and the cooling situation in the case can’t keep up with it. Trying to figure out if I can do fan hacks to get the temps down.

ProLet-1917 Apr 15, 2023

me too, always <290w
but when i use ComfyUI it can be higher

AhmadHamdan309 · 2023-03-09T22:11:41Z

AhmadHamdan309
Mar 9, 2023

Are you both using --xformers ?
it increases the speed, and decreases the used Vram at the same time.

Edit: Can you also check if they have overclocked their GPU's?

0 replies

nathanleclaire · 2023-03-11T06:08:53Z

nathanleclaire
Mar 11, 2023

Yea, I'm using --xformers. No overclock.

0 replies

LucaLush · 2023-03-17T11:51:43Z

LucaLush
Mar 17, 2023

me too, on ubuntu18.04 rtx4090 libcudnn8.7 512x512 EulerA 23-24.5 it/s

0 replies

GeneralUltra758 · 2023-03-17T14:23:39Z

GeneralUltra758
Mar 17, 2023

ok what the devil am i doing wrong here? i am on latest drivers on arch linux and cant get anywhere near that much iterations.. i even udated torch to a recent nightly build and am only getting <9 it/sec (yes, on 512x512 EulerA batch size 1)

how can i verify what libcudnn is being utilized?

my starting args are: --gradio-queue --xformers --no-half-vae

driver in use is nvidia
modinfo -F version nvidia gives 525.89.02

sidenote: GPU is a 4090 suprim liquid so thermals can be ruled out as the fans barely even bother to turn on

7 replies

EfourC Mar 18, 2023

I think there could be multiple things (like CPU) contributing to a bottleneck at 512x512, so it's best to measure higher resolutions as well.

What is your performance at size 1024x1024? You could also check 768x768 and 768x1152 to compare to this spreadsheet:
https://docs.getgrist.com/3mjouqRSdkBY/sdperformance

GeneralUltra758 Mar 18, 2023

upon closer inspection i see the comparison chart uses a different sampler from what everyone else here mentionsthey
the grist doc uses DPM++ 2M Karras

Testesd with EulerA
512x512: 27.9it/s
768x768: ~14.4it/s
1024x1024: ~7.8it/s

DPM++ 2M Karras: (no half vae is on as it caused errors win a larger batch before. tests show it has no effect on it/s)
512x512: 37.59it/s
768x768: 17.90it/s
768x1152: 10.33it/s

EfourC Mar 18, 2023

That's interesting, for me performance between EulerA and DPM++ 2M Karras is exactly the same! I wonder what the cause of that difference between samplers is for you. I'm using torch: 1.13.1+cu117.
My system has an RTX 2070Super.
(Nice results on the second set, btw!)

bbecausereasonss Mar 18, 2023

finally managed to compile xformers from source for Cuda 11.8 and torch-2.1.0.dev20230317+cu118 yet i am still capped at 28it/s (a more sensible benchmark: 25 images 512x512 generated in 23 seconds)

Torch Version:
2.1.0.dev20230317+cu118

Xformers info:
xFormers 0.0.17+c36468d.d20230317
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.flshattF:               available
memory_efficient_attention.flshattB:               available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        available
memory_efficient_attention.tritonflashattB:        available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
is_functorch_available:                            False
pytorch.version:                                   2.1.0.dev20230317+cu118
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4090
build.info:                                        available
build.cuda_version:                                1108
build.python_version:                              3.10.10
build.torch_version:                               2.1.0.dev20230317+cu118
build.env.TORCH_CUDA_ARCH_LIST:                    None
build.env.XFORMERS_BUILD_TYPE:                     None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   None
source.privacy:                                    open source

Do you mind linking it or letting me know how you compiled it?

GeneralUltra758 Mar 19, 2023

Do you mind linking it or letting me know how you compiled it?

basically, ensure that you have GCC 11 (GCC12 wont work) and that you have cuda 11.8 installed.

since i am on Arch, there is a AUR package GCC-11 after which you need to set those envs to ensure it uses gcc-11

export CC=gcc-11
export CXX=g++-11

dont know if you need that but in my case it didnt find a .h file from cuda so i tried a bunch of env vars courtesy of ChatGPT (it worked at some point)

export CUDA_HOME="/usr/local/cuda-11.8"
export PATH="/usr/local/cuda-11.8/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH"
export C_INCLUDE_PATH="/usr/local/cuda-11.8/include:$C_INCLUDE_PATH"
export CPLUS_INCLUDE_PATH="/usr/local/cuda-11.8/include:$CPLUS_INCLUDE_PATH"

then run pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers and pray it finishes with Successfully built xformers (this may take a few minutes)

bbecausereasonss · 2023-03-18T00:08:56Z

bbecausereasonss
Mar 18, 2023

Where do you guys download the latest cudnn dll's?

1 reply

nathanleclaire Mar 18, 2023

https://developer.nvidia.com/rdp/cudnn-download

playlogitech · 2023-03-26T17:08:41Z

playlogitech
Mar 26, 2023
Author

guys if you are on latest torch from nightly builds you can stop using xformers at all and use console argument --opt-sdp-attention instead

1 reply

Sakura-Luna Mar 26, 2023
Collaborator

PyTorch has released 2.0, you can use it directly.

BasedAnon · 2023-04-07T06:39:35Z

BasedAnon
Apr 7, 2023

I am getting 5 at best, usually <2... on my 4090, and that's with the new PyTorch...

2 replies

Alchete Apr 7, 2023

Measure using

Base 1.5 model
Base A1111 setup (Euler A / 20 steps)
Prompt: "strawberry sushi"

Fixes applied

Torch 2.0
CUDA 11.8 dlls
--opt-channelslast --opt-sdp-attention
No xformers
(Note that "--no-half" will cut your it/s approx. in half. So not using)

4090 / Intel i9-13900K
No fixes applied: ~4 it/s
With fixes: 34 it/s

fever308 Apr 20, 2023

I have all these fixes applied and I only get 21 it/s on my 4090.

afbagwell · 2023-04-21T16:46:35Z

afbagwell
Apr 21, 2023

I've followed all the steps above and have an odd observation.

I can get 28-35 it/s only if my console window is visible. If it's minimized I get approximately -10 it/s drop. I don't think this is an illusion. There is a noticeable speed difference in watching the generated images pop up on the webui viewing window. Also, if I do a run with console in view and the next one minimized, the first few generated images report the same top speeds, but by around the 4th or 5th image the speed falls off to the degraded level. Attached a snapshot to show this:

I'm doing the following test:
SD-1.5-pruned (safetensors), Euler A, 20 steps, 512x512, Prompt: "abstract colors"
Startup config: --opt-channelslast --opt-sdp-attention

Here is my system information. How does this look?

0 replies

bbecausereasonss · 2023-04-21T17:02:41Z

bbecausereasonss
Apr 21, 2023

Should we still be installing CUDA 11.8 dlls? I'm confused why that would be necessary if cuda 11.8 is installed along Torch 2.

1 reply

Sakura-Luna Apr 21, 2023
Collaborator

2.0+cu118 does not need it.

4090 performance #6436

Replies: 12 comments · 20 replies

ClashSAN Jan 6, 2023 Collaborator

playlogitech Jan 6, 2023 Author

ClashSAN Jan 6, 2023 Collaborator

playlogitech Jan 6, 2023 Author

playlogitech Jan 6, 2023 Author

playlogitech Mar 26, 2023 Author

Sakura-Luna Mar 26, 2023 Collaborator

Measure using

Fixes applied

Sakura-Luna Apr 21, 2023 Collaborator

Replies: 12 comments 20 replies

ClashSAN
Jan 6, 2023
Collaborator

playlogitech Jan 6, 2023
Author

ClashSAN Jan 6, 2023
Collaborator

playlogitech Jan 6, 2023
Author

playlogitech Jan 6, 2023
Author

playlogitech
Mar 26, 2023
Author

Sakura-Luna Mar 26, 2023
Collaborator

Sakura-Luna Apr 21, 2023
Collaborator