[Bug]: CUDA errors on 2nd training #671

dxqbYD · 2025-02-01T11:41:06Z

What happened?

Whenever I start a 2nd training without closing OT first, CUDA errors as below.
Reproducable with

NF4 without offloading
FP8 with 0.3 offloading
BF16 with 0.8 offloading

The error appears to happen when a torch tensor is moved to cuda, even if CUDA_LAUNCH_BLOCKING is enabled

What did you expect would happen?

Relevant log output

Traceback (most recent call last):
  File "/home/.../OneTrainer/modules/ui/TrainUI.py", line 569, in __training_thread_function
    trainer.start()
  File "/home/.../OneTrainer/modules/trainer/GenericTrainer.py", line 124, in start
    self.model_setup.setup_optimizations(self.model, self.config)
  File "/home/.../OneTrainer/modules/modelSetup/BaseFluxSetup.py", line 109, in setup_optimizations
    quantize_layers(model.text_encoder_2, self.train_device, model.text_encoder_2_train_dtype)
  File "/home/.../OneTrainer/modules/util/quantization_util.py", line 192, in quantize_layers
    child_module.quantize(device)
  File "/home/.../OneTrainer/modules/module/quantized/LinearFp8.py", line 43, in quantize
    weight = weight.to(device=device)
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Traceback (most recent call last):
  File "/home/.../OneTrainer/modules/ui/TrainUI.py", line 569, in __training_thread_function
    trainer.start()
  File "/home/.../OneTrainer/modules/trainer/GenericTrainer.py", line 124, in start
    self.model_setup.setup_optimizations(self.model, self.config)
  File "/home/.../OneTrainer/modules/modelSetup/BaseFluxSetup.py", line 111, in setup_optimizations
    quantize_layers(model.transformer, self.train_device, model.train_dtype)
  File "/home/.../OneTrainer/modules/util/quantization_util.py", line 192, in quantize_layers
    child_module.quantize(device)
  File "/home/.../OneTrainer/modules/module/quantized/LinearNf4.py", line 75, in quantize
    weight = weight.to(device=device)
RuntimeError: CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Traceback (most recent call last):
  File "/home/.../OneTrainer/modules/ui/TrainUI.py", line 569, in __training_thread_function
    trainer.start()
  File "/home/.../OneTrainer/modules/trainer/GenericTrainer.py", line 125, in start
    self.model_setup.setup_train_device(self.model, self.config)
  File "/home/.../OneTrainer/modules/modelSetup/FluxLoRASetup.py", line 198, in setup_train_device
    model.transformer_to(self.train_device)
  File "/home/.../OneTrainer/modules/model/FluxModel.py", line 146, in transformer_to
    self.transformer_offload_conductor.to(device)
  File "/home/.../OneTrainer/modules/util/LayerOffloadConductor.py", line 647, in to
    layer.to(self.__train_device)
  File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
    return self._apply(convert)
  File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
    module._apply(fn)
  File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
    module._apply(fn)
  File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
    param_applied = fn(param)
  File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
    return t.to(
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Output of `pip freeze`

absl-py==2.1.0
accelerate==1.0.1
aiodns==3.2.0
aiohappyeyeballs==2.4.4
aiohttp==3.11.11
aiohttp-retry==2.9.1
aiosignal==1.3.2
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.8.0
async-timeout==5.0.1
attrs==24.3.0
av==13.1.0
backoff==2.2.1
bcrypt==4.2.1
bitsandbytes==0.44.1
boto3==1.36.5
botocore==1.36.5
Brotli==1.1.0
certifi==2024.12.14
cffi==1.17.1
cfgv==3.4.0
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.3.1
cryptography==43.0.3
customtkinter==5.2.2
cycler==0.12.1
dadaptation==3.2
darkdetect==0.8.0
decorator==5.1.1
Deprecated==1.2.16
-e git+https://github.com/huggingface/diffusers.git@c944f0651f679728d4ec7b6488120ac49c2f1315#egg=diffusers
distlib==0.3.9
dnspython==2.7.0
email_validator==2.2.0
exceptiongroup==1.2.2
fabric==3.2.2
fastapi==0.115.7
fastapi-cli==0.0.7
filelock==3.17.0
flatbuffers==25.1.21
fonttools==4.55.5
frozenlist==1.5.0
fsspec==2024.12.0
ftfy==6.3.1
grpcio==1.70.0
h11==0.14.0
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.27.1
humanfriendly==10.0
identify==2.6.6
idna==3.10
importlib_metadata==8.6.1
inquirerpy==0.3.4
invisible-watermark==0.2.0
invoke==2.2.0
itsdangerous==2.2.0
Jinja2==3.1.5
jmespath==1.0.1
kiwisolver==1.4.8
lightning-utilities==0.11.9
lion-pytorch==0.2.2
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.9.2
mdurl==0.1.2
-e git+https://github.com/Nerogar/mgds.git@fcaec253ddff9dccd0f9644836fe87b0103f23f7#egg=mgds
mpmath==1.3.0
multidict==6.1.0
networkx==3.4.2
nodeenv==1.9.1
numpy==1.26.4
nvidia-cublas-cu11==11.11.3.6
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu11==11.8.87
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu11==11.8.89
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu11==11.8.89
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu11==8.7.0.84
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu11==10.9.0.58
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu11==10.3.0.86
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu11==11.4.1.48
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu11==11.7.5.86
nvidia-cusparse-cu12==12.3.1.170
nvidia-ml-py==12.560.30
nvidia-nccl-cu11==2.20.5
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu11==11.8.86
nvidia-nvtx-cu12==12.4.127
omegaconf==2.3.0
onnxruntime-gpu==1.19.2
open_clip_torch==2.28.0
opencv-python==4.10.0.84
orjson==3.10.15
packaging==24.2
paramiko==3.5.0
pfzy==0.3.4
pillow==11.0.0
platformdirs==4.3.6
pooch==1.8.2
pre_commit==4.1.0
prettytable==3.13.0
prodigyopt==1.1.1
prompt_toolkit==3.0.50
propcache==0.2.1
protobuf==5.29.3
psutil==6.1.1
py-cpuinfo==9.0.0
pycares==4.5.0
pycparser==2.22
pydantic==2.10.6
pydantic-extra-types==2.10.2
pydantic-settings==2.7.1
pydantic_core==2.27.2
Pygments==2.19.1
PyNaCl==1.5.0
pynvml==11.5.0
pyparsing==3.2.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytorch-lightning==2.4.0
pytorch_optimizer==3.3.0
PyWavelets==1.8.0
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rich-toolkit==0.13.2
runpod==1.7.4
s3transfer==0.11.2
safetensors==0.4.5
scalene==1.5.45
schedulefree==1.3
scipy==1.14.1
sentencepiece==0.2.0
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
starlette==0.45.3
sympy==1.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
timm==1.0.14
tokenizers==0.21.0
tomli==2.2.1
tomlkit==0.13.2
torch==2.5.1+cu124
torchmetrics==1.6.1
torchvision==0.20.1+cu124
tqdm==4.66.6
tqdm-loggable==0.2
transformers==4.47.0
triton==3.1.0
typer==0.15.1
typing_extensions==4.12.2
ujson==5.10.0
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
virtualenv==20.29.1
watchdog==6.0.0
watchfiles==1.0.4
wcwidth==0.2.13
websockets==14.2
Werkzeug==3.1.3
wrapt==1.17.2
xformers==0.0.28.post3
yarl==1.18.3
zipp==3.21.0

The text was updated successfully, but these errors were encountered:

ivanpoli · 2025-02-07T21:44:07Z

Same for me

dxqbYD added the bug Something isn't working label Feb 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: CUDA errors on 2nd training #671

[Bug]: CUDA errors on 2nd training #671

dxqbYD commented Feb 1, 2025 •

edited

Loading

ivanpoli commented Feb 7, 2025

[Bug]: CUDA errors on 2nd training #671

[Bug]: CUDA errors on 2nd training #671

Comments

dxqbYD commented Feb 1, 2025 • edited Loading

What happened?

What did you expect would happen?

Relevant log output

Output of pip freeze

ivanpoli commented Feb 7, 2025

dxqbYD commented Feb 1, 2025 •

edited

Loading

Output of `pip freeze`