Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: CUDA errors on 2nd training #671

Open
dxqbYD opened this issue Feb 1, 2025 · 1 comment
Open

[Bug]: CUDA errors on 2nd training #671

dxqbYD opened this issue Feb 1, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@dxqbYD
Copy link
Collaborator

dxqbYD commented Feb 1, 2025

What happened?

Whenever I start a 2nd training without closing OT first, CUDA errors as below.
Reproducable with

  • NF4 without offloading
  • FP8 with 0.3 offloading
  • BF16 with 0.8 offloading

The error appears to happen when a torch tensor is moved to cuda, even if CUDA_LAUNCH_BLOCKING is enabled

What did you expect would happen?

Relevant log output

Traceback (most recent call last):
  File "/home/.../OneTrainer/modules/ui/TrainUI.py", line 569, in __training_thread_function
    trainer.start()
  File "/home/.../OneTrainer/modules/trainer/GenericTrainer.py", line 124, in start
    self.model_setup.setup_optimizations(self.model, self.config)
  File "/home/.../OneTrainer/modules/modelSetup/BaseFluxSetup.py", line 109, in setup_optimizations
    quantize_layers(model.text_encoder_2, self.train_device, model.text_encoder_2_train_dtype)
  File "/home/.../OneTrainer/modules/util/quantization_util.py", line 192, in quantize_layers
    child_module.quantize(device)
  File "/home/.../OneTrainer/modules/module/quantized/LinearFp8.py", line 43, in quantize
    weight = weight.to(device=device)
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Traceback (most recent call last):
  File "/home/.../OneTrainer/modules/ui/TrainUI.py", line 569, in __training_thread_function
    trainer.start()
  File "/home/.../OneTrainer/modules/trainer/GenericTrainer.py", line 124, in start
    self.model_setup.setup_optimizations(self.model, self.config)
  File "/home/.../OneTrainer/modules/modelSetup/BaseFluxSetup.py", line 111, in setup_optimizations
    quantize_layers(model.transformer, self.train_device, model.train_dtype)
  File "/home/.../OneTrainer/modules/util/quantization_util.py", line 192, in quantize_layers
    child_module.quantize(device)
  File "/home/.../OneTrainer/modules/module/quantized/LinearNf4.py", line 75, in quantize
    weight = weight.to(device=device)
RuntimeError: CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Traceback (most recent call last):
  File "/home/.../OneTrainer/modules/ui/TrainUI.py", line 569, in __training_thread_function
    trainer.start()
  File "/home/.../OneTrainer/modules/trainer/GenericTrainer.py", line 125, in start
    self.model_setup.setup_train_device(self.model, self.config)
  File "/home/.../OneTrainer/modules/modelSetup/FluxLoRASetup.py", line 198, in setup_train_device
    model.transformer_to(self.train_device)
  File "/home/.../OneTrainer/modules/model/FluxModel.py", line 146, in transformer_to
    self.transformer_offload_conductor.to(device)
  File "/home/.../OneTrainer/modules/util/LayerOffloadConductor.py", line 647, in to
    layer.to(self.__train_device)
  File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
    return self._apply(convert)
  File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
    module._apply(fn)
  File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
    module._apply(fn)
  File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
    param_applied = fn(param)
  File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
    return t.to(
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Output of pip freeze

absl-py==2.1.0
accelerate==1.0.1
aiodns==3.2.0
aiohappyeyeballs==2.4.4
aiohttp==3.11.11
aiohttp-retry==2.9.1
aiosignal==1.3.2
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.8.0
async-timeout==5.0.1
attrs==24.3.0
av==13.1.0
backoff==2.2.1
bcrypt==4.2.1
bitsandbytes==0.44.1
boto3==1.36.5
botocore==1.36.5
Brotli==1.1.0
certifi==2024.12.14
cffi==1.17.1
cfgv==3.4.0
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.3.1
cryptography==43.0.3
customtkinter==5.2.2
cycler==0.12.1
dadaptation==3.2
darkdetect==0.8.0
decorator==5.1.1
Deprecated==1.2.16
-e git+https://github.com/huggingface/diffusers.git@c944f0651f679728d4ec7b6488120ac49c2f1315#egg=diffusers
distlib==0.3.9
dnspython==2.7.0
email_validator==2.2.0
exceptiongroup==1.2.2
fabric==3.2.2
fastapi==0.115.7
fastapi-cli==0.0.7
filelock==3.17.0
flatbuffers==25.1.21
fonttools==4.55.5
frozenlist==1.5.0
fsspec==2024.12.0
ftfy==6.3.1
grpcio==1.70.0
h11==0.14.0
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.27.1
humanfriendly==10.0
identify==2.6.6
idna==3.10
importlib_metadata==8.6.1
inquirerpy==0.3.4
invisible-watermark==0.2.0
invoke==2.2.0
itsdangerous==2.2.0
Jinja2==3.1.5
jmespath==1.0.1
kiwisolver==1.4.8
lightning-utilities==0.11.9
lion-pytorch==0.2.2
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.9.2
mdurl==0.1.2
-e git+https://github.com/Nerogar/mgds.git@fcaec253ddff9dccd0f9644836fe87b0103f23f7#egg=mgds
mpmath==1.3.0
multidict==6.1.0
networkx==3.4.2
nodeenv==1.9.1
numpy==1.26.4
nvidia-cublas-cu11==11.11.3.6
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu11==11.8.87
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu11==11.8.89
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu11==11.8.89
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu11==8.7.0.84
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu11==10.9.0.58
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu11==10.3.0.86
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu11==11.4.1.48
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu11==11.7.5.86
nvidia-cusparse-cu12==12.3.1.170
nvidia-ml-py==12.560.30
nvidia-nccl-cu11==2.20.5
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu11==11.8.86
nvidia-nvtx-cu12==12.4.127
omegaconf==2.3.0
onnxruntime-gpu==1.19.2
open_clip_torch==2.28.0
opencv-python==4.10.0.84
orjson==3.10.15
packaging==24.2
paramiko==3.5.0
pfzy==0.3.4
pillow==11.0.0
platformdirs==4.3.6
pooch==1.8.2
pre_commit==4.1.0
prettytable==3.13.0
prodigyopt==1.1.1
prompt_toolkit==3.0.50
propcache==0.2.1
protobuf==5.29.3
psutil==6.1.1
py-cpuinfo==9.0.0
pycares==4.5.0
pycparser==2.22
pydantic==2.10.6
pydantic-extra-types==2.10.2
pydantic-settings==2.7.1
pydantic_core==2.27.2
Pygments==2.19.1
PyNaCl==1.5.0
pynvml==11.5.0
pyparsing==3.2.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytorch-lightning==2.4.0
pytorch_optimizer==3.3.0
PyWavelets==1.8.0
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rich-toolkit==0.13.2
runpod==1.7.4
s3transfer==0.11.2
safetensors==0.4.5
scalene==1.5.45
schedulefree==1.3
scipy==1.14.1
sentencepiece==0.2.0
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
starlette==0.45.3
sympy==1.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
timm==1.0.14
tokenizers==0.21.0
tomli==2.2.1
tomlkit==0.13.2
torch==2.5.1+cu124
torchmetrics==1.6.1
torchvision==0.20.1+cu124
tqdm==4.66.6
tqdm-loggable==0.2
transformers==4.47.0
triton==3.1.0
typer==0.15.1
typing_extensions==4.12.2
ujson==5.10.0
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
virtualenv==20.29.1
watchdog==6.0.0
watchfiles==1.0.4
wcwidth==0.2.13
websockets==14.2
Werkzeug==3.1.3
wrapt==1.17.2
xformers==0.0.28.post3
yarl==1.18.3
zipp==3.21.0
@dxqbYD dxqbYD added the bug Something isn't working label Feb 1, 2025
@ivanpoli
Copy link

ivanpoli commented Feb 7, 2025

Same for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants