You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Whenever I start a 2nd training without closing OT first, CUDA errors as below.
Reproducable with
NF4 without offloading
FP8 with 0.3 offloading
BF16 with 0.8 offloading
The error appears to happen when a torch tensor is moved to cuda, even if CUDA_LAUNCH_BLOCKING is enabled
What did you expect would happen?
Relevant log output
Traceback (most recent call last):
File "/home/.../OneTrainer/modules/ui/TrainUI.py", line 569, in __training_thread_function
trainer.start()
File "/home/.../OneTrainer/modules/trainer/GenericTrainer.py", line 124, in start
self.model_setup.setup_optimizations(self.model, self.config)
File "/home/.../OneTrainer/modules/modelSetup/BaseFluxSetup.py", line 109, in setup_optimizations
quantize_layers(model.text_encoder_2, self.train_device, model.text_encoder_2_train_dtype)
File "/home/.../OneTrainer/modules/util/quantization_util.py", line 192, in quantize_layers
child_module.quantize(device)
File "/home/.../OneTrainer/modules/module/quantized/LinearFp8.py", line 43, in quantize
weight = weight.to(device=device)
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "/home/.../OneTrainer/modules/ui/TrainUI.py", line 569, in __training_thread_function
trainer.start()
File "/home/.../OneTrainer/modules/trainer/GenericTrainer.py", line 124, in start
self.model_setup.setup_optimizations(self.model, self.config)
File "/home/.../OneTrainer/modules/modelSetup/BaseFluxSetup.py", line 111, in setup_optimizations
quantize_layers(model.transformer, self.train_device, model.train_dtype)
File "/home/.../OneTrainer/modules/util/quantization_util.py", line 192, in quantize_layers
child_module.quantize(device)
File "/home/.../OneTrainer/modules/module/quantized/LinearNf4.py", line 75, in quantize
weight = weight.to(device=device)
RuntimeError: CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "/home/.../OneTrainer/modules/ui/TrainUI.py", line 569, in __training_thread_function
trainer.start()
File "/home/.../OneTrainer/modules/trainer/GenericTrainer.py", line 125, in start
self.model_setup.setup_train_device(self.model, self.config)
File "/home/.../OneTrainer/modules/modelSetup/FluxLoRASetup.py", line 198, in setup_train_device
model.transformer_to(self.train_device)
File "/home/.../OneTrainer/modules/model/FluxModel.py", line 146, in transformer_to
self.transformer_offload_conductor.to(device)
File "/home/.../OneTrainer/modules/util/LayerOffloadConductor.py", line 647, in to
layer.to(self.__train_device)
File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1340, in to
return self._apply(convert)
File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
module._apply(fn)
File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 900, in _apply
module._apply(fn)
File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in _apply
param_applied = fn(param)
File "/home/.../OneTrainer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1326, in convert
return t.to(
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
What happened?
Whenever I start a 2nd training without closing OT first, CUDA errors as below.
Reproducable with
The error appears to happen when a torch tensor is moved to cuda, even if CUDA_LAUNCH_BLOCKING is enabled
What did you expect would happen?
Relevant log output
Output of
pip freeze
The text was updated successfully, but these errors were encountered: