You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training a SDXL model I get a MemoryError message when saving the final model. During training, the model saves happen normally without issue but fail during the final model save. This issue was not happening with previous model trains last week.
What did you expect would happen?
That the final model save works if the intermediate model saves work.
Relevant log output
step: 100%|██████████| 28/28 [00:48<00:00, 1.74s/it, loss=0.0256, smooth loss=0.0907]7]
Saving C:/Users/klrom/Desktop/Model Training Mats/amber_img/workspace\save\tier1_fast2025-02-10_13-02-43-save-560-20-0.safetensors
step: 100%|██████████| 28/28 [01:00<00:00, 2.17s/it, loss=0.0329, smooth loss=0.0858]8]
step: 100%|██████████| 28/28 [00:49<00:00, 1.76s/it, loss=0.0377, smooth loss=0.085]5]
step: 21%|██▏ | 6/28 [00:12<00:45, 2.05s/it, loss=0.107, smooth loss=0.0844]4]
epoch: 55%|█████▌ | 22/40 [19:12<15:42, 52.38s/it]
Saving C:/Users/klrom/Desktop/Model Training Mats/amber_img/workspace/model/amber.safetensors
Exception in thread Thread-3 (__training_thread_function):
Traceback (most recent call last):
File "threading.py", line 1016, in _bootstrap_inner
File "threading.py", line 953, in run
File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\modules\ui\TrainUI.py", line 579, in __training_thread_function
trainer.end()
File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\modules\trainer\GenericTrainer.py", line 766, in end
self.model_saver.save(
File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\modules\modelSaver\StableDiffusionXLFineTuneModelSaver.py", line 28, in save
base_model_saver.save(model, output_model_format, output_model_destination, dtype)
File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\modules\modelSaver\stableDiffusionXL\StableDiffusionXLModelSaver.py", line 109, in save
self.__save_safetensors(model, output_model_destination, dtype)
File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\modules\modelSaver\stableDiffusionXL\StableDiffusionXLModelSaver.py", line 83, in __save_safetensors
save_file(save_state_dict, destination, self._create_safetensors_header(model, save_state_dict))
File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\venv\lib\site-packages\safetensors\torch.py", line 286, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\venv\lib\site-packages\safetensors\torch.py", line 496, in _flatten
return {
File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\venv\lib\site-packages\safetensors\torch.py", line 500, in<dictcomp>"data": _tobytes(v, k),
File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\venv\lib\site-packages\safetensors\torch.py", line 460, in _tobytes
returndata.tobytes()
MemoryError
What happened?
When training a SDXL model I get a MemoryError message when saving the final model. During training, the model saves happen normally without issue but fail during the final model save. This issue was not happening with previous model trains last week.
What did you expect would happen?
That the final model save works if the intermediate model saves work.
Relevant log output
Output of
pip freeze
absl-py==2.1.0
accelerate==1.0.1
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiohttp-retry==2.9.1
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.8.0
async-timeout==4.0.3
attrs==24.2.0
av==13.1.0
backoff==2.2.1
bcrypt==4.2.1
bitsandbytes==0.44.1
boto3==1.35.94
botocore==1.35.94
Brotli==1.1.0
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.4.0
click==8.1.8
cloudpickle==3.1.0
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.3.0
cryptography==43.0.3
customtkinter==5.2.2
cycler==0.12.1
dadaptation==3.2
darkdetect==0.8.0
decorator==5.1.1
Deprecated==1.2.15
-e git+https://github.com/huggingface/diffusers.git@c944f0651f679728d4ec7b6488120ac49c2f1315#egg=diffusers
dnspython==2.7.0
email_validator==2.2.0
exceptiongroup==1.2.2
fabric==3.2.2
fastapi==0.115.6
fastapi-cli==0.0.7
filelock==3.16.1
flatbuffers==24.3.25
fonttools==4.54.1
frozenlist==1.5.0
fsspec==2024.10.0
ftfy==6.3.1
grpcio==1.67.0
h11==0.14.0
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.27.1
humanfriendly==10.0
idna==3.10
importlib_metadata==8.5.0
inquirerpy==0.3.4
intel-openmp==2021.4.0
invisible-watermark==0.2.0
invoke==2.2.0
itsdangerous==2.2.0
Jinja2==3.1.4
jmespath==1.0.1
kiwisolver==1.4.7
lightning-utilities==0.11.8
lion-pytorch==0.2.2
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.9.2
mdurl==0.1.2
-e git+https://github.com/Nerogar/mgds.git@fcaec253ddff9dccd0f9644836fe87b0103f23f7#egg=mgds
mkl==2021.4.0
mpmath==1.3.0
multidict==6.1.0
networkx==3.4.2
numpy==1.26.4
nvidia-ml-py==12.560.30
omegaconf==2.3.0
onnxruntime==1.19.2
onnxruntime-gpu==1.19.2
open_clip_torch==2.28.0
opencv-python==4.10.0.84
orjson==3.10.13
packaging==24.1
paramiko==3.5.0
pfzy==0.3.4
pillow==11.0.0
platformdirs==4.3.6
pooch==1.8.2
prettytable==3.12.0
prodigyopt==1.1.1
prompt_toolkit==3.0.48
propcache==0.2.0
protobuf==4.25.5
psutil==6.1.0
py-cpuinfo==9.0.0
pycparser==2.22
pydantic==2.9.2
pydantic-extra-types==2.10.1
pydantic-settings==2.7.1
pydantic_core==2.23.4
Pygments==2.18.0
PyNaCl==1.5.0
pynvml==11.5.0
pyparsing==3.2.0
pyreadline3==3.5.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytorch-lightning==2.4.0
pytorch_optimizer==3.3.0
PyWavelets==1.7.0
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
rich==13.9.3
rich-toolkit==0.12.0
runpod==1.7.4
s3transfer==0.10.4
safetensors==0.4.5
scalene==1.5.45
schedulefree==1.3
sentencepiece==0.2.0
shellingham==1.5.4
six==1.16.0
sniffio==1.3.1
starlette==0.41.3
sympy==1.13.1
tbb==2021.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
timm==1.0.11
tokenizers==0.21.0
tomli==2.2.1
tomlkit==0.13.2
torch==2.5.1+cu124
torchmetrics==1.5.1
torchvision==0.20.1+cu124
tqdm==4.66.6
tqdm-loggable==0.2
transformers==4.47.0
typer==0.15.1
typing_extensions==4.12.2
ujson==5.10.0
urllib3==2.2.3
uvicorn==0.34.0
watchdog==6.0.0
watchfiles==1.0.3
wcwidth==0.2.13
websockets==14.1
Werkzeug==3.0.6
wrapt==1.17.0
xformers==0.0.28.post3
yarl==1.17.0
zipp==3.20.2
The text was updated successfully, but these errors were encountered: