Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final model save is failing #681

Open
klromans557 opened this issue Feb 10, 2025 · 0 comments
Open

Final model save is failing #681

klromans557 opened this issue Feb 10, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@klromans557
Copy link

What happened?

When training a SDXL model I get a MemoryError message when saving the final model. During training, the model saves happen normally without issue but fail during the final model save. This issue was not happening with previous model trains last week.

What did you expect would happen?

That the final model save works if the intermediate model saves work.

Relevant log output

step: 100%|██████████| 28/28 [00:48<00:00,  1.74s/it, loss=0.0256, smooth loss=0.0907]7]
Saving C:/Users/klrom/Desktop/Model Training Mats/amber_img/workspace\save\tier1_fast2025-02-10_13-02-43-save-560-20-0.safetensors
step: 100%|██████████| 28/28 [01:00<00:00,  2.17s/it, loss=0.0329, smooth loss=0.0858]8]
step: 100%|██████████| 28/28 [00:49<00:00,  1.76s/it, loss=0.0377, smooth loss=0.085]5] 
step:  21%|██▏       | 6/28 [00:12<00:45,  2.05s/it, loss=0.107, smooth loss=0.0844]4] 
epoch:  55%|█████▌    | 22/40 [19:12<15:42, 52.38s/it]
Saving C:/Users/klrom/Desktop/Model Training Mats/amber_img/workspace/model/amber.safetensors
Exception in thread Thread-3 (__training_thread_function):
Traceback (most recent call last):
  File "threading.py", line 1016, in _bootstrap_inner
  File "threading.py", line 953, in run
  File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\modules\ui\TrainUI.py", line 579, in __training_thread_function
    trainer.end()
  File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\modules\trainer\GenericTrainer.py", line 766, in end
    self.model_saver.save(
  File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\modules\modelSaver\StableDiffusionXLFineTuneModelSaver.py", line 28, in save
    base_model_saver.save(model, output_model_format, output_model_destination, dtype)
  File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\modules\modelSaver\stableDiffusionXL\StableDiffusionXLModelSaver.py", line 109, in save
    self.__save_safetensors(model, output_model_destination, dtype)
  File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\modules\modelSaver\stableDiffusionXL\StableDiffusionXLModelSaver.py", line 83, in __save_safetensors
    save_file(save_state_dict, destination, self._create_safetensors_header(model, save_state_dict))
  File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\venv\lib\site-packages\safetensors\torch.py", line 286, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\venv\lib\site-packages\safetensors\torch.py", line 496, in _flatten
    return {
  File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\venv\lib\site-packages\safetensors\torch.py", line 500, in <dictcomp>
    "data": _tobytes(v, k),
  File "C:\Users\klrom\Desktop\StableMatrix\Data\Packages\OneTrainer\venv\lib\site-packages\safetensors\torch.py", line 460, in _tobytes
    return data.tobytes()
MemoryError

Output of pip freeze

absl-py==2.1.0
accelerate==1.0.1
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiohttp-retry==2.9.1
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.8.0
async-timeout==4.0.3
attrs==24.2.0
av==13.1.0
backoff==2.2.1
bcrypt==4.2.1
bitsandbytes==0.44.1
boto3==1.35.94
botocore==1.35.94
Brotli==1.1.0
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.4.0
click==8.1.8
cloudpickle==3.1.0
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.3.0
cryptography==43.0.3
customtkinter==5.2.2
cycler==0.12.1
dadaptation==3.2
darkdetect==0.8.0
decorator==5.1.1
Deprecated==1.2.15
-e git+https://github.com/huggingface/diffusers.git@c944f0651f679728d4ec7b6488120ac49c2f1315#egg=diffusers
dnspython==2.7.0
email_validator==2.2.0
exceptiongroup==1.2.2
fabric==3.2.2
fastapi==0.115.6
fastapi-cli==0.0.7
filelock==3.16.1
flatbuffers==24.3.25
fonttools==4.54.1
frozenlist==1.5.0
fsspec==2024.10.0
ftfy==6.3.1
grpcio==1.67.0
h11==0.14.0
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.27.1
humanfriendly==10.0
idna==3.10
importlib_metadata==8.5.0
inquirerpy==0.3.4
intel-openmp==2021.4.0
invisible-watermark==0.2.0
invoke==2.2.0
itsdangerous==2.2.0
Jinja2==3.1.4
jmespath==1.0.1
kiwisolver==1.4.7
lightning-utilities==0.11.8
lion-pytorch==0.2.2
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.9.2
mdurl==0.1.2
-e git+https://github.com/Nerogar/mgds.git@fcaec253ddff9dccd0f9644836fe87b0103f23f7#egg=mgds
mkl==2021.4.0
mpmath==1.3.0
multidict==6.1.0
networkx==3.4.2
numpy==1.26.4
nvidia-ml-py==12.560.30
omegaconf==2.3.0
onnxruntime==1.19.2
onnxruntime-gpu==1.19.2
open_clip_torch==2.28.0
opencv-python==4.10.0.84
orjson==3.10.13
packaging==24.1
paramiko==3.5.0
pfzy==0.3.4
pillow==11.0.0
platformdirs==4.3.6
pooch==1.8.2
prettytable==3.12.0
prodigyopt==1.1.1
prompt_toolkit==3.0.48
propcache==0.2.0
protobuf==4.25.5
psutil==6.1.0
py-cpuinfo==9.0.0
pycparser==2.22
pydantic==2.9.2
pydantic-extra-types==2.10.1
pydantic-settings==2.7.1
pydantic_core==2.23.4
Pygments==2.18.0
PyNaCl==1.5.0
pynvml==11.5.0
pyparsing==3.2.0
pyreadline3==3.5.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytorch-lightning==2.4.0
pytorch_optimizer==3.3.0
PyWavelets==1.7.0
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
rich==13.9.3
rich-toolkit==0.12.0
runpod==1.7.4
s3transfer==0.10.4
safetensors==0.4.5
scalene==1.5.45
schedulefree==1.3
sentencepiece==0.2.0
shellingham==1.5.4
six==1.16.0
sniffio==1.3.1
starlette==0.41.3
sympy==1.13.1
tbb==2021.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
timm==1.0.11
tokenizers==0.21.0
tomli==2.2.1
tomlkit==0.13.2
torch==2.5.1+cu124
torchmetrics==1.5.1
torchvision==0.20.1+cu124
tqdm==4.66.6
tqdm-loggable==0.2
transformers==4.47.0
typer==0.15.1
typing_extensions==4.12.2
ujson==5.10.0
urllib3==2.2.3
uvicorn==0.34.0
watchdog==6.0.0
watchfiles==1.0.3
wcwidth==0.2.13
websockets==14.1
Werkzeug==3.0.6
wrapt==1.17.0
xformers==0.0.28.post3
yarl==1.17.0
zipp==3.20.2

@klromans557 klromans557 added the bug Something isn't working label Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant