Train Error #1

ShuangMa156 · 2023-11-07T08:04:28Z

When I run the train command "python main.py --yaml_path configs/evreds_train.yaml", it break down at the start of second stage with the error "ValueError: dictionary update sequence element #0 has length 1; 2 is required".
I want to know the version of pytorch, whether the version of package is wrong.

XiangZ-0 · 2023-11-07T14:20:13Z

Hi, thanks for the question. I have tested the codes again and everything seems ok on my side. The pytorch version is 1.10.0 and the torchvision version is 0.11.0, you can find more details about the used packages in the requirements.txt. Hope this helps!

ShuangMa156 · 2023-11-08T02:56:49Z

Thank you for your reply. I have checked my environment, it meets the requirement, but I still not found the reason of the error. Could you give me some guidances to help me solve the problem.
The error message is as follows:
Traceback (most recent call last): File "main.py", line 165, in <module> best_model_path = second_stage_training(args) File "main.py", line 59, in second_stage_training trainer.fit(model, data_module) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1225, in _run self._log_hyperparams() File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1295, in _log_hyperparams logger.save() File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn return fn(*args, **kwargs) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/loggers/tensorboard.py", line 264, in save save_hparams_to_yaml(hparams_file, self.hparams) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/core/saving.py", line 402, in save_hparams_to_yaml yaml.dump(v) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/__init__.py", line 253, in dump return dump_all([data], stream, Dumper=Dumper, **kwds) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/__init__.py", line 241, in dump_all dumper.represent(data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 27, in represent node = self.represent_data(data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 48, in represent_data node = self.yaml_representers[data_types[0]](self, data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 199, in represent_list return self.represent_sequence('tag:yaml.org,2002:seq', data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 92, in represent_sequence node_item = self.represent_data(item) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 343, in represent_object 'tag:yaml.org,2002:python/object:'+function_name, state) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 118, in represent_mapping node_value = self.represent_data(item_value) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 346, in represent_object return self.represent_sequence(tag+function_name, args) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 92, in represent_sequence node_item = self.represent_data(item) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 343, in represent_object 'tag:yaml.org,2002:python/object:'+function_name, state) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 118, in represent_mapping node_value = self.represent_data(item_value) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 330, in represent_object dictitems = dict(dictitems) ValueError: dictionary update sequence element #0 has length 1; 2 is required

XiangZ-0 · 2023-11-08T15:12:57Z

Thanks for your feedback. Actually it is my first time to see an error like this. From the bug message, it looks like an error occurred when loading the first-stage model for the second-stage training. But the codes work fine on my computer, so I don't have a clear picture how to fix it right now. Would you mind providing more details about your environment so I could try reproducing the error? Thanks.

ShuangMa156 · 2023-11-09T01:07:32Z

Thank you for your continued attention. In my environment, the version of python and the result of the command "pip list" are as follows. And I used a 3090 GPU for training.
Python 3.7.16
Package Version

absl-py 2.0.0
aiohttp 3.8.6
aiosignal 1.3.1
async-timeout 4.0.3
asynctest 0.13.0
attrs 23.1.0
cachetools 5.3.2
certifi 2022.12.7
charset-normalizer 3.3.2
cycler 0.11.0
DCN 1.0
fonttools 4.38.0
frozenlist 1.3.3
fsspec 2023.1.0
google-auth 2.23.4
google-auth-oauthlib 0.4.6
grpcio 1.59.2
h5py 3.8.0
hdf5storage 0.1.19
idna 3.4
imageio 2.31.2
importlib-metadata 6.7.0
kiwisolver 1.4.5
Markdown 3.4.4
MarkupSafe 2.1.3
matplotlib 3.5.0
multidict 6.0.4
networkx 2.6.3
numpy 1.21.6
oauthlib 3.2.2
opencv-python 4.7.0.72
packaging 23.2
pathlib2 2.3.7.post1
Pillow 9.3.0
pip 23.3.1
protobuf 3.20.3
pyasn1 0.5.0
pyasn1-modules 0.3.0
pyDeprecate 0.3.2
pyparsing 3.1.1
python-dateutil 2.8.2
pytorch-lightning 1.6.0
PyWavelets 1.3.0
PyYAML 6.0.1
requests 2.31.0
requests-oauthlib 1.3.1
rsa 4.9
scikit-image 0.19.3
scipy 1.7.3
setuptools 59.5.0
setuptools-scm 7.1.0
sewar 0.4.5
six 1.16.0
tensorboard 2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tifffile 2021.11.2
tomli 2.0.1
torch 1.10.0+cu111
torchaudio 0.10.0+rocm4.1
torchmetrics 0.11.4
torchvision 0.11.0+cu111
tqdm 4.66.1
typing_extensions 4.7.1
urllib3 2.0.7
warmup-scheduler 0.3
Werkzeug 2.2.3
wheel 0.38.4
yarl 1.9.2
zipp 3.15.0

When I debugged the code, I tried to print the value of the variable at the position where it broke down. And the result is as follows.

XiangZ-0 · 2023-11-10T21:38:35Z

Hi, thanks for sharing your environments. I have tried using exactly the same environment as yours, but everything still works well on my side, so wired. I will try on other computers to see if I can reproduce the error and let you know if I find something.

ShuangMa156 · 2023-11-11T08:33:14Z

Thank you for your pay attention this problem continuously. I have used two GPU to tranin the model, the error not occured in the same environment. And I tried to use a GPU for training again, the error reproduce.
When I used two GPUs for training, this is a warning in satge2, whether it has anything to do with the environment?

When I used one GPU for training, I just modified codes/main.py . Do I need to modify other codes?

XiangZ-0 · 2023-11-11T10:16:21Z

Hi, thanks for the useful information. I can now reproduce the error in a single GPU setting. It seems to be an error when Pytorch Lighting tries to save some unwanted parameters as pointed here. In our case, the bug occurs when saving "callbacks", so I ignore the "callbacks" by adding self.save_hyperparameters(ignore=["callbacks"]) in the './codes/model/model_interface.py' and now it works for both single-GPU and multi-GPU settings. I have updated the code and you can try it by replacing the 'model_interface.py' file.

For training with different GPU settings, you only need to change the GPU ID on the 'codes/main.py' just like you did there. Then it should be ok to train :)

ShuangMa156 · 2023-11-11T11:44:33Z

Thank you very much for your solution. After I modified the model_interface.py, the code is running correctly.

Hi, thanks for the useful information. I can now reproduce the error in a single GPU setting. It seems to be an error when Pytorch Lighting tries to save some unwanted parameters as pointed here. In our case, the bug occurs when saving "callbacks", so I ignore the "callbacks" by adding self.save_hyperparameters(ignore=["callbacks"]) in the './codes/model/model_interface.py' and now it works for both single-GPU and multi-GPU settings. I have updated the code and you can try it by replacing the 'model_interface.py' file.

For training with different GPU settings, you only need to change the GPU ID on the 'codes/main.py' just like you did there. Then it should be ok to train :)

Thank you very much for your solution. After I modified the model_interface.py, the code is running correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train Error #1

Train Error #1

ShuangMa156 commented Nov 7, 2023

XiangZ-0 commented Nov 7, 2023

ShuangMa156 commented Nov 8, 2023

XiangZ-0 commented Nov 8, 2023

ShuangMa156 commented Nov 9, 2023

XiangZ-0 commented Nov 10, 2023

ShuangMa156 commented Nov 11, 2023 •

edited

Loading

XiangZ-0 commented Nov 11, 2023

ShuangMa156 commented Nov 11, 2023

Train Error #1

Train Error #1

Comments

ShuangMa156 commented Nov 7, 2023

XiangZ-0 commented Nov 7, 2023

ShuangMa156 commented Nov 8, 2023

XiangZ-0 commented Nov 8, 2023

ShuangMa156 commented Nov 9, 2023

XiangZ-0 commented Nov 10, 2023

ShuangMa156 commented Nov 11, 2023 • edited Loading

XiangZ-0 commented Nov 11, 2023

ShuangMa156 commented Nov 11, 2023

ShuangMa156 commented Nov 11, 2023 •

edited

Loading