Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train Error #1

Open
ShuangMa156 opened this issue Nov 7, 2023 · 8 comments
Open

Train Error #1

ShuangMa156 opened this issue Nov 7, 2023 · 8 comments

Comments

@ShuangMa156
Copy link

When I run the train command "python main.py --yaml_path configs/evreds_train.yaml", it break down at the start of second stage with the error "ValueError: dictionary update sequence element #0 has length 1; 2 is required".
I want to know the version of pytorch, whether the version of package is wrong.

@XiangZ-0
Copy link
Owner

XiangZ-0 commented Nov 7, 2023

Hi, thanks for the question. I have tested the codes again and everything seems ok on my side. The pytorch version is 1.10.0 and the torchvision version is 0.11.0, you can find more details about the used packages in the requirements.txt. Hope this helps!

@ShuangMa156
Copy link
Author

Thank you for your reply. I have checked my environment, it meets the requirement, but I still not found the reason of the error. Could you give me some guidances to help me solve the problem.
The error message is as follows:
Traceback (most recent call last): File "main.py", line 165, in <module> best_model_path = second_stage_training(args) File "main.py", line 59, in second_stage_training trainer.fit(model, data_module) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1225, in _run self._log_hyperparams() File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1295, in _log_hyperparams logger.save() File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn return fn(*args, **kwargs) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/loggers/tensorboard.py", line 264, in save save_hparams_to_yaml(hparams_file, self.hparams) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/pytorch_lightning/core/saving.py", line 402, in save_hparams_to_yaml yaml.dump(v) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/__init__.py", line 253, in dump return dump_all([data], stream, Dumper=Dumper, **kwds) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/__init__.py", line 241, in dump_all dumper.represent(data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 27, in represent node = self.represent_data(data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 48, in represent_data node = self.yaml_representers[data_types[0]](self, data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 199, in represent_list return self.represent_sequence('tag:yaml.org,2002:seq', data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 92, in represent_sequence node_item = self.represent_data(item) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 343, in represent_object 'tag:yaml.org,2002:python/object:'+function_name, state) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 118, in represent_mapping node_value = self.represent_data(item_value) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 346, in represent_object return self.represent_sequence(tag+function_name, args) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 92, in represent_sequence node_item = self.represent_data(item) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 343, in represent_object 'tag:yaml.org,2002:python/object:'+function_name, state) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 118, in represent_mapping node_value = self.represent_data(item_value) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 52, in represent_data node = self.yaml_multi_representers[data_type](self, data) File "/root/anaconda3/envs/gem/lib/python3.7/site-packages/yaml/representer.py", line 330, in represent_object dictitems = dict(dictitems) ValueError: dictionary update sequence element #0 has length 1; 2 is required

@XiangZ-0
Copy link
Owner

XiangZ-0 commented Nov 8, 2023

Thanks for your feedback. Actually it is my first time to see an error like this. From the bug message, it looks like an error occurred when loading the first-stage model for the second-stage training. But the codes work fine on my computer, so I don't have a clear picture how to fix it right now. Would you mind providing more details about your environment so I could try reproducing the error? Thanks.

@ShuangMa156
Copy link
Author

Thank you for your continued attention. In my environment, the version of python and the result of the command "pip list" are as follows. And I used a 3090 GPU for training.
Python 3.7.16
Package Version


absl-py 2.0.0
aiohttp 3.8.6
aiosignal 1.3.1
async-timeout 4.0.3
asynctest 0.13.0
attrs 23.1.0
cachetools 5.3.2
certifi 2022.12.7
charset-normalizer 3.3.2
cycler 0.11.0
DCN 1.0
fonttools 4.38.0
frozenlist 1.3.3
fsspec 2023.1.0
google-auth 2.23.4
google-auth-oauthlib 0.4.6
grpcio 1.59.2
h5py 3.8.0
hdf5storage 0.1.19
idna 3.4
imageio 2.31.2
importlib-metadata 6.7.0
kiwisolver 1.4.5
Markdown 3.4.4
MarkupSafe 2.1.3
matplotlib 3.5.0
multidict 6.0.4
networkx 2.6.3
numpy 1.21.6
oauthlib 3.2.2
opencv-python 4.7.0.72
packaging 23.2
pathlib2 2.3.7.post1
Pillow 9.3.0
pip 23.3.1
protobuf 3.20.3
pyasn1 0.5.0
pyasn1-modules 0.3.0
pyDeprecate 0.3.2
pyparsing 3.1.1
python-dateutil 2.8.2
pytorch-lightning 1.6.0
PyWavelets 1.3.0
PyYAML 6.0.1
requests 2.31.0
requests-oauthlib 1.3.1
rsa 4.9
scikit-image 0.19.3
scipy 1.7.3
setuptools 59.5.0
setuptools-scm 7.1.0
sewar 0.4.5
six 1.16.0
tensorboard 2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tifffile 2021.11.2
tomli 2.0.1
torch 1.10.0+cu111
torchaudio 0.10.0+rocm4.1
torchmetrics 0.11.4
torchvision 0.11.0+cu111
tqdm 4.66.1
typing_extensions 4.7.1
urllib3 2.0.7
warmup-scheduler 0.3
Werkzeug 2.2.3
wheel 0.38.4
yarl 1.9.2
zipp 3.15.0

When I debugged the code, I tried to print the value of the variable at the position where it broke down. And the result is as follows.

报错
报错文件
输出变量值

@XiangZ-0
Copy link
Owner

Hi, thanks for sharing your environments. I have tried using exactly the same environment as yours, but everything still works well on my side, so wired. I will try on other computers to see if I can reproduce the error and let you know if I find something.

@ShuangMa156
Copy link
Author

ShuangMa156 commented Nov 11, 2023

Thank you for your pay attention this problem continuously. I have used two GPU to tranin the model, the error not occured in the same environment. And I tried to use a GPU for training again, the error reproduce.
When I used two GPUs for training, this is a warning in satge2, whether it has anything to do with the environment?
stage2_warning

When I used one GPU for training, I just modified codes/main.py . Do I need to modify other codes?

main文件修改

@XiangZ-0
Copy link
Owner

Hi, thanks for the useful information. I can now reproduce the error in a single GPU setting. It seems to be an error when Pytorch Lighting tries to save some unwanted parameters as pointed here. In our case, the bug occurs when saving "callbacks", so I ignore the "callbacks" by adding self.save_hyperparameters(ignore=["callbacks"]) in the './codes/model/model_interface.py' and now it works for both single-GPU and multi-GPU settings. I have updated the code and you can try it by replacing the 'model_interface.py' file.

For training with different GPU settings, you only need to change the GPU ID on the 'codes/main.py' just like you did there. Then it should be ok to train :)

@ShuangMa156
Copy link
Author

Thank you very much for your solution. After I modified the model_interface.py, the code is running correctly.

Hi, thanks for the useful information. I can now reproduce the error in a single GPU setting. It seems to be an error when Pytorch Lighting tries to save some unwanted parameters as pointed here. In our case, the bug occurs when saving "callbacks", so I ignore the "callbacks" by adding self.save_hyperparameters(ignore=["callbacks"]) in the './codes/model/model_interface.py' and now it works for both single-GPU and multi-GPU settings. I have updated the code and you can try it by replacing the 'model_interface.py' file.

For training with different GPU settings, you only need to change the GPU ID on the 'codes/main.py' just like you did there. Then it should be ok to train :)

Thank you very much for your solution. After I modified the model_interface.py, the code is running correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants