-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train Error #1
Comments
Hi, thanks for the question. I have tested the codes again and everything seems ok on my side. The pytorch version is 1.10.0 and the torchvision version is 0.11.0, you can find more details about the used packages in the requirements.txt. Hope this helps! |
Thank you for your reply. I have checked my environment, it meets the requirement, but I still not found the reason of the error. Could you give me some guidances to help me solve the problem. |
Thanks for your feedback. Actually it is my first time to see an error like this. From the bug message, it looks like an error occurred when loading the first-stage model for the second-stage training. But the codes work fine on my computer, so I don't have a clear picture how to fix it right now. Would you mind providing more details about your environment so I could try reproducing the error? Thanks. |
Thank you for your continued attention. In my environment, the version of python and the result of the command "pip list" are as follows. And I used a 3090 GPU for training. absl-py 2.0.0 When I debugged the code, I tried to print the value of the variable at the position where it broke down. And the result is as follows. |
Hi, thanks for sharing your environments. I have tried using exactly the same environment as yours, but everything still works well on my side, so wired. I will try on other computers to see if I can reproduce the error and let you know if I find something. |
Thank you for your pay attention this problem continuously. I have used two GPU to tranin the model, the error not occured in the same environment. And I tried to use a GPU for training again, the error reproduce. When I used one GPU for training, I just modified codes/main.py . Do I need to modify other codes? |
Hi, thanks for the useful information. I can now reproduce the error in a single GPU setting. It seems to be an error when Pytorch Lighting tries to save some unwanted parameters as pointed here. In our case, the bug occurs when saving "callbacks", so I ignore the "callbacks" by adding For training with different GPU settings, you only need to change the GPU ID on the 'codes/main.py' just like you did there. Then it should be ok to train :) |
Thank you very much for your solution. After I modified the model_interface.py, the code is running correctly.
Thank you very much for your solution. After I modified the model_interface.py, the code is running correctly. |
When I run the train command "python main.py --yaml_path configs/evreds_train.yaml", it break down at the start of second stage with the error "ValueError: dictionary update sequence element #0 has length 1; 2 is required".
I want to know the version of pytorch, whether the version of package is wrong.
The text was updated successfully, but these errors were encountered: