Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make restart tests work #222

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

yakutovicha
Copy link
Contributor

fix #174

@yakutovicha yakutovicha requested a review from oschuett January 15, 2025 14:28
@yakutovicha
Copy link
Contributor Author

@oschuett, this temporarily fixes the issues in this repo, but probably won't fix things for the CP2K one. As mentioned in #214 (comment), the time-based restarts are tricky and we might need a better solution. Please let me know if you have a better idea.

@oschuett
Copy link
Collaborator

Thanks a lot for looking into this issue!

Since aiidateam/aiida-core#6307 is now fixed, couldn't we lower the walltime to a value that works reliably everywhere?

@yakutovicha
Copy link
Contributor Author

Thanks a lot for looking into this issue!

Sure!

Since aiidateam/aiida-core#6307 is now fixed, couldn't we lower the walltime to a value that works reliably everywhere?

I think we can try to play around with things, but there will always be a case where things are breaking if the machine is fast. That's why I would prefer a more predictable test.

@oschuett
Copy link
Collaborator

but there will always be a case where things are breaking if the machine is fast.

Couldn't we tune it such that it has to restart 3 times on a fast machine today? That should keep working for at least the next 10 years. On a slow machine it will obviously need more restarts, but overall the test should still finish within a reasonable time.

@yakutovicha
Copy link
Contributor Author

but there will always be a case where things are breaking if the machine is fast.

Couldn't we tune it such that it has to restart 3 times on a fast machine today? That should keep working for at least the next 10 years. On a slow machine it will obviously need more restarts, but overall the test should still finish within a reasonable time.

Ok, trying here: d72faf2. However, I believe there is a significant dispersion among different runners on GitHub, so that might not be a super reliable solution either.

@yakutovicha
Copy link
Contributor Author

yakutovicha commented Jan 15, 2025

Ok, seems that things have improved a little bit. The remaining test failure will most likely be fixed by #218. Shall we merge this one?

@mkrack
Copy link
Contributor

mkrack commented Jan 16, 2025

Ok, trying here: d72faf2. However, I believe there is a significant dispersion among different runners on GitHub, so that might not be a super reliable solution either.

Maybe, it is better to use MAX_SCF in the SCF section or MAX_ITER in the GEO_OPT section to enforce a restart. That approach would not be affected by runtime fluctuation due to different hardware or availability of cloud resources. We could run for instance two GEO_OPT steps (MAX_ITER 2) and then restart once with an increase MAX_ITER 4 to run two more steps. We have only to make sure that the GEO_OPT run needs 4 or more GEO_OPT steps to converge for the selected system.

@oschuett
Copy link
Collaborator

We could run for instance two GEO_OPT steps MAX_ITER 2 and then restart once with an increase MAX_ITER 4 to run two more steps.

AFAIK, the restart mechanism in AiiDA does not allow for modifications of the input in between runs.

@mkrack
Copy link
Contributor

mkrack commented Jan 16, 2025

We could run for instance two GEO_OPT steps MAX_ITER 2 and then restart once with an increase MAX_ITER 4 to run two more steps.

AFAIK, the restart mechanism in AiiDA does not allow for modifications of the input in between runs.

If that is true, then we can try a sufficiently small MAX_SCF with OUTER_SCF disabled which will require 2 or 3 restarts to converge the system.

@yakutovicha
Copy link
Contributor Author

Ok, trying here: d72faf2. However, I believe there is a significant dispersion among different runners on GitHub, so that might not be a super reliable solution either.

Maybe, it is better to use MAX_SCF in the SCF section or MAX_ITER in the GEO_OPT section to enforce a restart. That approach would not be affected by runtime fluctuation due to different hardware or availability of cloud resources. We could run for instance two GEO_OPT steps (MAX_ITER 2) and then restart once with an increase MAX_ITER 4 to run two more steps. We have only to make sure that the GEO_OPT run needs 4 or more GEO_OPT steps to converge for the selected system.

I tried to implement that in b79531c. Please let me know if you have better ideas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The restart_incomplete_calculation runs only one time
3 participants