Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workchain restart tests are very brittle #214

Open
oschuett opened this issue Mar 23, 2024 · 4 comments
Open

Workchain restart tests are very brittle #214

oschuett opened this issue Mar 23, 2024 · 4 comments

Comments

@oschuett
Copy link
Collaborator

The workchain tests concerned with restarting have been failing on the CP2K Dashboard for a while.

I believe the problem is that the workload, e.g. number of MD steps, and the wall-time have to be precisely tuned to finish the work after exactly one restart. If the work finishes during the initial run then the test failes because no restart was triggered. If the work needs more than one restart then the Cp2kBaseWorkChain bailes out:

Cp2kCalculation<25> failed and error was not handled for the second consecutive time, aborting

Since the time required for a workload depends on the hardware these tests are inherently brittle. I think the best solution would be to allow for multiple restarts. See also #174.

@yakutovicha
Copy link
Contributor

Since the time required for a workload depends on the hardware these tests are inherently brittle. I think the best solution would be to allow for multiple restarts. See also #174.

it is a bug in aiida-core, which should be fixed in the next release.

@oschuett
Copy link
Collaborator Author

Since #170 the dashboard test has been failing even more. Do you know for when the next aiida-core release is planned?

@yakutovicha
Copy link
Contributor

Since #170 the dashboard test has been failing even more. Do you know for when the next aiida-core release is planned?

No idea, sorry 🤷‍♂️

@yakutovicha
Copy link
Contributor

Hey @oschuett and @mkrack!

I now understand the problem. The origin is the time limit that I put when running the tests.

The tests failed in this repo because we still rely on aiida-core==2.5.3. The GitHub runner is sometimes slow, so it required more than one restart, which was not working due to the issue mentioned above.

The tests in the cp2k dashboard fail because it is too fast. So it does not do any restarts and the tests fail again. See here for example: just search for "work chain completed after".

To fix them I need a reliable way to stop CP2K before completing the optimisation. Can you think of another approach that doesn't involve time? @mkrack mentioned that he had implemented a way to stop CP2K if the convergence criteria were not met. Please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants