Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that children processes are properly killed when parent killed and clean up nodes #3776

Open
sphuber opened this issue Feb 20, 2020 · 4 comments

Comments

@sphuber
Copy link
Contributor

sphuber commented Feb 20, 2020

There are various scenarios possible where when killing a parent process not all children processes are properly killed as well, or maybe just the nodes are not properly updated. Make sure that the process tasks are properly acknowledged so they don't remain in the queue and make sure to wrap up the nodes.

@yakutovicha
Copy link
Contributor

yakutovicha commented Jan 21, 2021

One more scenario, when killing the top work chain doesn't kill its children:

So I am running the following set of processes EquationOfState (EOF)-> Cp2kOpt -> Cp2kBase -> Cp2kCalc. I'm NOT submitting the calculation to the daemon, but running it.

When trying to kill the top work chain (EOS) using verdi process kill, I got the second one (Cp2kOpt) killed as well. At the same time, both Cp2kBase and C2pkCalc remain unreachable and can't be killed.

If I submit the work chain to the daemon instead - everything can be killed just fine.

aiida version: 1.5.0
machine: Quantum Mobile: 20.11.2a

@ltalirz
Copy link
Member

ltalirz commented Jan 22, 2021

@yakutovicha Could you please check the daemon logs for

AttributeError: 'NoneType' object has no attribute 'set_result'

If that is found, then #4669 may fix this issue

Edit: Sorry @yakutovicha , I misread your comment. Since this is limited to engine.run, it's likely unrelated to #4669

@chrisjsewell
Copy link
Member

Firstly, to note that now #4669 has been merged, instead of the AttributeError: 'NoneType' object has no attribute 'set_result', there should now be a logger warning logger.warning(f'killed CalcJob<{node.pk}> but async future was None')

Secondly, I have now created https://github.com/aiidateam/aiida-integration-tests.
The README explains what this is in detail, but basically the idea is to create a reproducible environment (with Docker Compose) to investigate these kind of "production" issues.

(see also #4603 (comment))

I can reproduce the issue whereby killing a running workchain (not submitted to the daemon), leaves the children unreachable:

Here is the outcome of CTRL-C a running workchain with children:

root@2940cef2c10d:~# aiida-sleep workchain -nw 1 -nc 10 -t 120
^C01/26/2021 04:14:18 PM <294> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 13
Traceback (most recent call last):
  File "/opt/venv/bin/aiida-sleep", line 8, in <module>
    sys.exit(main())
  File "/opt/venv/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/opt/venv/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/venv/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/venv/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/root/aiida-core/aiida/cmdline/utils/decorators.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/aiida_sleep/cli.py", line 109, in run_workchains_cli
    node = run_workchain(number_calc, code, time, payload, output, fail, submit)
  File "/opt/venv/lib/python3.8/site-packages/aiida_sleep/cli.py", line 139, in run_workchain
    node = run_get_node(builder).node
  File "/root/aiida-core/aiida/engine/launch.py", line 58, in run_get_node
    return runner.run_get_node(process, *args, **inputs)
  File "/root/aiida-core/aiida/engine/runners.py", line 268, in run_get_node
    result, node = self._run(process, *args, **inputs)
  File "/root/aiida-core/aiida/engine/runners.py", line 244, in _run
    process_inited.execute()
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 79, in func_wrapper
    return func(self, *args, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1150, in execute
    return self.future().result()
plumpy.exceptions.KilledError: Process was killed because the runner received an interrupt
root@2940cef2c10d:~# verdi process list -a
  PK  Created    Process label     Process State    Process status
----  ---------  ----------------  ---------------  -----------------------------------------------------------
   5  35m ago    SleepCalculation  ⏹ Finished [0]
  13  26s ago    SleepWorkChain    ☠ Killed         Process was killed because the runner received an interrupt
  14  26s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  15  26s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  16  26s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  17  25s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  18  25s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  19  25s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  20  25s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  21  25s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  22  25s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  23  24s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit

Total results: 12

Info: last time an entry changed state: 13s ago (at 16:14:18 on 2021-01-26)

If you try to kill a child calculation it is unreachable:

root@2940cef2c10d:~# verdi process kill 14
Error: Process<14> is unreachable

Here is the outcome of starting then killing a submitted workchain with children (all good):

root@2940cef2c10d:~# aiida-sleep workchain -nw 1 -nc 10 -t 120 -s
uuid: f48405c3-53c4-4143-8ce6-5f751003eb4c (pk: 37) (aiida.workflows:sleep)
root@2940cef2c10d:~# verdi process list -a
  PK  Created    Process label     Process State    Process status
----  ---------  ----------------  ---------------  -------------------------------------------------------------------
   5  41m ago    SleepCalculation  ⏹ Finished [0]
  13  6m ago     SleepWorkChain    ☠ Killed         Process was killed because the runner received an interrupt
  14  6m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  15  6m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  16  6m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  17  6m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  18  6m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  19  6m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  20  6m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  21  6m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  22  6m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  23  6m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  37  17s ago    SleepWorkChain    ⏵ Waiting        Waiting for child processes: 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
  38  17s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  39  16s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  40  16s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  41  16s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  42  16s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  43  16s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  44  16s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  45  15s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  46  15s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  47  15s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: submit

Total results: 23

Info: last time an entry changed state: 5s ago (at 16:20:44 on 2021-01-26)
root@2940cef2c10d:~# verdi process kill 37
Success: killed Process<37>
root@2940cef2c10d:~# verdi process list -a
  PK  Created    Process label     Process State    Process status
----  ---------  ----------------  ---------------  -----------------------------------------------------------
   5  42m ago    SleepCalculation  ⏹ Finished [0]
  13  7m ago     SleepWorkChain    ☠ Killed         Process was killed because the runner received an interrupt
  14  7m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  15  7m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  16  7m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  17  7m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  18  7m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  19  7m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  20  7m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  21  7m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  22  7m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  23  7m ago     SleepCalculation  ⏵ Waiting        Waiting for transport task: submit
  37  42s ago    SleepWorkChain    ☠ Killed         Killed through `verdi process kill`
  38  41s ago    SleepCalculation  ☠ Killed         Killed by parent<37>
  39  41s ago    SleepCalculation  ☠ Killed         Killed by parent<37>
  40  41s ago    SleepCalculation  ☠ Killed         Killed by parent<37>
  41  41s ago    SleepCalculation  ☠ Killed         Killed by parent<37>
  42  40s ago    SleepCalculation  ☠ Killed         Killed by parent<37>
  43  40s ago    SleepCalculation  ☠ Killed         Killed by parent<37>
  44  40s ago    SleepCalculation  ☠ Killed         Killed by parent<37>
  45  40s ago    SleepCalculation  ☠ Killed         Killed by parent<37>
  46  39s ago    SleepCalculation  ☠ Killed         Killed by parent<37>
  47  39s ago    SleepCalculation  ☠ Killed         Killed by parent<37>

Total results: 23

Info: last time an entry changed state: 6s ago (at 16:21:08 on 2021-01-26)
root@2940cef2c10d:~# 

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 26, 2021

Also, to copy over a comment from @unkcpz in the closed duplicate #4298

More information about this issue.

  • If use submit to launch the workchain, and kill the parent process, the child process is able to be killed.
  • Before the parent process kill the child by controller.kill_process(child.pk, 'msg') in aiida.engine.processes.process the child process is unreachable. controller.kill_process return the exception kiwipy.exceptions.UnroutableError: ('NO_ROUTE', '[rpc].102')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants