Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A lot of failed jobs in AiiDA 1.5 (DuplicateSubcriber) when machine is overloaded #4598

Closed
giovannipizzi opened this issue Nov 29, 2020 · 6 comments · Fixed by #5715
Closed

Comments

@giovannipizzi
Copy link
Member

giovannipizzi commented Nov 29, 2020

Duplicate of #3973

When submitting a lot of jobs in AiiDA 1.5.0, I get many failed (excepted) jobs with reports like the following one:

Traceback (most recent call last):
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/manage/external/rmq.py", line 201, in _continue
    result = yield super()._continue(communicator, pid, nowait, tag)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 307, in wrapper
    yielded = next(result)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/process_comms.py", line 541, in _continue
    proc = saved_state.unbundle(self._load_context)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/persistence.py", line 51, in unbundle
    return Savable.load(self, load_context)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/persistence.py", line 442, in load
    return load_cls.recreate_from(saved_state, load_context)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/processes.py", line 234, in recreate_from
    base.call_with_super_check(process.init)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/base/utils.py", line 28, in call_with_super_check
    wrapped(*args, **kwargs)
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/engine/processes/process.py", line 125, in init
    super().init()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/base/utils.py", line 15, in wrapper
    wrapped(self, *args, **kwargs)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/processes.py", line 293, in init
    self.broadcast_receive, identifier=str(self.pid)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py", line 125, in add_broadcast_subscriber
    return self._communicator.add_broadcast_subscriber(converted, identifier)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 592, in add_broadcast_subscriber
    return self._run_task(coro)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 677, in _run_task
    return self.tornado_to_kiwi_future(self._create_task(coro)).result(timeout=self.TASK_TIMEOUT)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 646, in done
    result = done_future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/futures.py", line 54, in capture_exceptions
    yield
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/utils.py", line 146, in run_task
    future.set_result((yield coro()))
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 393, in add_broadcast_subscriber
    identifier = yield self._message_subscriber.add_broadcast_subscriber(subscriber, identifier)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 307, in wrapper
    yielded = next(result)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 146, in add_broadcast_subscriber
    raise kiwipy.DuplicateSubscriberIdentifier("Broadcast identifier '{}'".format(identifier))
kiwipy.communications.DuplicateSubscriberIdentifier: Broadcast identifier '52632'

or

Traceback (most recent call last):
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/manage/external/rmq.py", line 201, in _continue
    result = yield super()._continue(communicator, pid, nowait, tag)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 307, in wrapper
    yielded = next(result)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/process_comms.py", line 541, in _continue
    proc = saved_state.unbundle(self._load_context)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/persistence.py", line 51, in unbundle
    return Savable.load(self, load_context)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/persistence.py", line 442, in load
    return load_cls.recreate_from(saved_state, load_context)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/processes.py", line 234, in recreate_from
    base.call_with_super_check(process.init)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/base/utils.py", line 28, in call_with_super_check
    wrapped(*args, **kwargs)
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/engine/processes/process.py", line 125, in init
    super().init()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/base/utils.py", line 15, in wrapper
    wrapped(self, *args, **kwargs)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/processes.py", line 293, in init
    self.broadcast_receive, identifier=str(self.pid)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py", line 125, in add_broadcast_subscriber
    return self._communicator.add_broadcast_subscriber(converted, identifier)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 592, in add_broadcast_subscriber
    return self._run_task(coro)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 677, in _run_task
    return self.tornado_to_kiwi_future(self._create_task(coro)).result(timeout=self.TASK_TIMEOUT)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 646, in done
    result = done_future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/futures.py", line 54, in capture_exceptions
    yield
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/utils.py", line 146, in run_task
    future.set_result((yield coro()))
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 393, in add_broadcast_subscriber
    identifier = yield self._message_subscriber.add_broadcast_subscriber(subscriber, identifier)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 307, in wrapper
    yielded = next(result)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 146, in add_broadcast_subscriber
    raise kiwipy.DuplicateSubscriberIdentifier("Broadcast identifier '{}'".format(identifier))
kiwipy.communications.DuplicateSubscriberIdentifier: Broadcast identifier '51047'

2020-11-29 11:22:39 [27009 |  ERROR]: Traceback (most recent call last):
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/manage/external/rmq.py", line 201, in _continue
    result = yield super()._continue(communicator, pid, nowait, tag)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/process_comms.py", line 547, in _continue
    yield proc.step_until_terminated()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/processes.py", line 1117, in step_until_terminated
    yield self.step()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/processes.py", line 1108, in step
    self.transition_to(next_state)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 318, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 332, in transition_failed
    raise exception.with_traceback(trace)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 302, in transition_to
    self._enter_next_state(new_state)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 367, in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 281, in _fire_state_event
    callback(self, hook, state)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/processes.py", line 313, in <lambda>
    state_machine.StateEventHook.ENTERED_STATE, lambda _s, _h, from_state: self.on_entered(from_state)
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/engine/processes/process.py", line 339, in on_entered
    super().on_entered(from_state)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/processes.py", line 639, in on_entered
    self._communicator.broadcast_send(body=None, sender=self.pid, subject=subject)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/plumpy/communications.py", line 137, in broadcast_send
    return self._communicator.broadcast_send(body, sender, subject, correlation_id)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 619, in broadcast_send
    correlation_id=correlation_id))
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 630, in _send_message
    return send_future.result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/futures.py", line 54, in capture_exceptions
    yield
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 627, in do_task
    send_future.set_result((yield coro()))
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 417, in broadcast_send
    result = yield self._message_publisher.broadcast_send(body, sender, subject, correlation_id)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 57, in broadcast_send
    result = yield self.publish(message, routing_key=defaults.BROADCAST_TOPIC)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/kiwipy/rmq/messages.py", line 207, in publish
    message, routing_key=routing_key, mandatory=mandatory)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/topika/common.py", line 174, in wrap
    raise gen.Return((yield func(self, *args, **kwargs)))
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/topika/exchange.py", line 210, in publish
    mandatory=mandatory)))
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 307, in wrapper
    yielded = next(result)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/topika/common.py", line 172, in wrap
    raise RuntimeError("The channel is closed")
RuntimeError: The channel is closed

I think other people observed the same.
The other thing I noticed (not sure if a cause or a consequence) is that now verdi process list gives an empty list, but the daemon workers are (while not busy) still using a lot of memory:

$ verdi daemon status
Profile: dispero2020
Daemon is running as PID 62212 since 2020-11-28 23:41:36
Active workers [8]:
  PID    MEM %    CPU %  started
-----  -------  -------  -------------------
62216    6.786        0  2020-11-28 23:41:37
62217    6.434        0  2020-11-28 23:41:37
62218    7.091        0  2020-11-28 23:41:37
62219    9.856        0  2020-11-28 23:41:37
62220   15.758        0  2020-11-28 23:41:37
62221    5.442        0  2020-11-28 23:41:37
62222    8.017        0  2020-11-28 23:41:37
62223    8.67         0  2020-11-28 23:41:37
Use verdi daemon [incr | decr] [num] to increase / decrease the amount of workers

I don't know if this is a memory leak. What I can say is that I was submitting quite a lot of processes/workflows and my machine was under heavy stress. Maybe everything became so slow that the two heartbeats were missed?
@sphuber @muhrin @unkcpz @chrisjsewell

@muhrin
Copy link
Contributor

muhrin commented Nov 29, 2020

Hi Giovanni,

This does indeed look like a possible manifestation of #3973. Could you also provide a relevant portion of your RMQ log? I presume you'll see the hearbeat misses causing disconnections.

As a starting hypothesis one thing that could be happening is that the communications thread (responsible for both sending the 'submit' messages and responding to heartbeats) is blocked submitting and doesn't get a chance to respond to hearbeats. I would have to look at the code but this seems a little strange because I'm pretty sure there are yields even when submitting which should allow for heartbeat to be interleaved.

As for the verdi process list, does this not use the database as the source of its information rather than anything to do with workers? Is there any state attached to the corresponding calculation nodes? Are the nodes even there?

@sphuber
Copy link
Contributor

sphuber commented Nov 30, 2020

@giovannipizzi this is a wild stab in the dark, but is there any chance that you are running multiple AiiDA profiles on this machine? Not necessarily even in the same virtual environment but just on the same machine? Are we sure that they all use different profile_uuids?

@giovannipizzi
Copy link
Member Author

Yes, in different venvs, but I double-checked and they have different UUIDs:

$ cat .virtualenvs/aiida-prod/.aiida/config.json | grep UUID
            "PROFILE_UUID": "7edea2d17c9646b8b351767452828469",
            "PROFILE_UUID": "c8717bdf79f644a6a7e2a49f5adb3122",
            "PROFILE_UUID": "05a409a0c94f4635bf43e2ee9cab8084",
            "PROFILE_UUID": "e6e0ad3e082d4d24bcbd2527f70afb20",
$ cat .aiida/config.json | grep UUID
            "PROFILE_UUID": "78b6264edbed4bcfaf826b191b3c54bd",
            "PROFILE_UUID": "125ac5ce5356443ab0df5682f60c8736",
            "PROFILE_UUID": "2d8b41669b704347a67218cc64889a6e",

Also, I'm quite sure at least some of the problems appeared where only one of the profiles was actually doing something.

@giovannipizzi
Copy link
Member Author

@muhrin indeed there are a few missed heartbeats in there:
Missed heartbeats from client, timeout: 60s
[email protected]
To me they seem too few, but anyway that's definitely (one of) the problem(s).
Check at the first half, before I restarted the server, AiiDA, etc.

Let me know if you need more logs (and which ones).

As I mentioned, at some point I submitted really a lot of calculations and my computer got very slow (tens of seconds even just to do verdi process list). So this might indeed have been the cause (BTW, I'm running with 500 slots per worker, and ~8 workers).
By submitting more "gently", all failed calculations are now going through OK (AiiDA 1.5.0).
BTW, this might have been the same reason for the multiple failures in #4595? I'm not sure

Anyway

  1. OK to mark this as duplicate (if you believe it is - maybe it isn't, in the other issue it was a problem of long-running blocking tasks, here it's probably more an issue of overload of the computer)
  2. It would be good to have some mitigation strategy? Not sure what, but even some information to the user if they are submitting "too much"?
  3. for verdi process list, and the large memory usage: I'm quite sure that all jobs were 'completed' in some sense. My guess is that
    • heartbeat was missed, a new worker started to work on the same process
    • one of the two managed to 'complete' (probably excepting the job)
    • the other one had some memory usage and remained in the RAM of the workers as nobody was telling it how to continue (since the actual messages were consumed by the one that excepted)
      and so workers remained "inactive" (0% CPU) but with GB of memory blocked?
      Indeed after the restart of the daemon, all went back to normal.

BTW, now I restarted the daemon - but for next time, is there a way to know "live", before restarting the daemon, what listeners exist in running daemons? I.e., to discover who's using all that memory, which processes they're waiting upon, ...?

@giovannipizzi giovannipizzi changed the title A lot of failed jobs in AiiDA 1.5 (DuplicateSubcriber) A lot of failed jobs in AiiDA 1.5 (DuplicateSubcriber) when machine is overloaded Dec 1, 2020
@chrisjsewell
Copy link
Member

I think this is related to my proposal in #4595 (comment), i.e. this exception should result in a TaskRejected exception that does not set an exception on the node.

@sphuber sphuber self-assigned this Oct 22, 2022
@sphuber sphuber added priority/critical-blocking must be resolved before next release and removed priority/important labels Oct 22, 2022
@sphuber sphuber added this to the v2.1 milestone Oct 22, 2022
@sphuber
Copy link
Contributor

sphuber commented Oct 27, 2022

I closed this through #5715 because it may solve at least part of these cases. Since these reports are very old, it is difficult to now to what extent the fix will work. It is very likely that the bug is still present, but will occur just less often.

If someone comes across this bug in aiida-core>=2.1 please just open a new issue and reference this one as being related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants