Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replica job stability 1.8 #375

Conversation

fiberbit
Copy link
Contributor

@fiberbit fiberbit commented Dec 4, 2023

The replica job in Yoda 1.8 has several operational issues.

Based on v1.8.12

This patch series aims to deal with the issues in a practical way, meaning that where possible known issues are reported, but in a way that prevents large numbers of errors in the rodsLog, and tries to make sure failures do not result in repeated attempts for the same replication over and over again.

They are workarounds, and implemented in the current "style".

Ad hoc tested only, on a development system with separate "producer" and "consumer" nodes, and with several resources on each.

Testing hints, after regular uploads (either with Portal or iCommands), change access permissions and/or generate extra AVU's, this regularly happens in production on Yoda 1.8.12:

  • ichmod -M null rods testfile.06.txt # takeaway rights to "rods" after uploading but before runnning replica job
  • imeta add -d /tempZone/home/research-alpha-02/testfile.06.txt org_replication_scheduled 'irodsRescRepl,irodsResc'
  • imeta add -d /tempZone/home/research-alpha-02/testfile.06.txt org_replication_scheduled 'la0123_02,irodsResc'
  • itrim -V -S irodsResc -N 1 testfile.06.txt && irepl -V -S irodsRescRepl -R irodsResc testfile.06.txt # after earlier succesful replication, simulates GEO accident

Enable stopping when memory usage exceeds a configurable limit.

This prevents using too much memory when large amounts of data objects
need to be replicated or should get revisions and memory-leaks exist.

After the limit is reached, the process stops in a normal way.
Presumably they will be started again soon by the normal cron jobs or
whatever the operators use.

To facilitate testing the job can now also run in "dry_run" mode, where
it logs which data objects would be processed, but doesn't do so.

Note that the current memory leaking seems directly proportional to the
number of ICAT queries processed. So in dry_run you won't see everything.

The revisions job, which amongst others creates and copies AVU's is
especially prone to leaking memory very fast.

Ironically, especially in the replica job, checking whether a "stop"
data objects exists, is the major source of trouble.

In no_action mode the memory used is logged at each step.

This particular commit is more of a proof of concept, but aimed at being
ready to ad hoc deal with the memory issues. It specifically does not
address other issues that are present within these jobs.

Ad hoc:

    - limit set to 1.000.000.000 (1 GB), not configurable but easily
      temporarily adjusted with vi, as is the dry_run parameter
    - psutil code identical in both source files, not refactored to
      separate shared utility code yet
    - the "psutil" package is already present in current Yoda due to
      dependencies of the data access passwords code
    - [replica] and [revisions] tags in logging added to stay consistent
      with this version
    - manually tested, no unit or integration tests present
    - during testing several issues surfaced, some quite severe, but
      they all seemed already present and not related to the specific
      memory limiting in this patch
So that we can use it from multiple call sites.
When parsing the flag failed we didn't remove it, so on the next run we
run into the same situation.
The message shown was 'Could not rename data object' and is now 'Could
not checksum data object'.
It could fail when the user (normally rods) had no permission to access
the data object.

Changes were done in the current style. So rather verbose and with lots
of repetition.

There probably is a "rodsAdmin=" flag somewhere which could be used in
this case for iRODS version 4.2.12 and above, but for now this is at
least a practical improvement.
Although rare in current Yoda, it is not always the case that there is a
replNum 0 for a data object, nor that it has always a "good" status.

Due to the logic in the current replica.py, at this stage the user may
not yet have sufficient access to the data object. And anyway the
"rodsAdmin=" flag seems more appropriate.
Due to multiple PEP's firing and sometimes not correctly dealing with
them, it is possible to have multiple "org_replication_scheduled" AVU's
for the same data object, see pep_resource_modified_post.

On the first occurrence in the replica job (which processes each AVU one
by one, alphabetically sorted due to the query) the first AVU is processed
(either successfully or not) and then *all* AVU's are deleted.

So on the next occurrence in the loop within the replica job the extra
AVU's for the same data object are no longer physically present.

Just ignore them, but do report them for keeping track of the situation.

Once the, arguably, bug has been fixed that creates these multiple AVU's
the workaround can be removed in a later commit.

Although in a general iRODS setup, it certainly could be a valid scenario
that multiple replica's of the same data object could have been touched
while the replica job is running, and you would want to do something with
them. That is out of scope for now.

Instead of the "count" variable "len(data_objects_seen)" could now be
used, but I have left it as it is, only moving it after the "ignore"
part.
…ered

Introduce "num_copies" (default 2). If a data object already has at
least "num_copies" good replica's skip trying to create another replica.

With the current logic in Yoda it would likely fail anyway.

The main motive for now is being able to prevent extra unnecessary actions
while doing large-scale maintenance, e.g. moving all data objects from
one resource to another with an "irepl" followed (possibly much later)
by an "itrim".

This version does not check whether both (or more) replica's are stored
in different locations. It is assumed we will fix those situations
later, in separate jobs.
@fiberbit fiberbit closed this Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant