Replica job stability 1.8 #375

fiberbit · 2023-12-04T13:51:59Z

The replica job in Yoda 1.8 has several operational issues.

Based on v1.8.12

This patch series aims to deal with the issues in a practical way, meaning that where possible known issues are reported, but in a way that prevents large numbers of errors in the rodsLog, and tries to make sure failures do not result in repeated attempts for the same replication over and over again.

They are workarounds, and implemented in the current "style".

Ad hoc tested only, on a development system with separate "producer" and "consumer" nodes, and with several resources on each.

Testing hints, after regular uploads (either with Portal or iCommands), change access permissions and/or generate extra AVU's, this regularly happens in production on Yoda 1.8.12:

ichmod -M null rods testfile.06.txt # takeaway rights to "rods" after uploading but before runnning replica job
imeta add -d /tempZone/home/research-alpha-02/testfile.06.txt org_replication_scheduled 'irodsRescRepl,irodsResc'
imeta add -d /tempZone/home/research-alpha-02/testfile.06.txt org_replication_scheduled 'la0123_02,irodsResc'
itrim -V -S irodsResc -N 1 testfile.06.txt && irepl -V -S irodsRescRepl -R irodsResc testfile.06.txt # after earlier succesful replication, simulates GEO accident

Enable stopping when memory usage exceeds a configurable limit. This prevents using too much memory when large amounts of data objects need to be replicated or should get revisions and memory-leaks exist. After the limit is reached, the process stops in a normal way. Presumably they will be started again soon by the normal cron jobs or whatever the operators use. To facilitate testing the job can now also run in "dry_run" mode, where it logs which data objects would be processed, but doesn't do so. Note that the current memory leaking seems directly proportional to the number of ICAT queries processed. So in dry_run you won't see everything. The revisions job, which amongst others creates and copies AVU's is especially prone to leaking memory very fast. Ironically, especially in the replica job, checking whether a "stop" data objects exists, is the major source of trouble. In no_action mode the memory used is logged at each step. This particular commit is more of a proof of concept, but aimed at being ready to ad hoc deal with the memory issues. It specifically does not address other issues that are present within these jobs. Ad hoc: - limit set to 1.000.000.000 (1 GB), not configurable but easily temporarily adjusted with vi, as is the dry_run parameter - psutil code identical in both source files, not refactored to separate shared utility code yet - the "psutil" package is already present in current Yoda due to dependencies of the data access passwords code - [replica] and [revisions] tags in logging added to stay consistent with this version - manually tested, no unit or integration tests present - during testing several issues surfaced, some quite severe, but they all seemed already present and not related to the specific memory limiting in this patch

So that we can use it from multiple call sites.

When parsing the flag failed we didn't remove it, so on the next run we run into the same situation.

The message shown was 'Could not rename data object' and is now 'Could not checksum data object'.

It could fail when the user (normally rods) had no permission to access the data object. Changes were done in the current style. So rather verbose and with lots of repetition. There probably is a "rodsAdmin=" flag somewhere which could be used in this case for iRODS version 4.2.12 and above, but for now this is at least a practical improvement.

Although rare in current Yoda, it is not always the case that there is a replNum 0 for a data object, nor that it has always a "good" status. Due to the logic in the current replica.py, at this stage the user may not yet have sufficient access to the data object. And anyway the "rodsAdmin=" flag seems more appropriate.

Due to multiple PEP's firing and sometimes not correctly dealing with them, it is possible to have multiple "org_replication_scheduled" AVU's for the same data object, see pep_resource_modified_post. On the first occurrence in the replica job (which processes each AVU one by one, alphabetically sorted due to the query) the first AVU is processed (either successfully or not) and then *all* AVU's are deleted. So on the next occurrence in the loop within the replica job the extra AVU's for the same data object are no longer physically present. Just ignore them, but do report them for keeping track of the situation. Once the, arguably, bug has been fixed that creates these multiple AVU's the workaround can be removed in a later commit. Although in a general iRODS setup, it certainly could be a valid scenario that multiple replica's of the same data object could have been touched while the replica job is running, and you would want to do something with them. That is out of scope for now. Instead of the "count" variable "len(data_objects_seen)" could now be used, but I have left it as it is, only moving it after the "ignore" part.

…ered Introduce "num_copies" (default 2). If a data object already has at least "num_copies" good replica's skip trying to create another replica. With the current logic in Yoda it would likely fail anyway. The main motive for now is being able to prevent extra unnecessary actions while doing large-scale maintenance, e.g. moving all data objects from one resource to another with an "irepl" followed (possibly much later) by an "itrim". This version does not check whether both (or more) replica's are stored in different locations. It is assumed we will fix those situations later, in separate jobs.

fiberbit added 8 commits November 20, 2023 08:48

replication.py: refactor removing job flag to separate function

f8b65f1

So that we can use it from multiple call sites.

replication.py: add missing call to remove job flag

71a632e

When parsing the flag failed we didn't remove it, so on the next run we run into the same situation.

Fix error message when checksumming failed

15f58bf

The message shown was 'Could not rename data object' and is now 'Could not checksum data object'.

fiberbit closed this Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replica job stability 1.8 #375

Replica job stability 1.8 #375

fiberbit commented Dec 4, 2023

Replica job stability 1.8 #375

Replica job stability 1.8 #375

Conversation

fiberbit commented Dec 4, 2023