-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replica job stability 1.8 #375
Closed
fiberbit
wants to merge
8
commits into
UtrechtUniversity:release-1.8
from
fiberbit:replica-job-stability-1.8
Closed
Replica job stability 1.8 #375
fiberbit
wants to merge
8
commits into
UtrechtUniversity:release-1.8
from
fiberbit:replica-job-stability-1.8
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Enable stopping when memory usage exceeds a configurable limit. This prevents using too much memory when large amounts of data objects need to be replicated or should get revisions and memory-leaks exist. After the limit is reached, the process stops in a normal way. Presumably they will be started again soon by the normal cron jobs or whatever the operators use. To facilitate testing the job can now also run in "dry_run" mode, where it logs which data objects would be processed, but doesn't do so. Note that the current memory leaking seems directly proportional to the number of ICAT queries processed. So in dry_run you won't see everything. The revisions job, which amongst others creates and copies AVU's is especially prone to leaking memory very fast. Ironically, especially in the replica job, checking whether a "stop" data objects exists, is the major source of trouble. In no_action mode the memory used is logged at each step. This particular commit is more of a proof of concept, but aimed at being ready to ad hoc deal with the memory issues. It specifically does not address other issues that are present within these jobs. Ad hoc: - limit set to 1.000.000.000 (1 GB), not configurable but easily temporarily adjusted with vi, as is the dry_run parameter - psutil code identical in both source files, not refactored to separate shared utility code yet - the "psutil" package is already present in current Yoda due to dependencies of the data access passwords code - [replica] and [revisions] tags in logging added to stay consistent with this version - manually tested, no unit or integration tests present - during testing several issues surfaced, some quite severe, but they all seemed already present and not related to the specific memory limiting in this patch
So that we can use it from multiple call sites.
When parsing the flag failed we didn't remove it, so on the next run we run into the same situation.
The message shown was 'Could not rename data object' and is now 'Could not checksum data object'.
It could fail when the user (normally rods) had no permission to access the data object. Changes were done in the current style. So rather verbose and with lots of repetition. There probably is a "rodsAdmin=" flag somewhere which could be used in this case for iRODS version 4.2.12 and above, but for now this is at least a practical improvement.
Although rare in current Yoda, it is not always the case that there is a replNum 0 for a data object, nor that it has always a "good" status. Due to the logic in the current replica.py, at this stage the user may not yet have sufficient access to the data object. And anyway the "rodsAdmin=" flag seems more appropriate.
Due to multiple PEP's firing and sometimes not correctly dealing with them, it is possible to have multiple "org_replication_scheduled" AVU's for the same data object, see pep_resource_modified_post. On the first occurrence in the replica job (which processes each AVU one by one, alphabetically sorted due to the query) the first AVU is processed (either successfully or not) and then *all* AVU's are deleted. So on the next occurrence in the loop within the replica job the extra AVU's for the same data object are no longer physically present. Just ignore them, but do report them for keeping track of the situation. Once the, arguably, bug has been fixed that creates these multiple AVU's the workaround can be removed in a later commit. Although in a general iRODS setup, it certainly could be a valid scenario that multiple replica's of the same data object could have been touched while the replica job is running, and you would want to do something with them. That is out of scope for now. Instead of the "count" variable "len(data_objects_seen)" could now be used, but I have left it as it is, only moving it after the "ignore" part.
…ered Introduce "num_copies" (default 2). If a data object already has at least "num_copies" good replica's skip trying to create another replica. With the current logic in Yoda it would likely fail anyway. The main motive for now is being able to prevent extra unnecessary actions while doing large-scale maintenance, e.g. moving all data objects from one resource to another with an "irepl" followed (possibly much later) by an "itrim". This version does not check whether both (or more) replica's are stored in different locations. It is assumed we will fix those situations later, in separate jobs.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The replica job in Yoda 1.8 has several operational issues.
Based on v1.8.12
This patch series aims to deal with the issues in a practical way, meaning that where possible known issues are reported, but in a way that prevents large numbers of errors in the rodsLog, and tries to make sure failures do not result in repeated attempts for the same replication over and over again.
They are workarounds, and implemented in the current "style".
Ad hoc tested only, on a development system with separate "producer" and "consumer" nodes, and with several resources on each.
Testing hints, after regular uploads (either with Portal or iCommands), change access permissions and/or generate extra AVU's, this regularly happens in production on Yoda 1.8.12: