Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Svnloadsub skipping load of revisions #17

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

omanikhi
Copy link
Contributor

@omanikhi omanikhi commented Mar 7, 2024

Fixes: https://pds-svn-gbg.pdsvision.net/trac/SubversionCMS/ticket/1783

Changes:

  • Verify youngest after loading shard.
  • Verify previous changed paths against previous shard paths before loading a shard0 dump.
  • Clean up corresponding shard0 after successfully dumping a shard3.

@omanikhi omanikhi requested a review from takesson March 26, 2024 10:40
@omanikhi
Copy link
Contributor Author

omanikhi commented Mar 26, 2024

The Maximum number of shards processed was unreachable presumably due to an indentation error if not intentionally. I have revived it now but I'm not sure how it's supposed to behave in this case:

Screenshot 2024-03-25 at 21 49 06

@omanikhi
Copy link
Contributor Author

omanikhi commented Mar 26, 2024

Regarding the lock on the svnloadsub, the way to do it is to use a pid file which is already somewhat implemented but the --pidfile parameter passed to the script is blank so it's not being used. In that case the next instances of the script will exit with an error if an instance is already running. If this is the desired behavior I can make it happen.

Copy link
Member

@takesson takesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the late review.

svndumpsub.py Outdated
logging.info('Dumping and uploading rev: %s from repo: %s' % (self.rev, self.repo))
self.dump_zip_upload(self._get_svn_dump_args(self.rev, self.rev), self.rev)


def dump_zip_upload(self, dump_args, rev):
def dump_zip_upload(self, dump_args, rev) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we modifying the dump_zip_upload code? I really like that it can succeed even if the disk is full...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to use the common and improved execute() function instead of the subprocess calls. If you remember our discussion, communicate() buffers the entire the data to memory but the execute() now reads from the disk and writes to the disk directly or at least that's the goal.

Now regarding the disk being full, it can easily be fixed by a catch clause. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to use the common and improved execute() function instead of the subprocess calls. If you remember our discussion, communicate() buffers the entire the data to memory but the execute() now reads from the disk and writes to the disk directly or at least that's the goal.

Are you saying that the previous version of dump_zip_upload placed the dump file in RAM before uploading to S3? So, committing a 1GB file would never be able to dump to S3 on our current production VMs (since all our VMs are quite low on RAM)?

Now regarding the disk being full, it can easily be fixed by a catch clause. What do you think?

That would still fail, right? I would like it to stream to S3 and get the dump backed up.

Consider the scenario:

  • A VM is low on disk space and we have failed to notice the alarms.
  • User commits successfully but consumes almost all the remaining disk space.
  • svndumpsub should not use disk and successfully stream the commit to S3
  • VM fails unrecoverably
  • We can restore to a new VM including the last commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that the previous version of dump_zip_upload placed the dump file in RAM before uploading to S3? So, committing a 1GB file would never be able to dump to S3 on our current production VMs (since all our VMs are quite low on RAM)?

Yes that's my conclusion reading the documentation and similar reports. Now as to why it still works in production, couldn't it be due to the OS swapping the overhead to the disk? Also it is the compressed gzip archive that is buffered.

That would still fail, right? I would like it to stream to S3 and get the dump backed up.

Consider the scenario:

* A VM is low on disk space and we have failed to notice the alarms.
* User commits successfully but consumes almost all the remaining disk space.
* svndumpsub should not use disk and successfully stream the commit to S3
* VM fails unrecoverably
* We can restore to a new VM including the last commit.

In this scenario the previous approach is unpredictable at best. Perhaps it would be safest to avoid using the disk altogether and dump, compress and upload the data in memory and chunk by chunk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've thought some more about this and I think what happens here is that the s3client.upload_fileobj() reads and thus removes all the returned data from the stdout buffer so p2.communicate()[] will not have returned any or much data. I will revert this function to its original form with some improvements.

svndumpsub.py Outdated
logging.error('No shards were loaded')
return start, end
youngest = self._get_head(self.repo)
if youngest != end:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this validation should simply compare youngest with to_rev.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

svndumpsub.py Outdated
change = str(match.group(2).rstrip().rstrip('/'))
changed_paths_after.add(change)
logging.debug(change)
if changed_paths_before != changed_paths_after:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this validates the paths that were just loaded. Correct?

The ticket specified:
"Verify previous changed paths against previous shard paths before loading a shard0 dump"

It is quite important to do the validation before loading because it must block from doing any additional loading. Otherwise it will keep loading a revision at a time, risk of not noticing the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify previous changed paths against previous shard paths before loading a shard0 dump [if the revision does not end with 001 because then we might not have a shard0 dump].

  1. This is somewhat confusing to me. So if we're loading shard 91005 I need to also retrieve shard 91004 and compare its changed paths against svnlook changes -r 91004?
  2. Also, 91005 does not end with 001 but it is a shard0. Could you clarify what you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify previous changed paths against previous shard paths before loading a shard0 dump [if the revision does not end with 001 because then we might not have a shard0 dump].

  1. This is somewhat confusing to me. So if we're loading shard 91005 I need to also retrieve shard 91004 and compare its changed paths against svnlook changes -r 91004?

Correct. This is the only way that I can see where by block further (incorrect) loading.

  1. Also, 91005 does not end with 001 but it is a shard0. Could you clarify what you mean?

Yes, so we should validate 91004.

But when you should load 91000, the 90999 might not exist so we skip validation. (I got it wrong with 001 in the ticket)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done?

gz_args = [gz, '-c']

from_rev = shard
to_rev = ((int(shard / self.shard_div) + 1) * self.shard_div) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ever validating that youngest is NNNN000 (cleanly divisible with 1000) before loading a shard3?

That would be a great validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we are :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a logical issue with this check. Isn't it sufficient to check whether the youngest = from_rev - 1? Although this will have one exception which is when youngest is 0.

2024-05-17 14:20:12 [INFO] Repository: documentation youngest: 999
2024-05-17 14:20:12 [INFO] Loaded revs: (0-999) from shard: 0 to repo: documentation
2024-05-17 14:20:12 [DEBUG] Shard key exists: v1/travelonium/documentation/shard3/0000001000/documentation-0000001000.svndump.gz
2024-05-17 14:20:12 [INFO] Shard exists, will load shard: 1000
2024-05-17 14:20:12 [INFO] Loading shard: 1000 to repo: documentation
2024-05-17 14:20:12 [DEBUG] Running: /usr/bin/svnlook youngest /srv/cms/svn/documentation
2024-05-17 14:20:12 [INFO] Repository: documentation youngest: 999
2024-05-17 14:20:12 [ERROR] Unable to load shard3 as the youngest revision: 999 is not a multiple of 1000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it look?

@takesson
Copy link
Member

takesson commented May 6, 2024

Regarding the lock on the svnloadsub, the way to do it is to use a pid file which is already somewhat implemented but the --pidfile parameter passed to the script is blank so it's not being used. In that case the next instances of the script will exit with an error if an instance is already running. If this is the desired behavior I can make it happen.

Yes, we should probably do that. I was under the incorrect assumption that cron did not start multiple instances.

omanikhi added 16 commits May 7, 2024 11:06
…te large

Even the previous approach using the communicate() function could potentially cause memory issues
as the function buffers the entire stdout before returning.
@omanikhi omanikhi force-pushed the svnloadsub-rev-skipping-fix branch from 6c03fd2 to c2d04fe Compare May 8, 2024 12:08
@omanikhi
Copy link
Contributor Author

Regarding the lock on the svnloadsub, the way to do it is to use a pid file which is already somewhat implemented but the --pidfile parameter passed to the script is blank so it's not being used. In that case the next instances of the script will exit with an error if an instance is already running. If this is the desired behavior I can make it happen.

Yes, we should probably do that. I was under the incorrect assumption that cron did not start multiple instances.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants