Strange differences in rustic restore performance #644

kalleyne · 2023-05-11T20:05:59Z

kalleyne
May 11, 2023

Good day.

This may or may not be directly related to: #629

We are trying to do a "rustic restore" at a customer location and are seeing some strange restore behavior.

Based on some internal testing we decided to use rustic within Docker of a NAS OS (based on Ubuntu) because rustic on the host would show these related errors: `GLIBC_2.35' not found. Using Docker with a Ubuntu 22.04 container for rustic fixed those GLIBC errors.

We started the "rustic restore" with version v0.5.1 to restore a snapshot from a REST server.

Within a few mins after "rustic restore" started to show ETA, the process was killed and we saw this in dmesg:

[ 3022.219218] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=a856fc8fc32b5a2732b5485844300d5c664de7fa097c4cd87f02097b087be311,mems_allowed=0,global_oom,task_memcg=/docker/a856fc8fc32b5a2732b5485844300d5c664de7fa097c4cd87f02097b087be311,task=rustic,pid=22730,uid=0
[ 3022.219262] Out of memory: Killed process 22730 (rustic) total-vm:77912012kB, anon-rss:61486824kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:127528kB oom_score_adj:0
[ 3024.740622] oom_reaper: reaped process 22730 (rustic), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The system only has 64GB RAM which we thought would be enough. The snapshot has a restore size of about 72.784 TiB.

We then upgraded the rustic version to latest 0.5.3. Now when we try to restore the same snapshot....rustic seems to be spending a lot of time at the "01:24:22 collecting file information..." prompt and keeps on incrementing the time for over an hour. "rustic restore" is not moving a lot of data over the network at this time.

So at this stage, we are kinda stuck. We assume this "rustic restore" is doing some heavier analysis that the earlier 0.5.1 did not do....but cannot say for sure. We don't know if RAM is still a factor or not....since we believe that restic (with Golang) uses memory differently than rustic (with rust).

Any quick advice on any of this?

Thanks.

aawsome · 2023-05-11T20:16:38Z

aawsome
May 11, 2023
Maintainer

Actually there haven't been substantial changes to restore between 0.5.1 and 0.5.3.
Could it be that you didn't clean your restore dir before changing to the 0.5.3 version?
I can imagine the following happened: Your first run created all files-to restore before it got killed.
Note that this is the standard processing: First all files are created and then the data is filled into the existing files.
Then your second run (using 0.5.3) found all existing files and first scans those for already usable data. This is the resume-feature of rustic.

0 replies

kalleyne · 2023-05-11T20:27:58Z

kalleyne
May 11, 2023
Author

Ok, we did not clean out the restore dir when switching versions of rustic.

You think the system is spending time trying to resume?

We will try the following steps. We will stop 0.5.3 restore now. Reboot and then remove the restore dir content. And try restore again.

1 reply

aawsome May 11, 2023
Maintainer

Yes - I think it tries to find the state in order to resume. But note that you most likely will run into the same memory issues as with your first run, if you redo with either 0.5.1 or 0.5.3.

aawsome · 2023-05-11T20:32:20Z

aawsome
May 11, 2023
Maintainer

About the memory usage: First thanks a lot that you are using rustic with such a big repository! I think there are things that can be improved to better support this scale, so I'm very happy to get your feedback!
Also I'm very sorry about your negative experience and yes, this could be related to #629 as restore has outstanding optimizations which could also improve memory usage.

For a first start, you could try out #624 which does increase performance but also should reduce memory usage quite a bit because it stops saving filenames for each blob used in this file.

4 replies

aawsome May 11, 2023
Maintainer

Also note that you can always verify your restored content by either running rustic restore --verify-existing or running rustic diff <SNAPSHOT>:<PATH> <RESTORE-PATH> - so IMO it is reasonable to try out experimental PRs for the restore.

kalleyne May 11, 2023
Author

Forgive me, but I'm not sure how to apply #624
I'm seeing what appears to be source changes but I'm unsure how to apply unless I compile?

About the memory usage: First thanks a lot that you are using rustic with such a big repository! I think there are things that can be improved to better support this scale, so I'm very happy to get your feedback! Also I'm very sorry about your negative experience and yes, this could be related to #629 as restore has outstanding optimizations which could also improve memory usage.

For a first start, you could try out #624 which does increase performance but also should reduce memory usage quite a bit because it stops saving filenames for each blob used in this file.

aawsome May 11, 2023
Maintainer

You can get the binaries compiled by the CI, but they are somewhat hidden.
Look at https://github.com/rustic-rs/rustic/actions/runs/4952216304?pr=624 and scroll all down to the artifacts.

aawsome May 11, 2023
Maintainer

Side remark: If you use the musl version, you get a statically compiled version and don't have trouble with GLIBC versions. But I did encounter somewhat lower performance using the musl version.

kalleyne · 2023-05-11T21:26:10Z

kalleyne
May 11, 2023
Author

We cleaned out the restore directory and now trying the PR624:

./rustic --version
rustic v0.5.3-8-g9abf69d

Running restore from REST server now. Will let you know soon about the results.

If this fails with "Out of memory" errors....just planning next steps.
Should we then try that musl version without Docker?
If that fails....switch back to restic (Golang) for bulk of restore. If that eventually break....hopefully we would have a fix for rustic (Rust) by then to address memory issues....and complete the restore task with resume?

1 reply

aawsome May 11, 2023
Maintainer

Actually I doubt that if rustic runs out of memory restic will succeed. In my tests restic usually uses much more memory than rustic. But of course it's worth a try.

Here is a strategy you could follow to minimize memory usage: Split your restore by separately restoring subdirs. E.g. if your snapshot contains /path/PATH1, /path/PATH2, ... you can restore each of PATH1, ... in a single step. Note that rustic allows the syntax rustic restore <SNAP>:<PATH> <RESTORE-PATH>, i.e. you can run in the given examples rustic restore <SNAP>:/path/PATH1 /DEST/path/PATH1.

kalleyne · 2023-05-11T21:46:07Z

kalleyne
May 11, 2023
Author

At this moment, rustic PR624 is running at a hell of a speedy pace. This is much better than the earlier production releases.
Appears to be restoring 500GB every 15 mins so far. Hope this continues. Here's to our success for a clean restore.

1 reply

aawsome May 11, 2023
Maintainer

Wow, that should be around 5Gbit/s - good to hear!
From this stage on, memory consumption should not vary too much, as all needed in-memory data structures have been build already. If you have some memory left I think this will not break due to memory issues.

kalleyne · 2023-05-12T16:09:00Z

kalleyne
May 12, 2023
Author

The restore is still running....thanks for your help in getting us started.

We are seeing a drop of restore performance now which reminds me of this thread on the restic forum:
https://forum.restic.net/t/restic-restore-showing-a-sharp-reduction-of-rx-receive-speed-over-time/6199

Initially, with this rustic restore we were seeing more than 1.5TiB per hour rx incoming data. Now it's trending to below 1TiB per hour and degrading further with every passing hour.

I don't believe that we have leveled out at a bottom limit as yet. We have not seen this activity with the earlier versions of rustic during our inhouse testing about 2 weeks ago....where rustic restore seemed to be at a consistent rate (400 GiB+ restore on the rx every single hour).

We are currently still getting better numbers than 400 GiB per hour but the continual drop in speed will probably affect the ETA. All this strange activity is somewhat of an ongoing mystery.

Thanks for your help again.

2 replies

aawsome May 14, 2023
Maintainer

@kalleyne Well, this is a really strange behavior, actually. The fact that you see a similar decrease in restic and rustic strongly indicates that the reason for the decrease is not caused by things which are obviously different in both implementations: The language used, the web frameworks, etc.

So, actually it could be two things:

The basic algorithm which is identical or at least very similar in rustic and restic (even more after restore: Download multiple contiguous blobs in one request #624). However, I can't think at a first glance about anything that could cause this. Note that the the algorithm determines all packs needed to do the restore and then processes them each-by-each and parallelized. The needed processing may very by the pack. But the packs are actually processing in random-like order!
The environment outside of the restore software. I could think of things like CPUs getting heated over time and automatically decrease processing speed as a result of this.

Anyway, this needs more testing to solve this mystery and maybe rule out the one or other candidates..

I hope your restore is still proceeding / or already finished and appreciate a lot to get more status information!

kalleyne May 15, 2023
Author

The restore job ended earlier this morning.
These are the last set of charts which shows the end of the restore.

Restore (rx) performance continued to fall from the 600 GB/hr to the 500 GB/hr range.

Somewhere along the way we are bleeding restore performance in such a way that we are seeing 1/3 the performance at the end of the restore as compared to the beginning. Still using PR624 version at that site location.

Of course, this steep performance loss has the unfortunate side effect of mostly affecting people trying to run restores from large snapshots stored on spinning disks. We've seen comments that organizations like the LHC/CERN etc. were using restic. More MinIO on-premise deployments are starting to make use of restic as well. How are folks like that getting around the scaling issues of restores?

If the basic algorithm shared by restic/rustic has a design flaw then perhaps.....just maybe.....members of both projects should collaborate jointly to solve this issue? I may be the lone voice out here....but I would think that maybe this should be considered high priority? We would like to help where we can but we don't have any Rust or Golang skills to bring to the table.

We have also seen this from the restic forum about the topic of backup - "As a very rough estimate you need 1GB RAM for every 5TB of data stored in the repository plus 1GB RAM for every 5 million unique files in the backed up data."

Would the same hold true for restores.....or even rustic restores...where garbage collection would not be an issue? I think we also need a better handle on CPU and RAM requirements for backup and restore for both projects.

Slightly different topic....

While completing restoration this morning we saw the following warnings:

"setting extended attributes failed" for a number of files.

I believe that rustic supports extended attributes for a few months now. I think that the likely cause of these messages may be related to a ZFS (e.g. xattr=sa) setting or a SMB share setting on the NAS OS....but I'm not 100% sure. Suggestions?

If we did a follow-up rustic restore would the resume feature attempt to set those extended attributes?

Thanks for your help and your fine work.

kalleyne · 2023-06-16T20:42:14Z

kalleyne
Jun 16, 2023
Author

Over the past couple of weeks we started to run new tests with restic/rustic on a different set of hardware. This time around the restore was performed on a system fitted with NVMe drives instead of spinning disks. One in-house theory which we proposed - perhaps the reason we were seeing reduced restore performance over time could be related to some sort of write amplification event on the spinning target disks. Even if the write amplification effect was not the main issue then we hoped that the NVMe drives would at least give us much lower latency.

First we tried a restore with restic 0.15.2.

The restic restore started off with 420 GiB/hr range and the dropped over time to the 200 GiB/hr range.

Deleted dirs/files on target.
Next we tried restore with rustic 0.5.4.

This was terrible by starting off with 60 MiB/sec speeds...far less than even 1 Gb/sec and we were using 10 GbE. So we quit that test.

Deleted dirs/files on target.
Next we tried restoring with PR624, rustic v0.5.3-8-g9abf69d.

This was so much better. Started off in the 140 MiB/sec range. Was quite consistent with 460 GiB/hr range. Completed to the end with no sign of performance degradation.

Deleted dirs/files on target.
Did another restore with PR624 but with a different snapshot.

This also worked out well. Consistent throughout with about 450 GiB/hr range. No performance degradation right to the end.

When do you think that PR624 would be ready to be introduced into the main branch of the rustic project?
As a side note, it seems that PR624 can no longer be downloaded from the artifacts here:

https://github.com/rustic-rs/rustic/actions/runs/4952216304?pr=624

Our snapshots are so large that any performance increase in restores would help tremendously. I'm thinking about moving away from ZFS on the target system and going back to older XFS/LVM next. Even considering bcache afterwards. Let us know your thoughts.

Thanks.

1 reply

aawsome Jun 17, 2023
Maintainer

Thanks for your continued tests. So, these do indicate that the prior performance degradation of the restore using rustic was not caused by rustic itself but merely by other circumstances. Hope this stays like this in your further tests.

However, about the performance degradation of restic, I'm still clueless...

Now, about #624: In principle, this PR can be merged. But I only treated the fact of downloading contiguous blobs in one request and I didn't treat parallelization topics so far. The result is that #624 may download full packs into memory with a currently fixed parallelization of 20. While this is good for performance, it could lead to high memory usage (e.g. if you have 1GiB pack files, you get a memory requirement of more than 20 GiB which may be a problem for some users). So I thought I'd need to first make the parallelization at least customizable such that users are not stuck with memory restrictions while restoring.

Also, I priorized working on refactoring rustic into a library part which is ongoing. However the first big part has already landed and the other steps are made piece-by-piece. So, I'll take #624 into account when moving restore to the library.

kalleyne · 2023-06-16T20:51:07Z

kalleyne
Jun 16, 2023
Author

Sorry, I see the artifacts now. I probably was not logged into Github earlier.

0 replies

aawsome · 2023-06-21T20:44:54Z

aawsome
Jun 21, 2023
Maintainer

BTW, #624 is already merged into main and will be included in the next release.

1 reply

kalleyne Jun 21, 2023
Author

Thanks for the update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange differences in rustic restore performance #644

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Strange differences in rustic restore performance #644

kalleyne May 11, 2023

Replies: 9 comments · 11 replies

aawsome May 11, 2023 Maintainer

kalleyne May 11, 2023 Author

aawsome May 11, 2023 Maintainer

aawsome May 11, 2023 Maintainer

aawsome May 11, 2023 Maintainer

kalleyne May 11, 2023 Author

aawsome May 11, 2023 Maintainer

aawsome May 11, 2023 Maintainer

kalleyne May 11, 2023 Author

aawsome May 11, 2023 Maintainer

kalleyne May 11, 2023 Author

aawsome May 11, 2023 Maintainer

kalleyne May 12, 2023 Author

aawsome May 14, 2023 Maintainer

kalleyne May 15, 2023 Author

kalleyne Jun 16, 2023 Author

aawsome Jun 17, 2023 Maintainer

kalleyne Jun 16, 2023 Author

aawsome Jun 21, 2023 Maintainer

kalleyne Jun 21, 2023 Author

kalleyne
May 11, 2023

Replies: 9 comments 11 replies

aawsome
May 11, 2023
Maintainer

kalleyne
May 11, 2023
Author

aawsome May 11, 2023
Maintainer

aawsome
May 11, 2023
Maintainer

aawsome May 11, 2023
Maintainer

kalleyne May 11, 2023
Author

aawsome May 11, 2023
Maintainer

aawsome May 11, 2023
Maintainer

kalleyne
May 11, 2023
Author

aawsome May 11, 2023
Maintainer

kalleyne
May 11, 2023
Author

aawsome May 11, 2023
Maintainer

kalleyne
May 12, 2023
Author

aawsome May 14, 2023
Maintainer

kalleyne May 15, 2023
Author

kalleyne
Jun 16, 2023
Author

aawsome Jun 17, 2023
Maintainer

kalleyne
Jun 16, 2023
Author

aawsome
Jun 21, 2023
Maintainer

kalleyne Jun 21, 2023
Author