Replies: 9 comments 11 replies
-
Actually there haven't been substantial changes to restore between 0.5.1 and 0.5.3. |
Beta Was this translation helpful? Give feedback.
-
Ok, we did not clean out the restore dir when switching versions of rustic. You think the system is spending time trying to resume? We will try the following steps. We will stop 0.5.3 restore now. Reboot and then remove the restore dir content. And try restore again. |
Beta Was this translation helpful? Give feedback.
-
About the memory usage: First thanks a lot that you are using rustic with such a big repository! I think there are things that can be improved to better support this scale, so I'm very happy to get your feedback! For a first start, you could try out #624 which does increase performance but also should reduce memory usage quite a bit because it stops saving filenames for each blob used in this file. |
Beta Was this translation helpful? Give feedback.
-
We cleaned out the restore directory and now trying the PR624: ./rustic --version Running restore from REST server now. Will let you know soon about the results. If this fails with "Out of memory" errors....just planning next steps. |
Beta Was this translation helpful? Give feedback.
-
At this moment, rustic PR624 is running at a hell of a speedy pace. This is much better than the earlier production releases. |
Beta Was this translation helpful? Give feedback.
-
The restore is still running....thanks for your help in getting us started. We are seeing a drop of restore performance now which reminds me of this thread on the restic forum: Initially, with this I don't believe that we have leveled out at a bottom limit as yet. We have not seen this activity with the earlier versions of rustic during our inhouse testing about 2 weeks ago....where We are currently still getting better numbers than 400 GiB per hour but the continual drop in speed will probably affect the ETA. All this strange activity is somewhat of an ongoing mystery. Thanks for your help again. |
Beta Was this translation helpful? Give feedback.
-
Over the past couple of weeks we started to run new tests with restic/rustic on a different set of hardware. This time around the restore was performed on a system fitted with NVMe drives instead of spinning disks. One in-house theory which we proposed - perhaps the reason we were seeing reduced restore performance over time could be related to some sort of write amplification event on the spinning target disks. Even if the write amplification effect was not the main issue then we hoped that the NVMe drives would at least give us much lower latency. First we tried a restore with restic 0.15.2. The restic restore started off with 420 GiB/hr range and the dropped over time to the 200 GiB/hr range. Deleted dirs/files on target. This was terrible by starting off with 60 MiB/sec speeds...far less than even 1 Gb/sec and we were using 10 GbE. So we quit that test. Deleted dirs/files on target. This was so much better. Started off in the 140 MiB/sec range. Was quite consistent with 460 GiB/hr range. Completed to the end with no sign of performance degradation. Deleted dirs/files on target. This also worked out well. Consistent throughout with about 450 GiB/hr range. No performance degradation right to the end. When do you think that PR624 would be ready to be introduced into the main branch of the rustic project? https://github.com/rustic-rs/rustic/actions/runs/4952216304?pr=624 Our snapshots are so large that any performance increase in restores would help tremendously. I'm thinking about moving away from ZFS on the target system and going back to older XFS/LVM next. Even considering bcache afterwards. Let us know your thoughts. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Sorry, I see the artifacts now. I probably was not logged into Github earlier. |
Beta Was this translation helpful? Give feedback.
-
BTW, #624 is already merged into main and will be included in the next release. |
Beta Was this translation helpful? Give feedback.
-
Good day.
This may or may not be directly related to: #629
We are trying to do a "rustic restore" at a customer location and are seeing some strange restore behavior.
Based on some internal testing we decided to use rustic within Docker of a NAS OS (based on Ubuntu) because rustic on the host would show these related errors: `GLIBC_2.35' not found. Using Docker with a Ubuntu 22.04 container for rustic fixed those GLIBC errors.
We started the "rustic restore" with version v0.5.1 to restore a snapshot from a REST server.
Within a few mins after "rustic restore" started to show ETA, the process was killed and we saw this in dmesg:
[ 3022.219218] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=a856fc8fc32b5a2732b5485844300d5c664de7fa097c4cd87f02097b087be311,mems_allowed=0,global_oom,task_memcg=/docker/a856fc8fc32b5a2732b5485844300d5c664de7fa097c4cd87f02097b087be311,task=rustic,pid=22730,uid=0
[ 3022.219262] Out of memory: Killed process 22730 (rustic) total-vm:77912012kB, anon-rss:61486824kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:127528kB oom_score_adj:0
[ 3024.740622] oom_reaper: reaped process 22730 (rustic), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
The system only has 64GB RAM which we thought would be enough. The snapshot has a restore size of about 72.784 TiB.
We then upgraded the rustic version to latest 0.5.3. Now when we try to restore the same snapshot....rustic seems to be spending a lot of time at the "01:24:22 collecting file information..." prompt and keeps on incrementing the time for over an hour. "rustic restore" is not moving a lot of data over the network at this time.
So at this stage, we are kinda stuck. We assume this "rustic restore" is doing some heavier analysis that the earlier 0.5.1 did not do....but cannot say for sure. We don't know if RAM is still a factor or not....since we believe that restic (with Golang) uses memory differently than rustic (with rust).
Any quick advice on any of this?
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions