Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: parallelism for borgstore #97

Open
alexandru-bagu opened this issue Feb 9, 2025 · 1 comment
Open

Discussion: parallelism for borgstore #97

alexandru-bagu opened this issue Feb 9, 2025 · 1 comment

Comments

@alexandru-bagu
Copy link
Contributor

alexandru-bagu commented Feb 9, 2025

Hello!

Since borgstore allows borg to work with remote repositories, I believe borg/borgstore needs to be able to handle parallel actions to improve the performance when working with remote repositories due to the inherent latency that comes with anything that is not local.

From my tests, I see that with one exception, the backup process can be helped a lot by processing all writes in parallel at store level.
For storing information the main bottleneck is here:

self.backend.store(self.find(name), value)

Due to nesting level changes this efectively locks you down to 1 get / 1 put per chunk, sequencially.
By getting rid of the nesting check, I was able to parallelize all the sequencial writes effectively. Doing this, the storage repository is no longer the bottleneck for making backups.
The code I used is this:
https://github.com/alexandru-bagu/borgstore/blob/parallel-s3-store/src/borgstore/backends/s3.py
The idea behind the implementation is that all writes can be parallelized but if you want to read, nothing else is allowed to happen at the same time. This can obviously be improved by using a read-write queue, any reads at any time, any writes at any time, no read and writes at the same time to make sure that the store is consistent.
At the end of this, I am able to easily upload 400 Mbps to S3. The issue now is the extract, because the operations are all sequencial. This is even more of an issue since borg itself is not multi-threaded. My extract network bandwidth is at most 40 Mbps because it is reading one chunk at a time.
One solution for this would be to allow the backend to handle parallelism on its own by providing an ordered list of files that will be requested at some point. That way the backend can prefetch all chunks it can (based on its own logic) and just return them when load is called.

TLDR:

  1. Can we make nesting a configuration option so that we can effectively skip the nesting checks if we want to? Or make it optional in general?
  2. Can we update the backend implementation and provide something like an iterator with all the chunks that will be downloaded? Something like a method that each store can implement (it should be optional): prefetch(paths: iterator)
  3. Can the borgstore have its own thread-pool for parallelism that any backend can make use of? This way there is a structure that is to be followed for parallelism, instead of each backend having its own implementation.

*Cygwin has issues with stopping threads however, if a thread-pool is used this can be non issue because the thread-pool can be disposed only at the end of the execution. This generates a set of warnings/errors but they can be ignored.
Example of warnings (it's like ~1000 lines of this warning at the end):

      0 [] python3.9 721 sig_send: error sending signal 11, pid 721, pipe handle 0x15C, nb 0, packsize 192, Win32 error 6
   2912 [] python3.9 721 sig_send: error sending signal 11, pid 721, pipe handle 0x15C, nb 0, packsize 192, Win32 error 6
   3120 [] python3.9 721 sig_send: error sending signal 11, pid 721, pipe handle 0x15C, nb 0, packsize 192, Win32 error 6
@alexandru-bagu
Copy link
Contributor Author

I see that borg already has something in place to handle preloading of chunks but I don't believe it is actively used with borgstore, is it?
Also, is this the wrong place for the discussion, should it be in https://github.com/borgbackup/borgstore/discussions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant