-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: async writer + multi-part #3255
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ion Koutsouris <[email protected]>
fc1ae86
to
a1787c9
Compare
Signed-off-by: Ion Koutsouris <[email protected]>
a1787c9
to
2353a21
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3255 +/- ##
==========================================
- Coverage 72.18% 72.17% -0.01%
==========================================
Files 143 144 +1
Lines 45629 45709 +80
Branches 45629 45709 +80
==========================================
+ Hits 32936 32992 +56
- Misses 10621 10640 +19
- Partials 2072 2077 +5 ☔ View full report in Codecov by Sentry. |
self.object_store.put(&path, buffer.into()).await?; | ||
let mut multi_part_upload = self.object_store.put_multipart(&path).await?; | ||
let part_size = upload_part_size(); | ||
let mut tasks = JoinSet::new(); | ||
let max_concurrent_tasks = 10; // TODO: make configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really such a major behavior change. I am not terribly familiar with the maturity level of multipart uploads in object_store
I don't think this is necessarily a bad change, but I am doubtful of this addressing the originally linked issue.
As best as I can tell the buffers are still going to fill up memory until the flush, and then the flush is going to fan out to have parallel uploads
flowchart LR
write --> buffer_batch;
write --> buffer_batch;
write --> buffer_batch;
buffer_batch --> flush;
flush --> p1;
flush --> p2;
flush --> p3;
flush --> p4;
p1 --> close;
p2 --> close;
p3 --> close;
p4 --> close;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, in its current form we indeed still buffer longer until the flush, but the buffering and writing has more parallelism now, so it should be faster.
We could do two things here btw:
- release this as is for people to experiment with (maybe after 1.0)
- iterate on this to also flush after the min - part size is available, also create some benchmark tests
Description
Changed to use arrows async writer and use multi-part uploads.
Related Issue(s)