-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial Checksum Extension #172
base: 1.1
Are you sure you want to change the base?
Conversation
- The base64 encoded parts of the example requests and responses is invalid fake data at that point
- add context to the example requests/responses
…artial-Checksum and Upload-Partial-Checksum-Range to simplify the implementation and specification - edit and partially rewrite for clarity and consistency with the rest of the specification - include POST requests into requests that can initiate Partial Checksum usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! IMO this nicely extends the core resumable upload of TUS with checksum support.
|
||
This extension allows Clients that use the Checksum extension to verify and confirm the | ||
integrity of chunks that were not received in full by the Server - thereby avoiding that | ||
intact data needs to be sent again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be changed to also allow for requests that are received in full but where the checksum does not match? E.g. if the file has a Upload-Length of 100 bytes with the first 50 already written and the client uploads the next 50 but with an invalid checksum, should we include the headers to let the client re-calculate the checksum and not have to resend the 50? I would argue that this makes the server side implementation easier as we do not have to differentiate between if the client disconnected or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Partial Checksum extension builds on top of the Checksum extension, which stays intact for completed requests.
In your example the PATCH
request for bytes 50-100 would - to satisfy the Checksum extension requirements - already include an Upload-Checksum
header with the checksum of those 50 bytes.
Per the Checksum extension, the Server would check if the checksum for those 50 bytes would match what the Client sent in the Upload-Checksum
header. If it does not, the Server would respond with a 460 Checksum Mismatch
error and discard the 50 bytes. Which I'd argue is the right thing to do when the Checksum for the fully received data doesn't match: the Client should still come up with the same checksum as the one it sent originally - and if it is tolerant to checksum errors, it could probably skip using the Checksum and Partial Checksum extensions altogether.
From the server-side, I'd think differentiating between partial/interrupted and fully received requests should be relatively simple, as the server can assume a partial/interrupted request when the received request body's byte count is less than the number in the request's Content-Length
header - and a fully received request if byte count and header value match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do now how the checksum extension work. ;) My comment was more to highlight a different angle on how it could be implemented. As long as it's clear how this extension behaves when both this and checksum is supported I don't think there is an issue here. The use case proposed (for invalid checksum) is very niche and just an idea.
|
||
If a Server receives an incomplete `PATCH` or `POST` request with `Upload-Checksum` header, it | ||
MAY store the incompletely received chunk's data (hereafter referred to as *partial data*) in | ||
a separate location instead of discarding it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea for implementation here (this is nitpicking but anyway) is that the server stores the data in the actual file but keeps some kind of marker which bytes are "partial" or "not validated" or similar. This would reduce reduce the number of writes needed for copying the data from one location to another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text only says store … in a separate location
, which could be the same file, or a separate file.
Thinking about it for a bit, the way to implement this with the least filesystem / copy overhead would likely be to simply keep writing all received data to a file, use file truncation as needed to discard data or partial data with bad checksums, and keep track of offsets and partial segments through a separate (JSON) file.
Writing that metadata to a chunk at the end of the file could be even more efficient, but could leave you with invalid data if there's a power outage or crash before the metadata chunk could be written to disk. An outdated JSON file, OTOH, would just prompt the client to upload the possibly imcompletely written part of the data again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I said "nitpicking" :) It was meant to just bring other solutions for the record in case people read this and interpret is as "store in a different file on disk" which was my first interpretation.
protocol.md
Outdated
|
||
If the Server receives a `HEAD` request on the upload thereafter, it MAY compute a checksum | ||
on the partial data and then provide the result as well as the range of the partial data within | ||
the upload through the `Upload-Partial-Checksum` and `Upload-Partial-Checksum-Range` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the partial data within the upload" reads a bit hard to me. Not sure on a better copy though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some text massaging, I arrived at this:
If the Server receives a
HEAD
request on the upload thereafter, it SHOULD compute a checksum
on the partial data and then return it, alongside the partial data's start and end offsets relative to
the beginning of the file, in theUpload-Partial-Checksum
andUpload-Partial-Checksum-Range
response headers respectively.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much clearer! Nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy you like it!
|
||
If the checksums don't match, the Client MUST continue the upload from the `Upload-Offset` | ||
returned by the `HEAD` request, at which point the Server MAY discard any partial data it may | ||
still be holding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section here is very clear. Nicely done! Would this be correct from a servers perspective just to clarify that I understand this correctly? The pretense is that the client started a PATCH request with a Upload-Checksum header and then disconnected:
- The HEAD request will contain
Upload-Partial-Checksum
and theUpload-Partial-Checksum-Range
headers. - In the next PATCH request, if the
Upload-Partial-Checksum
matches the bytes for the file specified inUpload-Partial-Checksum-Range
then the new Upload-Offset is the upload offset before the disconnected PATCH request + the range of the range header. If theUpload-Partial-Checksum
header is not the correct checksum (or is missing) then the Upload-Offset is the offset from before the disconnected PATCH request and the server could safely remove the partial data. - Do we wish to just discard the partial data if the
Upload-Partial-Checksum
is invalid or do we wish to return an error to the client? Scratch this, it is answered below. We will return a 409 Conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The HEAD request will contain
Upload-Partial-Checksum
and theUpload-Partial-Checksum-Range
headers.
If you mean that these headers would be contained in the response to the HEAD request, then: yes, that's exactly how it is meant to work.
(The HEAD request itself would not contain those headers.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sorry! Obviously meant the HEAD response :)
protocol.md
Outdated
If a Server receives `Upload-Partial-Checksum` and `Upload-Partial-Checksum-Range` | ||
headers as part of a `PATCH` request and they match its own values for partial data stored | ||
for the upload, the Server MUST append it to the upload and move its internal upload offset | ||
forward accordingly, before processing the request any further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this part a bit hard to read. Basically what you are saying is that if the Upload-Partial-Checksum
header matches for the range specified in Upload-Partial-Checksum-Range
then the server should no longer count this as partial data but instead as part of the file (either by copying it to the actual file or removing a marker or similar)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly that.
I tried to make it a bit easier to read and arrived at this (in the next commit):
If a Server receives
Upload-Partial-Checksum
andUpload-Partial-Checksum-Range
headers as part of aPATCH
request and they match the Server's computed results for its stored
partial data for the upload, the Server MUST append the partial data to the upload and move its
internal upload offset forward accordingly, before processing the request any further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better, thank you.
the partial data's range, it SHOULD dispose of the partial data. | ||
|
||
Clients receiving a `409 Conflict` status in response SHOULD send a new `HEAD` request to the | ||
Server to determine the upload's status before sending the next `PATCH` request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify: This second HEAD request would also include the Upload-Partial-Checksum-X
headers so that the client can try again? Is there a time limit here or is it forever for a specific file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the server would already have to have disposed of the partial data at that point:
If the Server has disposed of the partial data or if the Client sends range and checksum values
that do not match those of the Server for the partial data, the Server MUST dispose of the
partial data (if any) and MUST respond with a409 Conflict
status.
With partial data gone, the Server also shouldn't respond with any Upload-Partial-Checksum-X
headers anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Makes sense. Thanks for clarifying!
I like this extension! To be honest I think this is what the original checksum extension should have looked like. Hindsight is 20/20 I guess :) Naming is hard and I found the name of the extension and headers a bit confusing. Will have to think about other names though as nothing comes to mind right now. |
First of all, thank you for this PR! I have currently a lot on my plate and am a bit overworked, so I don't know yet when I have time to look into this. |
Happy you like it! 😀 As with the last extensions I'm completely open to alternative name suggestions. As the extension adds kind of a commit/rollback mechanism, I originally thought of "Checksum Commit", but then found it hard to use the same headers for |
Thanks for letting me know - and I hope you can find some time to rest and recharge soon. |
I've been thinking about the name for a bit and think that first of all we should switch the name around to have "checksum" first. This follows the pattern for other sub extensions (e.g. creation/creation-with-upload and concatenation/concatenation-unfinished). Not sure if "checksum-partial" is a good name though as it's not the checksum in itself that is partial but rather the data. I like the idea of "checksum-commit" but I feel like it might be a bit confusing. Some suggestions (and why they might be OK):
What do you think @felix-schwarz ? |
Very good point!
My biggest issue with checksum-commit was that "commit" only really works for the client to server request that "commits" the partial data, but is confusing when returned by the server in a
Thanks! I think checksum-unverified has the same issue as checksum-commit, just reversed: it really only works well for the Another one I could think of is Depending on choice, that'd result in these header names:
I'm fine with all of them, but if I had to pick one, it'd likely be Partial Data as it works well for both header names and extension name(s). What do you think @smatsson? |
Pardon the late reply! I've been swamped the last couple of weeks. Naming is truly hard :D I think most of these are OK and just to make it more complicated I will also throw in I'm wondering if we have to call the extension the same name as the headers? Probably not? I mean we could call the extension "Checksum with partial data" (or "Checksum with segments" or whatever) while the headers are still called I've been meaning to write a POC on this extension (as with Client-Tag and Challenge) but just haven't gotten around to it. |
When using just the Checksum extension, the Server is required to discard the data of
PATCH
requests that ended prematurely (f.ex. because of a loss of connection), so that Clients need to re-send the entire chunk.This extension allows Clients that use the Checksum extension to verify and confirm the integrity of chunks that were not received in full by the Server - thereby avoiding that data that was already transferred intact needs to be sent again.
Feedback welcome!