Partial Checksum Extension #172

felix-schwarz · 2021-02-26T15:11:47Z

When using just the Checksum extension, the Server is required to discard the data of PATCH requests that ended prematurely (f.ex. because of a loss of connection), so that Clients need to re-send the entire chunk.

This extension allows Clients that use the Checksum extension to verify and confirm the integrity of chunks that were not received in full by the Server - thereby avoiding that data that was already transferred intact needs to be sent again.

Feedback welcome!

- The base64 encoded parts of the example requests and responses is invalid fake data at that point

- add context to the example requests/responses

…artial-Checksum and Upload-Partial-Checksum-Range to simplify the implementation and specification - edit and partially rewrite for clarity and consistency with the rest of the specification - include POST requests into requests that can initiate Partial Checksum usage

butonic

Great! IMO this nicely extends the core resumable upload of TUS with checksum support.

protocol.md

smatsson · 2021-03-11T07:56:15Z

protocol.md

+
+This extension allows Clients that use the Checksum extension to verify and confirm the
+integrity of chunks that were not received in full by the Server - thereby avoiding that
+intact data needs to be sent again.


Could this be changed to also allow for requests that are received in full but where the checksum does not match? E.g. if the file has a Upload-Length of 100 bytes with the first 50 already written and the client uploads the next 50 but with an invalid checksum, should we include the headers to let the client re-calculate the checksum and not have to resend the 50? I would argue that this makes the server side implementation easier as we do not have to differentiate between if the client disconnected or not.

The Partial Checksum extension builds on top of the Checksum extension, which stays intact for completed requests.

In your example the PATCH request for bytes 50-100 would - to satisfy the Checksum extension requirements - already include an Upload-Checksum header with the checksum of those 50 bytes.

Per the Checksum extension, the Server would check if the checksum for those 50 bytes would match what the Client sent in the Upload-Checksum header. If it does not, the Server would respond with a 460 Checksum Mismatch error and discard the 50 bytes. Which I'd argue is the right thing to do when the Checksum for the fully received data doesn't match: the Client should still come up with the same checksum as the one it sent originally - and if it is tolerant to checksum errors, it could probably skip using the Checksum and Partial Checksum extensions altogether.

From the server-side, I'd think differentiating between partial/interrupted and fully received requests should be relatively simple, as the server can assume a partial/interrupted request when the received request body's byte count is less than the number in the request's Content-Length header - and a fully received request if byte count and header value match.

I do now how the checksum extension work. ;) My comment was more to highlight a different angle on how it could be implemented. As long as it's clear how this extension behaves when both this and checksum is supported I don't think there is an issue here. The use case proposed (for invalid checksum) is very niche and just an idea.

protocol.md

smatsson · 2021-03-11T07:59:09Z

protocol.md

+
+If a Server receives an incomplete `PATCH` or `POST` request with `Upload-Checksum` header, it
+MAY store the incompletely received chunk's data (hereafter referred to as *partial data*) in
+a separate location instead of discarding it.


Another idea for implementation here (this is nitpicking but anyway) is that the server stores the data in the actual file but keeps some kind of marker which bytes are "partial" or "not validated" or similar. This would reduce reduce the number of writes needed for copying the data from one location to another.

The text only says store … in a separate location, which could be the same file, or a separate file.

Thinking about it for a bit, the way to implement this with the least filesystem / copy overhead would likely be to simply keep writing all received data to a file, use file truncation as needed to discard data or partial data with bad checksums, and keep track of offsets and partial segments through a separate (JSON) file.

Writing that metadata to a chunk at the end of the file could be even more efficient, but could leave you with invalid data if there's a power outage or crash before the metadata chunk could be written to disk. An outdated JSON file, OTOH, would just prompt the client to upload the possibly imcompletely written part of the data again.

That's why I said "nitpicking" :) It was meant to just bring other solutions for the record in case people read this and interpret is as "store in a different file on disk" which was my first interpretation.

protocol.md

smatsson · 2021-03-11T07:59:53Z

protocol.md

+
+If the Server receives a `HEAD` request on the upload thereafter, it MAY compute a checksum
+on the partial data and then provide the result as well as the range of the partial data within
+the upload through the `Upload-Partial-Checksum` and `Upload-Partial-Checksum-Range`


"the partial data within the upload" reads a bit hard to me. Not sure on a better copy though.

After some text massaging, I arrived at this:

If the Server receives a HEAD request on the upload thereafter, it SHOULD compute a checksum
on the partial data and then return it, alongside the partial data's start and end offsets relative to
the beginning of the file, in the Upload-Partial-Checksum and Upload-Partial-Checksum-Range
response headers respectively.

What do you think?

Much clearer! Nice.

Happy you like it!

smatsson · 2021-03-11T08:42:35Z

protocol.md

+
+If the checksums don't match, the Client MUST continue the upload from the `Upload-Offset`
+returned by the `HEAD` request, at which point the Server MAY discard any partial data it may
+still be holding.


I think this section here is very clear. Nicely done! Would this be correct from a servers perspective just to clarify that I understand this correctly? The pretense is that the client started a PATCH request with a Upload-Checksum header and then disconnected:

The HEAD request will contain Upload-Partial-Checksum and the Upload-Partial-Checksum-Range headers.

In the next PATCH request, if the Upload-Partial-Checksum matches the bytes for the file specified in Upload-Partial-Checksum-Range then the new Upload-Offset is the upload offset before the disconnected PATCH request + the range of the range header. If the Upload-Partial-Checksum header is not the correct checksum (or is missing) then the Upload-Offset is the offset from before the disconnected PATCH request and the server could safely remove the partial data.

Do we wish to just discard the partial data if the Upload-Partial-Checksum is invalid or do we wish to return an error to the client? Scratch this, it is answered below. We will return a 409 Conflict.

The HEAD request will contain Upload-Partial-Checksum and the Upload-Partial-Checksum-Range headers.

If you mean that these headers would be contained in the response to the HEAD request, then: yes, that's exactly how it is meant to work.

(The HEAD request itself would not contain those headers.)

Yes sorry! Obviously meant the HEAD response :)

smatsson · 2021-03-11T08:44:16Z

protocol.md

+If a Server receives  `Upload-Partial-Checksum` and `Upload-Partial-Checksum-Range`
+headers as part of a `PATCH` request and they match its own values for partial data stored
+for the upload, the Server MUST append it to the upload and move its internal upload offset
+forward accordingly, before processing the request any further.


I find this part a bit hard to read. Basically what you are saying is that if the Upload-Partial-Checksum header matches for the range specified in Upload-Partial-Checksum-Range then the server should no longer count this as partial data but instead as part of the file (either by copying it to the actual file or removing a marker or similar)?

Yes, exactly that.

I tried to make it a bit easier to read and arrived at this (in the next commit):

If a Server receives Upload-Partial-Checksum and Upload-Partial-Checksum-Range
headers as part of a PATCH request and they match the Server's computed results for its stored
partial data for the upload, the Server MUST append the partial data to the upload and move its
internal upload offset forward accordingly, before processing the request any further.

Much better, thank you.

smatsson · 2021-03-11T08:46:02Z

protocol.md

+the partial data's range, it SHOULD dispose of the partial data.
+
+Clients receiving a `409 Conflict` status in response SHOULD send a new `HEAD` request to the
+Server to determine the upload's status before sending the next `PATCH` request.


Just to clarify: This second HEAD request would also include the Upload-Partial-Checksum-X headers so that the client can try again? Is there a time limit here or is it forever for a specific file?

No, the server would already have to have disposed of the partial data at that point:

If the Server has disposed of the partial data or if the Client sends range and checksum values
that do not match those of the Server for the partial data, the Server MUST dispose of the
partial data (if any) and MUST respond with a 409 Conflict status.

With partial data gone, the Server also shouldn't respond with any Upload-Partial-Checksum-X headers anymore.

👍 Makes sense. Thanks for clarifying!

protocol.md

smatsson · 2021-03-11T08:49:49Z

I like this extension! To be honest I think this is what the original checksum extension should have looked like. Hindsight is 20/20 I guess :) Naming is hard and I found the name of the extension and headers a bit confusing. Will have to think about other names though as nothing comes to mind right now.

Acconut · 2021-03-11T09:18:31Z

First of all, thank you for this PR! I have currently a lot on my plate and am a bit overworked, so I don't know yet when I have time to look into this.

felix-schwarz · 2021-03-11T15:20:29Z

I like this extension! To be honest I think this is what the original checksum extension should have looked like. Hindsight is 20/20 I guess :) Naming is hard and I found the name of the extension and headers a bit confusing. Will have to think about other names though as nothing comes to mind right now.

Happy you like it! 😀

As with the last extensions I'm completely open to alternative name suggestions. As the extension adds kind of a commit/rollback mechanism, I originally thought of "Checksum Commit", but then found it hard to use the same headers for HEAD response and PATCH request, since only the latter would actually "commit" the partial data.

felix-schwarz · 2021-03-11T15:23:08Z

First of all, thank you for this PR! I have currently a lot on my plate and am a bit overworked, so I don't know yet when I have time to look into this.

Thanks for letting me know - and I hope you can find some time to rest and recharge soon.

- massaging of several paragraphs to make them easier to read

smatsson · 2021-03-16T10:05:23Z

I like this extension! To be honest I think this is what the original checksum extension should have looked like. Hindsight is 20/20 I guess :) Naming is hard and I found the name of the extension and headers a bit confusing. Will have to think about other names though as nothing comes to mind right now.

Happy you like it! 😀

As with the last extensions I'm completely open to alternative name suggestions. As the extension adds kind of a commit/rollback mechanism, I originally thought of "Checksum Commit", but then found it hard to use the same headers for HEAD response and PATCH request, since only the latter would actually "commit" the partial data.

I've been thinking about the name for a bit and think that first of all we should switch the name around to have "checksum" first. This follows the pattern for other sub extensions (e.g. creation/creation-with-upload and concatenation/concatenation-unfinished). Not sure if "checksum-partial" is a good name though as it's not the checksum in itself that is partial but rather the data. I like the idea of "checksum-commit" but I feel like it might be a bit confusing.

Some suggestions (and why they might be OK):

checksum-unfinished - There is data written that have not been finalized yet, since we cannot validate the checksum.
checksum-unverified - Same reason as above
checksum-partial-data - It's the checksum extension but with support for partial data on disconnects.

What do you think @felix-schwarz ?

felix-schwarz · 2021-03-31T11:22:39Z

I've been thinking about the name for a bit and think that first of all we should switch the name around to have "checksum" first. This follows the pattern for other sub extensions (e.g. creation/creation-with-upload and concatenation/concatenation-unfinished).

Very good point!

Not sure if "checksum-partial" is a good name though as it's not the checksum in itself that is partial but rather the data. I like the idea of "checksum-commit" but I feel like it might be a bit confusing.

My biggest issue with checksum-commit was that "commit" only really works for the client to server request that "commits" the partial data, but is confusing when returned by the server in a HEAD response. And I didn't want to have two different header names whose content would be identical.

Some suggestions (and why they might be OK):

checksum-unfinished - There is data written that have not been finalized yet, since we cannot validate the checksum.

checksum-unverified - Same reason as above

checksum-partial-data - It's the checksum extension but with support for partial data on disconnects.

Thanks! I think checksum-unverified has the same issue as checksum-commit, just reversed: it really only works well for the HEAD response, but not for the request confirming it.

Another one I could think of is checksum-segment, because the checksum targets only a segment of the file.

Depending on choice, that'd result in these header names:

Upload-Checksum-Unfinished + Upload-Checksum-Unfinished-Range: summarizes the use case of this extension quite well, but not sure about the "Checksum Unfinished" name for the extension.
Upload-Checksum-Partial-Data + Upload-Checksum-Partial-Data-Range: nicely reflects the use of "partial data" in the proposed spec. Extension could be called "Checksum for Partial Data" or "Checksum with Partial Data".
Upload-Checksum-Segment + Upload-Checksum-Segment-Range: similar to "partial data", but in a single word. Extension name not so obvious, however. Maybe "Checksum for Segment".

I'm fine with all of them, but if I had to pick one, it'd likely be Partial Data as it works well for both header names and extension name(s).

What do you think @smatsson?

smatsson · 2021-04-09T09:44:19Z

Pardon the late reply! I've been swamped the last couple of weeks.

Naming is truly hard :D I think most of these are OK and just to make it more complicated I will also throw in checksum-resumable in the mix. I'm thinking that the original checksum extension is not resumable as the data will be discarded if the checksum mismatches (I found this quite confusing when adding support to tusdotnet, #143).

I'm wondering if we have to call the extension the same name as the headers? Probably not? I mean we could call the extension "Checksum with partial data" (or "Checksum with segments" or whatever) while the headers are still called Upload-Checksum-Partial-Data-Range etc. Segment has a nice ring to it but might be confusing in regards to chunks (the words being similar, not the concept)? I agree with you that partial data is my favorite so far.

I've been meaning to write a POC on this extension (as with Client-Tag and Challenge) but just haven't gotten around to it.

felix-schwarz added 5 commits February 23, 2021 11:55

- First draft

36e8ad8

- The base64 encoded parts of the example requests and responses is invalid fake data at that point

- clarify when partial data needs to be committed

e968566

- add context to the example requests/responses

- further clarifications

73e4f7e

- Clarifications

c5ef4f3

felix-schwarz mentioned this pull request Feb 26, 2021

implement oc chunking/checksum using tus metadata owncloud/ocis#1279

Closed

butonic approved these changes Mar 3, 2021

View reviewed changes

protocol.md Outdated Show resolved Hide resolved

smatsson reviewed Mar 11, 2021

View reviewed changes

- fixing typos (thanks @butonic and @smatsson)

32a812f

- massaging of several paragraphs to make them easier to read

exalate-issue-sync bot mentioned this pull request Sep 13, 2022

Async File uploads owncloud/ocis#4572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial Checksum Extension #172

Partial Checksum Extension #172

felix-schwarz commented Feb 26, 2021

butonic left a comment

smatsson Mar 11, 2021

felix-schwarz Mar 11, 2021

smatsson Mar 16, 2021

smatsson Mar 11, 2021

felix-schwarz Mar 11, 2021

smatsson Mar 16, 2021

smatsson Mar 11, 2021

felix-schwarz Mar 11, 2021 •

edited

Loading

smatsson Mar 16, 2021

felix-schwarz Mar 31, 2021

smatsson Mar 11, 2021

felix-schwarz Mar 11, 2021 •

edited

Loading

smatsson Mar 16, 2021

smatsson Mar 11, 2021

felix-schwarz Mar 11, 2021

smatsson Mar 16, 2021

smatsson Mar 11, 2021

felix-schwarz Mar 11, 2021

smatsson Mar 16, 2021

smatsson commented Mar 11, 2021

Acconut commented Mar 11, 2021

felix-schwarz commented Mar 11, 2021

felix-schwarz commented Mar 11, 2021

smatsson commented Mar 16, 2021

felix-schwarz commented Mar 31, 2021 •

edited

Loading

smatsson commented Apr 9, 2021

Partial Checksum Extension #172

Are you sure you want to change the base?

Partial Checksum Extension #172

Conversation

felix-schwarz commented Feb 26, 2021

butonic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felix-schwarz Mar 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felix-schwarz Mar 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smatsson commented Mar 11, 2021

Acconut commented Mar 11, 2021

felix-schwarz commented Mar 11, 2021

felix-schwarz commented Mar 11, 2021

smatsson commented Mar 16, 2021

felix-schwarz commented Mar 31, 2021 • edited Loading

smatsson commented Apr 9, 2021

felix-schwarz Mar 11, 2021 •

edited

Loading

felix-schwarz Mar 11, 2021 •

edited

Loading

felix-schwarz commented Mar 31, 2021 •

edited

Loading