Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial Checksum Extension #172

Open
wants to merge 6 commits into
base: 1.1
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions protocol.md
Original file line number Diff line number Diff line change
Expand Up @@ -568,6 +568,127 @@ Tus-Resumable: 1.0.0
Upload-Offset: 11
```

### Partial Checksum

This extension allows Clients that use the Checksum extension to verify and confirm the
integrity of chunks that were not received in full by the Server - thereby avoiding that
intact data needs to be sent again.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be changed to also allow for requests that are received in full but where the checksum does not match? E.g. if the file has a Upload-Length of 100 bytes with the first 50 already written and the client uploads the next 50 but with an invalid checksum, should we include the headers to let the client re-calculate the checksum and not have to resend the 50? I would argue that this makes the server side implementation easier as we do not have to differentiate between if the client disconnected or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Partial Checksum extension builds on top of the Checksum extension, which stays intact for completed requests.

In your example the PATCH request for bytes 50-100 would - to satisfy the Checksum extension requirements - already include an Upload-Checksum header with the checksum of those 50 bytes.

Per the Checksum extension, the Server would check if the checksum for those 50 bytes would match what the Client sent in the Upload-Checksum header. If it does not, the Server would respond with a 460 Checksum Mismatch error and discard the 50 bytes. Which I'd argue is the right thing to do when the Checksum for the fully received data doesn't match: the Client should still come up with the same checksum as the one it sent originally - and if it is tolerant to checksum errors, it could probably skip using the Checksum and Partial Checksum extensions altogether.

From the server-side, I'd think differentiating between partial/interrupted and fully received requests should be relatively simple, as the server can assume a partial/interrupted request when the received request body's byte count is less than the number in the request's Content-Length header - and a fully received request if byte count and header value match.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do now how the checksum extension work. ;) My comment was more to highlight a different angle on how it could be implemented. As long as it's clear how this extension behaves when both this and checksum is supported I don't think there is an issue here. The use case proposed (for invalid checksum) is very niche and just an idea.


Servers that support this extension MUST add `partial-checksum` to the `Tus-Extension` header.

If a Server receives an incomplete `PATCH` or `POST` request with `Upload-Checksum` header, it
SHOULD store the incompletely received chunk's data (hereafter referred to as *partial data*) in
a separate location instead of discarding it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea for implementation here (this is nitpicking but anyway) is that the server stores the data in the actual file but keeps some kind of marker which bytes are "partial" or "not validated" or similar. This would reduce reduce the number of writes needed for copying the data from one location to another.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text only says store … in a separate location, which could be the same file, or a separate file.

Thinking about it for a bit, the way to implement this with the least filesystem / copy overhead would likely be to simply keep writing all received data to a file, use file truncation as needed to discard data or partial data with bad checksums, and keep track of offsets and partial segments through a separate (JSON) file.

Writing that metadata to a chunk at the end of the file could be even more efficient, but could leave you with invalid data if there's a power outage or crash before the metadata chunk could be written to disk. An outdated JSON file, OTOH, would just prompt the client to upload the possibly imcompletely written part of the data again.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I said "nitpicking" :) It was meant to just bring other solutions for the record in case people read this and interpret is as "store in a different file on disk" which was my first interpretation.


If the Server receives a `HEAD` request on the upload thereafter, it SHOULD compute a checksum
on the partial data and then return it, alongside the partial data's start and end offsets relative to
the beginning of the file, in the `Upload-Partial-Checksum` and `Upload-Partial-Checksum-Range`
response headers respectively.

A Client can then use the range returned by the Server, compute its own checksum for the
same range and compare it against the checksum returned by the Server.

If the checksums are identical, the Client SHOULD confirm the partial data's integrity to the
Server by sending the checksum and range in the `Upload-Partial-Checksum` and
`Upload-Partial-Checksum-Range` headers of the next `PATCH` request. The `Upload-Offset`
header of the `PATCH` request MUST take the confirmed range into account and represent
the actual offset from where the upload should be resumed.

If the checksums don't match, the Client MUST continue the upload from the `Upload-Offset`
returned by the `HEAD` request, at which point the Server MAY discard any partial data it may
still be holding.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section here is very clear. Nicely done! Would this be correct from a servers perspective just to clarify that I understand this correctly? The pretense is that the client started a PATCH request with a Upload-Checksum header and then disconnected:

  • The HEAD request will contain Upload-Partial-Checksum and the Upload-Partial-Checksum-Range headers.
  • In the next PATCH request, if the Upload-Partial-Checksum matches the bytes for the file specified in Upload-Partial-Checksum-Range then the new Upload-Offset is the upload offset before the disconnected PATCH request + the range of the range header. If the Upload-Partial-Checksum header is not the correct checksum (or is missing) then the Upload-Offset is the offset from before the disconnected PATCH request and the server could safely remove the partial data.
  • Do we wish to just discard the partial data if the Upload-Partial-Checksum is invalid or do we wish to return an error to the client? Scratch this, it is answered below. We will return a 409 Conflict.

Copy link
Author

@felix-schwarz felix-schwarz Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The HEAD request will contain Upload-Partial-Checksum and the Upload-Partial-Checksum-Range headers.

If you mean that these headers would be contained in the response to the HEAD request, then: yes, that's exactly how it is meant to work.

(The HEAD request itself would not contain those headers.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry! Obviously meant the HEAD response :)


If a Server receives `Upload-Partial-Checksum` and `Upload-Partial-Checksum-Range`
headers as part of a `PATCH` request and they match the Server's computed results for its stored
partial data for the upload, the Server MUST append the partial data to the upload and move its
internal upload offset forward accordingly, before processing the request any further.

If the Server has disposed of the partial data or if the Client sends range and checksum values
that do not match those of the Server for the partial data, the Server MUST dispose of the
partial data (if any) and MUST respond with a `409 Conflict` status.

If the Server receives a `PATCH` request with an `Upload-Offset` other than the end of
the partial data's range, it SHOULD dispose of the partial data.

Clients receiving a `409 Conflict` status in response SHOULD send a new `HEAD` request to the
Server to determine the upload's status before sending the next `PATCH` request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: This second HEAD request would also include the Upload-Partial-Checksum-X headers so that the client can try again? Is there a time limit here or is it forever for a specific file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the server would already have to have disposed of the partial data at that point:

If the Server has disposed of the partial data or if the Client sends range and checksum values
that do not match those of the Server
for the partial data, the Server MUST dispose of the
partial data
(if any) and MUST respond with a 409 Conflict status.

With partial data gone, the Server also shouldn't respond with any Upload-Partial-Checksum-X headers anymore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Makes sense. Thanks for clarifying!


#### Headers

##### Upload-Partial-Checksum-Range

The `Upload-Partial-Checksum-Range` header MUST describe the byte range of the partial
data within the upload, relative to the start of the upload, using this format: `[first byte offset]-[last byte offset]`

For example `0-499` for the first 500 bytes, `500-999` for the second 500 bytes.

##### Upload-Partial-Checksum

The `Upload-Partial-Checksum` header uses the same format as the `Upload-Checksum`
header, but MUST contain the checksum of the partial data.

#### Example

The Client tries to append 11 bytes using a `PATCH` request:

**Request**

```
PATCH /files/17f44dbe1c4bace0e18ab850cf2b3a83 HTTP/1.1
Content-Length: 11
Upload-Offset: 0
Tus-Resumable: 1.0.0
Upload-Checksum: sha1 Kq5sNclPz7QV2+lfQIuc6R7oRu0=

hello w[connection breaks down]
```

Since the connection broke down before the request completed, the Client sends a `HEAD` request to determine from which offset it should resume:

**Request**

```
HEAD /files/17f44dbe1c4bace0e18ab850cf2b3a83 HTTP/1.1
Tus-Resumable: 1.0.0
```

The Server has kept the partially received data and adds range and checksum information about it to the response:

**Response**

```
HTTP/1.1 200 OK
Upload-Offset: 0
Upload-Partial-Checksum-Range: 0-6
Upload-Partial-Checksum: sha1 V2uc6R7+lKq5sfQINclPz7QoRu0=
Tus-Resumable: 1.0.0
```

The Client verifies that the checksum provided by the Server is valid for the returned range, confirms they are identical and resumes the upload from the end of the confirmed range:

**Request**

```
PATCH /files/17f44dbe1c4bace0e18ab850cf2b3a83 HTTP/1.1
Upload-Offset: 7
Upload-Partial-Checksum-Range: 0-7
Upload-Partial-Checksum: sha1 V2uc6R7+lKq5sfQINclPz7QoRu0=
Content-Length: 4
Tus-Resumable: 1.0.0

orld
```

The request completes successfully and the Server responds with the new `Upload-Offset`:

**Response**

```
HTTP/1.1 204 No Content
Tus-Resumable: 1.0.0
Upload-Offset: 11
```

### Termination

This extension defines a way for the Client to terminate completed and unfinished
Expand Down