Skip to content

Commit

Permalink
Merge pull request #8778 from shirady/chunked-content-decoder-docs
Browse files Browse the repository at this point in the history
Docs | Add `ChunkedContentDecoder` Documentation with State Machine Diagram
  • Loading branch information
shirady authored Feb 10, 2025
2 parents 8668b3c + ea9d1e8 commit b89f7bc
Show file tree
Hide file tree
Showing 2 changed files with 203 additions and 0 deletions.
174 changes: 174 additions & 0 deletions docs/design/ChunkedContentDecoder.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Chunked Content Decoder

HTTP Chunked Encoding is a streaming data transfer mechanism that breaks down the data stream into a series of non-overlapping segments called "chunks".
Each chunk is sent with its own size header, which tells the receiver how much data to expect in that chunk.
To indicate that the data is being sent in chunks a header of `Transfer-Encoding: chunked` is included.
Source: [HTTP Chunked Encoding](https://www.ioriver.io/terms/http-chunked-encoding)

The `ChunkedContentDecoder` class is a [Transform stream](https://nodejs.org/api/stream.html#class-streamtransform), which takes a stream that it received as chunks and streams only the data - it removes the size of the data, chunk headers of chunk-signature (optional extension), trailers, etc.

## Basic Encoding Structure:
### Chunks (Without Optional Extension and Trailers)
Each chunk consists of two parts:
1. a header
2. the actual data.
The header is a hexadecimal number that indicates the size of the chunk in bytes, followed by a carriage return (CR) and a line feed (NL).
The data that follows this header is exactly the size specified in the header.
After the data, another carriage return and line feed signify the end of the chunk.
Source: [HTTP Chunked Encoding](https://www.ioriver.io/terms/http-chunked-encoding)

```
<hex bytes of data>\r\n
<data>
...
the end of the chunk:
0\r\n
\r\n
```

Example:
```
7\r\n
Mozilla\r\n
11\r\n
Developer Network\r\n
0\r\n
\r\n
```
Source of the example: [Mozilla Transfer-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding)


### Chunks With Optional Extension and Trailers
(combined with example)
```
1fff;chunk-signature=1a2b\r\n - chunk header (optional extension)
<1fff bytes of data>\r\n - chunk data
2fff;chunk-signature=1a2b\r\n - chunk header (optional extension)
<2fff bytes of data>\r\n - chunk data
0\r\n - last chunk
<trailer>\r\n - optional trailer
<trailer>\r\n - optional trailer
\r\n - end of content
```
Notes:
- `1fff` and `2fff` are examples of the size in hex
- trailer example: `x-amz-checksum-crc32:uOMGCw==\r\n` (key - the algorithm `crc32`, value in base64 and `\r\n` as CR NL ending of the trailer)

More info in [Wikipedia](https://en.wikipedia.org/wiki/Chunked_transfer_encoding)
And also in [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1)

### Chunk Extension (chunk-signature)
In HTTP there is an option for adding chunk extensions, immediately following the chunk size.
```
chunk-ext = *( ";" chunk-ext-name [ "=" chunk-ext-val ] )
```
Source: [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1.1)

You can see in AWS SigV4 that when AWS uses the chunk extension as a chunk signature it uses it with the following structure:
```
string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n
```
Source: [AWS Documentation Defining the Chunk Body](https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html)

Chunk Extension are the following lines in the example encoding structure:
```
2fff;chunk-signature=1a2b\r\n - chunk header (optional extension)
<2fff bytes of data>\r\n - chunk data
```


### Trailers
In HTTP there is a option for a trailer. A trailer allows the sender to include additional fields at the end of a chunked message in order to supply metadata that might be dynamically generated while the message body is sent.
Source: [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1.2)


The name of the trailing is passed as a header and the trailer (key-value) passed after the chunked body. There can be added 0 or more trailers headers in a HTTP body.
Source: [Stack OverFlow](https://stackoverflow.com/questions/5590791/http-chunked-encoding-need-an-example-of-trailer-mentioned-in-spec)

Amazon S3 supports chunked uploads that use `aws-chunked` content encoding for `PutObject` and `UploadPart` requests with trailing checksums.

When a request has the header `x-amz-trailer` it indicates the name of the trailing header in the request. If trailing checksums exist the `x-amz-trailer` header value includes the `x-amz-checksum-` prefix and ends with the algorithm name. The following `x-amz-trailer` values are currently supported:
- x-amz-checksum-crc32
- x-amz-checksum-crc32c
- x-amz-checksum-crc64nvme
- x-amz-checksum-sha1
- x-amz-checksum-sha256

Source: [AWS Trailing Checksum Documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#trailing-checksums)

Trailers are the following lines in the example encoding structure:
```
<trailer>\r\n - optional trailer
<trailer>\r\n - optional trailer
```

## State Machine
The chunks are passing in the buffer and the buffer is parsed in a **loop** to handle multiple chunks in the same buffer, and to handle the case where the buffer ends in the middle of a chunk.

The `ChunkedContentDecoder` is using a state machine:
The state machine is updated according to the current state and the buffer content.
The state machine is updated by the following rules:
1. **STATE_READ_CHUNK_HEADER** - read the chunk header until CR and parse it.
2. **STATE_WAIT_NL_HEADER** - wait for NL after the chunk header.
3. **STATE_SEND_DATA** - send chunk data to the stream until chunk size bytes sent.
4. **STATE_WAIT_CR_DATA** - wait for CR after the chunk data.
5. **STATE_WAIT_NL_DATA** - wait for NL after the chunk data.
6. **STATE_READ_TRAILER** - read optional trailer until CR and save it.
7. **STATE_WAIT_NL_TRAILER** - wait for NL after non empty trailer.
8. **STATE_WAIT_NL_END** - wait for NL after the last empty trailer.
9. **STATE_CONTENT_END** - the stream is done.
10. **STATE_ERROR** - an error occurred.

The following diagram describes the changes of the state machine:
![State Machine Diagram](https://github.com/user-attachments/assets/727faf34-887a-4ad8-814c-134585618d8b)

#### Dry run for example:
An updated AWS SDK client operate `PutObject` with body: "body for example".
On the server side we get the following buffers (showing as strings for readability):
1. "10\r\n" <- 10 hex in decimal is 16 (this is the length of the data "body for example")
2. "body for example" <- the data (we want to save as content and pipe it)
3. "\r\n0\r\n" <- CR NL of the data and "0\r\n" as completion chunk (end of data - the final object chunk)
4. "x-amz-checksum-crc32:uOMGCw==\r\n"
5. "\r\n"

In this example there are 5 calls to the parse, the whole stream has 1 data chunk and its header, 1 trailer and no chunk-signature.
Although in this example the buffer includes the chunks inside, it doesn’t have to be like that a chunk might be split into a couple of buffers.

### Notice
Currently, we haven’t implemented the checksum on the server side, so if the request contains `x-amz-trailer: x-amz-checksum-crc32` and the railing chunk has the header name `x-amz-checksum-sha1` (instead of `x-amz-checksum-crc32`) this request would not fail.
Example source: [AWS Trailer Chunks Documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#trailing-checksums)

## Policy Change in AWS
In the past AWS supported data integrity check as an opt-in, and changed it to be by default.
Source: [Data Integrity Protections for Amazon S3](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html)

Useful link about the posted message in AWS Clients:
1. [AWS CLI](https://github.com/aws/aws-cli/issues/9214)
2. [AWS SDK JS V3](https://github.com/aws/aws-sdk-js-v3/issues/6810)
3. [AWS SDK GO V2](https://github.com/aws/aws-sdk-go-v2/discussions/2960)

They also posted a full table where this change was implemented by SDK and version [Compatibility with AWS SDKs](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html)
It was also announced in [AWS blog](https://aws.amazon.com/blogs/aws/introducing-default-data-integrity-protections-for-new-objects-in-amazon-s3/)

#### WorkAround
In the past the mentioned state machine did not include states for trailers, if a request had a trailer in its body it would get to `STATE_ERROR` as the previous state machine expected the body to end with:
```
0\r\n
\r\n
```
When using an updated AWS SDK Client directly against NooBaa before the mentioned change, please run the AWS client (CLI / SDK) with the following environment variables:
```
AWS_RESPONSE_CHECKSUM_VALIDATION=WHEN_REQUIRED
AWS_REQUEST_CHECKSUM_CALCULATION=WHEN_REQUIRED
```
Source: [Data Integrity Protections for Amazon S3](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html)

## Code References:
### Files:
- `src/util/chunked_content_decoder.js` - the class `ChunkedContentDecoder`
- `src/test/unit_tests/jest_tests/test_chunked_content_decoder.test.js` - unit test of the class, please run with `npx jest test_chunked_content_decoder.test.js`
- `ChunkedContentDecoder_State_Machine.md` - the diagram (not as link, in case we need to modify it).

### Related PRs
1. https://github.com/noobaa/noobaa-core/pull/5397 which created the stream transformer `ChunkedContentDecoder` (the original state machine - was build with the states that handled the optional extension of aws-chunk)
2. https://github.com/noobaa/noobaa-core/pull/8753 - mainly added the trailers to the `ChunkedContentDecoder` state machine.
29 changes: 29 additions & 0 deletions docs/design/uml/ChunkedContentDecoder_State_Machine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
```mermaid
---
config:
theme: dark
look: classic
layout: elk
---
flowchart TD
A -- not CR, append string --> A["STATE_READ_CHUNK_HEADER <br> read the chunk header until CR and parse it"]
A -. CR, header parse problem .-> J["STATE_ERROR <br> an error occurred"]
A -- "CR, chunk_size!=0" --> B["STATE_WAIT_NL_HEADER <br> wait for NL after the chunk header"]
A -- "CR, chunk_size==0" --> F["STATE_READ_TRAILER <br> read optional trailer until CR and save it"]
C -- data --> C["STATE_SEND_DATA <br> send chunk data to the stream until chunk size bytes sent"]
C -- done size bytes --> D["STATE_WAIT_CR_DATA <br> wait for CR after the chunk data"]
F -- not CR, append string --> F
F -- CR, keep trailer --> G["STATE_WAIT_NL_TRAILER <br> wait for NL after non empty trailer"]
F -- CR, empty trailer --> H["STATE_WAIT_NL_END <br> wait for NL after the last empty trailer"]
D -- CR --> E["STATE_WAIT_NL_DATA <br> wait for NL after the chunk data"]
H -- NL --> I["STATE_CONTENT_END <br> the stream is done"]
B -- NL --> C
E -- NL --> A
G -- NL --> F
B -. not NL .-> J
E -. not NL .-> J
G -. not NL .-> J
H -. not NL .-> J
D -. not CR .-> J
I -. any .-> J
```

0 comments on commit b89f7bc

Please sign in to comment.