Merge pull request #8778 from shirady/chunked-content-decoder-docs

Docs | Add `ChunkedContentDecoder` Documentation with State Machine Diagram
noobaa · Feb 10, 2025 · b89f7bc · b89f7bc
2 parents 8668b3c + ea9d1e8
commit b89f7bc
Show file tree

Hide file tree

Showing 2 changed files with 203 additions and 0 deletions.
diff --git a/docs/design/ChunkedContentDecoder.md b/docs/design/ChunkedContentDecoder.md
@@ -0,0 +1,174 @@
+# Chunked Content Decoder
+
+HTTP Chunked Encoding is a streaming data transfer mechanism that breaks down the data stream into a series of non-overlapping segments called "chunks".  
+Each chunk is sent with its own size header, which tells the receiver how much data to expect in that chunk.  
+To indicate that the data is being sent in chunks a header of `Transfer-Encoding: chunked` is included.  
+Source: [HTTP Chunked Encoding](https://www.ioriver.io/terms/http-chunked-encoding)
+
+The `ChunkedContentDecoder` class is a [Transform stream](https://nodejs.org/api/stream.html#class-streamtransform), which takes a stream that it received as chunks and streams only the data - it removes the size of the data, chunk headers of chunk-signature (optional extension), trailers, etc.
+
+## Basic Encoding Structure:
+### Chunks (Without Optional Extension and Trailers)
+Each chunk consists of two parts: 
+1. a header
+2. the actual data.
+The header is a hexadecimal number that indicates the size of the chunk in bytes, followed by a carriage return (CR) and a line feed (NL).  
+The data that follows this header is exactly the size specified in the header.  
+After the data, another carriage return and line feed signify the end of the chunk.  
+Source: [HTTP Chunked Encoding](https://www.ioriver.io/terms/http-chunked-encoding)
+
+```
+<hex bytes of data>\r\n
+<data>
+...
+the end of the chunk:
+0\r\n
+\r\n
+```
+
+Example:
+```
+7\r\n
+Mozilla\r\n
+11\r\n
+Developer Network\r\n
+0\r\n
+\r\n
+```
+Source of the example: [Mozilla Transfer-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding)
+
+
+### Chunks With Optional Extension and Trailers
+(combined with example)
+```
+1fff;chunk-signature=1a2b\r\n   - chunk header (optional extension)  
+<1fff bytes of data>\r\n        - chunk data  
+2fff;chunk-signature=1a2b\r\n   - chunk header (optional extension)  
+<2fff bytes of data>\r\n        - chunk data  
+0\r\n                           - last chunk  
+<trailer>\r\n                   - optional trailer  
+<trailer>\r\n                   - optional trailer  
+\r\n                            - end of content  
+```
+Notes:
+- `1fff` and `2fff` are examples of the size in hex
+- trailer example: `x-amz-checksum-crc32:uOMGCw==\r\n` (key - the algorithm `crc32`, value in base64 and `\r\n` as CR NL ending of the trailer)
+
+More info in [Wikipedia](https://en.wikipedia.org/wiki/Chunked_transfer_encoding)
+And also in [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1)
+
+### Chunk Extension (chunk-signature)
+In HTTP there is an option for adding chunk extensions, immediately following the chunk size.
+```
+chunk-ext      = *( ";" chunk-ext-name [ "=" chunk-ext-val ] )
+```
+Source: [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1.1)
+
+You can see in AWS SigV4 that when AWS uses the chunk extension as a chunk signature it uses it with the following structure:
+```
+string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n                    
+```
+Source: [AWS Documentation Defining the Chunk Body](https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html)
+
+Chunk Extension are the following lines in the example encoding structure:
+```
+2fff;chunk-signature=1a2b\r\n   - chunk header (optional extension)  
+<2fff bytes of data>\r\n        - chunk data  
+```
+
+
+### Trailers
+In HTTP there is a option for a trailer. A trailer allows the sender to include additional fields at the end of a chunked message in order to supply metadata that might be dynamically generated while the message body is sent.  
+Source: [RFC 7230](https://www.rfc-editor.org/rfc/rfc7230#section-4.1.2)
+
+
+The name of the trailing is passed as a header and the trailer (key-value) passed after the chunked body. There can be added 0 or more trailers headers in a HTTP body.  
+Source: [Stack OverFlow](https://stackoverflow.com/questions/5590791/http-chunked-encoding-need-an-example-of-trailer-mentioned-in-spec)
+
+Amazon S3 supports chunked uploads that use `aws-chunked` content encoding for `PutObject` and `UploadPart` requests with trailing checksums.  
+
+When a request has the header `x-amz-trailer` it indicates the name of the trailing header in the request. If trailing checksums exist the `x-amz-trailer` header value includes the `x-amz-checksum-` prefix and ends with the algorithm name. The following `x-amz-trailer` values are currently supported:
+- x-amz-checksum-crc32
+- x-amz-checksum-crc32c
+- x-amz-checksum-crc64nvme
+- x-amz-checksum-sha1
+- x-amz-checksum-sha256
+
+Source: [AWS Trailing Checksum Documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#trailing-checksums)
+
+Trailers are the following lines in the example encoding structure:
+```
+<trailer>\r\n                   - optional trailer  
+<trailer>\r\n                   - optional trailer  
+```
+
+## State Machine
+The chunks are passing in the buffer and the buffer is parsed in a **loop** to handle multiple chunks in the same buffer, and to handle the case where the buffer ends in the middle of a chunk.  
+
+The `ChunkedContentDecoder` is using a state machine:
+The state machine is updated according to the current state and the buffer content.
+The state machine is updated by the following rules:
+1. **STATE_READ_CHUNK_HEADER** - read the chunk header until CR and parse it.
+2. **STATE_WAIT_NL_HEADER** - wait for NL after the chunk header.
+3. **STATE_SEND_DATA** - send chunk data to the stream until chunk size bytes sent.
+4. **STATE_WAIT_CR_DATA** - wait for CR after the chunk data.
+5. **STATE_WAIT_NL_DATA** - wait for NL after the chunk data.
+6. **STATE_READ_TRAILER** - read optional trailer until CR and save it.
+7. **STATE_WAIT_NL_TRAILER** - wait for NL after non empty trailer.
+8. **STATE_WAIT_NL_END** - wait for NL after the last empty trailer.
+9. **STATE_CONTENT_END** - the stream is done.
+10. **STATE_ERROR** - an error occurred.
+
+The following diagram describes the changes of the state machine:
+![State Machine Diagram](https://github.com/user-attachments/assets/727faf34-887a-4ad8-814c-134585618d8b)
+
+#### Dry run for example:
+An updated AWS SDK client operate `PutObject` with body: "body for example".  
+On the server side we get the following buffers (showing as strings for readability):
+1. "10\r\n" <- 10 hex in decimal is 16 (this is the length of the data "body for example")
+2. "body for example" <- the data (we want to save as content and pipe it)
+3. "\r\n0\r\n" <- CR NL of the data and "0\r\n" as completion chunk (end of data - the final object chunk)
+4. "x-amz-checksum-crc32:uOMGCw==\r\n"
+5. "\r\n"
+
+In this example there are 5 calls to the parse, the whole stream has 1 data chunk and its header, 1 trailer and no chunk-signature.  
+Although in this example the buffer includes the chunks inside, it doesn’t have to be like that a chunk might be split into a couple of buffers.
+
+### Notice
+Currently, we haven’t implemented the checksum on the server side, so if the request contains `x-amz-trailer: x-amz-checksum-crc32` and the railing chunk has the header name `x-amz-checksum-sha1` (instead of `x-amz-checksum-crc32`) this request would not fail.  
+Example source: [AWS Trailer Chunks Documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html#trailing-checksums)
+
+## Policy Change in AWS
+In the past AWS supported data integrity check as an opt-in, and changed it to be by default.  
+Source: [Data Integrity Protections for Amazon S3](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html)
+
+Useful link about the posted message in AWS Clients:
+1. [AWS CLI](https://github.com/aws/aws-cli/issues/9214)
+2. [AWS SDK JS V3](https://github.com/aws/aws-sdk-js-v3/issues/6810)
+3. [AWS SDK GO V2](https://github.com/aws/aws-sdk-go-v2/discussions/2960)  
+
+They also posted a full table where this change was implemented by SDK and version [Compatibility with AWS SDKs](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html)  
+It was also announced in [AWS blog](https://aws.amazon.com/blogs/aws/introducing-default-data-integrity-protections-for-new-objects-in-amazon-s3/)
+
+#### WorkAround
+In the past the mentioned state machine did not include states for trailers, if a request had a trailer in its body it would get to `STATE_ERROR` as the previous state machine expected the body to end with:
+```
+0\r\n
+\r\n
+```
+When using an updated AWS SDK Client directly against NooBaa before the mentioned change, please run the AWS client (CLI / SDK) with the  following environment variables:
+```
+AWS_RESPONSE_CHECKSUM_VALIDATION=WHEN_REQUIRED
+AWS_REQUEST_CHECKSUM_CALCULATION=WHEN_REQUIRED
+```
+Source: [Data Integrity Protections for Amazon S3](https://docs.aws.amazon.com/sdkref/latest/guide/feature-dataintegrity.html)
+
+## Code References:
+### Files:
+- `src/util/chunked_content_decoder.js` - the class `ChunkedContentDecoder`
+- `src/test/unit_tests/jest_tests/test_chunked_content_decoder.test.js` - unit test of the class, please run with `npx jest test_chunked_content_decoder.test.js`
+- `ChunkedContentDecoder_State_Machine.md` - the diagram (not as link, in case we need to modify it).
+
+### Related PRs
+1. https://github.com/noobaa/noobaa-core/pull/5397 which created the stream transformer `ChunkedContentDecoder` (the original state machine - was build with the states that handled the optional extension of aws-chunk)
+2. https://github.com/noobaa/noobaa-core/pull/8753 - mainly added the trailers to the `ChunkedContentDecoder` state machine.
diff --git a/docs/design/uml/ChunkedContentDecoder_State_Machine.md b/docs/design/uml/ChunkedContentDecoder_State_Machine.md
@@ -0,0 +1,29 @@
+```mermaid
+---
+config:
+  theme: dark
+  look: classic
+  layout: elk
+---
+flowchart TD
+    A -- not CR, append string --> A["STATE_READ_CHUNK_HEADER <br> read the chunk header until CR and parse it"]
+    A -. CR, header parse problem .-> J["STATE_ERROR <br> an error occurred"]
+    A -- "CR, chunk_size!=0" --> B["STATE_WAIT_NL_HEADER <br> wait for NL after the chunk header"]
+    A -- "CR, chunk_size==0" --> F["STATE_READ_TRAILER <br> read optional trailer until CR and save it"]
+    C -- data --> C["STATE_SEND_DATA <br> send chunk data to the stream until chunk size bytes sent"]
+    C -- done size bytes --> D["STATE_WAIT_CR_DATA <br> wait for CR after the chunk data"]
+    F -- not CR, append string --> F
+    F -- CR, keep trailer --> G["STATE_WAIT_NL_TRAILER <br> wait for NL after non empty trailer"]
+    F -- CR, empty trailer --> H["STATE_WAIT_NL_END <br> wait for NL after the last empty trailer"]
+    D -- CR --> E["STATE_WAIT_NL_DATA <br> wait for NL after the chunk data"]
+    H -- NL --> I["STATE_CONTENT_END <br> the stream is done"]
+    B -- NL --> C
+    E -- NL --> A
+    G -- NL --> F
+    B -. not NL .-> J
+    E -. not NL .-> J
+    G -. not NL .-> J
+    H -. not NL .-> J
+    D -. not CR .-> J
+    I -. any .-> J
+```