GreptimeTeam · paomian · Dec 18, 2024
@@ -168,15 +168,86 @@ The output `Readable timestamp (UTC)` represents the creation time of the pipeli
 
 First, please refer to the [Quick Start example](/user-guide/logs/quick-start.md#write-logs-by-pipeline) to see the correct execution of the Pipeline.
 
-### Debug creating a Pipeline
+### Debug Pipeline Configuration
 
-You may encounter errors when creating a Pipeline. For example, when creating a Pipeline using the following configuration:
+Since the Pipeline configuration file and the data to be processed need to be uploaded simultaneously, and the data formats are inconsistent, it can lead to some difficulties in debugging from the command line. Therefore, we provide a graphical debugging interface in the logview of the Dashboard. If you need to debug from the command line, you can use the following script:
 
+**This script is only for testing the processing results of the Pipeline and will not write logs into GreptimeDB.**
 
 ```bash
-curl -X "POST" "http://localhost:4000/v1/events/pipelines/test" \
-     -H 'Content-Type: application/x-yaml' \
-     -d $'processors:
+#! /bin/bash
+
+PIPELINE_TYPE=$1
+P1=$2
+P2=$3
+P3=$4
+HOST="http://localhost:4000"
+
+function usage {
+    echo "Usage: $0 <pipeline_type> <param1> <param2> [param3]"
+    echo "pipeline_type: 'name' or 'content'"
+    echo "  - 'name': requires param1 (pipeline_name), param2 (data), and param3 (pipeline_version)"
+    echo "  - 'content': requires param1 (pipeline_content_file) and param2 (data)"
+    exit 1
+}
+
+function test_pipeline_name {
+    local PIPELINE_NAME=$P1
+    local DATA="$P2"
+    local PIPELINE_VERSION="$P3"
+    if [[ -z "$PIPELINE_VERSION" ]]; then
+        PIPELINE_VERSION="null"
+    else
+        PIPELINE_VERSION="\"$PIPELINE_VERSION\""
+    fi
+
+    local BODY=$(jq -nc --argjson data "$DATA" --arg pipeline_name "$PIPELINE_NAME" --argjson pipeline_version "$PIPELINE_VERSION" '{"pipeline_name": $pipeline_name, "pipeline_version": $pipeline_version, "data": $data}')
+    local result=$(curl --silent -X POST -H "Content-Type: application/json" "$HOST/v1/events/pipelines/dryrun" --data "$BODY")
+    echo "$result" | jq
+}
+
+function test_pipeline_content {
+    local PIPELINE_CONTENT=$(cat "$P1")
+    local DATA="$P2"
+    local BODY=$(jq -nc --arg pipeline_content "$PIPELINE_CONTENT" --argjson data "$DATA" '{"pipeline": $pipeline_content, "data": $data}')
+    local result=$(curl --silent -X POST -H "Content-Type: application/json" "$HOST/v1/events/pipelines/dryrun" --data "$BODY")
+    echo "$result" | jq
+}
+
+if [[ -z "$PIPELINE_TYPE" || -z "$P1" || -z "$P2" ]]; then
+    usage
+fi
+
+case $PIPELINE_TYPE in
+  "name")
+    test_pipeline_name
+    ;;
+  "content")
+    test_pipeline_content
+    ;;
+  *)
+    usage
+    ;;
+esac
+```
+
+We will use this script for subsequent debugging. First, save the script as `test_pipeline.sh`, then use `chmod +x test_pipeline.sh` to add execution permissions. This script depends on `curl` and `jq`. Please ensure that your system has these two tools installed. If you need to change the endpoint, modify the $HOST in the script to the address of your GreptimeDB.
+
+#### Debug Pipeline
+
+Suppose we need to replace the `.` in the `message` field of the following data with `-` and parse the `time` field into a time format.
+
+```json
+{
+    "message": "1998.08",
+    "time": "2024-05-25 20:16:37.217"
+}
+```
+
+So we wrote the following Pipeline configuration file and saved it as `pipeline.yaml`:
+
+```yaml
+processors:
   - date:
       field: time
       formats:
@@ -185,9 +256,8 @@ curl -X "POST" "http://localhost:4000/v1/events/pipelines/test" \
   - gsub:
       fields:
         - message
-      pattern: "\\\."
-      replacement:
-        - "-"
+      pattern: "\\."
+      replacement: "-"
       ignore_missing: true
 
 transform:
@@ -196,22 +266,27 @@ transform:
     type: string
   - field: time
     type: time
-    index: timestamp'
+    index: timestamp
 ```
 
-The pipeline configuration contains an error. The `gsub` Processor expects the `replacement` field to be a string, but the current configuration provides an array. As a result, the pipeline creation fails with the following error message:
+Then use the `test_pipeline.sh` script to test:
+
+```bash
+./test_pipeline.sh content pipeline.yaml '[{"message": 1998.08,"time":"2024-05-25 20:16:37.217"}]'
+```
 
+The input is as follows:
 
 ```json
-{"error":"Failed to parse pipeline: 'replacement' must be a string"}
+{
+  "error": "Failed to build pipeline: Field replacement must be a string"
+}
 ```
 
-Therefore, We need to modify the configuration of the `gsub` Processor and change the value of the `replacement` field to a string type.
+According to the error message, we need to modify the configuration of the `gsub` processor and change the value of the `replacement` field to a string type.
 
-```bash
-curl -X "POST" "http://localhost:4000/v1/events/pipelines/test" \
-     -H 'Content-Type: application/x-yaml' \
-     -d $'processors:
+```yaml
+processors:
   - date:
       field: time
       formats:
@@ -220,7 +295,8 @@ curl -X "POST" "http://localhost:4000/v1/events/pipelines/test" \
   - gsub:
       fields:
         - message
-      pattern: "\\\."
+      pattern: "\\."
+      # Change to string type
       replacement: "-"
       ignore_missing: true
 
@@ -230,70 +306,115 @@ transform:
     type: string
   - field: time
     type: time
-    index: timestamp'
+    index: timestamp
 ```
 
-Now that the Pipeline has been created successfully, you can test the Pipeline using the `dryrun` interface.
-
-### Debug writing logs
-
-We can test the Pipeline using the `dryrun` interface. We will test it with erroneous log data where the value of the message field is in numeric format, causing the pipeline to fail during processing.
-
-**This API is only used to test the results of the Pipeline and does not write logs to GreptimeDB.**
-
+Now test the Pipeline again:
 
 ```bash
-curl -X "POST" "http://localhost:4000/v1/events/pipelines/dryrun?pipeline_name=test" \
-     -H 'Content-Type: application/json' \
-     -d $'{"message": 1998.08,"time":"2024-05-25 20:16:37.217"}'
+./test_pipeline.sh content pipeline.yaml '[{"message": 1998.08,"time":"2024-05-25 20:16:37.217"}]'
+```
+
+The input is as follows:
 
-{"error":"Failed to execute pipeline, reason: gsub processor: expect string or array string, but got Float64(1998.08)"}
+```json
+{
+  "error": "Failed to exec pipeline: Processor gsub: expect string value, but got Float64(1998.08)"
+}
 ```
 
-The output indicates that the pipeline processing failed because the `gsub` Processor expects a string type rather than a floating-point number type. We need to adjust the format of the log data to ensure the pipeline can process it correctly.
-Let's change the value of the message field to a string type and test the pipeline again.
+According to the error message, we need to modify the format of the test data and change the value of the `message` field to a string type.
 
 ```bash
-curl -X "POST" "http://localhost:4000/v1/events/pipelines/dryrun?pipeline_name=test" \
-     -H 'Content-Type: application/json' \
-     -d $'{"message": "1998.08","time":"2024-05-25 20:16:37.217"}'
+./test_pipeline.sh content pipeline.yaml '[{"message": "1998.08","time":"2024-05-25 20:16:37.217"}]'
 ```
 
 At this point, the Pipeline processing is successful, and the output is as follows:
 
 ```json
 {
-    "rows": [
-        [
-            {
-                "data_type": "STRING",
-                "key": "message",
-                "semantic_type": "FIELD",
-                "value": "1998-08"
-            },
-            {
-                "data_type": "TIMESTAMP_NANOSECOND",
-                "key": "time",
-                "semantic_type": "TIMESTAMP",
-                "value": "2024-05-25 20:16:37.217+0000"
-            }
-        ]
-    ],
-    "schema": [
-        {
-            "colume_type": "FIELD",
-            "data_type": "STRING",
-            "fulltext": false,
-            "name": "message"
-        },
-        {
-            "colume_type": "TIMESTAMP",
-            "data_type": "TIMESTAMP_NANOSECOND",
-            "fulltext": false,
-            "name": "time"
-        }
+  "rows": [
+    [
+      {
+        "data_type": "STRING",
+        "key": "message",
+        "semantic_type": "FIELD",
+        "value": "1998-08"
+      },
+      {
+        "data_type": "TIMESTAMP_NANOSECOND",
+        "key": "time",
+        "semantic_type": "TIMESTAMP",
+        "value": "2024-05-25 20:16:37.217+0000"
+      }
     ]
+  ],
+  "schema": [
+    {
+      "colume_type": "FIELD",
+      "data_type": "STRING",
+      "fulltext": false,
+      "name": "message"
+    },
+    {
+      "colume_type": "TIMESTAMP",
+      "data_type": "TIMESTAMP_NANOSECOND",
+      "fulltext": false,
+      "name": "time"
+    }
+  ]
 }
 ```
 
-It can be seen that the `.` in the string `1998.08` has been replaced with `-`, indicating a successful processing of the Pipeline.
+#### Debug Pipeline API Information
+
+**This API is only for testing the processing results of the Pipeline and will not write logs into GreptimeDB.**
+
+```bash
+curl -X "POST" "http://localhost:4000/v1/events/pipelines/dryrun" \
+     -H 'Content-Type: application/json' \
+     -d 'data here'
+```
+
+The body is a fixed format JSON string, which includes the `pipeline`, `pipeline_name`, `pipeline_version`, and `data` fields.
+`pipeline` and `pipeline_name` are mutually exclusive, and `pipeline_version` is an optional field. When all fields are present, the `pipeline` field is used preferentially.
+`pipeline_name` is the name of the Pipeline already stored in GreptimeDB, and `pipeline_version` (optional) is the version number of the Pipeline.
+
+* `pipeline`: String, the configuration content of the Pipeline.
+* `pipeline_name`: String, the name of the Pipeline.
+* `pipeline_version`: String, the version number of the Pipeline when `pipeline_name` is present.
+* `data`: A JSON array, each element is a JSON object or string, the data to be processed.
+
+For example, if we use the debug Pipeline configuration file `pipeline.yaml`, the test data is as follows, then the curl command is:
+
+```bash
+curl -X "POST" "http://localhost:4000/v1/events/pipelines/dryrun" \
+     -H 'Content-Type: application/json' \
+     -d '{
+  "pipeline": "processors:\n  - date:\n      field: time\n      formats:\n        - \"%Y-%m-%d %H:%M:%S%.3f\"\n      ignore_missing: true\n  - gsub:\n      fields:\n        - message\n      pattern: \"\\\\.\"\n      # Change to string type\n      replacement: \"-\"\n      ignore_missing: true\n\ntransform:\n  - fields:\n      - message\n    type: string\n  - field: time\n    type: time\n    index: timestamp",
+  "data": [
+    {
+      "message": "1998.08",
+      "time": "2024-05-25 20:16:37.217"
+    }
+  ]
+}'
+```
+
+If we use the Pipeline name `test_pipeline` and do not provide the `pipeline_version` parameter, the curl command is:
+
+```bash
+curl -X "POST" "http://localhost:4000/v1/events/pipelines/dryrun" \
+     -H 'Content-Type: application/json' \
+     -d '{
+  "pipeline_name": "test_pipeline",
+  "data": [
+    {
+      "message": "1998.08",
+      "time": "2024-05-25 20:16:37.217"
+    }
+  ]
+}'
+```
+
+This will use the latest version of the Pipeline named `test_pipeline` to process the data.