Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update dryrun docs #1388

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
251 changes: 186 additions & 65 deletions docs/user-guide/logs/manage-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,15 +168,86 @@ The output `Readable timestamp (UTC)` represents the creation time of the pipeli

First, please refer to the [Quick Start example](/user-guide/logs/quick-start.md#write-logs-by-pipeline) to see the correct execution of the Pipeline.

### Debug creating a Pipeline
### Debug Pipeline Configuration

You may encounter errors when creating a Pipeline. For example, when creating a Pipeline using the following configuration:
Since the Pipeline configuration file and the data to be processed need to be uploaded simultaneously, and the data formats are inconsistent, it can lead to some difficulties in debugging from the command line. Therefore, we provide a graphical debugging interface in the logview of the Dashboard. If you need to debug from the command line, you can use the following script:

**This script is only for testing the processing results of the Pipeline and will not write logs into GreptimeDB.**

```bash
curl -X "POST" "http://localhost:4000/v1/events/pipelines/test" \
-H 'Content-Type: application/x-yaml' \
-d $'processors:
#! /bin/bash

PIPELINE_TYPE=$1
P1=$2
P2=$3
P3=$4
HOST="http://localhost:4000"

function usage {
echo "Usage: $0 <pipeline_type> <param1> <param2> [param3]"
echo "pipeline_type: 'name' or 'content'"
echo " - 'name': requires param1 (pipeline_name), param2 (data), and param3 (pipeline_version)"
echo " - 'content': requires param1 (pipeline_content_file) and param2 (data)"
exit 1
}

function test_pipeline_name {
local PIPELINE_NAME=$P1
local DATA="$P2"
local PIPELINE_VERSION="$P3"
if [[ -z "$PIPELINE_VERSION" ]]; then
PIPELINE_VERSION="null"
else
PIPELINE_VERSION="\"$PIPELINE_VERSION\""
fi

local BODY=$(jq -nc --argjson data "$DATA" --arg pipeline_name "$PIPELINE_NAME" --argjson pipeline_version "$PIPELINE_VERSION" '{"pipeline_name": $pipeline_name, "pipeline_version": $pipeline_version, "data": $data}')
local result=$(curl --silent -X POST -H "Content-Type: application/json" "$HOST/v1/events/pipelines/dryrun" --data "$BODY")
echo "$result" | jq
}

function test_pipeline_content {
local PIPELINE_CONTENT=$(cat "$P1")
local DATA="$P2"
local BODY=$(jq -nc --arg pipeline_content "$PIPELINE_CONTENT" --argjson data "$DATA" '{"pipeline": $pipeline_content, "data": $data}')
local result=$(curl --silent -X POST -H "Content-Type: application/json" "$HOST/v1/events/pipelines/dryrun" --data "$BODY")
echo "$result" | jq
}

if [[ -z "$PIPELINE_TYPE" || -z "$P1" || -z "$P2" ]]; then
usage
fi

case $PIPELINE_TYPE in
"name")
test_pipeline_name
;;
"content")
test_pipeline_content
;;
*)
usage
;;
esac
```

We will use this script for subsequent debugging. First, save the script as `test_pipeline.sh`, then use `chmod +x test_pipeline.sh` to add execution permissions. This script depends on `curl` and `jq`. Please ensure that your system has these two tools installed. If you need to change the endpoint, modify the $HOST in the script to the address of your GreptimeDB.

#### Debug Pipeline

Suppose we need to replace the `.` in the `message` field of the following data with `-` and parse the `time` field into a time format.

```json
{
"message": "1998.08",
"time": "2024-05-25 20:16:37.217"
}
```

So we wrote the following Pipeline configuration file and saved it as `pipeline.yaml`:

```yaml
processors:
- date:
field: time
formats:
Expand All @@ -185,9 +256,8 @@ curl -X "POST" "http://localhost:4000/v1/events/pipelines/test" \
- gsub:
fields:
- message
pattern: "\\\."
replacement:
- "-"
pattern: "\\."
replacement: "-"
ignore_missing: true

transform:
Expand All @@ -196,22 +266,27 @@ transform:
type: string
- field: time
type: time
index: timestamp'
index: timestamp
```

The pipeline configuration contains an error. The `gsub` Processor expects the `replacement` field to be a string, but the current configuration provides an array. As a result, the pipeline creation fails with the following error message:
Then use the `test_pipeline.sh` script to test:

```bash
./test_pipeline.sh content pipeline.yaml '[{"message": 1998.08,"time":"2024-05-25 20:16:37.217"}]'
```

The input is as follows:

```json
{"error":"Failed to parse pipeline: 'replacement' must be a string"}
{
"error": "Failed to build pipeline: Field replacement must be a string"
}
```

Therefore, We need to modify the configuration of the `gsub` Processor and change the value of the `replacement` field to a string type.
According to the error message, we need to modify the configuration of the `gsub` processor and change the value of the `replacement` field to a string type.

```bash
curl -X "POST" "http://localhost:4000/v1/events/pipelines/test" \
-H 'Content-Type: application/x-yaml' \
-d $'processors:
```yaml
processors:
- date:
field: time
formats:
Expand All @@ -220,7 +295,8 @@ curl -X "POST" "http://localhost:4000/v1/events/pipelines/test" \
- gsub:
fields:
- message
pattern: "\\\."
pattern: "\\."
# Change to string type
replacement: "-"
ignore_missing: true

Expand All @@ -230,70 +306,115 @@ transform:
type: string
- field: time
type: time
index: timestamp'
index: timestamp
```

Now that the Pipeline has been created successfully, you can test the Pipeline using the `dryrun` interface.

### Debug writing logs

We can test the Pipeline using the `dryrun` interface. We will test it with erroneous log data where the value of the message field is in numeric format, causing the pipeline to fail during processing.

**This API is only used to test the results of the Pipeline and does not write logs to GreptimeDB.**

Now test the Pipeline again:

```bash
curl -X "POST" "http://localhost:4000/v1/events/pipelines/dryrun?pipeline_name=test" \
-H 'Content-Type: application/json' \
-d $'{"message": 1998.08,"time":"2024-05-25 20:16:37.217"}'
./test_pipeline.sh content pipeline.yaml '[{"message": 1998.08,"time":"2024-05-25 20:16:37.217"}]'
```

The input is as follows:

{"error":"Failed to execute pipeline, reason: gsub processor: expect string or array string, but got Float64(1998.08)"}
```json
{
"error": "Failed to exec pipeline: Processor gsub: expect string value, but got Float64(1998.08)"
}
```

The output indicates that the pipeline processing failed because the `gsub` Processor expects a string type rather than a floating-point number type. We need to adjust the format of the log data to ensure the pipeline can process it correctly.
Let's change the value of the message field to a string type and test the pipeline again.
According to the error message, we need to modify the format of the test data and change the value of the `message` field to a string type.

```bash
curl -X "POST" "http://localhost:4000/v1/events/pipelines/dryrun?pipeline_name=test" \
-H 'Content-Type: application/json' \
-d $'{"message": "1998.08","time":"2024-05-25 20:16:37.217"}'
./test_pipeline.sh content pipeline.yaml '[{"message": "1998.08","time":"2024-05-25 20:16:37.217"}]'
```

At this point, the Pipeline processing is successful, and the output is as follows:

```json
{
"rows": [
[
{
"data_type": "STRING",
"key": "message",
"semantic_type": "FIELD",
"value": "1998-08"
},
{
"data_type": "TIMESTAMP_NANOSECOND",
"key": "time",
"semantic_type": "TIMESTAMP",
"value": "2024-05-25 20:16:37.217+0000"
}
]
],
"schema": [
{
"colume_type": "FIELD",
"data_type": "STRING",
"fulltext": false,
"name": "message"
},
{
"colume_type": "TIMESTAMP",
"data_type": "TIMESTAMP_NANOSECOND",
"fulltext": false,
"name": "time"
}
"rows": [
[
{
"data_type": "STRING",
"key": "message",
"semantic_type": "FIELD",
"value": "1998-08"
},
{
"data_type": "TIMESTAMP_NANOSECOND",
"key": "time",
"semantic_type": "TIMESTAMP",
"value": "2024-05-25 20:16:37.217+0000"
}
]
],
"schema": [
{
"colume_type": "FIELD",
"data_type": "STRING",
"fulltext": false,
"name": "message"
},
{
"colume_type": "TIMESTAMP",
"data_type": "TIMESTAMP_NANOSECOND",
"fulltext": false,
"name": "time"
}
]
}
```

It can be seen that the `.` in the string `1998.08` has been replaced with `-`, indicating a successful processing of the Pipeline.
#### Debug Pipeline API Information

**This API is only for testing the processing results of the Pipeline and will not write logs into GreptimeDB.**

```bash
curl -X "POST" "http://localhost:4000/v1/events/pipelines/dryrun" \
-H 'Content-Type: application/json' \
-d 'data here'
```

The body is a fixed format JSON string, which includes the `pipeline`, `pipeline_name`, `pipeline_version`, and `data` fields.
`pipeline` and `pipeline_name` are mutually exclusive, and `pipeline_version` is an optional field. When all fields are present, the `pipeline` field is used preferentially.
`pipeline_name` is the name of the Pipeline already stored in GreptimeDB, and `pipeline_version` (optional) is the version number of the Pipeline.

* `pipeline`: String, the configuration content of the Pipeline.
* `pipeline_name`: String, the name of the Pipeline.
* `pipeline_version`: String, the version number of the Pipeline when `pipeline_name` is present.
* `data`: A JSON array, each element is a JSON object or string, the data to be processed.

For example, if we use the debug Pipeline configuration file `pipeline.yaml`, the test data is as follows, then the curl command is:

```bash
curl -X "POST" "http://localhost:4000/v1/events/pipelines/dryrun" \
-H 'Content-Type: application/json' \
-d '{
"pipeline": "processors:\n - date:\n field: time\n formats:\n - \"%Y-%m-%d %H:%M:%S%.3f\"\n ignore_missing: true\n - gsub:\n fields:\n - message\n pattern: \"\\\\.\"\n # Change to string type\n replacement: \"-\"\n ignore_missing: true\n\ntransform:\n - fields:\n - message\n type: string\n - field: time\n type: time\n index: timestamp",
"data": [
{
"message": "1998.08",
"time": "2024-05-25 20:16:37.217"
}
]
}'
```

If we use the Pipeline name `test_pipeline` and do not provide the `pipeline_version` parameter, the curl command is:

```bash
curl -X "POST" "http://localhost:4000/v1/events/pipelines/dryrun" \
-H 'Content-Type: application/json' \
-d '{
"pipeline_name": "test_pipeline",
"data": [
{
"message": "1998.08",
"time": "2024-05-25 20:16:37.217"
}
]
}'
```

This will use the latest version of the Pipeline named `test_pipeline` to process the data.
Loading
Loading