From 04d2283c9b77c7f9a5edaff6c06f44a289666354 Mon Sep 17 00:00:00 2001 From: Alexey Milovidov Date: Sun, 16 Jun 2024 06:44:16 +0200 Subject: [PATCH] Add S3 Select --- README.md | 2 +- brytlytdb/README.md | 2 ++ s3select/README.md | 59 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 62 insertions(+), 1 deletion(-) create mode 100644 s3select/README.md diff --git a/README.md b/README.md index b64efad7f..6a079a673 100644 --- a/README.md +++ b/README.md @@ -269,7 +269,7 @@ We also introduced the [Hardware Benchmark](https://benchmark.clickhouse.com/har - [ ] Apache Drill - [ ] Apache Kudu - [ ] Apache Kylin -- [ ] S3 select command in AWS +- [x] S3 select command in AWS - [x] Kinetica - [ ] YDB - [ ] OceanBase diff --git a/brytlytdb/README.md b/brytlytdb/README.md index 4d7f8dd55..9ac1f88d8 100644 --- a/brytlytdb/README.md +++ b/brytlytdb/README.md @@ -3,3 +3,5 @@ An attempt to use their service resulted in a failure. It showed "Error: cannot (Update after 4 months) It did not happen. (Update after 8 months) It did not happen. + +(Update after 2 years) It did not happen. diff --git a/s3select/README.md b/s3select/README.md new file mode 100644 index 000000000..8bec70361 --- /dev/null +++ b/s3select/README.md @@ -0,0 +1,59 @@ +## Introduction + +AWS S3 has support for `SelectObjectContent` method, which allows to run SQL queries directly on S3 objects, if they contain data in CSV, JSON or Parquet formats. + +Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference-select.html + +Unfortunately, the language is very primitive. It does not support ORDER BY or GROUP BY. Only filtering by WHERE with a primitive set of functions and the LIMIT clause. + +That's why it cannot complete ClickBench. + +The performance is atrocious, and the usability is dubious. It is pointless to use even if you want to pre-filter data by some conditions before further processing. + +## Comparison + +AWS S3 Select: + +``` +time aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, SearchPhrase FROM S3Object WHERE SearchPhrase LIKE '%трешбоксарский%'" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' /dev/stdout +1634,прировочный счёт трешбоксарский лабор для железневые в гаражных расписатель + +real 0m33.796s +user 0m0.842s +sys 0m0.091s +``` + +ClickHouse: + +``` +time ch -q "SELECT CounterID, SearchPhrase FROM s3('s3://clickhouse-public-datasets/hits_compatible/hits.parquet') WHERE SearchPhrase LIKE '%трешбоксарский%'" +1634 прировочный счёт трешбоксарский лабор для железневые в гаражных расписатель + +real 0m3.526s +user 0m7.248s +sys 0m1.314s +``` + +We can see that ClickHouse is ten times faster despite the need for client-side processing. + +## Caveats + +Some invalid queries just hang instead of returning an error: + +``` +aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, count(*) FROM S3Object WHERE SearchPhrase LIKE '%test%'" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' - +``` + +When they do return an error, the error message is below reasonable: + +``` +aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, count(*) FROM S3Object GROUP BY CounterID ORDER BY count(*) DESC LIMIT 10" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' - + +An error occurred (ParseUnexpectedToken) when calling the SelectObjectContent operation: Unexpected token found KEYWORD:UNKNOWN at line 1, column 61. +``` + +## Alternatives + +You can use ClickHouse in AWS Lambda: https://github.com/aws-samples/aws-lambda-clickhouse + +This project is made by AWS engineers.