-
Notifications
You must be signed in to change notification settings - Fork 174
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #200 from ClickHouse/s3select
Add S3 Select
- Loading branch information
Showing
3 changed files
with
62 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
## Introduction | ||
|
||
AWS S3 has support for `SelectObjectContent` method, which allows to run SQL queries directly on S3 objects, if they contain data in CSV, JSON or Parquet formats. | ||
|
||
Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference-select.html | ||
|
||
Unfortunately, the language is very primitive. It does not support ORDER BY or GROUP BY. Only filtering by WHERE with a primitive set of functions and the LIMIT clause. | ||
|
||
That's why it cannot complete ClickBench. | ||
|
||
The performance is atrocious, and the usability is dubious. It is pointless to use even if you want to pre-filter data by some conditions before further processing. | ||
|
||
## Comparison | ||
|
||
AWS S3 Select: | ||
|
||
``` | ||
time aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, SearchPhrase FROM S3Object WHERE SearchPhrase LIKE '%трешбоксарский%'" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' /dev/stdout | ||
1634,прировочный счёт трешбоксарский лабор для железневые в гаражных расписатель | ||
real 0m33.796s | ||
user 0m0.842s | ||
sys 0m0.091s | ||
``` | ||
|
||
ClickHouse: | ||
|
||
``` | ||
time ch -q "SELECT CounterID, SearchPhrase FROM s3('s3://clickhouse-public-datasets/hits_compatible/hits.parquet') WHERE SearchPhrase LIKE '%трешбоксарский%'" | ||
1634 прировочный счёт трешбоксарский лабор для железневые в гаражных расписатель | ||
real 0m3.526s | ||
user 0m7.248s | ||
sys 0m1.314s | ||
``` | ||
|
||
We can see that ClickHouse is ten times faster despite the need for client-side processing. | ||
|
||
## Caveats | ||
|
||
Some invalid queries just hang instead of returning an error: | ||
|
||
``` | ||
aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, count(*) FROM S3Object WHERE SearchPhrase LIKE '%test%'" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' - | ||
``` | ||
|
||
When they do return an error, the error message is below reasonable: | ||
|
||
``` | ||
aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, count(*) FROM S3Object GROUP BY CounterID ORDER BY count(*) DESC LIMIT 10" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' - | ||
An error occurred (ParseUnexpectedToken) when calling the SelectObjectContent operation: Unexpected token found KEYWORD:UNKNOWN at line 1, column 61. | ||
``` | ||
|
||
## Alternatives | ||
|
||
You can use ClickHouse in AWS Lambda: https://github.com/aws-samples/aws-lambda-clickhouse | ||
|
||
This project is made by AWS engineers. |