Skip to content

Commit

Permalink
Merge pull request #200 from ClickHouse/s3select
Browse files Browse the repository at this point in the history
Add S3 Select
  • Loading branch information
alexey-milovidov authored Jun 16, 2024
2 parents 01ebb8b + 04d2283 commit 01dc48e
Show file tree
Hide file tree
Showing 3 changed files with 62 additions and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,7 @@ We also introduced the [Hardware Benchmark](https://benchmark.clickhouse.com/har
- [ ] Apache Drill
- [ ] Apache Kudu
- [ ] Apache Kylin
- [ ] S3 select command in AWS
- [x] S3 select command in AWS
- [x] Kinetica
- [ ] YDB
- [ ] OceanBase
Expand Down
2 changes: 2 additions & 0 deletions brytlytdb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ An attempt to use their service resulted in a failure. It showed "Error: cannot
(Update after 4 months) It did not happen.

(Update after 8 months) It did not happen.

(Update after 2 years) It did not happen.
59 changes: 59 additions & 0 deletions s3select/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
## Introduction

AWS S3 has support for `SelectObjectContent` method, which allows to run SQL queries directly on S3 objects, if they contain data in CSV, JSON or Parquet formats.

Reference: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-select-sql-reference-select.html

Unfortunately, the language is very primitive. It does not support ORDER BY or GROUP BY. Only filtering by WHERE with a primitive set of functions and the LIMIT clause.

That's why it cannot complete ClickBench.

The performance is atrocious, and the usability is dubious. It is pointless to use even if you want to pre-filter data by some conditions before further processing.

## Comparison

AWS S3 Select:

```
time aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, SearchPhrase FROM S3Object WHERE SearchPhrase LIKE '%трешбоксарский%'" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' /dev/stdout
1634,прировочный счёт трешбоксарский лабор для железневые в гаражных расписатель
real 0m33.796s
user 0m0.842s
sys 0m0.091s
```

ClickHouse:

```
time ch -q "SELECT CounterID, SearchPhrase FROM s3('s3://clickhouse-public-datasets/hits_compatible/hits.parquet') WHERE SearchPhrase LIKE '%трешбоксарский%'"
1634 прировочный счёт трешбоксарский лабор для железневые в гаражных расписатель
real 0m3.526s
user 0m7.248s
sys 0m1.314s
```

We can see that ClickHouse is ten times faster despite the need for client-side processing.

## Caveats

Some invalid queries just hang instead of returning an error:

```
aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, count(*) FROM S3Object WHERE SearchPhrase LIKE '%test%'" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' -
```

When they do return an error, the error message is below reasonable:

```
aws s3api select-object-content --bucket clickhouse-public-datasets --key 'hits_compatible/hits.parquet' --expression "SELECT CounterID, count(*) FROM S3Object GROUP BY CounterID ORDER BY count(*) DESC LIMIT 10" --expression-type SQL --input-serialization '{"Parquet": {}}' --output-serialization '{"CSV": {}}' -
An error occurred (ParseUnexpectedToken) when calling the SelectObjectContent operation: Unexpected token found KEYWORD:UNKNOWN at line 1, column 61.
```

## Alternatives

You can use ClickHouse in AWS Lambda: https://github.com/aws-samples/aws-lambda-clickhouse

This project is made by AWS engineers.

0 comments on commit 01dc48e

Please sign in to comment.