Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement bitmap_distinct function using bitmap #1823

Open
Ted-Jiang opened this issue Feb 14, 2022 · 6 comments
Open

implement bitmap_distinct function using bitmap #1823

Ted-Jiang opened this issue Feb 14, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@Ted-Jiang
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Like #1115 implement u8, i8, u16, i16, u32, i32 by using bitmap called bitmap_distinct

Describe the solution you'd like
I will use arrow-bitmap and roaring-bitmap and check the performance of both

@Ted-Jiang Ted-Jiang added the enhancement New feature or request label Feb 14, 2022
@Ted-Jiang
Copy link
Member Author

i have implement a initial version get below result:
1million_rows_10thousand_distinct.parquet

1. count distint
+----------------------------+
| COUNT(DISTINCT test.value) |
+----------------------------+
| 10000                      |
+----------------------------+
1 row in set. Query took 0.237 seconds.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2. bitmap distinct (roaring-rs)

+---------------------------------+
| BITMAPCOUNTDISTINCT(test.value) |
+---------------------------------+
| 10000                           |
+---------------------------------+
1 row in set. Query took 0.052 seconds
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3. approx distinct (hll)

+----------------------------+
| APPROXDISTINCT(test.value) |
+----------------------------+
| 9943                       |
+----------------------------+
1 row in set. Query took 0.047 seconds.

the bitmap used is this
i have checked influx_iox use croating-rs
@alamb Sorry to bother you 😂, could you share some info why use croating-rs, if you have a bench result that would be fantastic 👍 !

@liukun4515
Copy link
Contributor

i have implement a initial version get below result: 1million_rows_10thousand_distinct.parquet

1. count distint
+----------------------------+
| COUNT(DISTINCT test.value) |
+----------------------------+
| 10000                      |
+----------------------------+
1 row in set. Query took 0.237 seconds.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2. bitmap distinct (roaring-rs)

+---------------------------------+
| BITMAPCOUNTDISTINCT(test.value) |
+---------------------------------+
| 10000                           |
+---------------------------------+
1 row in set. Query took 0.052 seconds
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3. approx distinct (hll)

+----------------------------+
| APPROXDISTINCT(test.value) |
+----------------------------+
| 9943                       |
+----------------------------+
1 row in set. Query took 0.047 seconds.

the bitmap used is this i have checked influx_iox use croating-rs @alamb Sorry to bother you 😂, could you share some info why use croating-rs, if you have a bench result that would be fantastic 👍 !

Could you please file the draft of the pull request?
@Ted-Jiang

@Ted-Jiang
Copy link
Member Author

Test result of use roaring-rs and croaring-rs (use same logic: insert one value one time)

1million_rows_10thousand_distinct.parquet

bitmap distinct 

(roaring-rs)
+---------------------------------+
| BITMAPCOUNTDISTINCT(test.value) |
+---------------------------------+
| 10000                           |
+---------------------------------+
1 row in set. Query took 0.052 seconds

(croaring-rs)
+---------------------------------+
| BITMAPCOUNTDISTINCT(test.value) |
+---------------------------------+
| 10000                           |
+---------------------------------+
1 row in set. Query took 0.038 seconds.

1million_1million.parquet

roaring-rs
+---------------------------------+
| BITMAPCOUNTDISTINCT(test.value) |
+---------------------------------+
| 631504                          |
+---------------------------------+
1 row in set. Query took 0.175 seconds(roaring-rs).

croaring-rs
+---------------------------------+
| BITMAPCOUNTDISTINCT(test.value) |
+---------------------------------+
| 631504                          |
+---------------------------------+
1 row in set. Query took 0.052 seconds (croaring-rs).

@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

@alamb Sorry to bother you 😂, could you share some info why use croating-rs, if you have a bench result that would be fantastic 👍 !

IOx is using croaring for its "ReadBuffer" (an optimized in memory format we use rather than straight up RecordBatches).
@e-dard is the expert. There is more information from the Feb 2 2021 tech talk: https://github.com/influxdata/influxdb_iox/tree/main/docs#iox-tech-talks

Specifically:
Recording: https://www.youtube.com/watch?v=KslD31VNqPU
Slides: https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-intro-to-the-influxdb-iox-read-buffer-a-readoptimized-inmemory-query-execution-engine

Sorry for the delayed response

@e-dard
Copy link
Contributor

e-dard commented Feb 16, 2022

Hey @Ted-Jiang!

Nice to see some of these ideas making there way into Datafusion! I developed some of these ideas for IOx's Read Buffer happened in 2020.

At the time I chose croaring-rs for a couple of reasons:

  • performance: I did some benchmarking and it was faster than the pure rust crate (sadly I can't find these benchmarks on my machine now).
  • reliability: croaring-rs wraps the officially maintained C/C++ version, which generally means it's a lower risk choice.

The TLDR of how I use bitmaps in the Read Buffer is as follows:

  • constant time row identification for predicates that match column op literal (which is the vast majority for InfluxData's use-cases). When a user specifies one of these we already have a compressed bitmap of all matching rows available.
  • (very) late materialisation. After all predicates are applied to all columns in memory (generally only working on the compressed representations) then the bitsets are combined appropriately (intersected/unioned etc). Only then does the Read Buffer begin materialising rows into output record batches based on the ordinal offsets in the final bitmap.

@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Feb 17, 2022

Hi @e-dard Thanks a lot for your info! What an admirable work in IOx.
I have done some test above,there is 2x boost for croaring-rs than roaring-rs.
and according to reliability you mentioned, they are strong evidences for using it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants