Reconsider hashing of nulls #822

alamb · 2021-08-04T16:30:56Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The create_hash function is responsible for hashing values in arrays. At the moment, however, it (effectively) hashes NULL values to 0 for all types, which likely leads to sub optimial behavior such as @Dandandan observed in #812 (comment) that NULL,1 and 1,NULL will hash to the same value.

Describe the solution you'd like
TBD

Describe alternatives you've considered
@jorgecarleitao 's comment (copied below) from #790 (comment) offers a few alternatives:

From the hashing side, an unknown to me atm is how to efficiently hash values+validity. I.e. given V = ["a", "", "c"] and N = [true, false, true], I see some options:

hash(V) ^ !N + unique * N where unique is a unique sentinel value exclusive for null values. If hash is vectorized, this operation is vectorized.
concat(hash(value), is_valid) for value, is_valid in zip(V,N)
split the array between nulls and not nulls, i.e. N -> (non-null indices, null indices), perform hashing over valid indices only, and then, at the very end, append all values for the nulls. We do this in the sort kernel, to reduce the number of slots to perform comparisons over.

If we could write the code in a way that we could "easily" switch between implementations (during dev only, not a conf parameter), we could bench whether one wins over the other, or under which circumstances.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

alamb added the enhancement New feature or request label Aug 4, 2021

alamb mentioned this issue Aug 4, 2021

Implement vectorized hashing for DictionaryArray types #812

Merged

alamb added the datafusion Changes in the datafusion crate label Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider hashing of nulls #822

Reconsider hashing of nulls #822

alamb commented Aug 4, 2021

Reconsider hashing of nulls #822

Reconsider hashing of nulls #822

Comments

alamb commented Aug 4, 2021