You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The create_hash function is responsible for hashing values in arrays. At the moment, however, it (effectively) hashes NULL values to 0 for all types, which likely leads to sub optimial behavior such as @Dandandan observed in #812 (comment) that NULL,1 and 1,NULL will hash to the same value.
Describe the solution you'd like
TBD
Describe alternatives you've considered @jorgecarleitao 's comment (copied below) from #790 (comment) offers a few alternatives:
From the hashing side, an unknown to me atm is how to efficiently hash values+validity. I.e. given V = ["a", "", "c"] and N = [true, false, true], I see some options:
hash(V) ^ !N + unique * N where unique is a unique sentinel value exclusive for null values. If hash is vectorized, this operation is vectorized.
concat(hash(value), is_valid) for value, is_valid in zip(V,N)
split the array between nulls and not nulls, i.e. N -> (non-null indices, null indices), perform hashing over valid indices only, and then, at the very end, append all values for the nulls. We do this in the sort kernel, to reduce the number of slots to perform comparisons over.
If we could write the code in a way that we could "easily" switch between implementations (during dev only, not a conf parameter), we could bench whether one wins over the other, or under which circumstances.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The
create_hash
function is responsible for hashing values in arrays. At the moment, however, it (effectively) hashes NULL values to0
for all types, which likely leads to sub optimial behavior such as @Dandandan observed in #812 (comment) thatNULL,1
and1,NULL
will hash to the same value.Describe the solution you'd like
TBD
Describe alternatives you've considered
@jorgecarleitao 's comment (copied below) from #790 (comment) offers a few alternatives:
From the hashing side, an unknown to me atm is how to efficiently hash
values+validity
. I.e. givenV = ["a", "", "c"]
andN = [true, false, true]
, I see some options:hash(V) ^ !N + unique * N
whereunique
is a unique sentinel value exclusive for null values. Ifhash
is vectorized, this operation is vectorized.concat(hash(value), is_valid) for value, is_valid in zip(V,N)
split the array between nulls and not nulls, i.e.
N -> (non-null indices, null indices)
, perform hashing over valid indices only, and then, at the very end, append all values for the nulls. We do this in the sort kernel, to reduce the number of slots to perform comparisons over.If we could write the code in a way that we could "easily" switch between implementations (during dev only, not a conf parameter), we could bench whether one wins over the other, or under which circumstances.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: