-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a CardinalityAwareRowConverter
#4736
Conversation
Currently, I hardcode the dictionary key type to be |
@JayjeetAtGithub -- I think after some thought we should put this code into DataFusion (it will need to be changed). The reason I think this belongs in DataFusion is that the arrow row converter already has a way to control if interning is used for dictionaries. I seems somewhat confusing to then also have something that automatically picks an interning strategy based on cardinality -- I think until we see others wanting to use this, let's leave this in DataFusion |
Dictionaries can only be one of https://docs.rs/arrow/latest/arrow/datatypes/trait.ArrowDictionaryKeyType.html (so Int/UInt is the right list of types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @JayjeetAtGithub -- As I commented before I suggest we start with this code in the datafusion repo and then we can move it upstream in the future if it turns out to be more commonly useful
Let's move the conversation to apache/datafusion#7401
_ => unreachable!(), | ||
}; | ||
|
||
if cardinality >= LOW_CARDINALITY_THRESHOLD { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will effectively switch the encoding mid-steam, I think -- which will mean that the output can't be compared with previously created rows, which is not correct.
I think the decision has to be made based on the first batch and then that decision used for encoding all rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I am sorry, I didn't quite get it. I thought what I was doing was, tapping into the first batch of the stream, looking into it, setting the right codec (whether to use the interner or not) to encode the batches, and then let the conversion going for all the batches (including the first one).
for (i, col) in columns.iter().enumerate() { | ||
if let DataType::Dictionary(k, _) = col.data_type() { | ||
// let cardinality = col.as_any().downcast_ref::<DictionaryArray<Int32Type>>().unwrap().values().len(); | ||
let cardinality = match k.as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally thought that we should base the decision of "is this a high cardinality column" on the "in use" cardinality of the dictionary (aka how may distinct key values there were -- as suggested on apache/datafusion#7200 (comment))
However, I now realize that maybe the number of potential key values (aka the length of the values array) is actually a more robust predictor of being "high cardinality" (as the other values in the dictionary could be used in subsequent batches, perhaps)
Do you have any opinion @tustvold ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the RowConverter blindly generates a mapping for all values, regardless of if they appear in the keys, I think we should just use the length of the values. Whilst an argument could be made for doing something more sophisticated, this would only really make sense if the dictionary interner itself followed a similar approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Thank you
Closing this PR in lieu of apache/datafusion#7401. Shall continue the discussion there. |
Which issue does this PR close?
This PR adds a
CardinalityAwareRowConverter
(a wrapper aroundRowConverter
) toarrow-row
. Basically, when the cardinality of dict-encoded sort fields is>= 10
, we don't preserve dictionary encoding any more and fall back to using string encoding.Closes apache/datafusion#7200.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?