Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a CardinalityAwareRowConverter #4736

Closed
wants to merge 7 commits into from

Conversation

JayjeetAtGithub
Copy link
Contributor

Which issue does this PR close?

This PR adds a CardinalityAwareRowConverter (a wrapper around RowConverter) to arrow-row. Basically, when the cardinality of dict-encoded sort fields is >= 10, we don't preserve dictionary encoding any more and fall back to using string encoding.

Closes apache/datafusion#7200.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 24, 2023
@JayjeetAtGithub JayjeetAtGithub changed the title Sort fix Add a CardinalityAwareRowConverter Aug 24, 2023
@JayjeetAtGithub
Copy link
Contributor Author

JayjeetAtGithub commented Aug 25, 2023

Currently, I hardcode the dictionary key type to be Int and UInt types. I am working on fixing that. In general, any ideas on how to count unique items in a dict-encoded columns efficiently would be super handy. Thanks.

@alamb
Copy link
Contributor

alamb commented Aug 25, 2023

@JayjeetAtGithub -- I think after some thought we should put this code into DataFusion (it will need to be changed).

The reason I think this belongs in DataFusion is that the arrow row converter already has a way to control if interning is used for dictionaries. I seems somewhat confusing to then also have something that automatically picks an interning strategy based on cardinality -- I think until we see others wanting to use this, let's leave this in DataFusion

@alamb
Copy link
Contributor

alamb commented Aug 25, 2023

Currently, I hardcode the dictionary key type to be Int and UInt types. I am working on fixing that. In general, any ideas on how to count unique items in a dict-encoded columns efficiently would be super handy. Thanks.

Dictionaries can only be one of https://docs.rs/arrow/latest/arrow/datatypes/trait.ArrowDictionaryKeyType.html (so Int/UInt is the right list of types)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @JayjeetAtGithub -- As I commented before I suggest we start with this code in the datafusion repo and then we can move it upstream in the future if it turns out to be more commonly useful

Let's move the conversation to apache/datafusion#7401

_ => unreachable!(),
};

if cardinality >= LOW_CARDINALITY_THRESHOLD {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will effectively switch the encoding mid-steam, I think -- which will mean that the output can't be compared with previously created rows, which is not correct.

I think the decision has to be made based on the first batch and then that decision used for encoding all rows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I am sorry, I didn't quite get it. I thought what I was doing was, tapping into the first batch of the stream, looking into it, setting the right codec (whether to use the interner or not) to encode the batches, and then let the conversion going for all the batches (including the first one).

for (i, col) in columns.iter().enumerate() {
if let DataType::Dictionary(k, _) = col.data_type() {
// let cardinality = col.as_any().downcast_ref::<DictionaryArray<Int32Type>>().unwrap().values().len();
let cardinality = match k.as_ref() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought that we should base the decision of "is this a high cardinality column" on the "in use" cardinality of the dictionary (aka how may distinct key values there were -- as suggested on apache/datafusion#7200 (comment))

However, I now realize that maybe the number of potential key values (aka the length of the values array) is actually a more robust predictor of being "high cardinality" (as the other values in the dictionary could be used in subsequent batches, perhaps)

Do you have any opinion @tustvold ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the RowConverter blindly generates a mapping for all values, regardless of if they appear in the keys, I think we should just use the length of the values. Whilst an argument could be made for doing something more sophisticated, this would only really make sense if the dictionary interner itself followed a similar approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thank you

@tustvold tustvold marked this pull request as draft August 25, 2023 16:34
@JayjeetAtGithub
Copy link
Contributor Author

Closing this PR in lieu of apache/datafusion#7401. Shall continue the discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RowConverter keeps growing in size while merging streams on high-cardinality dictionary fields
3 participants