Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Decimal128 averaging code to be vectorizable (and easier to read) #6810

Merged
merged 2 commits into from
Jul 2, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 30, 2023

Which issue does this PR close?

Related to #4973

Rationale for this change

While working on a POC for new grouping code, I found I needed to call the decimal averaging code, but could not do so in a performant (vectorized) way because it created a ScalarValue

What changes are included in this PR?

  1. Pull out the decimal average code into its own structure
  2. Add comments that explain what is happening
  3. Refactor calculate_result_decimal_for_avg

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-expr Physical Expressions label Jun 30, 2023
@@ -37,45 +37,107 @@ pub fn get_accum_scalar_values_as_arrays(
.collect::<Vec<_>>())
}

pub fn calculate_result_decimal_for_avg(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called once per output row in the current code. I want to be able to perform the setup calculations once and then call the minimal code per row in a loop. Thus this refactor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it called per output row? Isn't it called in evaluate? I think evaluate is called only to get final aggregate value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it called in evaluate?

Yes

I think evaluate is called only to get final aggregate value?

Correct -- it is called for each group to get the final aggregate value. The ASCII art on this comment #4973 (comment) might help to understand why

/// (passed to `Self::try_new`)
/// * count: total count, stored as a i128 (*NOT* a Decimal128 value)
#[inline(always)]
pub fn avg(&self, sum: i128, count: i128) -> Result<i128> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal is to call this function in a loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that you mean you will call this in a loop in new grouping code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly -- #6810 (comment)

@alamb alamb marked this pull request as ready for review June 30, 2023 14:20
@alamb alamb requested review from viirya and mingmwang June 30, 2023 14:20
@alamb alamb changed the title Refactor Decimal128 averaging code to be vectorizable (and easier to … Refactor Decimal128 averaging code to be vectorizable (and easier to read) Jun 30, 2023
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review this soon or later in weekend.

@alamb
Copy link
Contributor Author

alamb commented Jun 30, 2023

I'll review this soon or later in weekend.

Thanks @viirya -- no rush from my perspective

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring looks good to me. I think the purpose is to call refactored function (avg) in other code (not this change) in vectorizable way?

@alamb
Copy link
Contributor Author

alamb commented Jul 2, 2023

Refactoring looks good to me. I think the purpose is to call refactored function (avg) in other code (not this change) in vectorizable way?

That is right -- in case you are interested, the location is here (and it shows very promising results)

https://github.com/alamb/arrow-datafusion/blob/35b413239a1fdad1663231cddc73501748404835/datafusion/physical-expr/src/aggregate/average.rs#L172-L181

@alamb alamb merged commit 9c46b1f into apache:main Jul 2, 2023
@alamb alamb deleted the alamb/decimal_sum branch July 2, 2023 12:00
2010YOUY01 pushed a commit to 2010YOUY01/arrow-datafusion that referenced this pull request Jul 5, 2023
…read) (apache#6810)

* Refactor Decimal128 averaging code to be vectorizable (and easier to read)

* Update datafusion/physical-expr/src/aggregate/utils.rs

Co-authored-by: Liang-Chi Hsieh <[email protected]>

---------

Co-authored-by: Liang-Chi Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants