Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45594: [C++][Parquet] POC: Optimize Parquet DecodeArrow in DeltaLengthByteArray #45622

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Feb 25, 2025

Rationale for this change

See #45594

What changes are included in this PR?

  1. Add a hack interface for binary builder
  2. Optimize decoding in DeltaLengthByteArray

Are these changes tested?

Covered by existing

Are there any user-facing changes?

no

@mapleFU mapleFU requested a review from wgtmac as a code owner February 25, 2025 09:55
Copy link

⚠️ GitHub issue #45594 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 25, 2025

On my MacOS:

After:

BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/1024        1803 ns         1794 ns       366634 bytes_per_second=3.27674G/s items_per_second=570.698M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/4096        5502 ns         5453 ns       129056 bytes_per_second=4.21124G/s items_per_second=751.186M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/32768      39024 ns        38960 ns        17944 bytes_per_second=4.68932G/s items_per_second=841.076M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/65536      76109 ns        76037 ns         9217 bytes_per_second=4.79637G/s items_per_second=861.897M/s

Before:

BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/1024       11620 ns        10801 ns        55154 bytes_per_second=557.395M/s items_per_second=94.8042M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/4096       51641 ns        51339 ns        13196 bytes_per_second=458.007M/s items_per_second=79.7829M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/32768     479142 ns       459910 ns         1550 bytes_per_second=406.772M/s items_per_second=71.2488M/s
BM_ArrowBinaryDeltaLength/DL_DecodeArrow_Dense/65536     963371 ns       929585 ns          759 bytes_per_second=401.743M/s items_per_second=70.5003M/s

@mapleFU mapleFU changed the title GH-45594: [C++][Parquet] Optimize Parquet DecodeArrow in DeltaLengthByteArray GH-45594: [C++][Parquet] POC: Optimize Parquet DecodeArrow in DeltaLengthByteArray Feb 25, 2025
@mapleFU mapleFU force-pushed the optimize-decode-delta-length-byte-array branch from 153b065 to aa25e2e Compare February 25, 2025 11:05
@mapleFU mapleFU force-pushed the optimize-decode-delta-length-byte-array branch from aa25e2e to 4a4847f Compare February 25, 2025 11:12
@mapleFU
Copy link
Member Author

mapleFU commented Feb 25, 2025

cc @pitrou this interface is a bit ugly but I don't know whether we have better way for this. Would you mind take a look?

Comment on lines +307 to +308
Status AppendBinaryWithLengths(std::string_view binary, const int32_t* value_lengths,
int64_t length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call this AppendValuesWithLengths?
Also, it should probably take the right offset type?
And you need to add a docstring...

Suggested change
Status AppendBinaryWithLengths(std::string_view binary, const int32_t* value_lengths,
int64_t length) {
/// XXX docstring
Status AppendValuesWithLengths(std::string_view binary, util::span<const offset_type> lengths) {

Comment on lines +316 to +318
if (ARROW_PREDICT_FALSE(binary.size() < static_cast<size_t>(accum_length))) {
return Status::Invalid("Binary data is too short");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check equality? It's trivial to call substr on a std::string_view.

Suggested change
if (ARROW_PREDICT_FALSE(binary.size() < static_cast<size_t>(accum_length))) {
return Status::Invalid("Binary data is too short");
}
if (ARROW_PREDICT_FALSE(binary.size() != static_cast<size_t>(accum_length))) {
return Status::Invalid("Binary size does not match lengths array");
}

Comment on lines +319 to +322
if (ARROW_PREDICT_FALSE(binary.size() + value_data_builder_.length() >
std::numeric_limits<int32_t>::max())) {
return Status::Invalid("Append binary data too long");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call ValidateOverflow instead?

ARROW_RETURN_NOT_OK(value_data_builder_.Append(
reinterpret_cast<const uint8_t*>(sub_data.data()), sub_data.size()));
accum_length = 0;
const int64_t initialize_offset = value_data_builder_.length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int64_t initialize_offset = value_data_builder_.length();
const int64_t initial_offset = value_data_builder_.length();

return Status::OK();
}

Status AppendBinaryWithLengths(std::string_view binary, const int32_t* value_lengths,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments below...

Comment on lines +309 to +310
ARROW_RETURN_NOT_OK(Reserve(length));
UnsafeAppendToBitmap(/*valid_bytes=*/NULLPTR, length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you call these after the error checks below?

if (ARROW_PREDICT_FALSE(binary.size() < static_cast<size_t>(accum_length))) {
return Status::Invalid("Binary data is too short");
}
const int64_t original_offset = value_data_builder_.length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the same naming as above? e.g. initial_offset

[&]() {
offsets_builder_.UnsafeAppend(
static_cast<int32_t>(original_offset + accum_length));
return Status::OK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we're not incrementing length_idx here? This is not usual in Builder APIs, I don't think this is a good idea.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 27, 2025
@pitrou
Copy link
Member

pitrou commented Feb 27, 2025

Hmm, I really don't like the new BinaryBuilder API that this is adding.

Perhaps we should instead add these APIs and let Parquet use those builders?

diff --git a/cpp/src/arrow/array/builder_binary.h b/cpp/src/arrow/array/builder_binary.h
index 442e4a2632..e568279508 100644
--- a/cpp/src/arrow/array/builder_binary.h
+++ b/cpp/src/arrow/array/builder_binary.h
@@ -359,6 +359,9 @@ class BaseBinaryBuilder
   /// \return data pointer of the value date builder
   const offset_type* offsets_data() const { return offsets_builder_.data(); }
 
+  TypedBufferBuilder<offset_type>* offsets_builder() { return &offsets_builder_; }
+  TypedBufferBuilder<uint8_t>* value_data_builder() { return &value_data_builder_; }
+
   /// Temporary access to a value.
   ///
   /// This pointer becomes invalid on the next modifying operation.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 27, 2025

Perhaps we should instead add these APIs and let Parquet use those builders?

Previously I've using a poc like this, maybe unsafe or other prefix can make this better?

diff --git a/cpp/src/arrow/array/builder_binary.h b/cpp/src/arrow/array/builder_binary.h
index 442e4a2632..e568279508 100644
--- a/cpp/src/arrow/array/builder_binary.h
+++ b/cpp/src/arrow/array/builder_binary.h
@@ -359,6 +359,9 @@ class BaseBinaryBuilder
   /// \return data pointer of the value date builder
   const offset_type* offsets_data() const { return offsets_builder_.data(); }
 
+  TypedBufferBuilder<offset_type>* unsafe_offsets_builder() { return &offsets_builder_; }
+  TypedBufferBuilder<uint8_t>* unsafe_value_data_builder() { return &value_data_builder_; }
+
   /// Temporary access to a value.
   ///
   /// This pointer becomes invalid on the next modifying operation.

@pitrou
Copy link
Member

pitrou commented Feb 27, 2025

I don't think adding "unsafe" would really bring anything (there is no risk of crashing, for instance). However, we should add a docstring explaining the caveats when using these methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants