-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45594: [C++][Parquet] POC: Optimize Parquet DecodeArrow in DeltaLengthByteArray #45622
base: main
Are you sure you want to change the base?
GH-45594: [C++][Parquet] POC: Optimize Parquet DecodeArrow in DeltaLengthByteArray #45622
Conversation
|
On my MacOS: After:
Before:
|
153b065
to
aa25e2e
Compare
aa25e2e
to
4a4847f
Compare
cc @pitrou this interface is a bit ugly but I don't know whether we have better way for this. Would you mind take a look? |
Status AppendBinaryWithLengths(std::string_view binary, const int32_t* value_lengths, | ||
int64_t length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call this AppendValuesWithLengths
?
Also, it should probably take the right offset type?
And you need to add a docstring...
Status AppendBinaryWithLengths(std::string_view binary, const int32_t* value_lengths, | |
int64_t length) { | |
/// XXX docstring | |
Status AppendValuesWithLengths(std::string_view binary, util::span<const offset_type> lengths) { |
if (ARROW_PREDICT_FALSE(binary.size() < static_cast<size_t>(accum_length))) { | ||
return Status::Invalid("Binary data is too short"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not check equality? It's trivial to call substr
on a std::string_view
.
if (ARROW_PREDICT_FALSE(binary.size() < static_cast<size_t>(accum_length))) { | |
return Status::Invalid("Binary data is too short"); | |
} | |
if (ARROW_PREDICT_FALSE(binary.size() != static_cast<size_t>(accum_length))) { | |
return Status::Invalid("Binary size does not match lengths array"); | |
} |
if (ARROW_PREDICT_FALSE(binary.size() + value_data_builder_.length() > | ||
std::numeric_limits<int32_t>::max())) { | ||
return Status::Invalid("Append binary data too long"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call ValidateOverflow
instead?
ARROW_RETURN_NOT_OK(value_data_builder_.Append( | ||
reinterpret_cast<const uint8_t*>(sub_data.data()), sub_data.size())); | ||
accum_length = 0; | ||
const int64_t initialize_offset = value_data_builder_.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const int64_t initialize_offset = value_data_builder_.length(); | |
const int64_t initial_offset = value_data_builder_.length(); |
return Status::OK(); | ||
} | ||
|
||
Status AppendBinaryWithLengths(std::string_view binary, const int32_t* value_lengths, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments below...
ARROW_RETURN_NOT_OK(Reserve(length)); | ||
UnsafeAppendToBitmap(/*valid_bytes=*/NULLPTR, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you call these after the error checks below?
if (ARROW_PREDICT_FALSE(binary.size() < static_cast<size_t>(accum_length))) { | ||
return Status::Invalid("Binary data is too short"); | ||
} | ||
const int64_t original_offset = value_data_builder_.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the same naming as above? e.g. initial_offset
[&]() { | ||
offsets_builder_.UnsafeAppend( | ||
static_cast<int32_t>(original_offset + accum_length)); | ||
return Status::OK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we're not incrementing length_idx
here? This is not usual in Builder APIs, I don't think this is a good idea.
Hmm, I really don't like the new Perhaps we should instead add these APIs and let Parquet use those builders? diff --git a/cpp/src/arrow/array/builder_binary.h b/cpp/src/arrow/array/builder_binary.h
index 442e4a2632..e568279508 100644
--- a/cpp/src/arrow/array/builder_binary.h
+++ b/cpp/src/arrow/array/builder_binary.h
@@ -359,6 +359,9 @@ class BaseBinaryBuilder
/// \return data pointer of the value date builder
const offset_type* offsets_data() const { return offsets_builder_.data(); }
+ TypedBufferBuilder<offset_type>* offsets_builder() { return &offsets_builder_; }
+ TypedBufferBuilder<uint8_t>* value_data_builder() { return &value_data_builder_; }
+
/// Temporary access to a value.
///
/// This pointer becomes invalid on the next modifying operation. |
Previously I've using a poc like this, maybe unsafe or other prefix can make this better?
|
I don't think adding "unsafe" would really bring anything (there is no risk of crashing, for instance). However, we should add a docstring explaining the caveats when using these methods. |
Rationale for this change
See #45594
What changes are included in this PR?
Are these changes tested?
Covered by existing
Are there any user-facing changes?
no