Skip to content

Commit

Permalink
Remove extraneous padding in plain encoder
Browse files Browse the repository at this point in the history
This fixes two bugs with the padding added by the bytes_to_array
function in the plain encoder. This function is used for reading indexes
in LanceDB cloud.

Problem 1:
Prior to this commit, space was reserved based on an incorrect
multiplication of the user-supplied buffer's byte length and the
bytewidth of the item type. Instead, we should multiply the bytewidth
by the element count. The effect of this was previously to pad too
much space at the end of parsed arrays.

Problem 2:
We previously had some logic that was padding a buffer by extending it
with 0-valued uint32s, instead of 0-valued uint8. This resulted in a
multiplication of the padding added by four.
  • Loading branch information
wkalt committed Feb 7, 2025
1 parent d62ddb0 commit 0b8bcdb
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 2 deletions.
2 changes: 1 addition & 1 deletion rust/lance-arrow/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -822,7 +822,7 @@ impl BufferExt for arrow_buffer::Buffer {
let mut buf = MutableBuffer::with_capacity(size_bytes);
let to_fill = size_bytes - bytes.len();
buf.extend(bytes);
buf.extend(std::iter::repeat(0).take(to_fill));
buf.extend(std::iter::repeat(0 as u8).take(to_fill));
Self::from(buf)
}
}
Expand Down
21 changes: 20 additions & 1 deletion rust/lance-io/src/encodings/plain.rs
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ pub fn bytes_to_array(
{
// this code is taken from
// https://github.com/apache/arrow-rs/blob/master/arrow-data/src/data.rs#L748-L768
let len_plus_offset = bytes.len() + offset;
let len_plus_offset = len + offset;
let min_buffer_size = len_plus_offset.saturating_mul(*byte_width);

// alignment or size isn't right -- just make a copy
Expand Down Expand Up @@ -634,6 +634,25 @@ mod tests {
test_round_trip(arrs.as_slice(), t).await;
}

#[tokio::test]
async fn test_bytes_to_array_padding() {
let bytes = Bytes::from_static(&[0x01, 0x00, 0x02, 0x00, 0x03]);
let arr = bytes_to_array(&DataType::UInt16, bytes, 3, 0).unwrap();

let expected = UInt16Array::from(vec![1, 2, 3]);
assert_eq!(arr.as_ref(), &expected);

// Underlying data is padded to the nearest multiple of two bytes (for u16).
let data = arr.to_data();
let buf = &data.buffers()[0];
let repr = format!("{:?}", buf);
assert!(
repr.contains("[1, 0, 2, 0, 3, 0]"),
"Underlying buffer contains unexpected data: {}",
repr
);
}

#[tokio::test]
async fn test_encode_decode_nested_fixed_size_list() {
// FixedSizeList of FixedSizeList
Expand Down

0 comments on commit 0b8bcdb

Please sign in to comment.