Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: convert binary logical encoding/decoding to physical array encoding/page decoding #2426

Merged
merged 21 commits into from
Jun 20, 2024

Conversation

raunaks13
Copy link
Contributor

@raunaks13 raunaks13 commented May 31, 2024

Should enable adding a dictionary array encoder (which will help add dictionary encoding functionality - ref #2409 )

@github-actions github-actions bot added the enhancement New feature or request label May 31, 2024
@raunaks13 raunaks13 requested a review from westonpace June 3, 2024 15:15
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
Comment on lines 46 to 54
let mut int_arr = Vec::new();

let mut cum_sz: u32 = 0;
for i in 0..string_arr.len() {
let s = string_arr.value(i);
let sz = s.len() as u32;
cum_sz += sz;
int_arr.push(cum_sz);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way we can do this without making a copy of the data (not entirely true, a copy is made during cast but we can rely on cast being as optimized as possible):

  • Get offsets from StringArray to get OffsetBuffer<i32>
  • Get nulls from StringArray
  • Convert nulls + offsets into array (from OffsetBuffer<i32> -> Int32Array)
  • Cast from Int32Array -> UInt64Array (can use cast)

Actually, once we need to support nulls, we will need a loop here anyway and we can convert the offsets from i32 to u64 as we are inserting the null_adjustment_offset.

rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical.rs Outdated Show resolved Hide resolved
@raunaks13 raunaks13 changed the title feat: convert BinaryFieldEncoder to an array encoder feat: convert binary field encoding/decoding to array encoding/page decoding Jun 4, 2024
@raunaks13 raunaks13 requested a review from westonpace June 7, 2024 01:52
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming along well. I can see a few reasons that "decode range doesn't start at beginning of page", "multiple decode ranges", and "multiple encode batches" might fail and left some notes.

rust/lance-encoding/src/encodings/logical/primitive.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduling and decoding look correct to me. We will still need to add null support and there are some potential performance improvements. We can handle those in follow-ups if you'd like. However, we will need to avoid using this encoder until its ready.

rust/lance-file/src/v2/reader.rs Outdated Show resolved Hide resolved
let decoded_part = indices_decode_task.decode()?;

let indices_array = decoded_part.as_primitive::<UInt32Type>();
let mut indices_vec = indices_array.values().to_vec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's handle this in a future PR (for perf optimization) but it looks like we make several copies of the indices. There is one here. Then another one at normalized_indices.values().to_vec() and then another one at builder.append_slice. I think, before this loop, we should know how many indices will be in our final output, and so we should create a vec and preallocate capacity. All manipulations should be doable in-place from that point onwards.

Comment on lines +265 to +267
let target_vec = target_offsets.values();
let normalized_array: PrimitiveArray<UInt32Type> =
target_vec.iter().map(|x| x - target_vec[0]).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good thinking.

However, there is an interesting peculiarity about list/string arrays in Arrow. You are technically allowed to have the first offset be non-zero. This is because Arrow arrays really want to support zero-copy slicing. So this normalization is not strictly neccessary.

This can lead to quite a few subtle bugs (and it can make unit testing harder) so I'm not sure this normalization is altogether bad. Let's leave it in for now and consider removing later.

rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
// Zero offset is removed from the start of the offsets array
// The indices array is computed across all arrays in the vector
fn get_indices_from_string_arrays(arrays: &[ArrayRef]) -> Vec<ArrayRef> {
let mut indices_builder = Int32Builder::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to preallocate the capacity here.

rust/lance-encoding/src/encodings/physical/binary.rs Outdated Show resolved Hide resolved
Comment on lines 360 to 373
fn get_bytes_from_string_arrays(arrays: &[ArrayRef]) -> Vec<ArrayRef> {
let mut bytes_builder = UInt8Builder::new();
arrays.iter().for_each(|arr| {
let string_arr = arrow_array::cast::as_string_array(arr);
let values = string_arr.values().to_vec();
bytes_builder.append_slice(&values);
// let bytes_arr = Arc::new(UInt8Array::from(values)) as ArrayRef;
// Some(bytes_arr)
});

let final_array = Arc::new(bytes_builder.finish()) as ArrayRef;

vec![final_array]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we aren't manipulating the bytes at all. It would be great if we could avoid any copies. It's not obvious but there is a way to go from StringArray to PrimitiveArray<u8> without making a copy. So we should be able to do something like...

fn get_bytes_from_string_arrays(arrays: &[ArrayRef]) -> Vec<ArrayRef> {
    arrays.iter().map(|arr| {
      // zero-copy conversion from arr to uint8 array
    }).collect::<Vec<_>>()
}

@@ -234,6 +237,15 @@ impl CoreArrayEncodingStrategy {
*dimension as u32,
)))))
}
DataType::Utf8 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to be careful about these changes until we have null support in the string encoder / decoder. We don't want to switch over before that. I think it would be good to support nulls in a follow-up PR. Can we revert these changes (the changes picking the new array decoder/encoder over the old field decoder/encoder) for now?

Or you can guard them with an environment variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guarding for now. There are a couple locations where I had to add the check

@raunaks13 raunaks13 merged commit 2febf2d into lancedb:main Jun 20, 2024
17 of 19 checks passed
@raunaks13 raunaks13 changed the title feat: convert binary field encoding/decoding to array encoding/page decoding feat: convert binary logical encoding/decoding to physical array encoding/page decoding Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants