-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: convert binary logical encoding/decoding to physical array encoding/page decoding #2426
Conversation
let mut int_arr = Vec::new(); | ||
|
||
let mut cum_sz: u32 = 0; | ||
for i in 0..string_arr.len() { | ||
let s = string_arr.value(i); | ||
let sz = s.len() as u32; | ||
cum_sz += sz; | ||
int_arr.push(cum_sz); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a way we can do this without making a copy of the data (not entirely true, a copy is made during cast
but we can rely on cast
being as optimized as possible):
- Get offsets from
StringArray
to getOffsetBuffer<i32>
- Get nulls from
StringArray
- Convert nulls + offsets into array (from
OffsetBuffer<i32> -> Int32Array
) - Cast from
Int32Array
->UInt64Array
(can use cast)
Actually, once we need to support nulls, we will need a loop here anyway and we can convert the offsets from i32 to u64 as we are inserting the null_adjustment_offset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coming along well. I can see a few reasons that "decode range doesn't start at beginning of page", "multiple decode ranges", and "multiple encode batches" might fail and left some notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduling and decoding look correct to me. We will still need to add null support and there are some potential performance improvements. We can handle those in follow-ups if you'd like. However, we will need to avoid using this encoder until its ready.
let decoded_part = indices_decode_task.decode()?; | ||
|
||
let indices_array = decoded_part.as_primitive::<UInt32Type>(); | ||
let mut indices_vec = indices_array.values().to_vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's handle this in a future PR (for perf optimization) but it looks like we make several copies of the indices. There is one here. Then another one at normalized_indices.values().to_vec()
and then another one at builder.append_slice
. I think, before this loop, we should know how many indices will be in our final output, and so we should create a vec and preallocate capacity. All manipulations should be doable in-place from that point onwards.
let target_vec = target_offsets.values(); | ||
let normalized_array: PrimitiveArray<UInt32Type> = | ||
target_vec.iter().map(|x| x - target_vec[0]).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good thinking.
However, there is an interesting peculiarity about list/string arrays in Arrow. You are technically allowed to have the first offset be non-zero. This is because Arrow arrays really want to support zero-copy slicing. So this normalization is not strictly neccessary.
This can lead to quite a few subtle bugs (and it can make unit testing harder) so I'm not sure this normalization is altogether bad. Let's leave it in for now and consider removing later.
// Zero offset is removed from the start of the offsets array | ||
// The indices array is computed across all arrays in the vector | ||
fn get_indices_from_string_arrays(arrays: &[ArrayRef]) -> Vec<ArrayRef> { | ||
let mut indices_builder = Int32Builder::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to preallocate the capacity here.
fn get_bytes_from_string_arrays(arrays: &[ArrayRef]) -> Vec<ArrayRef> { | ||
let mut bytes_builder = UInt8Builder::new(); | ||
arrays.iter().for_each(|arr| { | ||
let string_arr = arrow_array::cast::as_string_array(arr); | ||
let values = string_arr.values().to_vec(); | ||
bytes_builder.append_slice(&values); | ||
// let bytes_arr = Arc::new(UInt8Array::from(values)) as ArrayRef; | ||
// Some(bytes_arr) | ||
}); | ||
|
||
let final_array = Arc::new(bytes_builder.finish()) as ArrayRef; | ||
|
||
vec![final_array] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we aren't manipulating the bytes at all. It would be great if we could avoid any copies. It's not obvious but there is a way to go from StringArray
to PrimitiveArray<u8>
without making a copy. So we should be able to do something like...
fn get_bytes_from_string_arrays(arrays: &[ArrayRef]) -> Vec<ArrayRef> {
arrays.iter().map(|arr| {
// zero-copy conversion from arr to uint8 array
}).collect::<Vec<_>>()
}
@@ -234,6 +237,15 @@ impl CoreArrayEncodingStrategy { | |||
*dimension as u32, | |||
))))) | |||
} | |||
DataType::Utf8 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to be careful about these changes until we have null support in the string encoder / decoder. We don't want to switch over before that. I think it would be good to support nulls in a follow-up PR. Can we revert these changes (the changes picking the new array decoder/encoder over the old field decoder/encoder) for now?
Or you can guard them with an environment variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guarding for now. There are a couple locations where I had to add the check
Should enable adding a dictionary array encoder (which will help add dictionary encoding functionality - ref #2409 )