feat: convert binary logical encoding/decoding to physical array encoding/page decoding #2426

raunaks13 · 2024-05-31T19:36:36Z

Should enable adding a dictionary array encoder (which will help add dictionary encoding functionality - ref #2409 )

rust/lance-encoding/src/encodings/physical/binary.rs

westonpace · 2024-06-03T17:34:21Z

rust/lance-encoding/src/encodings/physical/binary.rs

+                let mut int_arr = Vec::new();
+
+                let mut cum_sz: u32 = 0;
+                for i in 0..string_arr.len() {
+                    let s = string_arr.value(i);
+                    let sz = s.len() as u32;
+                    cum_sz += sz;
+                    int_arr.push(cum_sz);
+                }


There is a way we can do this without making a copy of the data (not entirely true, a copy is made during cast but we can rely on cast being as optimized as possible):

Get offsets from StringArray to get OffsetBuffer<i32>

Get nulls from StringArray

Convert nulls + offsets into array (from OffsetBuffer<i32> -> Int32Array)

Cast from Int32Array -> UInt64Array (can use cast)

Actually, once we need to support nulls, we will need a loop here anyway and we can convert the offsets from i32 to u64 as we are inserting the null_adjustment_offset.

rust/lance-encoding/src/encodings/physical/binary.rs

rust/lance-encoding/src/encodings/physical.rs

westonpace

This is coming along well. I can see a few reasons that "decode range doesn't start at beginning of page", "multiple decode ranges", and "multiple encode batches" might fail and left some notes.

rust/lance-encoding/src/encodings/logical/primitive.rs

rust/lance-encoding/src/encodings/physical/binary.rs

westonpace

The scheduling and decoding look correct to me. We will still need to add null support and there are some potential performance improvements. We can handle those in follow-ups if you'd like. However, we will need to avoid using this encoder until its ready.

rust/lance-file/src/v2/reader.rs

westonpace · 2024-06-19T13:09:12Z

rust/lance-encoding/src/encodings/physical/binary.rs

+                let decoded_part = indices_decode_task.decode()?;
+
+                let indices_array = decoded_part.as_primitive::<UInt32Type>();
+                let mut indices_vec = indices_array.values().to_vec();


Let's handle this in a future PR (for perf optimization) but it looks like we make several copies of the indices. There is one here. Then another one at normalized_indices.values().to_vec() and then another one at builder.append_slice. I think, before this loop, we should know how many indices will be in our final output, and so we should create a vec and preallocate capacity. All manipulations should be doable in-place from that point onwards.

westonpace · 2024-06-19T13:14:37Z

rust/lance-encoding/src/encodings/physical/binary.rs

+        let target_vec = target_offsets.values();
+        let normalized_array: PrimitiveArray<UInt32Type> =
+            target_vec.iter().map(|x| x - target_vec[0]).collect();


This is good thinking.

However, there is an interesting peculiarity about list/string arrays in Arrow. You are technically allowed to have the first offset be non-zero. This is because Arrow arrays really want to support zero-copy slicing. So this normalization is not strictly neccessary.

This can lead to quite a few subtle bugs (and it can make unit testing harder) so I'm not sure this normalization is altogether bad. Let's leave it in for now and consider removing later.

rust/lance-encoding/src/encodings/physical/binary.rs

westonpace · 2024-06-19T13:24:56Z

rust/lance-encoding/src/encodings/physical/binary.rs

+// Zero offset is removed from the start of the offsets array
+// The indices array is computed across all arrays in the vector
+fn get_indices_from_string_arrays(arrays: &[ArrayRef]) -> Vec<ArrayRef> {
+    let mut indices_builder = Int32Builder::new();


We should be able to preallocate the capacity here.

rust/lance-encoding/src/encodings/physical/binary.rs

westonpace · 2024-06-19T13:30:06Z

rust/lance-encoding/src/encodings/physical/binary.rs

+fn get_bytes_from_string_arrays(arrays: &[ArrayRef]) -> Vec<ArrayRef> {
+    let mut bytes_builder = UInt8Builder::new();
+    arrays.iter().for_each(|arr| {
+        let string_arr = arrow_array::cast::as_string_array(arr);
+        let values = string_arr.values().to_vec();
+        bytes_builder.append_slice(&values);
+        // let bytes_arr = Arc::new(UInt8Array::from(values)) as ArrayRef;
+        // Some(bytes_arr)
+    });
+
+    let final_array = Arc::new(bytes_builder.finish()) as ArrayRef;
+
+    vec![final_array]
+}


In this case we aren't manipulating the bytes at all. It would be great if we could avoid any copies. It's not obvious but there is a way to go from StringArray to PrimitiveArray<u8> without making a copy. So we should be able to do something like...

fn get_bytes_from_string_arrays(arrays: &[ArrayRef]) -> Vec<ArrayRef> { arrays.iter().map(|arr| { // zero-copy conversion from arr to uint8 array }).collect::<Vec<_>>() }

westonpace · 2024-06-19T13:31:30Z

rust/lance-encoding/src/encoder.rs

@@ -234,6 +237,15 @@ impl CoreArrayEncodingStrategy {
                    *dimension as u32,
                )))))
            }
+            DataType::Utf8 => {


We will need to be careful about these changes until we have null support in the string encoder / decoder. We don't want to switch over before that. I think it would be good to support nulls in a follow-up PR. Can we revert these changes (the changes picking the new array decoder/encoder over the old field decoder/encoder) for now?

Or you can guard them with an environment variable

Guarding for now. There are a couple locations where I had to add the check

added binary array encoder and protobuf

5fa670c

github-actions bot added the enhancement New feature or request label May 31, 2024

added protobuf

62c4ffe

raunaks13 requested a review from westonpace June 3, 2024 15:15

westonpace reviewed Jun 3, 2024

View reviewed changes

raunaks13 added 4 commits June 3, 2024 20:12

optimized operations, panic on null input

87c7e67

Merge branch 'main' of github.com:raunaks13/lance into str_encoding

7e87615

added initial logic for str decode

e6910bd

Merge branch 'main' of github.com:raunaks13/lance into str_encoding

e0f71bb

raunaks13 changed the title ~~feat: convert BinaryFieldEncoder to an array encoder~~ feat: convert binary field encoding/decoding to array encoding/page decoding Jun 4, 2024

base test passing

305ffcd

raunaks13 requested a review from westonpace June 7, 2024 01:52

westonpace requested changes Jun 7, 2024

View reviewed changes

bug fixes, optimizations

e978ab2

westonpace mentioned this pull request Jun 15, 2024

feat: add FSST string compression #2470

Merged

raunaks13 added 5 commits June 19, 2024 01:09

major fixes, all tests passing

b9aab13

minor fix

50e0039

minor fix

c4759bb

remove unused import

53485d8

minor fix

3fc27f0

raunaks13 requested a review from westonpace June 19, 2024 01:51

raunaks13 added 2 commits June 19, 2024 02:02

Merge branch 'main' of github.com:raunaks13/lance into str_encoding

554338f

merged with main

7003b5d

westonpace requested changes Jun 19, 2024

View reviewed changes

raunaks13 added 4 commits June 20, 2024 00:02

perf optimizations, env variable guard, cleaning up

6eef4e5

lint

1bf47da

formatting

74353cc

clippy

082302f

westonpace approved these changes Jun 20, 2024

View reviewed changes

commented out test for merging purposes

778ff01

clippy

3cf56dd

raunaks13 merged commit 2febf2d into lancedb:main Jun 20, 2024
17 of 19 checks passed

raunaks13 changed the title ~~feat: convert binary field encoding/decoding to array encoding/page decoding~~ feat: convert binary logical encoding/decoding to physical array encoding/page decoding Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: convert binary logical encoding/decoding to physical array encoding/page decoding #2426

feat: convert binary logical encoding/decoding to physical array encoding/page decoding #2426

raunaks13 commented May 31, 2024 •

edited

Loading

westonpace Jun 3, 2024

westonpace left a comment

westonpace left a comment

westonpace Jun 19, 2024

westonpace Jun 19, 2024

westonpace Jun 19, 2024

westonpace Jun 19, 2024

westonpace Jun 19, 2024

raunaks13 Jun 19, 2024

feat: convert binary logical encoding/decoding to physical array encoding/page decoding #2426

feat: convert binary logical encoding/decoding to physical array encoding/page decoding #2426

Conversation

raunaks13 commented May 31, 2024 • edited Loading

westonpace Jun 3, 2024

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

westonpace Jun 19, 2024

Choose a reason for hiding this comment

westonpace Jun 19, 2024

Choose a reason for hiding this comment

westonpace Jun 19, 2024

Choose a reason for hiding this comment

westonpace Jun 19, 2024

Choose a reason for hiding this comment

westonpace Jun 19, 2024

Choose a reason for hiding this comment

raunaks13 Jun 19, 2024

Choose a reason for hiding this comment

raunaks13 commented May 31, 2024 •

edited

Loading