-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: move IVF_HNSW_SQ & IVF_FLAT to new buliding & search path #2469
Conversation
BubbleCal
commented
Jun 14, 2024
•
edited
Loading
edited
- IVF_HNSW_SQ for new search/build path
- IVF_FLAT e2e pass
- support to train quantizer with new index builder
- fix partition order broken after building
- clean IVF related types
- index builder method chaining for customizing
- impl merging deltas for new IVF_HNSW_SQ & IVF_FLAT
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2469 +/- ##
==========================================
+ Coverage 79.55% 79.63% +0.07%
==========================================
Files 208 207 -1
Lines 59351 59187 -164
Branches 59351 59187 -164
==========================================
- Hits 47217 47131 -86
+ Misses 9371 9268 -103
- Partials 2763 2788 +25
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big refactor! Thanks for working on this. I have a few minor comments but it looks good overall.
_: &Query, | ||
_: Arc<dyn PreFilter>, | ||
) -> Result<RecordBatch> { | ||
todo!("panic") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these should all be unimplemented!()
?
let row_ids = result | ||
.column_by_name(ROW_ID) | ||
.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this can be simplified to
let row_ids = result | |
.column_by_name(ROW_ID) | |
.unwrap() | |
let row_ids = result[ROW_ID] |
rust/lance/src/index/vector/ivf.rs
Outdated
existing_indices: &[Arc<dyn Index>], | ||
options: &OptimizeOptions, | ||
) -> Result<(Uuid, usize)> { | ||
// Senity check the indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Senity check the indices | |
// Sanity check the indices |
match index_type { | ||
(SubIndexType::Flat, QuantizationType::Flat) => { | ||
IvfIndexBuilder::<FlatIndex, FlatQuantizer>::new_incremental( | ||
dataset.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method take an Arc<Dataset>
instead if we need to clone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, will do it next PR cause this requires to modify the whole path
let partition_build_order = (0..ivf.num_partitions()) | ||
.map(|partition_id| reader.partiton_size(partition_id)) | ||
.collect::<Result<Vec<_>>>()? | ||
// sort by partition size in descending order | ||
.into_iter() | ||
.enumerate() | ||
.map(|(idx, x)| (x, idx)) | ||
.sorted() | ||
.rev() | ||
.map(|(_, idx)| idx) | ||
.collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How expensive is reader.partition_size()
? I feel like this could simply be:
let partition_build_order = (0..ivf.num_partitions()) | |
.map(|partition_id| reader.partiton_size(partition_id)) | |
.collect::<Result<Vec<_>>>()? | |
// sort by partition size in descending order | |
.into_iter() | |
.enumerate() | |
.map(|(idx, x)| (x, idx)) | |
.sorted() | |
.rev() | |
.map(|(_, idx)| idx) | |
.collect::<Vec<_>>(); | |
let mut partition_build_order = (0..ivf.num_partitions()).collect::<Vec<_>>(); | |
partition_build_order.sort_by_key(|part_id| -reader.partition_size(part_id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can't be done because:
partition_size
returnsResult<usize>
- can'y apply "-" on usize
but i simplify this a little
); | ||
} | ||
|
||
match reader.partiton_size(partition)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling
match reader.partiton_size(partition)? { | |
match reader.partition_size(partition)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is large so I think the best path forward is just to go ahead and merge it and we can work through the details as we test it.
How do I use the new path today? Is it always used for HNSW indices (and the old path still always used for IVF_PQ indices)?
Do you anticipate using the new path for IVF_PQ indices? Is there some way I can try that out today? Or do I need to hack in some code changes to test that?
This new path can be used for IVF_FLAT and IVF_HNSW_SQ only now. The current PQ implementation is a little complicated so it needs more work to make it work with the new path. So far there's no simple way to get IVF_PQ work on the new path |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…to hnsw-refactor
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…ancedb#2469) - IVF_HNSW_SQ for new search/build path - IVF_FLAT e2e pass - support to train quantizer with new index builder - fix partition order broken after building - clean IVF related types - index builder method chaining for customizing - impl merging deltas for new IVF_HNSW_SQ & IVF_FLAT --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>