Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: move IVF_HNSW_SQ & IVF_FLAT to new buliding & search path #2469

Merged
merged 30 commits into from
Jun 25, 2024

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Jun 14, 2024

  • IVF_HNSW_SQ for new search/build path
  • IVF_FLAT e2e pass
  • support to train quantizer with new index builder
  • fix partition order broken after building
  • clean IVF related types
  • index builder method chaining for customizing
  • impl merging deltas for new IVF_HNSW_SQ & IVF_FLAT

Copy link

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@BubbleCal BubbleCal changed the title refactor: IVF_HNSW_SQ & IVF_FLAT refactor: support IVF_HNSW_SQ & IVF_FLAT e2e Jun 14, 2024
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal changed the title refactor: support IVF_HNSW_SQ & IVF_FLAT e2e refactor: move IVF_HNSW_SQ & IVF_FLAT to new buliding & search path Jun 18, 2024
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov-commenter
Copy link

codecov-commenter commented Jun 22, 2024

Codecov Report

Attention: Patch coverage is 66.92708% with 381 lines in your changes missing coverage. Please review.

Project coverage is 79.63%. Comparing base (7cd11b8) to head (12d7a90).
Report is 6 commits behind head on main.

Files Patch % Lines
rust/lance/src/index/vector/builder.rs 70.00% 43 Missing and 56 partials ⚠️
rust/lance/src/index/vector/ivf.rs 61.93% 48 Missing and 11 partials ⚠️
rust/lance-index/src/vector/flat/index.rs 43.47% 26 Missing ⚠️
rust/lance/src/index.rs 62.31% 0 Missing and 26 partials ⚠️
rust/lance-index/src/vector/quantizer.rs 45.23% 20 Missing and 3 partials ⚠️
rust/lance-index/src/vector/hnsw/index.rs 5.00% 19 Missing ⚠️
rust/lance/src/index/vector/ivf/v2.rs 89.59% 9 Missing and 9 partials ⚠️
rust/lance-index/src/vector/ivf/storage.rs 77.92% 14 Missing and 3 partials ⚠️
rust/lance/src/session/index_extension.rs 0.00% 17 Missing ⚠️
rust/lance/src/index/vector/pq.rs 44.44% 15 Missing ⚠️
... and 11 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2469      +/-   ##
==========================================
+ Coverage   79.55%   79.63%   +0.07%     
==========================================
  Files         208      207       -1     
  Lines       59351    59187     -164     
  Branches    59351    59187     -164     
==========================================
- Hits        47217    47131      -86     
+ Misses       9371     9268     -103     
- Partials     2763     2788      +25     
Flag Coverage Δ
unittests 79.63% <66.92%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big refactor! Thanks for working on this. I have a few minor comments but it looks good overall.

_: &Query,
_: Arc<dyn PreFilter>,
) -> Result<RecordBatch> {
todo!("panic")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these should all be unimplemented!()?

Comment on lines 558 to 560
let row_ids = result
.column_by_name(ROW_ID)
.unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this can be simplified to

Suggested change
let row_ids = result
.column_by_name(ROW_ID)
.unwrap()
let row_ids = result[ROW_ID]

existing_indices: &[Arc<dyn Index>],
options: &OptimizeOptions,
) -> Result<(Uuid, usize)> {
// Senity check the indices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Senity check the indices
// Sanity check the indices

match index_type {
(SubIndexType::Flat, QuantizationType::Flat) => {
IvfIndexBuilder::<FlatIndex, FlatQuantizer>::new_incremental(
dataset.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method take an Arc<Dataset> instead if we need to clone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, will do it next PR cause this requires to modify the whole path

Comment on lines 321 to 331
let partition_build_order = (0..ivf.num_partitions())
.map(|partition_id| reader.partiton_size(partition_id))
.collect::<Result<Vec<_>>>()?
// sort by partition size in descending order
.into_iter()
.enumerate()
.map(|(idx, x)| (x, idx))
.sorted()
.rev()
.map(|(_, idx)| idx)
.collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is reader.partition_size()? I feel like this could simply be:

Suggested change
let partition_build_order = (0..ivf.num_partitions())
.map(|partition_id| reader.partiton_size(partition_id))
.collect::<Result<Vec<_>>>()?
// sort by partition size in descending order
.into_iter()
.enumerate()
.map(|(idx, x)| (x, idx))
.sorted()
.rev()
.map(|(_, idx)| idx)
.collect::<Vec<_>>();
let mut partition_build_order = (0..ivf.num_partitions()).collect::<Vec<_>>();
partition_build_order.sort_by_key(|part_id| -reader.partition_size(part_id));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can't be done because:

  • partition_size returns Result<usize>
  • can'y apply "-" on usize
    but i simplify this a little

);
}

match reader.partiton_size(partition)? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling

Suggested change
match reader.partiton_size(partition)? {
match reader.partition_size(partition)? {

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is large so I think the best path forward is just to go ahead and merge it and we can work through the details as we test it.

How do I use the new path today? Is it always used for HNSW indices (and the old path still always used for IVF_PQ indices)?

Do you anticipate using the new path for IVF_PQ indices? Is there some way I can try that out today? Or do I need to hack in some code changes to test that?

@BubbleCal
Copy link
Contributor Author

This PR is large so I think the best path forward is just to go ahead and merge it and we can work through the details as we test it.

How do I use the new path today? Is it always used for HNSW indices (and the old path still always used for IVF_PQ indices)?

Do you anticipate using the new path for IVF_PQ indices? Is there some way I can try that out today? Or do I need to hack in some code changes to test that?

This new path can be used for IVF_FLAT and IVF_HNSW_SQ only now. The current PQ implementation is a little complicated so it needs more work to make it work with the new path.

So far there's no simple way to get IVF_PQ work on the new path

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal merged commit f51c5f0 into lancedb:main Jun 25, 2024
21 of 22 checks passed
triandco pushed a commit to triandco/lance that referenced this pull request Jun 25, 2024
…ancedb#2469)

- IVF_HNSW_SQ for new search/build path
- IVF_FLAT e2e pass
- support to train quantizer with new index builder
- fix partition order broken after building
- clean IVF related types
- index builder method chaining for customizing
- impl merging deltas for new IVF_HNSW_SQ & IVF_FLAT

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants