-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: stable row id support in queries #2452
Conversation
7ccdf9b
to
009efcd
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2452 +/- ##
==========================================
+ Coverage 79.63% 79.78% +0.14%
==========================================
Files 207 207
Lines 59179 59590 +411
Branches 59179 59590 +411
==========================================
+ Hits 47129 47545 +416
+ Misses 9260 9253 -7
- Partials 2790 2792 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
rust/lance/src/io/exec/knn.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes aren't related necessarily. I just noticed while debugging that the output schema was a lie.
2850161
to
3616503
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I can't say I understand all the internals enough to know what the perf impacts will be to prefiltering but, if there are any, we can address those later I think.
fn without_column(&self, column_name: &str) -> Schema { | ||
let fields: Vec<FieldRef> = self | ||
.fields() | ||
.iter() | ||
.filter(|f| f.name() != column_name) | ||
.cloned() | ||
.collect(); | ||
Self::new_with_metadata(fields, self.metadata.clone()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we document the contract with field ids (this method leaves them unchanged it appears)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operates on Arrow schemas, so no implications for field ids.
|
||
/// Insert a range of values into the set | ||
pub fn insert_range<R: RangeBounds<u64>>(&mut self, range: R) -> u64 { | ||
let (mut start_high, mut start_low) = match range.start_bound() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused here. start_high
/ start_low
are u32
. Are they fragment IDs? Or row offsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we move to generic u64 row ids, the high and low bits no longer consistently have the semantics of fragment ids and row offsets. Thus I rename them "high" and "low".
/// These row ids may either be stable-style (where they can be an incrementing | ||
/// u64 sequence) or address style, where they are a fragment id and a row offset. | ||
/// When address style, this supports setting entire fragments as selected, | ||
/// without needing to enumerate all the ids in the fragment. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a minute to figure this out. Even if these are stable-style we store them in the same nested 32-bit map structure? (though we will have far fewer entries in the outer map)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. To be fair, the roaring crate has a RoaringTreeMap it uses to store u64 values, so it’s not all different from that. The vast majority of tables will just use a single RoaringBitmap here, since most won’t have over 4 billions rows I expect.
// We don't test removing from a full fragment, because that would take | ||
// a lot of memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been making large tests and just marking them ignore
. It's helpful during refactoring to manually make sure I didn't botch anything but I don't know how maintainable it is.
@@ -269,6 +269,39 @@ impl RowIdSequence { | |||
} | |||
} | |||
|
|||
impl From<&RowIdSequence> for RowIdTreeMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These utilities we are building are very cool.
@@ -551,7 +551,7 @@ impl Transaction { | |||
manifest.tag.clone_from(&self.tag); | |||
|
|||
if config.auto_set_feature_flags { | |||
apply_feature_flags(&mut manifest)?; | |||
apply_feature_flags(&mut manifest, config.use_move_stable_row_ids)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what are the rules on this config option? Is it the same as the legacy format rules (only applies to new datasets) or can you adjust the row id style on demand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, it should only be sett-able on new datasets. And once set it cannot be unset.
/// Sometimes this will be a block list of row ids that are deleted, based | ||
/// on the deletion files in the fragments. If stable row ids are used and | ||
/// there are missing fragments, this may instead be an allow list, since | ||
/// we can't easily compute the block list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...if it is an allow list I think this may have some performance impact on prefiltering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is possible. I will be doing further benchmarking to understand this later.
@@ -325,14 +334,14 @@ impl DisplayAs for MaterializeIndexExec { | |||
} | |||
} | |||
|
|||
struct FragIdIter { | |||
src: Arc<Vec<Fragment>>, | |||
struct FragIdIter<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for overcoming my early fear of lifetime variables 😆
retain_fragments(&mut allow_list, fragments, dataset).await?; | ||
|
||
if let Some(allow_list_iter) = allow_list.row_ids() { | ||
Ok(allow_list_iter.map(u64::from).collect::<Vec<_>>()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is map(u64::from)
needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.row_ids()
returns RowAddress
while the return type is a u64. Should be a no-op when compiled.
async fn row_ids_for_mask( | ||
mask: RowIdMask, | ||
dataset: &Dataset, | ||
fragments: &[Fragment], | ||
) -> Result<Vec<u64>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are potentially materializing a huge list here but I see now we always were. In the future this path should definitely be avoided when we have only a block list but I think we already have a TODO for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I'll definetely be examining the performance here closely soon.
c813c59
to
f2a070d
Compare
Part of #2307