-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: improve inverted index performance #2574
Conversation
- add stopword filter to avoid words that occur everywhere causing very bad performance - store tokens, inverted list and docs in `HashMap` - use docs in `LargeStringArray` cause single doc could be large then the total length could be over `i32::MAX` - more effecient updating for inverted list Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2574 +/- ##
==========================================
+ Coverage 79.91% 79.95% +0.04%
==========================================
Files 212 212
Lines 61639 61658 +19
Branches 61639 61658 +19
==========================================
+ Hits 49256 49298 +42
+ Misses 9448 9436 -12
+ Partials 2935 2924 -11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
let freq = *freq as f32; | ||
let bm25 = bm25.entry(row_id).or_insert(0.0); | ||
*bm25 += self.idf(row_freq.len()) * freq * (K1 + 1.0) | ||
/ (freq + K1 * (1.0 - B + B * self.docs.num_tokens(row_id) as f32 / avgdl)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this formula correct? I would think - B + B
is zero, so this looks suspicious. What is this based on? (perhaps you should have a link to this in the method?)
/ (freq + K1 * (1.0 - B + B * self.docs.num_tokens(row_id) as f32 / avgdl)); | |
/ (freq + K1 * (1.0 * self.docs.num_tokens(row_id) as f32 / avgdl)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, just added reference link for this method.
it's correct, it's 1.0 - B + (B * nq as f32 / avgdl)
but the operator *
is with higher priority so I just ignored the parentheses
let mut token_id_builder = UInt32Builder::with_capacity(self.tokens.len()); | ||
let mut frequency_builder = UInt64Builder::with_capacity(self.tokens.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these aren't nullable, I think the faster approach would be to use a Vec
, and then convert from a vec into the appropriate array at the end. This could say some cycles involved in handling the null buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can arrow reuse the vector's data?
here with the array builder I think it doesn't need to copy the data from builder to array again.
but for vector it does need right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrays can re-use the vectors data without copying.
.zip(token_id_col.iter()) | ||
.zip(frequency_col.iter()) | ||
{ | ||
let token = token.unwrap(); | ||
let token_id = token_id.unwrap(); | ||
let frequency = frequency.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid the null checks / unwraps by iterator over the values buffer:
.zip(token_id_col.iter()) | |
.zip(frequency_col.iter()) | |
{ | |
let token = token.unwrap(); | |
let token_id = token_id.unwrap(); | |
let frequency = frequency.unwrap(); | |
.zip(token_id_col.values().iter()) | |
.zip(frequency_col.values().iter()) | |
{ | |
let token = token.unwrap(); |
for ((token_id, row_ids), frequencies) in token_col | ||
.iter() | ||
.zip(row_ids_col.iter()) | ||
.zip(frequencies_col.iter()) | ||
{ | ||
let token_id = token_id.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, if it's non-null:
for ((token_id, row_ids), frequencies) in token_col | |
.iter() | |
.zip(row_ids_col.iter()) | |
.zip(frequencies_col.iter()) | |
{ | |
let token_id = token_id.unwrap(); | |
for ((token_id, row_ids), frequencies) in token_col | |
.values() | |
.iter() | |
.zip(row_ids_col.iter()) | |
.zip(frequencies_col.iter()) | |
{ |
for (row_id, num_tokens) in row_id_col.iter().zip(num_tokens_col.iter()) { | ||
let row_id = row_id.unwrap(); | ||
let num_tokens = num_tokens.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (row_id, num_tokens) in row_id_col.iter().zip(num_tokens_col.iter()) { | |
let row_id = row_id.unwrap(); | |
let num_tokens = num_tokens.unwrap(); | |
for (row_id, num_tokens) in row_id_col.values().iter().zip(num_tokens_col.values().iter()) { |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
let stopword_filter = | ||
tantivy::tokenizer::StopWordFilter::new(tantivy::tokenizer::Language::English).unwrap(); | ||
let mut tokenizer = tantivy::tokenizer::TextAnalyzer::builder( | ||
tantivy::tokenizer::SimpleTokenizer::default(), | ||
stopword_filter.transform(tantivy::tokenizer::SimpleTokenizer::default()), | ||
) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we'll make this all configurable later, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just make it work now
HashMap
LargeStringArray
cause single doc could be large then the total length could be overi32::MAX
110.3%+ faster