Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid eagerly loading StoredFieldsReader in fetch phase #83693

Merged
merged 5 commits into from
Feb 9, 2022

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Feb 8, 2022

Every time we create a hit document, we create a new SourceLookup and call
setSegmentAndDocument. This in turn creates a new StoredFieldsReader, which is
pretty expensive. In scenarios where you are retrieving a lot of hits, this can
add significant overhead. Prior to version 7.11, we did not create a new
SourceLookup per hit, so this is a performance regression.

This PR updates setSegmentAndDocument to avoid eagerly creating a
new StoredFieldsReader (through StoredFieldsReader#getMergeInstance).

Closes #82777.

Every time we create a hit document, we create a new SourceLookup and call
setSegmentAndDocument. This in turn creates a new StoredFieldsReader, which is
pretty expensive. In scenarios where you are retrieving a lot of hits, this can
add significant overhead. Prior to version 7.11, we did not create a new
SourceLookup per hit, so this is a performance regression.

This PR updates setSegmentAndDocument to avoid eagerly creating a
new StoredFieldsReader (through StoredFieldsReader#getMergeInstance).
@jtibshirani jtibshirani added >bug :Search/Search Search-related issues that do not fall into other categories v8.1.0 v7.17.1 v8.0.1 labels Feb 8, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @jtibshirani, I've created a changelog YAML for you.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Feb 8, 2022

Some notes:

  • An alternative would be to introduce SourceLookup#setDoc, which sets the document without the leaf reader context. This felt less solid, and I wanted to avoid touching the HitContext -- source handling in the fetch phase is very complex and doesn't have strong test coverage!
  • This gives more motivation for adding fetch benchmarks (rally-tracks#199). I'm also guessing that EQL tests might have caught it, which didn't exist back in 7.11.


import java.util.function.Supplier;

public class MemoizedSupplier<T> implements Supplier<T> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restored this class, which was only very recently deleted because it was unused.

// get better sequential access.
if (context.reader() instanceof SequentialStoredFieldsLeafReader lf) {
// Avoid eagerly loading the stored fields reader, since this can be expensive
Supplier<StoredFieldsReader> supplier = new MemoizedSupplier<>(lf::getSequentialStoredFieldsReader);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue is that getSequentialStoredFieldsReader calls StoredFieldsReader#getMergeInstance, which creates a new stored fields reader. I wonder if this could be optimized in Lucene to avoid recreating it every time 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we'd like to push as much of this up into Lucene as possible, as it is all hacks at the moment and if anything changes in how stored fields are merged then we are in trouble.

@jtibshirani jtibshirani marked this pull request as ready for review February 9, 2022 05:39
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Feb 9, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// get better sequential access.
if (context.reader() instanceof SequentialStoredFieldsLeafReader lf) {
// Avoid eagerly loading the stored fields reader, since this can be expensive
Supplier<StoredFieldsReader> supplier = new MemoizedSupplier<>(lf::getSequentialStoredFieldsReader);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we'd like to push as much of this up into Lucene as possible, as it is all hacks at the moment and if anything changes in how stored fields are merged then we are in trouble.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @jtibshirani.

@elasticsearchmachine
Copy link
Collaborator

Hi @jtibshirani, I've updated the changelog YAML for you.

@jtibshirani
Copy link
Contributor Author

Thanks for the reviews! I forgot to link the original issue: #82777.

@jtibshirani
Copy link
Contributor Author

@elasticmachine ok to test

@jtibshirani jtibshirani merged commit 4e28da4 into elastic:master Feb 9, 2022
@jtibshirani jtibshirani deleted the stored-fields-reader branch February 9, 2022 19:20
jtibshirani added a commit that referenced this pull request Feb 9, 2022
Every time we create a hit document, we create a new SourceLookup and call
setSegmentAndDocument. This in turn creates a new StoredFieldsReader, which is
pretty expensive. In scenarios where you are retrieving a lot of hits, this can
add significant overhead. Prior to version 7.11, we did not create a new
SourceLookup per hit, so this is a performance regression.

This PR updates setSegmentAndDocument to avoid eagerly creating a
new StoredFieldsReader (through StoredFieldsReader#getMergeInstance).
jtibshirani added a commit that referenced this pull request Feb 17, 2022
Every time we create a hit document, we create a new SourceLookup and call
setSegmentAndDocument. This in turn creates a new StoredFieldsReader, which is
pretty expensive. In scenarios where you are retrieving a lot of hits, this can
add significant overhead. Prior to version 7.11, we did not create a new
SourceLookup per hit, so this is a performance regression.

This PR updates setSegmentAndDocument to avoid eagerly creating a
new StoredFieldsReader (through StoredFieldsReader#getMergeInstance).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.17.1 v8.0.1 v8.1.0 v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance degradation for new HitContext constructor when only pulling docvalues
5 participants