Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle ignored fields directly in SourceValueFetcher #68738

Merged
merged 9 commits into from
Feb 16, 2021

Conversation

romseygeek
Copy link
Contributor

Currently, the value fetcher framework handles ignored fields by reading
the stored values of the _ignored metadata field, and passing these through
on calls to fetchValues(). However, this means that if a document has multiple
values indexed for a field, and one malformed value, then the fields API will
ignore everything, including the valid values, and return an empty list for this
document.

If a document source contains a malformed value, then it must have been
ignored at index time. Therefore, we can safely assume that if we get an
exception parsing values from source at fetch time, they were also ignored
at index time and they can be skipped. This commit moves this exception
handling directly into SourceValueFetcher and ArraySourceValueFetcher,
removing the need to inspect the _ignored metadata and fixing the case
of mixed valid and invalid values.

@romseygeek romseygeek added >bug :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.12.0 labels Feb 9, 2021
@romseygeek romseygeek self-assigned this Feb 9, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Feb 9, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a good direction to me. It addresses two issues that I've been concerned about:

  1. The fields option tries to reconstruct the document parsing logic. This logic is really complex and we sometimes hit edge cases. The current error mode for these cases is to fail the entire request. This means that Kibana Discover, which uses "fields": ["*"] can have a whole search fail because of one bad value in one document. (We should certainly aim to make this more robust, perhaps by sharing field loading logic with document parsing or working to simplify parsing behavior. But those efforts will take some time.)
  2. We don't handle ignored fields correctly in when fetching fields inner_hits. This is because the list of ignored fields is stored on the root document, and is not easily available in the inner hits context.

Related to point 1, I have one major open question about this approach: can we avoid swallowing legitimate errors? Otherwise we may have some broken parsing logic, but never realize and just silently drop values. Maybe we could be more specific about what types of exceptions we swallow. We could also consider returning a list of fields/ values that were ignored as part of the response.

@romseygeek
Copy link
Contributor Author

Maybe we could be more specific about what types of exceptions we swallow. We could also consider returning a list of fields/ values that were ignored as part of the response.

The first option is tricky because we're doing the Exception handling in SourceValueFetcher directly, and it doesn't know about any of the exceptions that may be thrown by the various Mapper implementations. I like the idea of the second option though - perhaps a new fetch phase option called show_malformed_values or something that would return each ignored value along with its parsing exception?

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of the second option though - perhaps a new fetch phase option called show_malformed_values or something

I will give this some thought but am happy to move forward with this PR without an alternative ready for returning ignored fields/ values. It's a clear improvement over the current approach.

return fetcher.fetchValues(lookup);
}

public static List<?> fetchSourceValues(MappedFieldType fieldType, Object... values) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One motivation for this PR is to return well-formed values for a document field, even if its some of its values were ignored. So it'd be good to add a test for this case explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a yaml test for this

@@ -83,7 +83,7 @@ public TokenCountFieldMapper build(ContentPath contentPath) {
@Override
public ValueFetcher valueFetcher(SearchExecutionContext context, String format) {
if (hasDocValues() == false) {
return (lookup, ignoredFields) -> List.of();
return (lookup) -> List.of();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment, could be lookup -> List.of().

try {
values.addAll((List<?>) parseSourceValue(sourceValue));
}
catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment, we always put catch on same line as previous brace.

@romseygeek
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romseygeek ! It'd be great to get this into 7.12 since it fixes an edge case.

- 2
- match:
hits.hits.0.fields.products.0: { "manufacturer" : ["Supersoft"]}
- match:
hits.hits.0.fields.products.1: { "manufacturer" : ["HyperSmart"]}

---
"Test ignores malformed values while returning valid ones":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to move this to FieldFetcherTests, to prefer unit testing as much as possible.

@romseygeek
Copy link
Contributor Author

@elasticmachine update branch

@romseygeek
Copy link
Contributor Author

@elasticmachine update branch

@romseygeek romseygeek merged commit 8fba6e4 into elastic:master Feb 16, 2021
@romseygeek romseygeek deleted the fetch/ignored-fields branch February 16, 2021 15:19
romseygeek added a commit to romseygeek/elasticsearch that referenced this pull request Feb 16, 2021
Currently, the value fetcher framework handles ignored fields by reading
the stored values of the _ignored metadata field, and passing these through
on calls to fetchValues(). However, this means that if a document has multiple
values indexed for a field, and one malformed value, then the fields API will
ignore everything, including the valid values, and return an empty list for this
document.

If a document source contains a malformed value, then it must have been
ignored at index time. Therefore, we can safely assume that if we get an
exception parsing values from source at fetch time, they were also ignored
at index time and they can be skipped. This commit moves this exception
handling directly into SourceValueFetcher and ArraySourceValueFetcher,
removing the need to inspect the _ignored metadata and fixing the case
of mixed valid and invalid values.
romseygeek added a commit to romseygeek/elasticsearch that referenced this pull request Feb 17, 2021
Currently, the value fetcher framework handles ignored fields by reading
the stored values of the _ignored metadata field, and passing these through
on calls to fetchValues(). However, this means that if a document has multiple
values indexed for a field, and one malformed value, then the fields API will
ignore everything, including the valid values, and return an empty list for this
document.

If a document source contains a malformed value, then it must have been
ignored at index time. Therefore, we can safely assume that if we get an
exception parsing values from source at fetch time, they were also ignored
at index time and they can be skipped. This commit moves this exception
handling directly into SourceValueFetcher and ArraySourceValueFetcher,
removing the need to inspect the _ignored metadata and fixing the case
of mixed valid and invalid values.
romseygeek added a commit that referenced this pull request Feb 17, 2021
Currently, the value fetcher framework handles ignored fields by reading
the stored values of the _ignored metadata field, and passing these through
on calls to fetchValues(). However, this means that if a document has multiple
values indexed for a field, and one malformed value, then the fields API will
ignore everything, including the valid values, and return an empty list for this
document.

If a document source contains a malformed value, then it must have been
ignored at index time. Therefore, we can safely assume that if we get an
exception parsing values from source at fetch time, they were also ignored
at index time and they can be skipped. This commit moves this exception
handling directly into SourceValueFetcher and ArraySourceValueFetcher,
removing the need to inspect the _ignored metadata and fixing the case
of mixed valid and invalid values.
romseygeek added a commit that referenced this pull request Feb 17, 2021
romseygeek added a commit that referenced this pull request Mar 1, 2021
The ValueFetcher for geo_shape will shortcut the validation of its
source value if it detects that the source format and the requested
format are the same. This worked fine when malformed values were
dealt with by checking the _ignored metadata, but since #68738
we need to always validate source values at fetch time.

This commit removes this special shortcut logic, and adds tests
to check that geo_shape value fetchers do not return malformed
source inputs.

Fixes #69071
romseygeek added a commit to romseygeek/elasticsearch that referenced this pull request Mar 1, 2021
The ValueFetcher for geo_shape will shortcut the validation of its
source value if it detects that the source format and the requested
format are the same. This worked fine when malformed values were
dealt with by checking the _ignored metadata, but since elastic#68738
we need to always validate source values at fetch time.

This commit removes this special shortcut logic, and adds tests
to check that geo_shape value fetchers do not return malformed
source inputs.

Fixes elastic#69071
romseygeek added a commit that referenced this pull request Mar 1, 2021
The ValueFetcher for geo_shape will shortcut the validation of its
source value if it detects that the source format and the requested
format are the same. This worked fine when malformed values were
dealt with by checking the _ignored metadata, but since #68738
we need to always validate source values at fetch time.

This commit removes this special shortcut logic, and adds tests
to check that geo_shape value fetchers do not return malformed
source inputs.

Fixes #69071
romseygeek added a commit that referenced this pull request Mar 1, 2021
The ValueFetcher for geo_shape will shortcut the validation of its
source value if it detects that the source format and the requested
format are the same. This worked fine when malformed values were
dealt with by checking the _ignored metadata, but since #68738
we need to always validate source values at fetch time.

This commit removes this special shortcut logic, and adds tests
to check that geo_shape value fetchers do not return malformed
source inputs.

Fixes #69071
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.12.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants