Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "all field" execution mode to query_string query #20925

Merged
merged 1 commit into from
Nov 4, 2016

Conversation

dakrone
Copy link
Member

@dakrone dakrone commented Oct 13, 2016

This commit introduces a new execution mode for the query_string query, which
is intended down the road to be a replacement for the current _all field.

It now does auto-field-expansion and auto-leniency when the following criteria
are ALL met:

  • The _all field is disabled
  • No default_field has been set in the index settings
  • No default_field has been set in the request
  • No fields are specified in the request

Additionally, a user can force the "all-like" execution by setting the
all_fields parameter to true.

When executing in all field mode, the query_string query will look at all the
fields in the mapping that are not metafields and can be searched, and
automatically expand the list of fields that are going to be queried.

Relates to #19784

@dakrone
Copy link
Member Author

dakrone commented Oct 13, 2016

I am opening this PR so we can discuss the direction we'd like the "all" query
to go. That's why this PR currently doesn't have super-exhaustive tests or
documentation (though it does have some tests).

For more back story, see: #19784 (comment)

I'd like to get feedback about whether this is the way we want to build this
query, after which I can iterate, make changes, and then polish the
documentation. So any feedback is welcome!

@dakrone
Copy link
Member Author

dakrone commented Oct 13, 2016

Other features off the top of my head that would be good to eventually incorporate:

  • changing operator of the top-level bool query (currently hard-coded to OR but it would be trivial to add)
  • minimum_should_match
  • perhaps scoring?

@dadoonet
Copy link
Member

dadoonet commented Oct 13, 2016

I did not read the code but only the description of the PR and it looks great. I like it!
Just wondering about the naming of this query while we are at it. Does all mean what we are meaning?
To me all means match_all so I think we should try to find another name although it's a way to replace _all field.

Something like everywhere, all_fields or similar...

@dakrone
Copy link
Member Author

dakrone commented Oct 13, 2016

Just wondering about the naming of this query while we are at it.

The naming is totally up for grabs! I named it "all" because I was planning on punting to get feedback, so other suggestions are totally welcome!

@markharwood
Copy link
Contributor

Thanks for the work on this @dakrone

Working from your description alone I have a couple of points:

I think the use of "cross_fields" is irrelevant here if you're just going to "constant_score" everything. at the end. In this scenario the only thing cross_fields is doing for you is adding a pointless limitation that all fields must share a common analyzer (I think it silently reverts to a "normal" query if this is the case rather than erroring)

I can see we might want to ignore IDF in scoring but I still think coord should be a ranking factor (how many of the user's search terms matched). Presumably this just means pushing the constant_score wrapper down to the word-level clauses not the root

Would be nice for us to know more about which of the query string operators (phrase, field, AND/OR) tend to get used by folks in practice.

@jimczi
Copy link
Contributor

jimczi commented Oct 14, 2016

I wonder if we can avoid the new query path and provides this functionality in the existing parser.
For instance it should be possible to do:

{
  "query": {
    "query_string": {
      "query": "foo bar title:baz body:eggplant"
    }
  }
}

... and apply your logic on the "foo bar" part only. I am saying this because it would be difficult to handle query_string (and simple_query_string) without a default field which seems to provide a nice OOTB.

production 200 10.29.2.1
Expecting to find documents for their production environment where a 200
response code is returned from the given IP address. Since the tokens
are treated individually, all numeric fields (int, long) can be searched
for 200, all the IP address fields can be searched for 10.29.2.1, and
all text-like (string, text, keyword) fields can be searched for
"production". This can closely mimic search the _all field using the
given terms.

I am not sure about the numerics for numerics fields and ip for ip fields restriction. I would prefer another approach, something like create a query on this field if the input is compatible and ignore the part that are not parseable by this field. In such case 200 would also be added to the text-like fields (and the ip as well).

Internally, what happens is that the "all" query analyzes the text using
a Lucene analyzer (by default, the whitespace analyzer) to split it
into tokens, so in this case ["foo", "Bar", "baz"] are the tokens
generated.

Since you can group each field depending on the analyzer I think that you could use the analyzer directly for the text fields instead of relying on the whitespace analyzer upfront. For numerics and others you could split on whitespace but it's really counter intuitive to have this split before the analysis takes place (see #20841)

I think the use of "cross_fields" is irrelevant here if you're just going to "constant_score" everything. at the end. In this scenario the only thing cross_fields is doing for you is adding a pointless limitation that all fields must share a common analyzer (I think it silently reverts to a "normal" query if this is the case rather than erroring)

The cross_fields can be useful but if you split on whitespace before the analysis you already have some kind of cross_fields behavior (the same than what query_stringdoes). The query foo bar on two fields would produce:
(field1:foo OR field2:foo) AND (field1:bar OR field2:bar)
... with AND as the default operator.

@dakrone
Copy link
Member Author

dakrone commented Oct 16, 2016

Thanks for the feedback @markharwood & @jimferenczi!

Would be nice for us to know more about which of the query string operators (phrase, field, AND/OR) tend to get used by folks in practice.

That's a great idea, I am hoping some Kibana users (and others!) chime in here with any information about use cases!

I wonder if we can avoid the new query path and provides this functionality in the existing parser. For instance it should be possible to ... I am saying this because it would be difficult to handle query_string (and simple_query_string) without a default field which seems to provide a nice OOTB.

I think that is the eventual goal of this, however, I think right now it's more complex, and it'd be good to nail down what we want the "all-ish" query to be before changing the way query_string behaves (which may only happen in a major version, while this is additive and could go into 5.1 eventually)

I am not sure about the numerics for numerics fields and ip for ip fields restriction. I would prefer another approach, something like create a query on this field if the input is compatible and ignore the part that are not parseable by this field. In such case 200 would also be added to the text-like fields (and the ip as well).

This is what it already does, I apologize, I simplified in my explanation, but it will still search text/keyword fields for "200" and "127.0.0.1", not only numeric and IP fields. It only ignores fields that it cannot parse the input text as.

Since you can group each field depending on the analyzer I think that you could use the analyzer directly for the text fields instead of relying on the whitespace analyzer upfront. For numerics and others you could split on whitespace but it's really counter intuitive to have this split before the analysis takes place (see #20841)

I experimented with doing this at first, but ran into the problem where someone sends the query foo 200, which without analysis would try to parse the entire text "foo 200" as a numeric to search on a numeric field. This would then not be parseable (it can't parse "foo 200" as a number). As you said though, doing this only for numerics might be a better idea that works around this!

Regarding cross_fields ...

I think re-evaluating this (using cross_fields) makes sense. I'd like to get more feedback on whether it'd be useful to have scoring (when we talked about it last we talked about doing everything with constant_score). Also, determining how this will change if the analysis suggestion Jim made gets applied.

@jimczi
Copy link
Contributor

jimczi commented Oct 17, 2016

I really like how this PR reused the MultiMatchQuery to create the inner queries.
I don't know if the analyzer option is required though. It seems misleading since it is used to parse the query before the analysis so it's more a split analyzer that is not used to create the final field query. Do you have use cases in mind where this analyzer option could be useful ? I think it could be simplified to a simple boolean split_on_whitespace or we could simply removed this option. I think it should be removed if the default operator of this query (OR/SHOULD) cannot be changed.

Regarding scoring I don't think this query should produce a constant score query. Since it uses the cross_fields types of the MultiMatchQuery the scoring should be consistent thanks to the blending of the term queries. It would give a nice OOTB experience as opposed to the first match ranked first scenario. Of course some queries will fail regarding the scoring expectation but it should work for simple ones seamlessly.

The code looks good to me, it's simple and well contained. Tough in order to be able to push a first version of this I think we should discuss the following points:

  • decide if this query should produce a score or not
  • maybe a better name like david proposed
  • decide which options should be exposed (default operator, min should match, ...). This point is not required for the first version since we can add the options later.

@clintongormley
Copy link

To reiterate the conclusions in Prague:

If a user wants to do real full text search against a single field which contains the contents of multiple other fields, they should use copy_to.

This leaves the "discovery" use case: find these search elements regardless of which fields they're in, or what type the field is. For this use case, we spoke about adding a new _all query which could query all fields in the mapping. We said it was OK to be slower than a query against the _all field, and that scoring wasn't important (and should probably be disabled to discourage use of this query for full text search).

The main use case for this _all query is to support the query string query. Today, the query string query tokenizes the query string on whitespace and then passes each token to the underlying field separately (where it might undergo further analysis). However, this approach is flawed (it breaks n-grams, shingles, and multi-word synonyms) and has been fixed in Lucene 6.2 (see https://issues.apache.org/jira/browse/LUCENE-2605 for details). There is a PR to expose this in Elasticsearch (#20965).

That said, the change in LUCENE-2605 is not without problems. It works well when there is a single default field, but breaks when there are multiple fields. Imagine a query string like (quick brown fox)~2. In the current tokenize-on-whitespace approach, two of the three tokens must be present in any of the listed fields. With the new approach, two of the three tokens would need to be present in the SAME field (unless we complicate things further by applying the cross-field matching that exists in the multi_match query).

The gist of all of the above is that these same issues are shared by the query string query and by the proposed _all query. I'm beginning to think that we shouldn't add the _all query, but instead change the behaviour of the query string query and, when the default_field is set to _all, expand fields to all fields.

There are three types of queries to deal with:

  • Analyzed text field - you want to pass as much contiguous text as possible to the underlying field to be dealt with by the field's analyzer. Min-should-match logic could be applied as we do with cross_fields matching in multi_match.
  • Not analyzed keyword field - Contiguous text should probably be split on white space so that each token can be considered separately. Keywords can of course contain spaces. This could be worked around by either (a) splitting on whitespace and using a phrase query for "keyword with space" or (b) not splitting on whitespace and relying on operators foo OR bar OR keyword with space
  • Numeric fields - Should be split on whitespace and only legal numbers, dates, IP addresses, etc should be passed to the underlying field (won't work for date formats like yyyy-mm-dd hh:mm:ss but could be worked around with phrase queries).

Honestly, the more I write here, the less I like the idea of removing the _all field. There is a helluva lot of complexity involved in trying to do the right thing at query time, and there are lots of cases where we will fail. Explaining the failure is even more complicated.

I'm leaning towards going back to the simpler change of excluding numerics from the _all field and being done with it. It is easy to disable the _all field if you don't need/want it.

@dakrone
Copy link
Member Author

dakrone commented Oct 17, 2016

change the behaviour of the query string query and, when the default_field is set to _all, expand fields to all fields.

@clintongormley something that concerns me about this is that the query_string query can have very expensive queries in it - fuzzy, phrase with large slops, regexp, wildcards (including leading wildcards), etc. If instead of performing these expensive queries on a single field we expanded the query to perform them on dozens (or potentially hundreds) of fields, I worry that it may be something that takes a cluster down.

Today, the query string query tokenizes the query string on whitespace and then passes each token to the underlying field separately (where it might undergo further analysis). However, this approach is flawed (it breaks n-grams, shingles, and multi-word synonyms) and has been fixed in Lucene 6.2

I believe this can be handled by Jim's suggestion to have two versions of the query text, an non-analyzed version that's used for all the string-ish fields, and a version split on whitespace to be used for all the numeric-ish fields.

Imagine a query string like (quick brown fox)~2. In the current tokenize-on-whitespace approach, two of the three tokens must be present in any of the listed fields. With the new approach, two of the three tokens would need to be present in the SAME field

Do we think that's the intended use case for this query? To me this is more of a "match-like" query in that it is for matching, not doing more complicated things. In that case, a dedicated field and potentially using copy_to is a better use case. I don't have any hard evidence to back this up though, until more people chime in :)

@dakrone
Copy link
Member Author

dakrone commented Oct 17, 2016

Regarding the three different types you mentioned @clintongormley, those make sense (I was only thinking about two, but splitting on whitespace for keyword fields makes sense also). If this was done, would you still not want to have this as a separate query?

@clintongormley
Copy link

@clintongormley something that concerns me about this is that the query_string query can have very expensive queries in it - fuzzy, phrase with large slops, regexp, wildcards (including leading wildcards), etc. If instead of performing these expensive queries on a single field we expanded the query to perform them on dozens (or potentially hundreds) of fields, I worry that it may be something that takes a cluster down.

But that's essentially what we're talking about doing at the moment.

If this was done, would you still not want to have this as a separate query?

I'm not getting how we would support the query string syntax with the _all query, without making the _all query do everything that the query string query does today (incl fuzzy, min should match etc).

The _all query feels like the wrong answer here. My fear is it is going to become a big complex mess, while the existing alternative (the _all field) is pretty simple.

@rjernst
Copy link
Member

rjernst commented Oct 17, 2016

I'm leaning towards going back to the simpler change of excluding numerics from the _all field and being done with it. It is easy to disable the _all field if you don't need/want it.

This does not solve these issues (and which is why we have discussed this so much!):

  • Kibana would still not be able to do search on "foo 200"
  • everyone still pays the cost of _all out of the box (essentially doubling the size of their index), when everyone does not actually need this.
  • Mappings stay complex (we must keep the include_in_all setting, which is invasive internal mapping code)

And there are probably some more things I've already forgotten. We started this discussion because of all the problems with _all. Please, let's not just drop this because it is a "big" change. _all is broken in many ways right now. Let's fix it.

Regarding some points made above:

  • Why can we not keep this simple and leave scoring out of it? I thought that was the consensus before.
  • We could handle the numerics by wrapping the analyzer for numeric fields, when used with the all query, with one that splits on whitespace (and discards non-numeric terms before even making it to the numeric parser).
  • For (simple_)query_string, can we just document how to create a catch all field with copy_to, and require the fields parameter (and keep a separate all query or whatever we want to call it, which does not have any special operators)?

@clintongormley
Copy link

@rjernst what gets exposed to the user? A simple field whose value gets passed to just the _all query? If so, then it won't be enough - users use the syntax like AND, OR, NOT, ranges, field:, prefix, wildcard etc. But query string query also supports grouping, fuzzy, min should match, etc. How do we handle those with the _all query? Build in support for all these features?

I could imagine changing the query string to auto-expand _all to all fields, and adding some of the things I've mentioned, but the _all query doesn't feel like the answer here.

@dakrone
Copy link
Member Author

dakrone commented Oct 17, 2016

A simple field whose value gets passed to just the _all query? If so, then it won't be enough - users use the syntax like AND, OR, NOT, ranges, field:, prefix, wildcard etc. But query string query also supports grouping, fuzzy, min should match, etc. How do we handle those with the _all query?

I think we're conflating two different use cases. I would argue that for the users that are advanced enough to use prefixes, wildcards, regexp, fuzzy, etc that the "all" query is not for them. These users can use the : syntax to specify which fields they want to match on in their query.

To me, the "all" query is for the users that have no idea what that search box in Kibana is backed by, and simply want easy-to-use filtering functionality that works without knowing anything about their data, the mappings, and without fear of query_string's error messages.

I have pushed a change that separates the query text into an analyzed and non-analyzed version (as people have recommended) and uses the appropriate version depending on the field type. This allows a keyword field with whitespace to match.

@dakrone
Copy link
Member Author

dakrone commented Oct 17, 2016

maybe a better name like david proposed

If we get to this point, maybe "all_fields" is a better query name?

@clintongormley
Copy link

I think we're conflating two different use cases. I would argue that for the users that are advanced enough to use prefixes, wildcards, regexp, fuzzy, etc that the "all" query is not for them. These users can use the : syntax to specify which fields they want to match on in their query.

Think about Kibana users. Sure, some use the search bar just for filtering. Others use it with at least ranges, wildcards, AND|OR|NOT etc. Others use a lot more. Either way, there's only one search field exposed in Kibana. So either we're exposing the query string query, or we're coming up with a new query parser syntax that can be backed by this _all query.

@jimczi
Copy link
Contributor

jimczi commented Oct 18, 2016

Either way, there's only one search field exposed in Kibana.

I am confused here, Kibana exposes the query_string query as is and the default field is the one defined in the targeted ES, right ?

So either we're exposing the query string query, or we're coming up with a new query parser syntax that can be backed by this _all query.

I agree with Clinton, why can't we just change the logic when _all is the default field for a query_string query ? This parser is easily modifiable and we can tweak everything we want in it.

In fact this simple query:

GET t/t/_validate/query?explain
{
    "query": {
        "query_string": {
           "query": "this is a test",
           "fields": ["*"],
           "lenient": true
        }
    }
}

... is really close to what the proposed _all query is doing. The only difference I see is that the query_string does not filter the metadata fields. This would be much simpler than adding the required syntax to a new _all query parser.

@dadoonet
Copy link
Member

I agree with @jimferenczi and @clintongormley:

I wonder if we can avoid the new query path and provides this functionality in the existing parser.
For instance it should be possible to do:

{
  "query": {
    "query_string": {
      "query": "foo bar title:baz body:eggplant"
    }
  }
}

... and apply your logic on the "foo bar" part only. I am saying this because it would be difficult to handle query_string (and simple_query_string) without a default field which seems to provide a nice OOTB.

So no need for another query type if we can get a nice OOTB without the need of generating a _all field at index time.

@dakrone
Copy link
Member Author

dakrone commented Oct 18, 2016

In fact this simple query ... is really close to what the proposed _all query is doing. The only difference I see is that the query_string does not filter the metadata fields. This would be much simpler than adding the required syntax to a new _all query parser.

I wish it did, however, if what @clintongormley said about users using AND and OR is correct, the AND functionality is broken in this query: https://gist.github.com/dakrone/022cf20c8e390c4063aa263cd9370e5a which is an unfortunate thing about the query_string query.

@dakrone
Copy link
Member Author

dakrone commented Oct 18, 2016

(By the way I'm not against re-using the query_string query for this functionality, I'm still looking into it. I don't want anyone to think I'm against it and trying to stop the conversation)

@dakrone
Copy link
Member Author

dakrone commented Oct 18, 2016

Found it. The unintended behavior is related to bools having an implicit boolfield:true clause in the query. I'll look at how hard it would be to change this for query_string.

@dakrone
Copy link
Member Author

dakrone commented Oct 20, 2016

Okay, I have made some changes to the query_string query to get this sort of behavior. It now does auto-field-expansion and auto-leniency when the following criteria are ALL met:

  • The _all field is disabled
  • No default_field has been set in the index settings
  • No default_field has been set in the request
  • No fields are specified in the request

It works great so far! What do we think of this as a solution?

This would allow Kibana to retain almost all of its behavior without any changes other than disabling the _all field by default on the ES side.

@jpountz
Copy link
Contributor

jpountz commented Nov 7, 2016

Should we try to consolidate the all_fields: true and fields: [ "*" ] options? From a user perspective, they look like they are achieving the same thing?

dakrone added a commit to dakrone/elasticsearch that referenced this pull request Nov 9, 2016
This commit introduces a new execution mode for the
`simple_query_string` query, which is intended down the road to be a
replacement for the current _all field.

It now does auto-field-expansion and auto-leniency when the following criteria
are ALL met:

    The _all field is disabled
    No default_field has been set in the index settings
    No fields are specified in the request

Additionally, a user can force the "all-like" execution by setting the
all_fields parameter to true.

When executing in all field mode, the `simple_query_string` query will
look at all the fields in the mapping that are not metafields and can be
searched, and automatically expand the list of fields that are going to
be queried.

Relates to elastic#20925, which is the `query_string` version of this work.
This is basically the same behavior, but for the `simple_query_string`
query.

Relates to elastic#19784
dakrone added a commit that referenced this pull request Nov 9, 2016
This commit introduces a new execution mode for the
`simple_query_string` query, which is intended down the road to be a
replacement for the current _all field.

It now does auto-field-expansion and auto-leniency when the following criteria
are ALL met:

    The _all field is disabled
    No default_field has been set in the index settings
    No fields are specified in the request

Additionally, a user can force the "all-like" execution by setting the
all_fields parameter to true.

When executing in all field mode, the `simple_query_string` query will
look at all the fields in the mapping that are not metafields and can be
searched, and automatically expand the list of fields that are going to
be queried.

Relates to #20925, which is the `query_string` version of this work.
This is basically the same behavior, but for the `simple_query_string`
query.

Relates to #19784
dakrone added a commit to dakrone/elasticsearch that referenced this pull request Nov 22, 2016
As part of elastic#20925 and elastic#21341 we added an "all-fields" mode to the
`query_string` and `simple_query_string`. This would expand the query to
all fields and automatically set `lenient` to true.

However, we should still allow a user to override the `lenient` flag to
whichever value they desire, should they add it in the request. This
commit does that.
dakrone added a commit to dakrone/elasticsearch that referenced this pull request Jan 11, 2017
This change disables the _all meta field by default.

Now that we have the "all-fields" method of query execution, we can save both
indexing time and disk space by disabling it.

_all can no longer be configured for indices created after 6.0.

Relates to elastic#20925 and elastic#21341
Resolves elastic#19784
@dakrone dakrone deleted the add-all-query branch January 23, 2017 17:23
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jun 27, 2017
Add `quote_field_suffix` to simple_query_string query
Deprecate `lowercase_expanded_terms` and `locale`: elastic/elasticsearch#20208
Add "all field" execution mode to query_string and simple_query_string query: elastic/elasticsearch#20925

Fixes #2792
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jun 29, 2017
Add `quote_field_suffix` to simple_query_string query
Deprecate `lowercase_expanded_terms` and `locale`: elastic/elasticsearch#20208
Add "all field" execution mode to query_string and simple_query_string query: elastic/elasticsearch#20925

Fixes #2792
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jun 29, 2017
* Add missing fields on simple query string query

Add `quote_field_suffix` to simple_query_string query
Deprecate `lowercase_expanded_terms` and `locale`: elastic/elasticsearch#20208
Add "all field" execution mode to query_string and simple_query_string query: elastic/elasticsearch#20925

Fixes #2792

* Convert to property expressions
russcam added a commit to elastic/elasticsearch-net that referenced this pull request Jun 29, 2017
* Add missing fields on simple query string query

Add `quote_field_suffix` to simple_query_string query
Deprecate `lowercase_expanded_terms` and `locale`: elastic/elasticsearch#20208
Add "all field" execution mode to query_string and simple_query_string query: elastic/elasticsearch#20925

Fixes #2792

* Convert to property expressions

(cherry picked from commit bef4b9d)
awelburn pushed a commit to Artesian/elasticsearch-net that referenced this pull request Nov 6, 2017
* Add missing fields on simple query string query

Add `quote_field_suffix` to simple_query_string query
Deprecate `lowercase_expanded_terms` and `locale`: elastic/elasticsearch#20208
Add "all field" execution mode to query_string and simple_query_string query: elastic/elasticsearch#20925

Fixes elastic#2792

* Convert to property expressions

(cherry picked from commit bef4b9d)
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
@anhlqn
Copy link

anhlqn commented Mar 9, 2018

I've just upgraded a test cluster from ES and Kibana from 2.4 to 5.6.8 recently, and I am grateful for this change in Log Analysis use case. In ES 2.4, to provide end users with an easy Discovery mode when searching syslog events, I set include_in_all: false at index level and set include_in_all: true for some fields. That approach allows users to make initial full text searches without me blowing up index size for leaving _all field enabled for all fields.

The biggest field for syslog events in my case is the message field, and to save storage, I've set it to

"message": {
  "type": "string",
  "index": "no",
  "include_in_all": "true"
}

With the discontinue of _all field since ES v6, it seems that to preserve the same search experience, I have two choices:

  1. Change message field type to text and reindex existing data to prevent mapping conflict (1 year of data at tens of TB)
  2. Make use of copy_to to copy message to another field, perhaps to the all_fields
"message": {
  "type": "text",
  "index": false,
  "copy_to": "all_fields"
}

My question is that is number 2 a good approach and is there any other ways?

@dakrone
Copy link
Member Author

dakrone commented Mar 9, 2018

Hi @anhlqn, this sounds like a good questions for the discuss forums at https://discuss.elastic.co as we reserve Github issues for feature requests and bug reports. If you want to ask your question there it is a much better place to discuss things like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories v5.1.1 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants