Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search - add case insensitive flag for "term" family of queries #61596

Merged
merged 2 commits into from
Sep 18, 2020

Conversation

markharwood
Copy link
Contributor

Adds case insensitive option to term, terms, terms_in_set, prefix, wildcard queries
First cut.
Closes #61546

@markharwood markharwood added WIP :Search/Search Search-related issues that do not fall into other categories labels Aug 26, 2020
@markharwood markharwood self-assigned this Aug 26, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Aug 26, 2020
@markharwood markharwood force-pushed the fix/61546 branch 3 times, most recently from 5b9a7bf to cc70b86 Compare August 27, 2020 18:00
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add new functions rather than parameters ?
Something like termQueryCI, prefixQueryCI and wildcardCI with a default implementation that throws a QueryShardException ?

@markharwood
Copy link
Contributor Author

markharwood commented Sep 1, 2020

I wonder if we should add new functions rather than parameters ?

I was following the approach of using parameters that we introduced in regexpQuery.
We could have a version of termQuery etc without the param that delegates to the with-param version using the default of false. I originally made a hard break with the method signature to make sure the compiler pointed me at all uses of these functions. Having reviewed all the uses maybe we could look at a less invasive change (your termsQueryCI or my suggestion of a parameter-less default). Any preference before I start a refactor @jimczi?

@jimczi
Copy link
Contributor

jimczi commented Sep 4, 2020

I was following the approach of using parameters that we introduced in regexpQuery.

For regexQuery I think it's different since it's only relevant for text and keyword field. Term queries are implemented by almost all field types so I think we can have at least a differentiation there ?

@markharwood markharwood force-pushed the fix/61546 branch 3 times, most recently from ab4bdfe to 053aee1 Compare September 7, 2020 12:45
@markharwood
Copy link
Contributor Author

I'm going to copy the regexp docs for this flag:

(Optional, boolean) allows case insensitive matching of the regular expression value with the indexed field values when set to true. Setting to false is disallowed.

I'll not include the flag in the example query though (despite this helping improve test coverage). The reason is that I want to avoid promoting the use of this query-time flag if an index-time choice (eg using a normalizer) is typically the better option. I'm not clear if the above is the best wording or how to present the choices between index-time and query-time case sensitivity options. It seems a pain to dive into that detailed discussion on every type of query that supports the new case insensitive flag.

@markharwood
Copy link
Contributor Author

I made the changes you suggested, @jimczi
As mentioned, I'd appreciate your eyes on this @nik9000 for the changes to scripted fields.

I feel uneasy about query docs for the flag - it would be nice to point to somewhere central that outlines the trade offs for case insensitivity at query-time Vs index-time

@nik9000 nik9000 requested a review from javanna September 8, 2020 12:38
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime fields stuff seems right to me. I've added @javanna as a reviewer to make sure he gets a look at it too. We're trying to stay pretty synchronized on this code.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime fields stuff seems right to me. I've added @javanna as a reviewer to make sure he gets a look at it too. We're trying to stay pretty synchronized on this code.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a naming concern, can we not use the CI acronym in the new method names? I find them hard to read, and makes it hard for readers to figure out what CI means in the context. I personally think that a slightly longer name is not a big problem.

@markharwood
Copy link
Contributor Author

Thanks for the review, @javanna . I changed the name.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, thanks for adding the distinction for term queries



// Case insensitive form of terms query
public Query termsQueryCaseInsensitive(List<?> values, @Nullable QueryShardContext context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what we should do here ? We'd need to add the support for case insensitive matching in TermInSetQuery if we want to avoid building giant boolean query in the keyword field. For now I think it's ok to not provide this option on terms query since we cannot support them efficiently ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi terms query is currently used by eql for more obvious queries like file where file_name in ("wininit.exe", "lsass.exe") and (less obvious) for cidrMatch function for matching an IP address in a list of CIDR blocks. Would be useful to have case insensitive support for it as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(less obvious) for cidrMatch function for matching an IP address in a list of CIDR blocks.

Neat! I just want to make sure you aren't expanding the cidr blocks or anything - we natively support the cidr match in the term and terms query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to make sure you aren't expanding the cidr blocks or anything - we natively support the cidr match in the term and terms query.

We're not, we just pass the block to ES.
When dealing with an expression that forces us to do scripting, we use the underlying Lucene class for matching.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

Hopefully you'll be able to use runtime field before too long!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I think it's ok to not provide this option on terms query since we cannot support them efficiently ?

If we don't support it won't people just be forced to do what we do internally and create a bool should array of term queries?
Is there a query-complexity circuit-breaker we're somehow missing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a +1 to have this supported in terms query. The alternative in EQL, in case we decide to not support this, would probably be (as @markharwood mentioned) to create a bool query with a bunch of term queries in it. CC @jimczi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry is that we'd internally create an array of should term queries because the TermsInSet query doesn't handle case insensitive queries. It's maybe not a big issue but that wouldn't be consistent since a case insensitive terms query would fail with more than 1024 terms (the max boolean clause limit) while the normal one would use the optimized TermsInSet query. So my take on this is that we should implement a proper support in TermInSet if we really want to handle terms case insensitive query. For EQL and SQL I don't think there's a real need to handle terms though. The in operator doesn't need to handle thousands of terms so imo that would be acceptable to translate it into an array of term query. Same for the cidrMatch function that also work on term queries.

Copy link
Contributor Author

@markharwood markharwood Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I'll pull the termsQueryCaseInsensitive method for now. We can always add in another PR later. This PR probably has enough changes to consider already

@markharwood
Copy link
Contributor Author

I ran some benchmarks to see what effects this flag has on response times.
I used an index of 2m weblogs, indexing as wildcard and keyword and indexing the whole log record string (high cardinality) and just the user agent (low cardinality).
I then ran a variety of queries on these fields which helps show some of the performance characteristics:

Response Time (secs) for 1,000 queries on different fields in 2m doc index

Field type/query Low cardinality (12k unique strings) Hi cardinality (1.3m unique strings)
Keyword/ci term 8 57
Keyword/term 1 1
Keyword/ci wildcard 13 1098
Keyword/wildcard 11 1100
Wildcard/ci term 24 2.7
Wildcard/term 14 1.3
Wildcard/ci wildcard 56 129
Wildcard/wildcard 60 133

low_hi_cardinality_chart

Some of the observations this helped me draw out:

  • A case insensitive (CI) term query is slower than a case sensitive one but nowhere near as expensive as any partial-value search on a high cardinality keyword field.
  • Keyword fields experience up to a 7x slowdown when adding CI but wildcard fields only have a ~2x slow down.
  • For CI or case sensitive term queries, the higher the cardinality, the faster the wildcard field searches (presumably less postings per unique term to be verified).

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. Thanks for iterating on this. I'd prefer that we consider terms queries in a follow up since the current matching on keyword would not be able to leverage the TermsInSetQuery but I don't have strong feelings either. We can also optimize later so I let you decide on that aspect ;).

return simpleMatchWithNormalizedStrings(pattern, str);
}

private static boolean simpleMatchWithNormalizedStrings(String pattern, String str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpleMatchCaseInsensitive ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That name might suggest the case insensitivity is built into the method (it's not).
I wanted something less explicit ("post-normalization"?) that suggested the strings were assumed to have already been normalised (which for argument's sake could have been uppercasing not lowercasing).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

@markharwood markharwood force-pushed the fix/61546 branch 2 times, most recently from de8c3d2 to 8f9ad73 Compare September 18, 2020 13:16
@markharwood
Copy link
Contributor Author

test this please

@andreidan
Copy link
Contributor

Removed backport pending as this has been backported via #62661

@jypan0115
Copy link

In server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function, for now it only works with ASCII characters. May I know why not support foreign characters like Vietnamese? It is not consist with keyword.

@cbuescher
Copy link
Member

@jypan0115 your reading of the code is correct, as the comment in the code points out, the "insensitive option" currently only works for ASCII characters. Your you mind opening an enhancement issue pointing out where this fails e.g. for Vietnamese? I'm not sure how casing is handled in that language, but having an example will help us determine the implementation effort and prioritize this.

@jypan0115
Copy link

@jypan0115 your reading of the code is correct, as the comment in the code points out, the "insensitive option" currently only works for ASCII characters. Your you mind opening an enhancement issue pointing out where this fails e.g. for Vietnamese? I'm not sure how casing is handled in that language, but having an example will help us determine the implementation effort and prioritize this.

@cbuescher I opened an issue about this https://github.com/elastic/elasticsearch/issues/95120 And here is a fail example, we have a field called name. If we store it as keyword field and use case insensitive term query to search for Ngô Đức(Uppercase) or ngô đức(lowercase), it works fine. But if it is stored as wildcard field, it failed when I use case insensitive term query. This can be fixed by removing these 3 lines in server/src/main/java/org/elasticsearch/common/lucene/search/AutomatonQueries.java toCaseInsensitiveChar function.

 if (codepoint > 128) {
            return case1;
        }

And that's why I am asking why we only support ASCII characters in wildcard field. Hope this helps.

@cbuescher
Copy link
Member

Thanks, I'll copy parts of this short note over to the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Search - add case insensitive flag to term, prefix and wildcard queries