Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify equality and typed in filter behavior for numeric match values on string columns #16593

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Jun 13, 2024

Description

This PR changes EqualityFilter and TypedInfilter numeric match values against string columns to now cast the strings to numbers instead of converting the numeric values directly to string for pure string equality. This makes using these filters (the default in SQL compatible mode) behave consistently with 'default' value mode which uses the BoundFilter for numeric comparison of string values.

The added tests to cover numeric equality matching. Double match values in particular would fail to match the string values since 1.0 would become '1.0' which does not match '1'.

I investigated an alternative of just having the SQL planner not eat CAST operators like it currently does, since a query like ... WHERE stringColumn = 1.0 ends up as ... WHERE CAST(stringColumn as DOUBLE) = 1.0 before we eat the cast, which makes the native layer a lot more explicit, however this change basically does the same thing and is a lot less disruptive.

It may still be worth investigating not eating casts during SQL planning, but I'll save that for follow-up work.

Release note

Modified behavior of using EqualityFilter and TypedInFilter to match numeric typed values (particularly DOUBLE) against string columns to effectively cast the strings to use numerical comparison, for more consistent Druid behavior between sqlUseBoundAndSelectors context flag.

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

… string columns

changes:
* EqualityFilter and TypedInfilter numeric match values against string columns will now use StringComparators.NUMERIC instead of converting the numeric values directly to string for pure string equality. This effectively is an implicit cast of the STRING values to the numeric match value type, which is consistent with the casts which are eaten in the SQL layer, as well as classic druid behavior
* added tests to cover numeric equality matching. Double match values in particular would fail to match the string values since `1.0` would become `'1.0'` which does not match `'1'`.
@clintropolis clintropolis changed the title fix equality and typed in filter behavior for numeric match values on string columns modify equality and typed in filter behavior for numeric match values on string columns Jun 13, 2024
@kgyrtkirk
Copy link
Member

It may still be worth investigating not eating casts during SQL planning, but I'll save that for follow-up work.

I think the functions in Calcite which could make it easy to discard casts (like RexUtil#isLiteral) ; could have made it too easy to discard such things :)

final Set<String> stringSet;
// when matching strings to numeric match values, use numeric comparator to implicitly cast the string to number
if (matchValueType.isNumeric()) {
stringSet = new ObjectAVLTreeSet<>(StringComparators.NUMERIC);
Copy link
Member

@kgyrtkirk kgyrtkirk Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like regardless of that the values arrive as sortedValues ; the build cost of this type of TreeSet will still be O(NlogN) ; and I guess lookup will also be O(logN)

the old approach of a hashmap build O(N) ; lookup O(1)

I feel like instead of delaying the normalization until the comparision is a bit too much - why not introduce normalization of the input values instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the other comments about trying to have consistent behavior no matter how sqlUseBoundAndSelectors is set. If sqlUseBoundAndSelectors is set to true (which was always the old behavior before these new filters and the context flag were introduced), something like stringCol in (1.0, 2.0, 3.0) would be planned into an OrFilter of BoundFilter, e.g. (1.0 <= stringCol <= 1.0) OR (2.0 <= stringCol <= 2.0) OR (3.0 <= stringCol <= 3.0), which is even less efficient than this sorted set.

The goal here is to get this to behave as if though stringCol was cast to double, which means we cannot convert the match value set to Strings and use a plain set. We either need to use this numeric comparator, or actually cast the entire string column to a double externally so that this filter uses a double predicate instead of string predicate (not eating the cast in SQL, in which case this filter would instead be using a double predicate because the column would be a double virtual column instead of string regular column).

Since using this comparator is basically guaranteed not to be worse than the older behavior (and in fact slightly better since we don't have to do the more expensive BoundFilter check and can use the set instead of the loop that OR would need to do), this is the 'easier' approach.

I agree its potentially not the best approach, but the possibly best approach requires quite a lot more work and validation, which is what I really meant by saving not eating the cast for a follow-up PR.

if (value == null) {
return DruidPredicateMatch.UNKNOWN;
}
return DruidPredicateMatch.of(StringComparators.NUMERIC.compare(value, castForComparison.asString()) == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringComparators.NUMERIC is full of interesting calls - like: convertStringToBigDecimal - seems like something which could possibly generate some garbage collector pressure ...especially if it gets used inside a for-each-row comparator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is trying to make this filter when comparing strings to numerics consistent with the behavior of the old BoundFilter, which was what was previously always used for comparing strings to numeric values until these new filters were added, equality was planned to something like 1.0 <= stringCol <= 1.0. BoundFilter uses this comparator when comparing strings to numeric values, https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/filter/BoundFilter.java#L265. The goal here is to make the behavior the same regardless of how sqlUseBoundAndSelectors (the flag that controls whether or not we use the new or old filters is set).

Casting the string column to a double column is also relatively expensive, currently we use https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/math/expr/ExprEval.java#L627, where Doubles.tryParse involves a regex. This could probably be done more efficiently, but again goes back to requiring more dramatic changes be made in the native layer which I was trying to avoid for now.

Copy link
Member

@kgyrtkirk kgyrtkirk Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the TypedInFilter class

  • knows in which type the match should happen matchValueType
  • the values it has should already be preapared to be used with that

Casting the string column to a double column is also relatively expensive

with a virtual column that might be true;
but I think that casting will happen quite a lot with the current approach as well:
let me try to explain what I see right now:

  • load an AVL tree
  • do comparisions with a comparator
    • which will create a BigDecimal when
      • not a comparision against null
      • the number(s) are not a long
    • that's kinda like all interesting cases...
    • due to the AVL tree lookup there will be O(logN) scaler
    • the BigDecimal creation will also happen in cases the comparator return false

I wonder why not:

  • remove the AVL tree stuff
  • the sorted values should be double-s if not already...but could populate the set with BigDecimal-s for simplicty
  • add decimalValue = convertStringToBigDecimal(value)
  • use decimalValue to lookup in the map

Coudln't this avoid the tree build penalty/multiple comparisions/normalizations during every comparision?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set is built once per query, not every comparison, however fair point about lookup speed using a plain HashSet for predicates, though there is a fair bit of complexity here that needs careful consideration and probably a bunch of measurement I was looking to avoid, at least until the bug is fixed.

TypedInFilter currently is constructed with a sorted List of match values, the sortedness is very important for computing indexes. We definitely optimized for that use case over the predicate filtering use case.

Right now, when the column type (predicate type) matches the filters matchValueType, we just use the sorted list directly, and Collections.binarySearch, so that no set is constructed at all, though it has this same logn lookup speed that you're complaining about here, which is again a fair complaint.

As mentioned before though, we do need the sorted list, so it can't really go away, so at best we can also build a hashset, so the next question is where and how to do it.

Its really not that awesome to be computing any sets in the 'makePredicate' functions because technically if several segments are in the processing pool they would be blocked while one of them builds the sets, which is kind of sad.

We could consider building the "default" set (meaning set of the matchValueType) outside of the predicate factory, as part of TypedInFilter constructor, though that downside is it guarantees double memory footprint instead of just when matchValueType doesn't match the column predicate type, along with the added up front cost of constructing it.

When building the set we also need to coerce the values to handle any ugly that happens during json deserialization, such as longs coming in as ints or doubles/floats or whatever, which isn't really a problem when computing indexes or with the current predicates using Collections.binarySearch because the match type comparator handles that (not to be confused with the StringComparators.numeric added in this PR, which is string predicate + number match value).

Anyway, this would let same predicate and matchValueType plus these string to numeric comparisons use that default set which would be computed up front, and then only have to deal with the more blocky other set construction for stranger cases of matchValueType not matching predicate type.

The predicate maker functions currently build sets backwards, to try to cast the match values of matchValueType to the predicate type, on the assumption that the number of values the predicate is going to be evaluated against is going to be a lot bigger than the matchValue set size (millions of rows vs however big the in filter is). I feel like there still seems to be some benefit to converting the set when safe rather than converting every row value (like number predicates with string matchValues), so I don't think we necessarily would always want to use a hashset of the match values and convert the row values either, because it means we always have to cast every row value the predicate is processing. I also don't think we want to build all of these combinations up front, since the expense would be too much if things are never used.

Anyway, all of this is why I was suggesting that longer term it makes sense to make virtual columns more efficient and also not eat the casts in the SQL layer, so that way the predicate chosen would always match the matchValueType. However, still we have an open question of whether we should build a hashset up front or not, and the answer is i don't really know yet, maybe?

I guess all of this is why I was suggesting for comparison against the behavior of "sqlUseBoundAndSelectors":true, which makes an OrFilter of BoundFilter for IN, and a single BoundFilter for numeric equality using StringComparators.NUMERIC, so that this bug can be fixed without answering all of the other hard questions right now, because the current behavior is simply incorrect. However, I can at least try to explore building a same type hashset to cut down on the sets built inside the supplier and take some measurements if you view it as a blocker to fixing this bug.

@clintropolis
Copy link
Member Author

I think the functions in Calcite which could make it easy to discard casts (like RexUtil#isLiteral) ; could have made it too easy to discard such things :)

It is like basically a 1 line change to not discard the casts in SQL layer,
https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/expression/builtin/CastOperatorConversion.java#L110.

The additional work is all at the native layer to make it so that there is no performance penalty to having the casts be explicit (lots of things implicitly perform the cast at the native layer is why we eat them in the first place), and UNNEST also needs some modification to allow additional virtual columns be defined and continue to allow filter pushdown of casts on the unnest column.

if (matchValue.type().isNumeric()) {
return value -> {
if (value == null) {
return DruidPredicateMatch.UNKNOWN;
Copy link
Contributor

@LakshSingla LakshSingla Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking if this change was a regression in the original code. Earlier, there was DruidObjectPredicate.alwaysFalseWithNullUnknown, however now we return UNKNOWN. Since the match value is null, the modified code looks right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DruidObjectPredicate.alwaysFalseWithNullUnknown is still there, it is the call to DruidObjectPredicate.equalTo that is modified. equalTo returns unknown if the value input is null.

The DruidObjectPredicate.alwaysFalseWithNullUnknown is for if ExprEval.castForEqualityComparison returns null, which is a short circuit for things that can never match, such as if matching double value 1.1 to a long column and such.

Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the updates!

@clintropolis clintropolis merged commit 09e0eef into apache:master Jul 8, 2024
88 checks passed
@clintropolis clintropolis deleted the equality-and-typed-in-numeric-match-on-strings branch July 8, 2024 17:58
sreemanamala pushed a commit to sreemanamala/druid that referenced this pull request Aug 6, 2024
… on string columns (apache#16593)

* fix equality and typed in filter behavior for numeric match values on string columns
changes:
* EqualityFilter and TypedInfilter numeric match values against string columns will now cast strings to numeric values instead of converting the numeric values directly to string for pure string equality, which is consistent with the casts which are eaten in the SQL layer, as well as classic druid behavior
* added tests to cover numeric equality matching. Double match values in particular would fail to match the string values since `1.0` would become `'1.0'` which does not match `'1'`.
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants