Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Druid Multi-Value String Columns, Expressions, and SQL #7525

Open
clintropolis opened this issue Apr 22, 2019 · 5 comments
Open

Druid Multi-Value String Columns, Expressions, and SQL #7525

clintropolis opened this issue Apr 22, 2019 · 5 comments

Comments

@clintropolis
Copy link
Member

clintropolis commented Apr 22, 2019

Motivation

Druid SQL is the likely way for new users to interact with Druid, but despite many advancements since introduction in some regards is still not as expressive as using native Druid queries. One of the largest remaining gaps in functionality is that Druid SQL only partially support multi-value dimensions, which is captured in #4638.

There are a number of non-trivial challenges with rectifying this situation however, with the largest barrier being the behavior of Druid multi-value dimensions themselves. The native behavior of these columns is not directly compatible with standard SQL array or multi-set types because they are not designed or equipped to actually be array or multi-set types. Every string dimension is opportunistically a multi-value dimension, as they share the same underlying string dictionary column type. Segment to segment, whether or not a column has multiple values may vary, with just a column capabilities flag to distinguish them at processing time. This is not itself a problem for Druid, because natively they are just string dimensions. Druid offers no real significance to the array of values which make up a row; when aggregated upon with a top-n or a group-by query, it is done by the individual string values in the array, not the array itself; when filtering, any individual value in the array which satisfies the filters counts as a match and the entire array worth of values is included in the result set. The original array can only be retrieved with a scan or select query, which largely diminishes its utility.

On top of this, multi-value dimensions themselves are something of a second class citizen in Druid, as they are not supported by the expression language used in virtual columns and expression filters, which is also used heavily by the SQL planner.

Proposed changes

This proposal aims to lay out a way in which multi-value string dimensions can be operated on as either complete arrays or as individual values, near invisibly, at the users discretion. This has a number of purposes, but perhaps the primary one is to avoid breaking backwards compatibility for SQL queries that by happenstance currently work. The purpose of allowing operations on multi-value dimensions with single value semantics is that this is likely most intuitive to existing users of these dimensions who are using native Druid queries or have constructed a working SQL query. Likewise, allowing array operations is perhaps more naturally intuitive with how a user might expect these dimensions to behave; an array is ingested, an array is stored, therefore it should be possible to operate on arrays on a per row basis and preserve the structure as it was ingested during aggregations and filtering. Additionally by providing this array functionality, we can vastly increase the power of existing multi-value string dimensions to allow expressiveness that is currently just not possible. The primary extent of the changes will be focused in two areas: the Druid expression language used for virtual columns and expression filters, and wiring the expanded functionality up to the Druid SQL planner.

Druid Expressions

Druid SQL relies heavily on expression virtual columns and expression filters, so introducing expression support for multi-value string dimensions will be necessary. Currently the Druid expression language has no real support for multi-value dimensions, with null being the value binding for rows which have multiple values. The objective of expanding the expression language is to both support expressions against existing multi-value dimensions, as well as provide the machinery for any future fully SQL compatible array types we might wish to introduce in future work.

To this end, this proposal introduces array types for doubles, longs, and strings to the Druid expression language, as well as a handful of function expressions to operate on them. Array constants will parse into DoubleArrayExpr, LongArrayExpr, and StringArrayExpr which extend ConstantArrayExpr which itself extends ConstantExpr. A new type of function expression, ApplyFunctionExpr will also be defined, which
takes a newly defined LambdaExpr fragment and one or more Expr arguments which are applied to the LambdaExpr, to enable defining some of the functions necessary to fully support all aspects of this proposal, as well as increase the expressiveness of what is possible with these new array types.

New top level grammar:


expr : 'null'                                         # null
     | ('-'|'!') expr                                 # unaryOpExpr
     |<assoc=right> expr '^' expr                     # powOpExpr
     | expr ('*'|'/'|'%') expr                        # mulDivModuloExpr
     | expr ('+'|'-') expr                            # addSubExpr
     | expr ('<'|'<='|'>'|'>='|'=='|'!=') expr        # logicalOpExpr
     | expr ('&&'|'||') expr                          # logicalAndOrExpr
     | '(' expr ')'                                   # nestedExpr
     | IDENTIFIER '(' lambda ',' fnArgs ')'           # applyFunctionExpr
     | IDENTIFIER '(' fnArgs? ')'                     # functionExpr
     | IDENTIFIER                                     # identifierExpr
     | DOUBLE                                         # doubleExpr
     | LONG                                           # longExpr
     | STRING                                         # string
     | '[' DOUBLE  (','? DOUBLE)* ']'                 # doubleArray
     | '[' LONG (','? LONG)* ']'                      # longArray
     | '[' STRING (','? STRING)* ']'                  # stringArray
     ;

lambda : (IDENTIFIER | '(' IDENTIFIER (','? IDENTIFIER)* ')') '->' expr
       ;

...
'Apply' Functions

The ApplyFunctionExpr is added, not directly for SQL compatibility, but rather to lay the groundwork to facilitate 'magical' transformations to seamlessly allow SQL interactions with multi-value dimensions with either single valued or array semantics.

function description
map(lambda,arr) applies a transform specified by a single argument lambda expression to all elements of arr, returning a new array
cartesian_map(lambda,arr1,arr2,...) applies a transform specified by a multi argument lambda expression to all elements of the cartesian product of all input arrays, returning a new array; the number of lambda arguments and array inputs must be the same
filter(lambda,arr) filters arr by a single argument lambda, returning a new array with all matching elements, or null if no elements match
fold(lambda,arr) folds a 2 argument lambda across arr. The first argument of the lambda is the array element and the second the accumulator, returning a single accumulated value.
cartesian_fold(lambda,arr1,arr2,...) folds a multi argument lambda across the cartesian product of all input arrays. The first arguments of the lambda is the array element and the last is the accumulator, returning a single accumulated value.
any(lambda,arr) returns true if any element in the array matches the lambda expression
all(lambda,arr) returns true if all elements in the array matches the lambda expression
Array Functions

To unlock the potential of multi-value dimensions as array types, as well as support future array types, a number of array specific functions will also be defined.

function description
array_length(arr) returns length of array expression
array_offset(arr,long) returns the array element at the 0 based index supplied, or null for an out of range index
array_ordinal(arr,long) returns the array element at the 1 based index supplied, or null for an out of range index
array_contains(arr,expr) returns true if the array contains the element specified by expr, or contains all elements specified by expr if expr is an array
array_overlap(arr1,arr2) returns true if arr1 and arr2 have any elements in common
array_offset_of(expr) returns the 0 based index of the first occurrence of expr in the array, or null if no matching elements exist in the array.
array_ordinal_of(expr) returns the 1 based index of the first occurrence of expr in the array, or null if no matching elements exist in the array.
array_append(arr1,expr) appends expr to arr
array_concat(arr1,arr2) concatenates 2 arrays
array_to_string(arr,str) joins all elements of arr by the delimiter specified by str
string_to_array(str1,str2) splits str1 into an array on the delimiter specified by str2
Automatic Expression Transformations and Expression Value Selector Behavior

One of the less straight-forward problems to solve is how the expression language can be made to cope with the matter that when operating on an individual segment it is impossible to tell if a dimension is multi-valued if that segment specific does not have multiple values but others in the query set do. Likewise, since native Druid facilities for handling multi-valued dimensions do not approach them as arrays, it might be more natural to allow users to write expressions against them as though they are single valued since that is how they aggregate. Since multi-valued dimensions natively have this opportunistic nature, so too will the expression language handling of these dimensions.

When creating the expression virtual column selector, currently the expression is examined to collect the required column identifiers to read values from to create the bindings for the expression to operate on. An additional bit of logic to check if an identifier is taking part in an array function will be included to detect which columns are expected to have multiple values by the expression. Likewise, the underlying columns that take part in these bindings will be examined to detect if multiple values are present.

  • Expressions which have array functions, but with single valued columns as input to those functions, will see those values implicitly cast to single element arrays containing the value.
  • Expressions which do not have array functions, but have multi valued columns as input, will be implicitly wrapped in a map expression created using the column identifier as the lambda identifier and the actual expression as the body, and the column as the array input, e.g. for an Expr with an multi-value identifier x will become map((x) -> Expr, x). Expression filters which do not contain array functions, but have multi value columns as input will instead of map, use the any function to mimic native Druid filter behavior.

These translations resolve how to handle the input to allow operating on multi-value dimensions either as the individual values or the complete array, but the introduction of array types as the potential output of expressions does present a bit of a problem. Since there are no new ValueType being introduced to correspond to the array expression types, expression system will also need to automatically coerce these output types back to String and fabricate a multi value selector to make use of all of the existing native infrastructure to process multi-value dimensions.

SQL Support

Druid SQL will continue to represent multi-value string dimensions as VARCHAR type as far as Calcite is concerned, saving true SQL array types for future work. The Druid INFORMATION_SCHEMA columns table should be expanded to include a new column, IS_MULTIVALUE to indicate that a string dimension has multiple values, connecting to the segment metadata query data on the column.

Array Functions

All of the expression language array functions will be mirrored in SQL. Future work will add a NEST function, the opposite of the UNNEST that many SQL dialects provide, to allow Druid multi-value string columns to be treated as ARRAY typed in SQL.

Filtering

Druid SQL will still have some limitations of when filtering on multi-value string dimensions compared to its native counterpart, namely the 'contradictory' filter problem illustrated in #4638, a = 'x' AND a = 'y'. Calcite will correctly realize that this is effectively false for a legitimate VARCHAR, and the query will be a no-op. This behavior will continue since we are still going to represent multi-value string dimensions as VARCHAR, so to express this type of filter the user must use array_contains or array_overlap functions. If simple, these SQL expressions can be translated directly into native Druid selector filters, and fall back to Druid expressions if the planner is unable to perform the optimization.

Odds and Ends

There are potentially some discrepancies in the behavior of native handling of multi-value dimensions if issues #4195 and #5897 are still valid, which indicate that during aggregation top-n ignores null values and group-by does not, producing different results which is not likely expected. If this is still an issue it should be addressed to make multi-value dimensions more self consistent, especially since in SQL a group-by can become a top-n if a limit is added and approximate results are allowed.

Rationale

While it might have been put in place some machinery to make multi-value dimensions masquerade solely as true SQL compatible array types and represent them as such, since they are already partially supported by Druid SQL this does not seem a good option to force and break backwards compatibility. Furthermore, having wildly different default behavior between native queries and SQL would not be intuitive.

Instead, this proposal argues that the best path forward is for multi-value dimensions to continue to behave as string dimensions in SQL, but also introduce a handful of mechanisms to Druid SQL and the virtual column expression language to interact with them as array types as well. The resulting enhancements will make both native and SQL queries against multi-value dimensions much more powerful than they currently are.

Future work

NEST Function

The opposite of UNNEST that many SQL dialects support to allow aggregations of single values like the default Druid behavior.

function description
nest(arr,[delimiter]) converts a multi-value string dimension into a single, combined value for the purposes of aggregation with group-by or top-n queries. Internally this will aggregate as an expression virtual column which is joined into a string with array_to_string and transformed back into an array prior to returning the results. delimiter is optional, with a default of ','

SQL Sub-Query Support for Multi-Value String Dimensions

Absent from the SQL portion of this proposal is an analog to the map, filter, fold and similar 'lambda' functions, because SQL doesn't have any of these class of functions. However, subqueries against a multi-value column which a handful of implementations support are a sort of analog of the type of behavior that map and filter provide to the expression language, so providing support in SQL to map array column subqueries to these functions will be needed for complete parity between SQL queries and native queries.

Native Array Typed Columns

The introduction of array types and multi-value support for the expression language will lay the foundation to introduce true SQL Array types in the future should we choose. Much of the existing multi-value string dimension machinery could be re-used and expanded to add native array type columns. Many of the functions that the improved multi-value string dimensions will support could likely be optimized greatly if the explicit array typed columns are created, and native representations of long/float/double arrays would likely be much more space efficient than hijacking multi-value string dimensions for this purpose.

@gianm
Copy link
Contributor

gianm commented Apr 23, 2019

I buy the general idea (especially trying to make this an extension of existing multi-value dimension functionality). The fact that multi-value dimensions are created opportunistically (as arrays are detected), and not through any specific user-driven configuration, I think means it is best to make it possible for expressions written assuming singly-valued strings to be automatically lifted to work on multi-value strings. The idea around automatically wrapping in a map is similar to how extractionFns work on multi-value dimensions, so there is a symmetry there.

Expressions which do not have array functions, but have multi valued columns as input, will be implicitly wrapped in a map expression created using the column identifier as the lambda identifier and the actual expression as the body, and the column as the array input, e.g. for an Expr with an multi-value identifier x will become map((x) -> Expr, x). Expression filters which do not contain array functions, but have multi value columns as input will instead of map, use the any function to mimic native Druid filter behavior.

image

  1. What happens if an expression has array functions, but a multi-valued input x is sometimes used by array functions and sometimes used by non-array functions? (We could disallow this, I suppose, on the grounds that the writer of the expression should be able to decide if they want to treat x as an array or not.)
  2. What happens if an expression has no array functions, but there are two multi-valued inputs x and y? (Do we map Expr over all possible pairings of values from x and y? Do we return… null? Do we return… some N-dimensional matrix?)
  3. What happens if an expression uses array functions for a multi-valued input x but doesn't for multi-valued inputs y and z?

Druid SQL will still have some limitations of when filtering on multi-value string dimensions compared to its native counterpart, namely the 'contradictory' filter problem illustrated in #4638, a = 'x' AND a = 'y'. Calcite will correctly realize that this is invalid syntax, and the query will result in a validation failure.

I think what will really happen is Calcite will 'optimize' this filter into FALSE and plan the query into a no-op.

All of the expression language array functions will be mirrored in SQL. In addition, to provide SQL the ability to aggregate with multi-value dimensions as though they are arrays, a nest function, the opposite of unnest that many SQL dialects support to allow aggregations of single values like the default Druid behavior.

Will there be issues if we decide to introduce real SQL array types in the future? Presumably, these expression array functions will be written such that they apply to VARCHAR (and numeric types as well?). Will they conflict with honest-to-goodness SQL array functions or is there a path to supporting both?

@clintropolis
Copy link
Member Author

These are very good questions, and missing from the proposal because for cases like these I am not quite certain how to handle it yet, so thanks for bringing it up. 🤘

What happens if an expression has array functions, but a multi-valued input x is sometimes used by array functions and sometimes used by non-array functions? (We could disallow this, I suppose, on the grounds that the writer of the expression should be able to decide if they want to treat x as an array or not.)

My current prototype doesn't have anything that explicitly is checking for this, so the behavior is probably from the user perspective unexpected and strange. What it does right now is notice the outer x that was used in a non-array function without a map, and then wrap the whole expression in a map, which would mean the array functions would be applied to auto cast single element arrays of the mapped value instead of the source array. This behavior is definitely not probably what a user would expect, so I think we probably should check for these mixed usages and consider it a malformed expression and some sort of error.

And while we certainly could probably handle this by rewriting the expression tree appropriately to only map the sub expressions that makes sense, or in such a way that it doesn't overload the array identifier, I'm not sure that we should allow this since it seems a bit too magical for my taste. I think it's fair that if the user explicitly expresses an input as having multiple values then it must be treated as multi-valued for all usages in the same expression, and it's invalid otherwise.

What happens if an expression has no array functions, but there are two multi-valued inputs x and y? (Do we map Expr over all possible pairings of values from x and y? Do we return… null? Do we return… some N-dimensional matrix?)

At the time I opened the proposal my prototype would throw an exception with explicit language in it because I wasn't sure what to do yet. I think the first 2 suggestions you have are the only ones which make sense to me. I haven't implemented expression arrays to support any sort of nesting or anything, only flat constant arrays so the matrix idea is out of the picture for me at least, and I see nothing obvious to do with such a matrix even if we could compute it if it made it out as the final evaluated result.

After sleeping on it and doing some experiments, I think the most useful thing would be to do the cartesian product of all multi-value inputs which are not part of an array function, done by either extending the map function to allow it to just implicitly do this if given more than one argument, or maybe better an explicit cartesian_map function.

What happens if an expression uses array functions for a multi-valued input x but doesn't for multi-valued inputs y and z?

I think we could handle this case in the same manner we handle more than one multi-value inputs without any array function use (question 2), at least the way I have things setup currently I would have to explicitly check to stop this from working as you might expect, so maybe would be nice to leave it that way? I feel less strongly about asserting that since a user acknowledged one column as an array that they must be aware of all other columns that are array typed in the expression.

I think what will really happen is Calcite will 'optimize' this filter into FALSE and plan the query into a no-op.

Yes, you're correct, updated the proposal to reflect this behavior.

Will there be issues if we decide to introduce real SQL array types in the future? Presumably, these expression array functions will be written such that they apply to VARCHAR (and numeric types as well?). Will they conflict with honest-to-goodness SQL array functions or is there a path to supporting both?

I do not believe there needs to be a conflict, I think it should be possible to write the SQL functions such that they accept either VARCHAR or ARRAY and then validate that any VARCHAR input are marked as multi value in the information schema? At the expression layer the array functions are intended to handle both multi-value strings and true array typed columns as arrays internally, so that shouldn't be an issue. Ideally, at the SQL level I'd definitely like these array functions to support both our multi-value string dimensions and real SQL arrays.

Admittedly I need to do a bit more experimentation with SQL layer stuff to be certain that this is possible though and there are no like validation or syntax gotchas from us treating them as VARCHAR that will make working with ARRAY with the same functions painful, so we might need to reconsider if there happens to be a conflict. In that case, I would expect to rename the functions for multi-val VARCHAR at the SQL layer, to reserve these suggested names for true SQL array types since some of them are standard function names, and then plumb them both to the same set of Druid expression array functions to handle internally.

@gianm
Copy link
Contributor

gianm commented Apr 28, 2019

After sleeping on it and doing some experiments, I think the most useful thing would be to do the cartesian product of all multi-value inputs which are not part of an array function, done by either extending the map function to allow it to just implicitly do this if given more than one argument, or maybe better an explicit cartesian_map function.

This seems reasonable to me. Basically it's saying that the user has the ability to declare that they want to treat an input as an array, in which case they use array functions. Or they can use scalar functions on it, in which case we map over all the inputs (possibly the cartesian product of all multi-value inputs if there are more than one).

@himanshug
Copy link
Contributor

himanshug commented Apr 30, 2019

#7574 may be slightly related in that it could allow implementing a virtual column that gives view of existing multi-value string column as a String[] type column and works as efficiently as if String[] was a native Druid column type.

@clintropolis
Copy link
Member Author

#7574 may be slightly related in that it could allow implementing a virtual column that gives view of existing multi-value string column as a String[] type column and works as efficiently as if String[] was a native Druid column type.

I think that would be a useful optimization for the cases where the user explicitly acknowledges a column as a multi-value string dimension, and think this should probably be explored in the future. If we ever add native handling at the processing layer of array types (i.e. group by entire array) and/or native array column types, then I think this could also be used to adapt multi-value dimensions into native array handling as well and just bypass the expressions entirely in some cases?

Beyond arrays, I find the changes in #7574/#7618 and the semi related #7633 very rad in general, since it seems to me you can now effectively make any sort of custom column type with a complex type serde, and then expose it as native Druid types via virtual columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants