-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add syntax support for PARTITIONED BY/CLUSTERED BY in INSERT queries #12163
Add syntax support for PARTITIONED BY/CLUSTERED BY in INSERT queries #12163
Conversation
LGTM failed due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM post review comments.
sql/pom.xml
Outdated
<dependency> | ||
<groupId>org.slf4j</groupId> | ||
<artifactId>slf4j-api</artifactId> | ||
<version>1.7.25</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this come from the parent pom ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I remembered why I didn't use the version from the parent pom. Their is a parent class variable <slf4j.version>1.7.12</slf4j.version>
, however the sl4j-api is using a different version in the parent pom, which is hardcoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't use different versions of slf4j-api in the sql module vs. the rest of the modules: it should be <scope>provided</scope>
with no version specified. That way, we'll only ship one copy of the jar.
If you do that change, does it work out OK or does something bad happen?
sql/pom.xml
Outdated
<plugin> | ||
<groupId>com.googlecode.fmpp-maven-plugin</groupId> | ||
<artifactId>fmpp-maven-plugin</artifactId> | ||
<version>1.0</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Push to parent pom would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I searched the usages and it is a one-off plugin only used here. That's why refraining from pushing to parent pom. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For hygiene I think it's good to define all the plugin version numbers in pluginManagement of the main pom. The actual configuration should stay in this module, since it's specific to this particular module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL at the poms once again, updated with the relevant changes.
|
||
# List of new keywords. Example: "DATABASES", "TABLES". If the keyword is not a reserved | ||
# keyword add it to 'nonReservedKeywords' section. | ||
keywords: [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it non reserved ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably makes sense for this to be a non-reserved keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be ideal that "CLUSTER" is non-reserved since it won't break people's query's with CLUSTER/cluster being an identifier (datasource name, column name, etc). But on testing it out, the test case INSERT INTO dst SELECT * FROM foo CLUSTER BY dim1
breaks. This is because the parser is assuming it is written like INSERT INTO dst SELECT * FROM foo AS CLUSTER ...
, since CLUSTER can be used to name stuff. On the other hand INSERT INTO dst SELECT * FROM foo PARTITION BY 'day' CLUSTER BY dim1
won't fail.
Fwiw, PARTITION
is also a reserved keyword in druid.
Reverting the change for now so that build passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. In that case let's keep it as a reserved keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that @abhishekagarwal87 suggested is to have a flag (in PlannerContext or PlannerConfig) that can be used to switch between the custom and the default parser. That would make sense in case we don't want existing queries with cluster
identifiers to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should always use our own version of the parser. It gives us more control over our dialect, which we'll find useful for other things in the future.
I'm not sure if parser flags are a good idea or not, but if we do have them, IMO the best approach is to have "versions" of the entire parser. i.e. the flag would be druid.sql.parser.version=2
or druid.sql.parser.version=latest
rather than controlling specific features.
But, again, I'm not sure if we need this or not.
New keywords are added from time to time in other databases' SQL dialects. How do they typically handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having two parsers creates great usability and testing issues: All tests would have to be run in both. Users would need a per-query way to switch parsers, which means the poor user has to understand which is in each one, which is a hard task even for developers. Automated tools that generate Druid SQL would not know what to do.
So, better is to get the solution right within a single parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. I agree it makes sense to go with a single dialect; and in that context I like going with PARTITIONED BY and CLUSTERED BY instead of PARTITION BY and CLUSTER BY, like you mentioned in #12163 (comment), because it will be less likely to collide with existing column names.
We'd need to add something to the release notes about the new keywords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser flag is supposed to be at a druid cluster level. It's definitely not great but a workaround if we don't want to use a new reserved keyword. But if we do, using CLUSTERED BY
instead of CLUSTER BY
will definitely make the chance of affecting queries very low so +1 to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I might have been unclear in the original comment. Its general usage is not to be overridden per query but acts as a fallback in case there is some constraint due to which these changes would break existing queries. But going through the PARTITIONED BY
, CLUSTERED BY
route, it makes sense to not consider that case since maintaining and deprecating the older parser would be unnecessary overhead.
sql/src/main/java/org/apache/druid/sql/calcite/parser/DruidSqlInsert.java
Outdated
Show resolved
Hide resolved
sql/src/main/java/org/apache/druid/sql/calcite/parser/DruidSqlInsert.java
Show resolved
Hide resolved
sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidPlanner.java
Outdated
Show resolved
Hide resolved
sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidPlanner.java
Outdated
Show resolved
Hide resolved
sql/src/test/java/org/apache/druid/sql/calcite/CalciteInsertDmlTest.java
Show resolved
Hide resolved
sql/src/main/java/org/apache/druid/sql/calcite/parser/DruidSqlInsert.java
Show resolved
Hide resolved
} | ||
} | ||
// Creates a new SqlOrderBy query, which may have our CLUSTER BY overwritten | ||
query = new SqlOrderBy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the SQL layer is the right place to do this transformation or should we just leave it to the native layer to use orderBy and clusterBy together. My assumption for this transformation is that we would like the query results to be ordered on same dimensions that we want to use to arrange data in segments. Maybe it's something that QueryMaker should decide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, whatever you do enter in the "OrderBy" is getting converted as an ordering spec. I think the aim of the clause is to add syntactic sugar rather than add a different semantic.
(I am unfamiliar with Hive) I looked up Hive, which implements both of these clauses stackoverflow, and it seems that both of them try and achieve similar functionality albeit with different results and different semantics.
But yeah then it makes sense to implement both the CLUSTER BY and ORDER BY clauses and punt the actual logic to QueryMaker or a DruidRel.
The trouble I see implementing it is
a) We will need to convert it to a RelNode in SqlToRelConverter, and then pass it down the query stack. (This is because CLUSTER BY
can be a reference to a column in the statement, and we probably need to use it). I am unsure if this is possible without making changes to the SqlToRelConverter itself.
b) A new Query of type Query.INSERT would need to be created to accommodate "ClusteringSpec" in addition to the "OrderingSpec".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK for now to continue using the native ordering specs to represent cluster by, and to disallow having ORDER BY and CLUSTER BY at the same time. Also, IMO, we should not allow ORDER BY on the SELECT for an INSERT. The error could be something like "Cannot have ORDER BY on an INSERT query; use CLUSTER BY instead."
That way, we can keep discussing if we should allow ORDER BY and CLUSTER BY to be used together, and if so, what should happen. Without blocking this PR.
Thanks everyone for reviewing the PR. The changes have been made per the newer proposal for the supported clauses in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking mostly good to me; a few remaining comments about unparse and also about the error messages.
{ | ||
super.unparse(writer, leftPrec, rightPrec); | ||
writer.keyword("PARTITIONED BY"); | ||
writer.keyword(getPartitionedBy().toString()); // TODO: Make it cleaner by directly unparsing the SqlNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this unparse
method get used anywhere? If so, we need to fix this because it's going to be incorrect in a bunch of cases. Or if it doesn't get used anywhere, IMO it's better to throw UnsupportedOperationException rather than have a broken implementation. At least that way it's clear that the method isn't reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the method, with the proper implementation. Here are a few unparse examples after the change:
INSERT INTO `dst`
(SELECT *
FROM `foo`) PARTITIONED BY ALL TIME
INSERT INTO `dst`
(SELECT *
FROM `view`.`aview`) PARTITIONED BY `TIME_FLOOR`(`__TIME`, 'PT2H')
INSERT INTO `druid`.`dst`
(SELECT `__time`, FLOOR(`m1`) AS `floor_m1`, `dim1`, CEIL(`m2`)
FROM `foo`) PARTITIONED BY FLOOR(`__TIME` TO DAY) CLUSTERED BY 2 `dim1` DESC CEIL(`m2`)
unparse
method is important since this is being used in SqlNode
's toString
method. So it gets used while debugging to view the original node as a SQL string. (Also, it gets used in other unparse methods, so not relevant to DruidSqlInsert currently since it's a top-level node, but it might cause say Explain on a DriudSqlInsert to fail, which is going to be implemented). Therefore I would say it is important to get it right.
writer.keyword("PARTITIONED BY"); | ||
writer.keyword(getPartitionedBy().toString()); // TODO: Make it cleaner by directly unparsing the SqlNode | ||
if (getClusteredBy() != null) { | ||
writer.sep("CLUSTERED BY"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why sep
rather than keyword
? sep
is supposed to be for list separators, which doesn't seem right here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, oversight on my part
sql/src/main/java/org/apache/druid/sql/calcite/parser/DruidSqlParserUtils.java
Outdated
Show resolved
Hide resolved
public static Granularity convertSqlNodeToGranularity(SqlNode sqlNode) throws ParseException | ||
{ | ||
|
||
final String genericParseFailedMessageFormatString = "Unable to parse the granularity from %s. Expected HOUR, DAY, MONTH, YEAR, ALL TIME, FLOOR function or %s function"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"parse the granularity" sounds strange to me, grammatically. How about this:
Encountered <%s> after PARTITIONED BY. Expected HOUR, DAY, MONTH, YEAR, ALL TIME, FLOOR(__time TO <timeUnit>) or %s(__time, '<period>').
"FLOOR".equalsIgnoreCase(operatorName) | ||
|| TimeFloorOperatorConversion.SQL_FUNCTION_NAME.equalsIgnoreCase(operatorName), | ||
StringUtils.format( | ||
"PARTITIONED BY clause can only parse FLOOR and %s functions.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The verb "parse" here seems out of place. The clause doesn't do the parsing; the parser does the parsing.
How about including the same error message as above? I don't think this message needs to be different; the earlier one makes total sense here.
List<SqlNode> operandList = sqlCall.getOperandList(); | ||
Preconditions.checkArgument( | ||
operandList.size() == 2, | ||
StringUtils.format("Invalid number of arguments passed to %s in PARTIITONED BY clause", operatorName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
%s in PARTITIONED BY must have two arguments.
(Even if you don't change this, PARTIITONED is misspelled, so please fix that at least.)
|
||
private ParsedNodes( | ||
@Nullable SqlExplain explain, | ||
@Nullable SqlInsert insert, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. But we'll need EXPLAIN on INSERT to work eventually, so we'll need to clean this up at some point.
@@ -765,13 +786,65 @@ static ParsedNodes create(final SqlNode node) throws ValidationException | |||
if (query.getKind() == SqlKind.INSERT) { | |||
insert = (SqlInsert) query; | |||
query = insert.getSource(); | |||
|
|||
// Processing to be done when the original query has either of the PARTITION BY or CLUSTER BY clause | |||
if (insert instanceof DruidSqlInsert) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me to do it in a follow up. We do need to do it, though, so EXPLAIN on INSERT works properly.
sql/src/main/java/org/apache/druid/sql/calcite/parser/DruidSqlParserUtils.java
Outdated
Show resolved
Hide resolved
period = new Period(granularityString); | ||
} | ||
catch (IllegalArgumentException e) { | ||
throw new ParseException(StringUtils.format("Unable to create period from %s", granularitySqlNode.toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw new ParseException(StringUtils.format("Unable to create period from %s", granularitySqlNode.toString())); | |
throw new ParseException(StringUtils.format("Unable to create period from %s", granularityString)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the other form because then I won't need to explicitly put the string in quotes. Plus it represents exactly what the user typed, not the one curated for our use.
sql/src/main/java/org/apache/druid/sql/calcite/parser/DruidSqlParserUtils.java
Outdated
Show resolved
Hide resolved
sql/src/main/java/org/apache/druid/sql/calcite/parser/DruidSqlParserUtils.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
catch (JsonProcessingException e) { | ||
throw new ValidationException("Unable to serialize partition granularity."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should log the exception stack trace.
Great work @LakshSingla. Almost there :) I had some comments on the error messages. My reasoning for the rephrasing is that users don't know about the conversion or translation of parameters they are passing. So if a parameter X cannot be converted to Y, the error message can simply be that "X is an invalid input. Please refer to documentation" instead of "Unable to convert X to Y". |
Thanks @abhishekagarwal87 @gianm for another pass at the PR. I have updated the clumsy error messages in the DruidSqlParserUtils to be more affirmative, and hide the 'parse' that happens behind the scenes. Also, Regarding I hope I have addressed the latest and the unresolved comments. PTAL at the updated PR. Thanks. |
} | ||
catch (Exception e) { | ||
log.error(e, StringUtils.format("Unable to convert %s to a valid granularity.", sqlNode.toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this error occurs, it's due to user error on one particular query, so it shouldn't be an ERROR that appears in the server logs. That should be reserved for really important stuff that cluster operators need to worry about.
However, there should be some way to get the stack trace, and we generally don't want to return it directly to users of the SQL API.
So in a situation like this I'd do log.debug
instead of log.error
, so server admins can see the stack trace if they really need to, but won't see it by default. That balances all of the relevant desires: server admins shouldn't see "user error" stuff by default, and users shouldn't see stack traces as part of normal error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with the correct level, thanks for the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks @LakshSingla!
@LakshSingla Could you please take a look at the "intellij inspections" check, and the CalciteInsertDmlTest failure in the "other modules test"? The issues there look legit. |
#12163 makes PARTITIONED BY a required clause in INSERT queries. While this is required, if a user accidentally omits the clause, it emits a JavaCC/Calcite error, since it's syntactically incorrect. The error message is cryptic. Since it's a custom clause, this PR aims to make the clause optional on the syntactic side, but move the validation to DruidSqlInsert where we can surface a friendlier error.
@LakshSingla The INSERT-SELECT query is only supported in the tests, right? I tried to execute a INSERT-SELECT query on the lastest master branch, it said: Cannot execute INSERT queries. When I checked the code, I found the message was thrown from |
@FrankChen021 This PR aims to add only the support for the INSERT statements with the custom PARTITIONED/CLUSTERED syntax. To actually run an INSERT query, a supported QueryMaker/QueryMakerFactory is required, which is only supported in the tests ( |
@LakshSingla Thanks for the clarification. I thought the INSERT query was supported, looks like my memory is wrong. |
NOTE: This PR and the description was originally written while keeping PARTITION BY and CLUSTER BY clauses in mind. The changes might not be reflected in the description and the comments.
This PR aims to add parser changes for supporting
PARTITIONED BY
andCLUSTERED BY
as proposed in the issue #11929.Description
This PR adds the ability to execute the queries of the form
Since
PARTITION BY
andCLUSTER BY
are non-standard SQL keywords whose support is not provided in Apache Calcite (which is used for parsing the SQL queries), some changes are made into thepom.xml
ofdruid-sql
to accommodate for the changes made to the parser.Proposal
Calcite is a general-purpose library, and while it does support parsing of SQL that should cover most of the dialects of SQL, one can also provide one's own SQL parser to it, to parse custom SQL statements.
Pre Read
(This is my current understanding and might have some flaws in it, please feel free to point out where I am wrong)
Provision in Calcite for extending the grammar
Calcite provides the option of setting the parser factory in
SqlParser.Config
which could be anything that implementsSqlParserImplFactory
. Therefore as far as Calcite is concerned one can rewrite the whole SQL parser from scratch in Java and can pass it to the SqlParser.Config (which is not ideal and tedious). Following is the minimalistic code snippet which achieves the same:For providing the default parser, Calcite uses JavaCC, which is a parser generator (
Parser.jj
) to specify the grammar in EBNF, which gets compiled to a Java class. Ideally, we would like to add or extend our custom grammar production rules in thisParser.jj
. Luckily, Calcite does support extending this Parser.jj via the use of Apache FreeMarker (FTL) templates.Therefore, in a nutshell, we would like to recompile the default parser by providing our custom rules (through template), and then pass it to the
SqlParser.Config
.Requirements
Parser.jj
manually and modify it directly. This is because a lot of changes do happen in Calcite's main branch, and we would ideally want to keep in line with those changes without having to copy and merge theParser.jj
from upstream every time. Therefore we want to reuse templates as much as possible.Modifications to the POM
The stages added to the build process look a lot similar to the one used by the project: dask-sql which also extends the grammar for its own purpose. Therefore, we can use other "druidesque" methods/plugins to achieve the following, but the overall steps remain the same:
target/codegen
directory.a. Copy the default
Parser.jj
from Calcite’s jar, to the build directoryb. Copy our FTL files, which contains our custom rules, and
config.fmpp
files, which contain the information to be plugged in to the parser.Parser.jj
which contains the default Calcite's Parser.jj along with our custom rules, using fmpp pluginDruidSqlParserImpl.java
. This implementation would be passed to the SqlConfig for parsing the SQL strings.Production Grammar added
Class Design
For the custom insert statement, we are creating a class
DruidSqlInsert
, which inherits from theSqlInsert
. The custom class contains the extra information (like the clause attached to the PartitionBy) while passing the information from the main INSERT node to the parent.Therefore from the custom class, we can extract the extra information as and when required, while passing the same object down for the further conversion to Rel/DruidRel objects. (NOTE: Those objects however won't have any knowledge of these extra clauses added, that would be manually plugged).
Integration with the DruidPlanner
String
gets parsed as a SqlNode, we get an instance ofDruidSqlInsert
(if the statement is an INSERT statement with either of the PARTITION BY or CLUSTER BY clauses.Caveats to the Design
(Most of the following arise due to how we are extending Apache Calcite's grammar, and the limited number of regions we can actually make the changes).
INSERT
production rule. Instead, we are extending the top-levelstatement
production rule to look like:This is because in the original Parser.jj, there is a limited number of places wherein we can insert our custom statements.
The direct implication of this is that
EXPLAIN PLAN FOR SELECT * FROM druid.foo CLUSTER BY dim1
doesn't work, since there is nocustomExplainStatement
or its equivalent. To make this work, we would have to add our own custom EXPLAIN rule as well.As a direct consequence of the above,
customInsertStatement
actually overshadows theinsert
statement. That is any statement that begins withINSERT
is going to be interpreted ascustomInsertStatement
, irrespective of whether it uses custom clauses or not. Therefore, we have to account for simpleinsert
statements with our custom rule.There is no support for our custom syntax at the
RelNode
level. Conversion from SQL string to Druid Native queries look like this:String
->SqlNode
->RelNode
->DruidRel
. The custom syntax is only present till the conversion from String to SqlNode. After that, the custom parameters are extracted in theParsedNodes
present inDruidPlanner
, and not sent further down to the RelNode/DruidRel's. This is because there is no rule for parsing the custom nodes in theSqlToRelConverter.class
. This possibly implies that once the native query is created, parameters passed in sayPARTITION BY
would need to be manually injected before executing the query.The
CLUSTER BY
andPARTITION BY
clauses are linked to theINSERT
statement, as opposed to theSELECT
statements used inside them. Therefore the query looks likeINSERT INTO dst (SELECT * FROM foo) PARTITION BY 'day' CLUSTER BY dim1
rather thanINSERT INTO dst (SELECT * FROM foo PARTITION BY 'day' CLUSTER BY dim1)
.Alternatives design ideas
DruidSqlInsert
inheriting from theSqlInsert
, it can not do so. To make the query work, we would need to replace the query node (in ParsedNodes). The current class design ensures that no replacement of nodes is required.Note
default_config.fmpp
which can be used to supply the boilerplate code inconfig.fmpp
([CALCITE-4238] Create a default parser configuration, to reduce redundant information in sub-parsers calcite#2167). On migrating to newer version, that file can be used to reduce the code inconfig.fmpp
significantly.PS: Currently the build process is failing due to spotbugs/forbiddenapis being run on the generated parser files, and due to failing
mvn dependency:analyze
check.Key changed/added classes in this PR
This PR has: