-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31372][SQL][TEST] Display expression schema for double check. #28194
Conversation
Test build #121141 has finished for PR 28194 at commit
|
Test build #121140 has finished for PR 28194 at commit
|
retest this please |
Test build #121145 has finished for PR 28194 at commit
|
-- !query | ||
SELECT forall(array(2, null, 8), x -> x % 2 == 0) | ||
-- !query schema | ||
struct<forall(array(2, CAST(NULL AS INT), 8), lambdafunction(((namedlambdavariable() % 2) = 0), namedlambdavariable())):boolean> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we do not need all of them. How about just showing the first one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. we only need the first one. Because other SQL and schema no more value. Thanks for your remind.
Test build #121165 has finished for PR 28194 at commit
|
retest this please |
Test build #121194 has finished for PR 28194 at commit
|
retest this please |
Test build #121177 has finished for PR 28194 at commit
|
Test build #121205 has finished for PR 28194 at commit
|
Test build #121245 has finished for PR 28194 at commit
|
retest this please |
Test build #121258 has finished for PR 28194 at commit
|
discussed this with @beliefer offline. Let us generate a MD file as the golden file. We can build a MD table to present the schema info. |
Test build #121359 has finished for PR 28194 at commit
|
retest this please |
sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME")) | ||
} | ||
|
||
java.nio.file.Paths.get(sparkHome, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In BenchmarkBase.main
, we generate the path by simply doing
val file = new File(s"benchmarks/$resultFileName")
if (!file.exists()) {
file.createNewFile()
}
Does it work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's work too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then why we write such complex code here to generate the path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There need to create the parent dir sql-functions
first.
I can replace the code below
java.nio.file.Paths.get(sparkHome,
"sql", "core", "src", "test", "resources", "sql-functions").toFile
as
val file = new File(s"$sparkHome/sql/core/src/test/resources/sql-functions")
if (!file.exists()) {
file.mkdir()
}
But I am neutral about this change.
|
||
/** A single SQL query's SQL and schema. */ | ||
protected case class QueryOutput( | ||
className: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 4 space identation for parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala
Show resolved
Hide resolved
} | ||
|
||
// Compare results. | ||
assertResult(expectedOutputs.size, s"Number of queries should be ${expectedOutputs.size}") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isn't it simply assert(expectedOutputs.size == outputs.size, error message ...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Test build #122044 has finished for PR 28194 at commit
|
Test build #122049 has finished for PR 28194 at commit
|
Test build #122065 has finished for PR 28194 at commit
|
Test build #122063 has finished for PR 28194 at commit
|
| org.apache.spark.sql.catalyst.expressions.Abs | abs | SELECT abs(-1) | struct<abs(-1):int> | | ||
| org.apache.spark.sql.catalyst.expressions.Acos | acos | SELECT acos(1) | struct<ACOS(CAST(1 AS DOUBLE)):double> | | ||
| org.apache.spark.sql.catalyst.expressions.Acosh | acosh | SELECT acosh(1) | struct<ACOSH(CAST(1 AS DOUBLE)):double> | | ||
| org.apache.spark.sql.catalyst.expressions.Add | + | SELECT 1 + 2 | struct<(1 + 2):int> | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One case that running multiple examples is useful: date + interval
will be replaced by DateAddInterval
during analysis, and it's better to test it as well.
We can fix it in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the code of DateAddInterval
override def sql: String = s"${left.sql} + ${right.sql}"
The alias of DateAddInterval
is consistent with Add
.
If we think DateAddInterval
is one of inner implement for Add
. This is not affect the user's use.
thanks, merging to master/3.0! |
### What changes were proposed in this pull request? Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement. We need to add more powerful guarantees so that aliases outputed by built-in functions are correct. This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file. By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them. ### Why are the changes needed? Ensure that the output alias is correct ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28194 from beliefer/check-expression-schema. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1d1bb79) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan @gatorsmile @maropu Thanks for all your help! |
@beliefer could you do follow-up? It seems we just need to update the golden file for 3.0. |
To recover 3.0 asap, I opened a PR to fix it. |
if (regenerateGoldenFiles) { | ||
val missingExampleStr = missingExamples.mkString(",") | ||
val goldenOutput = { | ||
s"<!-- Automatically generated by${getClass.getSimpleName} -->\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Needs to a space by ${getClass.getSimpleName}
. We can fix it when updating this file next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me do it.
#28164
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Looks okay.
@@ -0,0 +1,341 @@ | |||
<!-- Automatically generated byExpressionsSchemaSuite --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this file md specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #28194 (comment)
I guess it's easier for humans the read the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but seems it's more difficult to track the diff. See https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.0-test-sbt-hadoop-2.7-hive-2.3/431/testReport/junit/org.apache.spark.sql/ExpressionsSchemaSuite/Check_schemas_for_expression_examples/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully we can improve the diff here ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be row-by-row diff, @beliefer can you help to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan @HyukjinKwon
This checked exception is thrown when the number of expected SQL and the number of actual SQL are not equal.
Could I not output all the SQL to the checked exception ?
… that easy to track the diff ### What changes were proposed in this pull request? This PR follows up #28194. As discussed at https://github.com/apache/spark/pull/28194/files#r418418796. This PR will improve `ExpressionsSchemaSuite` so that easy to track the diff. Although `ExpressionsSchemaSuite` at line https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala#L165 just want to compare the total size between expected output size and the newest output size, the scalatest framework will output the extra information contains all the content of expected output and newest output. This PR will try to avoid this issue. After this PR, the exception looks like below: ``` [info] - Check schemas for expression examples *** FAILED *** (7 seconds, 336 milliseconds) [info] 340 did not equal 341 Expected 332 blocks in result file but got 333. Try regenerate the result files. (ExpressionsSchemaSuite.scala:167) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.sql.ExpressionsSchemaSuite.$anonfun$new$1(ExpressionsSchemaSuite.scala:167) ``` ### Why are the changes needed? Make the exception more concise and clear. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28430 from beliefer/improve-expressions-schema-suite. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
… that easy to track the diff ### What changes were proposed in this pull request? This PR follows up #28194. As discussed at https://github.com/apache/spark/pull/28194/files#r418418796. This PR will improve `ExpressionsSchemaSuite` so that easy to track the diff. Although `ExpressionsSchemaSuite` at line https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala#L165 just want to compare the total size between expected output size and the newest output size, the scalatest framework will output the extra information contains all the content of expected output and newest output. This PR will try to avoid this issue. After this PR, the exception looks like below: ``` [info] - Check schemas for expression examples *** FAILED *** (7 seconds, 336 milliseconds) [info] 340 did not equal 341 Expected 332 blocks in result file but got 333. Try regenerate the result files. (ExpressionsSchemaSuite.scala:167) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.sql.ExpressionsSchemaSuite.$anonfun$new$1(ExpressionsSchemaSuite.scala:167) ``` ### Why are the changes needed? Make the exception more concise and clear. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28430 from beliefer/improve-expressions-schema-suite. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b949420) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ment in output schema This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic. Before this PR (a column name changes run-by-run): ``` scala> sql("select rand()").show() +-------------------------+ |rand(7986133828002692830)| +-------------------------+ | 0.9524061403696937| +-------------------------+ ``` After this PR (a column name fixed): ``` scala> sql("select rand()").show() +------------------+ | rand()| +------------------+ |0.7137935639522275| +------------------+ // If a seed given, it is still shown in a column name // (the same with the current behaviour) scala> sql("select rand(1)").show() +------------------+ | rand(1)| +------------------+ |0.6363787615254752| +------------------+ // We can still check a seed in explain output: scala> sql("select rand()").explain() == Physical Plan == *(1) Project [rand(-2282124938778456838) AS rand()#0] +- *(1) Scan OneRowRelation[] ``` Note: This fix comes from apache#28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests. To make output schema deterministic. No. Added unit tests. Closes apache#28392 from maropu/SPARK-31594. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement.
We need to add more powerful guarantees so that aliases outputed by built-in functions are correct.
This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file.
By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them.
Why are the changes needed?
Ensure that the output alias is correct
Does this PR introduce any user-facing change?
'No'.
How was this patch tested?
Jenkins test.