-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema #28392
Conversation
Minor fix though, how about this? @dongjoon-hyun @HyukjinKwon |
@@ -102,6 +102,8 @@ case class Rand(child: Expression) extends RDG with ExpressionWithRandomSeed { | |||
} | |||
|
|||
override def freshCopy(): Rand = Rand(child) | |||
|
|||
override def sql: String = "rand()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm .. shouldn't we print out the seed when it's explicitly given?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, I see... I just want to reomve a non-deterministic number in a column name when no argument given. I need more time to think of it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe the current output is useful. Yes the seed was randomly chosen, but you might want to know what it was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I also think it's important that we can check seeds, but df.explain
is not enough for checking it? Actually, the other two expression with random seeds (shuffle
and uuid
) don't display it in column names;
scala> sql("select shuffle(array(1, 2))").show()
+--------------------+
|shuffle(array(1, 2))|
+--------------------+
| [2, 1]|
+--------------------+
scala> sql("select shuffle(array(1, 2))").explain()
== Physical Plan ==
*(1) Project [shuffle([1,2], Some(894779230406706679)) AS shuffle(array(1, 2))#14]
+- *(1) Scan OneRowRelation[]
scala> sql("select uuid()").show()
+--------------------+
| uuid()|
+--------------------+
|dde93891-8a95-4e9...|
+--------------------+
scala> sql("select uuid()").explain()
== Physical Plan ==
*(1) Project [uuid(Some(4613707233104825008)) AS uuid()#23]
+- *(1) Scan OneRowRelation[]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, consistency is good. I don't feel strongly either way.
Test build #121990 has finished for PR 28392 at commit
|
@@ -102,6 +105,11 @@ case class Rand(child: Expression) extends RDG with ExpressionWithRandomSeed { | |||
} | |||
|
|||
override def freshCopy(): Rand = Rand(child) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when we do freshCopy
? Do we need to propagate useRandSeed
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yes. Nice catch!
@@ -3425,6 +3425,28 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark | |||
assert(SQLConf.get.getConf(SQLConf.CODEGEN_FALLBACK) === true) | |||
} | |||
} | |||
|
|||
test("Do not display the seed of rand/randn with no argument in output schema") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, SPARK-31594:
prefix please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, of course not!
|
||
override def flatArguments: Iterator[Any] = Iterator(child) | ||
override def sql: String = { | ||
s"randn(${if (useRandSeed) "" else child.sql})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming useRandSeed
might be a little mismatched. This is used only here to hide random seed. Maybe, something like hideSeed
is more direct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, Randn(child = expr, useRandSeed = true)
seems to be possible in program. It might look weird because it will not use rand seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, the name looks reasonable.
Test build #122007 has finished for PR 28392 at commit
|
Test build #122021 has finished for PR 28392 at commit
|
Console.withOut(output) { | ||
df.explain() | ||
} | ||
output.toString.matches("""randn?\(-?[0-9]+\)""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want assert(...)
?
} | ||
val df1 = sql("SELECT rand()") | ||
assert(df1.schema.head.name === "rand()") | ||
checkIfSeedExistsInExplain(df1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add assert
at line 3435, this test will fail.
Test build #122042 has finished for PR 28392 at commit
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @maropu, @HyukjinKwon , @srowen .
The last commit is only test case update and I verified that locally.
Merged to master.
Thanks, @dongjoon-hyun ! |
Test build #122045 has finished for PR 28392 at commit
|
…ment in output schema This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic. Before this PR (a column name changes run-by-run): ``` scala> sql("select rand()").show() +-------------------------+ |rand(7986133828002692830)| +-------------------------+ | 0.9524061403696937| +-------------------------+ ``` After this PR (a column name fixed): ``` scala> sql("select rand()").show() +------------------+ | rand()| +------------------+ |0.7137935639522275| +------------------+ // If a seed given, it is still shown in a column name // (the same with the current behaviour) scala> sql("select rand(1)").show() +------------------+ | rand(1)| +------------------+ |0.6363787615254752| +------------------+ // We can still check a seed in explain output: scala> sql("select rand()").explain() == Physical Plan == *(1) Project [rand(-2282124938778456838) AS rand()#0] +- *(1) Scan OneRowRelation[] ``` Note: This fix comes from apache#28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests. To make output schema deterministic. No. Added unit tests. Closes apache#28392 from maropu/SPARK-31594. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR intends to update
sql
inRand
/Randn
with no argument to make a column name deterministic.Before this PR (a column name changes run-by-run):
After this PR (a column name fixed):
Note: This fix comes from #28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests.
Why are the changes needed?
To make output schema deterministic.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added unit tests.