-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fall back to CPU for try_cast in Spark 3.4.0 [databricks] #8179
Conversation
Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
build |
1 similar comment
build |
build |
Co-authored-by: Navin Kumar <97137715+NVnavkumar@users.noreply.github.com>
build |
build |
val ansiEnabled = evalMode == GpuEvalMode.ANSI | ||
|
||
def withToTypeOverride(newToType: DecimalType): CastExprMeta[INPUT] = { | ||
val evalMode = if (ansiEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why use evalMode
to compute ansiEnabled
and then turn around and use ansiEnabled
to compute evalMode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to minimize changes to existing code, but I can revisit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just a nit so you can decide what to do. But if you delete lines 58 to 62 I would not complain about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -22,8 +22,7 @@ package com.nvidia.spark.rapids.shims | |||
import com.nvidia.spark.rapids._ | |||
|
|||
import org.apache.spark.rapids.shims.GpuShuffleExchangeExec | |||
import org.apache.spark.sql.catalyst.expressions.{Expression, KnownNullable} | |||
import org.apache.spark.sql.catalyst.expressions.Empty2Null | |||
import org.apache.spark.sql.catalyst.expressions.{Empty2Null, Expression, KnownNullable} | |||
import org.apache.spark.sql.catalyst.plans.physical.SinglePartition | |||
import org.apache.spark.sql.execution.{CollectLimitExec, GlobalLimitExec, SparkPlan} | |||
import org.apache.spark.sql.execution.command.{CreateDataSourceTableAsSelectCommand, DataWritingCommand, RunnableCommand} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused and it is not clear how Spark340Plus shims is going to know if the cast is ANSI, TRY, or LEGACY. This inherits from Spark331PlusShims, But 331 does not do this, Only databricks 330, and I don't think this inherits from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We provide overrides for Cast in Spark320PlusShims and Spark31XShims.
Spark340PlusShims indirectly extends Spark320PlusShims (via Spark331PlusShims, Spark330PlusNonDBShims, Spark330PlusShims, and Spark321PlusShims).
These shims are delegating to AnsiCastShim
, which is shimmed for 311+ and 330db/340 as follows:
sql-plugin/src/main/spark311/scala/com/nvidia/spark/rapids/shims/AnsiCastShim.scala (311+)
sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/AnsiCastShim.scala (330db + 340)
It is all very confusing, for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, it was especially confusing that we have a db shim that shims for non-db. I understand that it makes sense, but I think this is not a pattern we are used to seeing so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it works, because the tests pass, but we should fix this. Can you file an issue so that we don't have 340 depend on 330db, unless we have one to do it already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a discussion about this in #8169 (comment) and it seems that it is correct that we have both 330db/340 in the same shim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is right. Not that it makes it any less confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #8188 for removing the dependency from 340 to 330db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best way to make the code more clear is to break up trait inheritance, stop encoding version ranges in the class/trait/object names
sql-plugin/src/main/spark311/scala/com/nvidia/spark/rapids/shims/AnsiCastShim.scala
Show resolved
Hide resolved
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think you need to use the *operator in python, like Looks like I forgot to put it in the initial suggestion. It's a nit, so I don't think it's a big deal to fix at the moment. |
@NVnavkumar my bad: I thought it's a typo and edited it, should have just made a comment instead. |
build |
1 similar comment
build |
build has failed twice here:
|
That looks like CI lost it's ssh connection to the Databricks instance in the middle of testing. Did it hit an idle timeout? |
That seems likely. The timestamps in the log file span a period of 4 hours and ~7 minutes. @GaryShen2008 @pxLi could we increase the timeout for these tests? |
build |
w/ increase timeout to 6.5 hours, I found that the CI actually stuck at below (~3hrs) and made no progress after.
please check rapids_premerge-github. build ID 7061 previous failure were the same actually, stuck then timeout cc @andygrove @NVnavkumar [2023-04-27T16:43:47.223Z] ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Timestamp-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] |
The INJECT_OOM in the test looks like it might be causing some issues. Would be good to have someone look at it and see if we can reproduce it. |
This PR might also close #7046, btw. |
I tried to reproduce on DB 11.3 by modifying the |
log of let me try re-trigger to see if this is still reproducible in CI |
build |
passed CI cleanly in latest run with latest nightly JNI I guess the hanging might be related to side effect of cudf changes last week |
Is |
The seed changes each time, and is printed in the logs:
In order to repro the same injection order, we can go to the failed build and look for the seed that was printed (and then |
Note that on this same day April 28 a fix was merged that could have caused segfaults in the executors. I wonder if this is actually another instance of: rapidsai/cudf#13238 |
@andygrove might be worth trying with the cuDF nightly of April 27, if you already have an environment setup for this. |
I went back over the history. The CI failures were before the fix rapidsai/cudf#13240 was merged and later runs were fine, so I think this was likely the issue and that we should go ahead and merge. |
Thanks for the help with this @pxLi |
Closes #5807
Spark 3.2.0 introduces the
TryCast
expression. We do not have any explicit handling for this, so we fall back to CPU.Spark 3.4.0 removes the
TryCast
expression andCast
now has anevalMode
ofLEGACY
,ANSI
, orTRY
. We did not have any logic to inspect the eval mode in the 340 shims, so we would executetry_cast
as a regularcast
, with incorrect behavior.This PR refactors
GpuCast
to useGpuEvalMode
and adds a test to confirm that we now fall back to CPU fortry_cast
in Spark 3.2.0+.