Fall back to CPU for try_cast in Spark 3.4.0 [databricks] #8179

andygrove · 2023-04-25T16:26:54Z

Spark 3.2.0 introduces the TryCast expression. We do not have any explicit handling for this, so we fall back to CPU.

Spark 3.4.0 removes the TryCast expression and Cast now has an evalMode of LEGACY, ANSI, or TRY. We did not have any logic to inspect the eval mode in the 340 shims, so we would execute try_cast as a regular cast, with incorrect behavior.

This PR refactors GpuCast to use GpuEvalMode and adds a test to confirm that we now fall back to CPU for try_cast in Spark 3.2.0+.

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove · 2023-04-25T17:30:56Z

build

andygrove · 2023-04-25T17:38:41Z

build

andygrove · 2023-04-25T18:12:07Z

build

andygrove · 2023-04-25T22:21:53Z

build

integration_tests/src/main/python/cast_test.py

Co-authored-by: Navin Kumar <97137715+NVnavkumar@users.noreply.github.com>

andygrove · 2023-04-26T13:18:38Z

build

andygrove · 2023-04-26T13:51:58Z

build

revans2 · 2023-04-26T13:49:01Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCast.scala

+  val ansiEnabled = evalMode == GpuEvalMode.ANSI
+
+  def withToTypeOverride(newToType: DecimalType): CastExprMeta[INPUT] = {
+    val evalMode = if (ansiEnabled) {


nit: Why use evalMode to compute ansiEnabled and then turn around and use ansiEnabled to compute evalMode?

I was trying to minimize changes to existing code, but I can revisit this.

It is just a nit so you can decide what to do. But if you delete lines 58 to 62 I would not complain about it.

revans2 · 2023-04-26T13:56:28Z

sql-plugin/src/main/spark340/scala/com/nvidia/spark/rapids/shims/Spark340PlusShims.scala

@@ -22,8 +22,7 @@ package com.nvidia.spark.rapids.shims
 import com.nvidia.spark.rapids._

 import org.apache.spark.rapids.shims.GpuShuffleExchangeExec
-import org.apache.spark.sql.catalyst.expressions.{Expression, KnownNullable}
-import org.apache.spark.sql.catalyst.expressions.Empty2Null
+import org.apache.spark.sql.catalyst.expressions.{Empty2Null, Expression, KnownNullable}
 import org.apache.spark.sql.catalyst.plans.physical.SinglePartition
 import org.apache.spark.sql.execution.{CollectLimitExec, GlobalLimitExec, SparkPlan}
 import org.apache.spark.sql.execution.command.{CreateDataSourceTableAsSelectCommand, DataWritingCommand, RunnableCommand}


I am confused and it is not clear how Spark340Plus shims is going to know if the cast is ANSI, TRY, or LEGACY. This inherits from Spark331PlusShims, But 331 does not do this, Only databricks 330, and I don't think this inherits from that.

We provide overrides for Cast in Spark320PlusShims and Spark31XShims.

Spark340PlusShims indirectly extends Spark320PlusShims (via Spark331PlusShims, Spark330PlusNonDBShims, Spark330PlusShims, and Spark321PlusShims).

These shims are delegating to AnsiCastShim, which is shimmed for 311+ and 330db/340 as follows:

sql-plugin/src/main/spark311/scala/com/nvidia/spark/rapids/shims/AnsiCastShim.scala (311+)
sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/AnsiCastShim.scala (330db + 340)

It is all very confusing, for sure.

For me, it was especially confusing that we have a db shim that shims for non-db. I understand that it makes sense, but I think this is not a pattern we are used to seeing so far.

Yes it works, because the tests pass, but we should fix this. Can you file an issue so that we don't have 340 depend on 330db, unless we have one to do it already?

There was a discussion about this in #8169 (comment) and it seems that it is correct that we have both 330db/340 in the same shim.

That is right. Not that it makes it any less confusing.

I filed #8188 for removing the dependency from 340 to 330db

I think the best way to make the code more clear is to break up trait inheritance, stop encoding version ranges in the class/trait/object names

sql-plugin/src/main/spark311/scala/com/nvidia/spark/rapids/shims/AnsiCastShim.scala

andygrove · 2023-04-26T19:43:33Z

build

andygrove · 2023-04-26T22:27:38Z

build

gerashegalov

LGTM

NVnavkumar · 2023-04-26T23:16:45Z

@NVnavkumar I ran into problems with the test refactor that you suggested:

2023-04-26T20:19:23.6118186Z [2023-04-26T20:18:52.350Z] INTERNALERROR> E                 extras.append('ALLOW_NON_GPU(' + ','.join(non_gpu.args) + ')')
2023-04-26T20:19:23.6118491Z [2023-04-26T20:18:52.350Z] INTERNALERROR> E             TypeError: sequence item 0: expected str instance, list found
2023-04-26T20:19:23.6118701Z [2023-04-26T20:18:52.350Z] INTERNALERROR> E           assert False

I have gone back to two tests for now.

I think you need to use the *operator in python, like allow_non_gpu(*execs_to_allow) where execs_to_allow = ["Exec1", "Exec2"]

Looks like I forgot to put it in the initial suggestion. It's a nit, so I don't think it's a big deal to fix at the moment.

gerashegalov · 2023-04-27T00:25:09Z

Looks like I forgot to put it in the initial suggestion. It's a nit, so I don't think it's a big deal to fix at the moment.

@NVnavkumar my bad: I thought it's a typo and edited it, should have just made a comment instead.

andygrove · 2023-04-27T00:57:20Z

build

andygrove · 2023-04-27T13:40:58Z

build

andygrove · 2023-04-27T18:16:02Z

build has failed twice here:

2023-04-27T17:59:00.3281611Z [2023-04-27T17:58:39.881Z] [gw0]^[[36m [ 93%] ^[[0m^[[32mPASSED^[[0m ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Timestamp-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] Connection to ec2-34-217-117-108.us-west-2.compute.amazonaws.com closed by remote host.
2023-04-27T17:59:00.3281967Z [2023-04-27T17:58:39.883Z] ssh: connect to host ec2-34-217-117-108.us-west-2.compute.amazonaws.com port 2200: Connection refused

NVnavkumar · 2023-04-27T18:18:15Z

build has failed twice here:

2023-04-27T17:59:00.3281611Z [2023-04-27T17:58:39.881Z] [gw0]^[[36m [ 93%] ^[[0m^[[32mPASSED^[[0m ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Timestamp-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] Connection to ec2-34-217-117-108.us-west-2.compute.amazonaws.com closed by remote host.
2023-04-27T17:59:00.3281967Z [2023-04-27T17:58:39.883Z] ssh: connect to host ec2-34-217-117-108.us-west-2.compute.amazonaws.com port 2200: Connection refused

That looks like CI lost it's ssh connection to the Databricks instance in the middle of testing. Did it hit an idle timeout?

andygrove · 2023-04-27T20:05:49Z

That looks like CI lost it's ssh connection to the Databricks instance in the middle of testing. Did it hit an idle timeout?

That seems likely. The timestamps in the log file span a period of 4 hours and ~7 minutes.

@GaryShen2008 @pxLi could we increase the timeout for these tests?

pxLi · 2023-04-28T01:38:50Z

build

pxLi · 2023-04-28T06:18:57Z

w/ increase timeout to 6.5 hours, I found that the CI actually stuck at below (~3hrs) and made no progress after.

[2023-04-28T04:47:44.612Z] ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Decimal(38,1)-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] 

[2023-04-28T04:47:44.613Z] [gw1] [ 93%] PASSED ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Decimal(38,1)-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] 

[2023-04-28T04:47:46.489Z] ../../src/main/python/window_function_test.py::test_percent_rank_no_part_multiple_batches

please check rapids_premerge-github. build ID 7061

previous failure were the same actually, stuck then timeout cc @andygrove @NVnavkumar

[2023-04-27T16:43:47.223Z] ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Timestamp-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT]
[2023-04-27T17:58:39.881Z] [gw0] [ 93%] PASSED ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Timestamp-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] Connection to ec2-34-217-117-108.us-west-2.compute.amazonaws.com closed by remote host.
[2023-04-27T17:58:39.883Z] ssh: connect to host ec2-34-217-117-108.us-west-2.compute.amazonaws.com port 2200: Connection refused

revans2 · 2023-05-01T14:54:36Z

The INJECT_OOM in the test looks like it might be causing some issues. Would be good to have someone look at it and see if we can reproduce it.

NVnavkumar · 2023-05-02T05:57:21Z

This PR might also close #7046, btw.

andygrove · 2023-05-02T16:07:28Z

I tried to reproduce on DB 11.3 by modifying the jenkins/databricks/test.sh script to add -k window_function_test.py but could not reproduce.

pxLi · 2023-05-03T00:57:47Z

I tried to reproduce on DB 11.3 by modifying the jenkins/databricks/test.sh script to add -k window_function_test.py but could not reproduce.

log of window_function_test has already been printed out, so the actual hanging test may not be the window_function_test one. I would suggest to run the full test first, and try check the executor logs while its stuck (2.5~3hours).

let me try re-trigger to see if this is still reproducible in CI

pxLi · 2023-05-03T00:58:10Z

build

pxLi · 2023-05-03T04:24:30Z

passed CI cleanly in latest run with latest nightly JNI

I guess the hanging might be related to side effect of cudf changes last week

andygrove · 2023-05-03T13:52:55Z

Is SPARK_RAPIDS_TEST_INJECT_OOM_SEED set to a constant value when running the build in blossom? Are we sure that this issue is resolved and that we didn't just run with a different seed this time?

abellina · 2023-05-03T13:57:25Z

The seed changes each time, and is printed in the logs:

SPARK_RAPIDS_TEST_INJECT_OOM_SEED used: X

In order to repro the same injection order, we can go to the failed build and look for the seed that was printed (and then export SPARK_RAPIDS_TEST_INJECT_OOM_SEED=X). We can also use: --test_oom_injection_mode=always to always inject.

abellina · 2023-05-03T14:00:44Z

Note that on this same day April 28 a fix was merged that could have caused segfaults in the executors. I wonder if this is actually another instance of: rapidsai/cudf#13238

abellina · 2023-05-03T14:06:47Z

I tried to reproduce on DB 11.3 by modifying the jenkins/databricks/test.sh script to add -k window_function_test.py but could not reproduce.

@andygrove might be worth trying with the cuDF nightly of April 27, if you already have an environment setup for this.

andygrove · 2023-05-04T14:32:30Z

Note that on this same day April 28 a fix was merged that could have caused segfaults in the executors. I wonder if this is actually another instance of: rapidsai/cudf#13238

I went back over the history. The CI failures were before the fix rapidsai/cudf#13240 was merged and later runs were fine, so I think this was likely the issue and that we should go ahead and merge.

andygrove · 2023-05-04T15:09:55Z

I guess the hanging might be related to side effect of cudf changes last week

Thanks for the help with this @pxLi

Fall back to CPU for try_cast in Spark 3.4.0

7d3c50d

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove self-assigned this Apr 25, 2023

andygrove added 5 commits April 25, 2023 10:35

Refactor

633ee8e

specific tests for 320 and 340

4745c96

update test

72dae84

update copyright year

322f8e4

fix 31x shim

86cfde4

andygrove added the Spark 3.4+ Spark 3.4+ issues label Apr 25, 2023

andygrove changed the title ~~Fall back to CPU for try_cast in Spark 3.4.0~~ Fall back to CPU for try_cast in Spark 3.4.0 [databricks] Apr 25, 2023

refactor and fix failure in 330db

160da45

andygrove marked this pull request as ready for review April 25, 2023 22:18

andygrove requested review from gerashegalov, NVnavkumar and nartal1 April 25, 2023 22:22

NVnavkumar reviewed Apr 25, 2023

View reviewed changes

integration_tests/src/main/python/cast_test.py Outdated Show resolved Hide resolved

andygrove and others added 2 commits April 26, 2023 07:17

Update integration_tests/src/main/python/cast_test.py

6994dad

Co-authored-by: Navin Kumar <97137715+NVnavkumar@users.noreply.github.com>

update test for db 11.3

0ac77cf

fix imports, rename test

54dcf92

revans2 reviewed Apr 26, 2023

View reviewed changes

remove redundant code

ec66d5b

andygrove mentioned this pull request Apr 26, 2023

[FEA] Refactor shims to remove dependency from 340 to 330db for cast/try_cast #8188

Closed

address feedback

db17aef

simplify code

68a73ce

revert changes to Spark340PlusShims

3004cd9

gerashegalov approved these changes Apr 26, 2023

View reviewed changes

Merge remote-tracking branch 'nvidia/branch-23.06' into try-cast-340

d4fa8f1

revans2 approved these changes Apr 27, 2023

View reviewed changes

pxLi mentioned this pull request Apr 28, 2023

Increase databricks cluster autotermination to 6.5 hours [skip ci] #8193

Merged

andygrove merged commit 8703289 into NVIDIA:branch-23.06 May 4, 2023

andygrove deleted the try-cast-340 branch May 4, 2023 14:34

NVnavkumar mentioned this pull request May 4, 2023

[FEA] EvalMode.TRY Spark 3.4.0 testing and support #7046

Closed

Fall back to CPU for try_cast in Spark 3.4.0 [databricks] #8179

Fall back to CPU for try_cast in Spark 3.4.0 [databricks] #8179

Conversation

andygrove commented Apr 25, 2023 • edited Loading

andygrove commented Apr 25, 2023

andygrove commented Apr 25, 2023

andygrove commented Apr 25, 2023

andygrove commented Apr 25, 2023

andygrove commented Apr 26, 2023

andygrove commented Apr 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Apr 26, 2023

andygrove commented Apr 26, 2023

gerashegalov left a comment

Choose a reason for hiding this comment

NVnavkumar commented Apr 26, 2023 • edited Loading

gerashegalov commented Apr 27, 2023

andygrove commented Apr 27, 2023

andygrove commented Apr 27, 2023

andygrove commented Apr 27, 2023

NVnavkumar commented Apr 27, 2023

andygrove commented Apr 27, 2023

pxLi commented Apr 28, 2023

pxLi commented Apr 28, 2023 • edited Loading

revans2 commented May 1, 2023

NVnavkumar commented May 2, 2023

andygrove commented May 2, 2023

pxLi commented May 3, 2023 • edited Loading

pxLi commented May 3, 2023

pxLi commented May 3, 2023

andygrove commented May 3, 2023

abellina commented May 3, 2023 • edited Loading

abellina commented May 3, 2023

abellina commented May 3, 2023

andygrove commented May 4, 2023

andygrove commented May 4, 2023

andygrove commented Apr 25, 2023 •

edited

Loading

andygrove Apr 26, 2023 •

edited

Loading

NVnavkumar commented Apr 26, 2023 •

edited

Loading

pxLi commented Apr 28, 2023 •

edited

Loading

pxLi commented May 3, 2023 •

edited

Loading

abellina commented May 3, 2023 •

edited

Loading