Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fall back to CPU for try_cast in Spark 3.4.0 [databricks] #8179

Merged
merged 19 commits into from
May 4, 2023

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Apr 25, 2023

Closes #5807

Spark 3.2.0 introduces the TryCast expression. We do not have any explicit handling for this, so we fall back to CPU.

Spark 3.4.0 removes the TryCast expression and Cast now has an evalMode of LEGACY, ANSI, or TRY. We did not have any logic to inspect the eval mode in the 340 shims, so we would execute try_cast as a regular cast, with incorrect behavior.

This PR refactors GpuCast to use GpuEvalMode and adds a test to confirm that we now fall back to CPU for try_cast in Spark 3.2.0+.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove self-assigned this Apr 25, 2023
@andygrove andygrove added the Spark 3.4+ Spark 3.4+ issues label Apr 25, 2023
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove changed the title Fall back to CPU for try_cast in Spark 3.4.0 Fall back to CPU for try_cast in Spark 3.4.0 [databricks] Apr 25, 2023
@andygrove
Copy link
Contributor Author

build

1 similar comment
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove marked this pull request as ready for review April 25, 2023 22:18
@andygrove
Copy link
Contributor Author

build

andygrove and others added 2 commits April 26, 2023 07:17
Co-authored-by: Navin Kumar <97137715+NVnavkumar@users.noreply.github.com>
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

val ansiEnabled = evalMode == GpuEvalMode.ANSI

def withToTypeOverride(newToType: DecimalType): CastExprMeta[INPUT] = {
val evalMode = if (ansiEnabled) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why use evalMode to compute ansiEnabled and then turn around and use ansiEnabled to compute evalMode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to minimize changes to existing code, but I can revisit this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just a nit so you can decide what to do. But if you delete lines 58 to 62 I would not complain about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -22,8 +22,7 @@ package com.nvidia.spark.rapids.shims
import com.nvidia.spark.rapids._

import org.apache.spark.rapids.shims.GpuShuffleExchangeExec
import org.apache.spark.sql.catalyst.expressions.{Expression, KnownNullable}
import org.apache.spark.sql.catalyst.expressions.Empty2Null
import org.apache.spark.sql.catalyst.expressions.{Empty2Null, Expression, KnownNullable}
import org.apache.spark.sql.catalyst.plans.physical.SinglePartition
import org.apache.spark.sql.execution.{CollectLimitExec, GlobalLimitExec, SparkPlan}
import org.apache.spark.sql.execution.command.{CreateDataSourceTableAsSelectCommand, DataWritingCommand, RunnableCommand}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused and it is not clear how Spark340Plus shims is going to know if the cast is ANSI, TRY, or LEGACY. This inherits from Spark331PlusShims, But 331 does not do this, Only databricks 330, and I don't think this inherits from that.

Copy link
Contributor Author

@andygrove andygrove Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We provide overrides for Cast in Spark320PlusShims and Spark31XShims.

Spark340PlusShims indirectly extends Spark320PlusShims (via Spark331PlusShims, Spark330PlusNonDBShims, Spark330PlusShims, and Spark321PlusShims).

These shims are delegating to AnsiCastShim, which is shimmed for 311+ and 330db/340 as follows:

sql-plugin/src/main/spark311/scala/com/nvidia/spark/rapids/shims/AnsiCastShim.scala (311+)
sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/AnsiCastShim.scala (330db + 340)

It is all very confusing, for sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, it was especially confusing that we have a db shim that shims for non-db. I understand that it makes sense, but I think this is not a pattern we are used to seeing so far.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it works, because the tests pass, but we should fix this. Can you file an issue so that we don't have 340 depend on 330db, unless we have one to do it already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a discussion about this in #8169 (comment) and it seems that it is correct that we have both 330db/340 in the same shim.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is right. Not that it makes it any less confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #8188 for removing the dependency from 340 to 330db

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way to make the code more clear is to break up trait inheritance, stop encoding version ranges in the class/trait/object names

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NVnavkumar
Copy link
Collaborator

NVnavkumar commented Apr 26, 2023

@NVnavkumar I ran into problems with the test refactor that you suggested:

2023-04-26T20:19:23.6118186Z [2023-04-26T20:18:52.350Z] INTERNALERROR> E                 extras.append('ALLOW_NON_GPU(' + ','.join(non_gpu.args) + ')')
2023-04-26T20:19:23.6118491Z [2023-04-26T20:18:52.350Z] INTERNALERROR> E             TypeError: sequence item 0: expected str instance, list found
2023-04-26T20:19:23.6118701Z [2023-04-26T20:18:52.350Z] INTERNALERROR> E           assert False

I have gone back to two tests for now.

I think you need to use the *operator in python, like allow_non_gpu(*execs_to_allow) where execs_to_allow = ["Exec1", "Exec2"]

Looks like I forgot to put it in the initial suggestion. It's a nit, so I don't think it's a big deal to fix at the moment.

@gerashegalov
Copy link
Collaborator

Looks like I forgot to put it in the initial suggestion. It's a nit, so I don't think it's a big deal to fix at the moment.

@NVnavkumar my bad: I thought it's a typo and edited it, should have just made a comment instead.

@andygrove
Copy link
Contributor Author

build

1 similar comment
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build has failed twice here:

2023-04-27T17:59:00.3281611Z [2023-04-27T17:58:39.881Z] [gw0]^[[36m [ 93%] ^[[0m^[[32mPASSED^[[0m ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Timestamp-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] Connection to ec2-34-217-117-108.us-west-2.compute.amazonaws.com closed by remote host.
2023-04-27T17:59:00.3281967Z [2023-04-27T17:58:39.883Z] ssh: connect to host ec2-34-217-117-108.us-west-2.compute.amazonaws.com port 2200: Connection refused

@NVnavkumar
Copy link
Collaborator

build has failed twice here:

2023-04-27T17:59:00.3281611Z [2023-04-27T17:58:39.881Z] [gw0]^[[36m [ 93%] ^[[0m^[[32mPASSED^[[0m ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Timestamp-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] Connection to ec2-34-217-117-108.us-west-2.compute.amazonaws.com closed by remote host.
2023-04-27T17:59:00.3281967Z [2023-04-27T17:58:39.883Z] ssh: connect to host ec2-34-217-117-108.us-west-2.compute.amazonaws.com port 2200: Connection refused

That looks like CI lost it's ssh connection to the Databricks instance in the middle of testing. Did it hit an idle timeout?

@andygrove
Copy link
Contributor Author

That looks like CI lost it's ssh connection to the Databricks instance in the middle of testing. Did it hit an idle timeout?

That seems likely. The timestamps in the log file span a period of 4 hours and ~7 minutes.

@GaryShen2008 @pxLi could we increase the timeout for these tests?

@pxLi
Copy link
Collaborator

pxLi commented Apr 28, 2023

build

@pxLi
Copy link
Collaborator

pxLi commented Apr 28, 2023

w/ increase timeout to 6.5 hours, I found that the CI actually stuck at below (~3hrs) and made no progress after.

[2023-04-28T04:47:44.612Z] ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Decimal(38,1)-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] 

[2023-04-28T04:47:44.613Z] [gw1] [ 93%] PASSED ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Decimal(38,1)-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] 

[2023-04-28T04:47:46.489Z] ../../src/main/python/window_function_test.py::test_percent_rank_no_part_multiple_batches

please check rapids_premerge-github. build ID 7061

previous failure were the same actually, stuck then timeout cc @andygrove @NVnavkumar

[2023-04-27T16:43:47.223Z] ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Timestamp-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT]
[2023-04-27T17:58:39.881Z] [gw0] [ 93%] PASSED ../../src/main/python/window_function_test.py::test_multi_types_window_aggs_for_rows[partAndOrderBy:Timestamp-String][INJECT_OOM, IGNORE_ORDER({'local': True}), APPROXIMATE_FLOAT] Connection to ec2-34-217-117-108.us-west-2.compute.amazonaws.com closed by remote host.
[2023-04-27T17:58:39.883Z] ssh: connect to host ec2-34-217-117-108.us-west-2.compute.amazonaws.com port 2200: Connection refused

@revans2
Copy link
Collaborator

revans2 commented May 1, 2023

The INJECT_OOM in the test looks like it might be causing some issues. Would be good to have someone look at it and see if we can reproduce it.

@NVnavkumar
Copy link
Collaborator

This PR might also close #7046, btw.

@andygrove
Copy link
Contributor Author

I tried to reproduce on DB 11.3 by modifying the jenkins/databricks/test.sh script to add -k window_function_test.py but could not reproduce.

@pxLi
Copy link
Collaborator

pxLi commented May 3, 2023

I tried to reproduce on DB 11.3 by modifying the jenkins/databricks/test.sh script to add -k window_function_test.py but could not reproduce.

log of window_function_test has already been printed out, so the actual hanging test may not be the window_function_test one. I would suggest to run the full test first, and try check the executor logs while its stuck (2.5~3hours).

let me try re-trigger to see if this is still reproducible in CI

@pxLi
Copy link
Collaborator

pxLi commented May 3, 2023

build

@pxLi
Copy link
Collaborator

pxLi commented May 3, 2023

passed CI cleanly in latest run with latest nightly JNI

I guess the hanging might be related to side effect of cudf changes last week

@andygrove
Copy link
Contributor Author

Is SPARK_RAPIDS_TEST_INJECT_OOM_SEED set to a constant value when running the build in blossom? Are we sure that this issue is resolved and that we didn't just run with a different seed this time?

@abellina
Copy link
Collaborator

abellina commented May 3, 2023

The seed changes each time, and is printed in the logs:

SPARK_RAPIDS_TEST_INJECT_OOM_SEED used: X

In order to repro the same injection order, we can go to the failed build and look for the seed that was printed (and then export SPARK_RAPIDS_TEST_INJECT_OOM_SEED=X). We can also use: --test_oom_injection_mode=always to always inject.

@abellina
Copy link
Collaborator

abellina commented May 3, 2023

Note that on this same day April 28 a fix was merged that could have caused segfaults in the executors. I wonder if this is actually another instance of: rapidsai/cudf#13238

@abellina
Copy link
Collaborator

abellina commented May 3, 2023

I tried to reproduce on DB 11.3 by modifying the jenkins/databricks/test.sh script to add -k window_function_test.py but could not reproduce.

@andygrove might be worth trying with the cuDF nightly of April 27, if you already have an environment setup for this.

@andygrove
Copy link
Contributor Author

Note that on this same day April 28 a fix was merged that could have caused segfaults in the executors. I wonder if this is actually another instance of: rapidsai/cudf#13238

I went back over the history. The CI failures were before the fix rapidsai/cudf#13240 was merged and later runs were fine, so I think this was likely the issue and that we should go ahead and merge.

@andygrove andygrove merged commit 8703289 into NVIDIA:branch-23.06 May 4, 2023
@andygrove andygrove deleted the try-cast-340 branch May 4, 2023 14:34
@andygrove
Copy link
Contributor Author

I guess the hanging might be related to side effect of cudf changes last week

Thanks for the help with this @pxLi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Spark 3.4+ Spark 3.4+ issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark 3.4 changes to cast broke build
6 participants