Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Please consider to support spark.sql.extensions=com.nvidia.spark.rapids.SQLExecPlugin #479

Closed
wbo4958 opened this issue Jul 31, 2020 · 4 comments
Assignees
Labels
feature request New feature or request wontfix This will not be worked on

Comments

@wbo4958
Copy link
Collaborator

wbo4958 commented Jul 31, 2020

Is your feature request related to a problem? Please describe.
This issue is regarding to #460, when user sets spark.sql.extensitions=com.nvidia.spark.rapids.SQLExecPlugin instead of spark.plugins=com.nvidia.spark.SQLPlugin, It will result Null Pointer exception

20/07/29 15:51:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.19.183.93, executor 0): java.lang.NullPointerException
	at org.apache.spark.sql.rapids.GpuShuffleEnv$.isRapidsShuffleEnabled(GpuShuffleEnv.scala:128)
	at com.nvidia.spark.rapids.GpuPartitioning.sliceInternalGpuOrCpu(GpuPartitioning.scala:99)
	at com.nvidia.spark.rapids.GpuPartitioning.sliceInternalGpuOrCpu$(GpuPartitioning.scala:97)
	at com.nvidia.spark.rapids.GpuRoundRobinPartitioning.sliceInternalGpuOrCpu(GpuRoundRobinPartitioning.scala:34)
	at com.nvidia.spark.rapids.GpuRoundRobinPartitioning.columnarEval(GpuRoundRobinPartitioning.scala:81)
	at com.nvidia.spark.rapids.GpuShuffleExchangeExec$.$anonfun$prepareBatchShuffleDependency$2(GpuShuffleExchangeExec.scala:156)
	at com.nvidia.spark.rapids.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:177)
	at com.nvidia.spark.rapids.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:188)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

That's because RapidsExecutorPlugin will not be initialized, So GpuShuffleEnv will also not be initialized. If any code calls the method of GpuShuffleEnv, it may suffer NullPointerException, because the code assumes that the GpuShuffleEnv must have been initialized.

Describe the solution you'd like
Better to support spark.sql.extensitions=com.nvidia.spark.rapids.SQLExecPlugin

@abellina
Copy link
Collaborator

Sorry I though this issue was to have a friendlier exception thrown here. I am going to file something different for that, it is a nuisance and different to the main issue discussed here.

@abellina
Copy link
Collaborator

abellina commented Jul 31, 2020

@wbo4958 fyi, we are changing this exception to be an IllegalStateException for now.

We need, specifically, to have the correct GPU acquired in the plugin. One approach as of today for you to be able to run is to a) acquire a gpu, and b) initialize GpuShuffleEnv with the mem info required there from the device. This would mean as far as I can tell that spark.sql.extensions would need to be prefixed with one of your own to initialize things.

How do solve for the correct GPU in your case? That seems to be the main issue we have concerns about. Let us know what you think, perhaps there's more we can do to make it easier.

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Aug 4, 2020

Thx @abellina.

@wbo4958 wbo4958 closed this as completed Aug 4, 2020
@revans2
Copy link
Collaborator

revans2 commented Aug 4, 2020

@wbo4958 is what @abellina described an acceptable long term solution? If not then we should reopen this. We can probably make it work with lazy initialization, but it is a matter of time and prioritization.

@jlowe jlowe added the wontfix This will not be worked on label Sep 14, 2020
pxLi pushed a commit to pxLi/spark-rapids that referenced this issue May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

5 participants