Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] java.lang.NullPointerException when using spark.sql.extensions=com.nvidia.spark.rapids.SQLExecPlugin #460

Closed
wbo4958 opened this issue Jul 29, 2020 · 5 comments
Labels
bug Something isn't working

Comments

@wbo4958
Copy link
Collaborator

wbo4958 commented Jul 29, 2020

Describe the bug

If the user is using the configs as follows,

--conf spark.sql.extensions=com.nvidia.spark.rapids.SQLExecPlugin

Then RapidsExecutorPlugin will not be initialized, So GpuShuffleEnv will also not be initialized. If any code calls the method of GpuShuffleEnv, it may suffer NullPointerException, because the code assumes that the GpuShuffleEnv must have been initialized. The exception just like below.

20/07/29 15:51:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.19.183.93, executor 0): java.lang.NullPointerException
	at org.apache.spark.sql.rapids.GpuShuffleEnv$.isRapidsShuffleEnabled(GpuShuffleEnv.scala:128)
	at com.nvidia.spark.rapids.GpuPartitioning.sliceInternalGpuOrCpu(GpuPartitioning.scala:99)
	at com.nvidia.spark.rapids.GpuPartitioning.sliceInternalGpuOrCpu$(GpuPartitioning.scala:97)
	at com.nvidia.spark.rapids.GpuRoundRobinPartitioning.sliceInternalGpuOrCpu(GpuRoundRobinPartitioning.scala:34)
	at com.nvidia.spark.rapids.GpuRoundRobinPartitioning.columnarEval(GpuRoundRobinPartitioning.scala:81)
	at com.nvidia.spark.rapids.GpuShuffleExchangeExec$.$anonfun$prepareBatchShuffleDependency$2(GpuShuffleExchangeExec.scala:156)
	at com.nvidia.spark.rapids.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:177)
	at com.nvidia.spark.rapids.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:188)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Steps/Code to reproduce bug
use

--conf spark.sql.extensions=com.nvidia.spark.rapids.SQLExecPlugin
@wbo4958 wbo4958 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 29, 2020
@jlowe
Copy link
Member

jlowe commented Jul 29, 2020

I don't believe our intention is to support specifying the SQLExecPlugin exclusively. The getting started guide and other documentation specify to use: --conf spark.plugins=com.nvidia.spark.SQLPlugin. Besides the shuffle environment not being setup properly with this config, the RMM pool, pinned memory pool, and GPU semaphore won't be initialized either.

If there is documentation stating to configure Spark with --conf spark.sql.extensions=com.nvidia.spark.rapids.SQLExecPlugin then we need to update it.

@jlowe jlowe removed the ? - Needs Triage Need team to review and classify label Jul 29, 2020
@abellina
Copy link
Collaborator

Agree with @jlowe.

I think if this case could be detected and a user-friendly error thrown that would be a good small fix, perhaps. Thoughts?

For the particular NPE, some other user-friendly exception could be thrown.

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jul 29, 2020

I also agree with @jlowe. But if we use spark.plugins in Spark Standalone mode, seems we need to copy rapids jar into each node. that's really boring.

So from the ML part, which may not need RMM/GPU shuffle things. And GPU semaphore seems can be initialized when it is used.
BTW, it can work well using spark.sql.extensions previously. Anyway, I'm ok if we stick user to use spark.plugins

@revans2
Copy link
Collaborator

revans2 commented Jul 30, 2020

@wbo4958

If you want to be able to support that use case then file a feature request with what you want to support. We can then prioritize it on the backlog and try to figure out what is the right way to support it.

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jul 31, 2020

close this issue and file a FEA #479

@wbo4958 wbo4958 closed this as completed Jul 31, 2020
pxLi pushed a commit to pxLi/spark-rapids that referenced this issue May 12, 2022
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…#460)

Add ability to provide a seed value in config. Closes NVIDIA#452  
    
Signed-off-by: Gera Shegalov <gera@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants