Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ThreadPool size is deduced incorrectly in MultiFileReaderThreadPool on YARN clusters #9271

Closed
mythrocks opened this issue Sep 19, 2023 · 0 comments · Fixed by #9381
Closed
Assignees
Labels
bug Something isn't working

Comments

@mythrocks
Copy link
Collaborator

mythrocks commented Sep 19, 2023

On YARN clusters, when a Spark cluster is spun up without explicitly setting spark.executor.cores, it is seen that the MultiFileReaderThreadPool is initialized to use all cores on the executor host. This results in the per-thread allocations overwhelming the executor's memory allocation, and queries failing. (One example was observed here: #9135.)

Part of the per-thread allocation problem will be tackled on this bug: #9269. But the core issue seems to be in RapidsPluginUtils::estimateCoresOnExec():

  def estimateCoresOnExec(conf: SparkConf): Int = {
    conf.getOption(RapidsPluginUtils.EXECUTOR_CORES_KEY)
        .map(_.toInt)
        .getOrElse(Runtime.getRuntime.availableProcessors)
  }

On YARN setups (and in local mode), one sees that this code falls back to Runtime.getRuntime.availableProcessors instead of using the default value for EXECUTOR_CORES_KEY. It appears that conf.getOption(RapidsPluginUtils.EXECUTOR_CORES_KEY) does not return the default, if unset.

The right thing to do might be to fetch the option via a ConfigEntry object, instead of using the conf key string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants