Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set spark.executor.cores for integration tests. #9177

Closed

Conversation

mythrocks
Copy link
Collaborator

Fixes #9135. (By workaround.)

This change sets spark.executor.cores to 10, if it is unset. This allows integration tests to work around the failure seen in parquet_test.py:test_small_file_memory, where the COALESCING Parquet reader's thread pool accidentally uses 128 threads with 8MB memory each, thus consuming the entire heap.

Note that this is a bit of a workaround. A more robust solution would be to scale the Parquet reader's buffers based on the amount of available memory, and the number of threads.

Fixes NVIDIA#9135. (By workaround.)

This change sets `spark.executor.cores` to `10`, if it is unset. This allows
integration tests to work around the failure seen in `parquet_test.py:test_small_file_memory`,
where the `COALESCING` Parquet reader's thread pool accidentally uses 128 threads with 8MB memory
each, thus consuming the entire heap.

Note that this is a bit of a workaround.  A more robust solution would be to scale the Parquet
reader's buffers based on the amount of available memory, and the number of threads.

Signed-off-by: MithunR <mythrocks@gmail.com>
@mythrocks
Copy link
Collaborator Author

Build

@mythrocks mythrocks self-assigned this Sep 5, 2023
@mythrocks
Copy link
Collaborator Author

Build


# Set per-executor cores, if unspecified.
# This prevents per-thread allocations (like Parquet read buffers) from overwhelming the heap.
export PYSP_TEST_spark_executor_cores=${PYSP_TEST_spark_executor_cores:-'10'}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 10? We have a few other places where we try to configure things for local mode already why are the number of executor cores out of sync with the LOCAL_PARALLEL or NUM_LOCAL_EXECS?

LOCAL_PARALLEL=$(( $CPU_CORES > 4 ? 4 : $CPU_CORES ))

On a side note are databricks tests being run in local mode and configured badly? In a regular databricks cluster will we also run into this type of a problem? If so this workaround feels very much like it is going in the wrong direction, like we need to really fix the underlying problem ASAP instead of trying to work around it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did see that. I figured we might consider lower parallelism for local-mode than cluster-mode, and that a more appropriate number might be suggested in the review. I have verified that this works with 4.

@mythrocks
Copy link
Collaborator Author

we need to really fix the underlying problem ASAP instead of trying to work around it.

This was an attempt to get a clean build on CDH, as quickly as possible. But I'm supportive of closing this in favour of a proper fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GC/OOM on parquet_test.py::test_small_file_memory
3 participants