Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid requiring single batch when using out-of-core sort #5903

Merged
merged 5 commits into from
Jul 18, 2022

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Jun 24, 2022

Closes #5448

Problem

The cause is that the input of the out-of-core sort is a single large batch.
The single large batch caused the OOM.

out-of-core sort should not require a single batch, it can pull all the input batches and then sort.
Before the out-of-core sort feature was added in, we do need the single batch for in-memory sorting.

Solution

When executing partitioned writes which requires sorting, we use out-of-core sort and do not require single batch.
Note: If specified stable sort configuration, still need require a single batch as before.

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

res-life commented Jul 1, 2022

build

@sameerz sameerz added the task Work required that improves the product but is not user facing label Jul 1, 2022
@res-life
Copy link
Collaborator Author

res-life commented Jul 8, 2022

build

@res-life
Copy link
Collaborator Author

res-life commented Jul 8, 2022

Verified a large data frame, OOM did not occur after disabled the require a single batch

I will file a follow-on issue to explore support for something like DynamicPartitionDataConcurrentWriter

@res-life res-life marked this pull request as ready for review July 8, 2022 10:42
@res-life res-life requested a review from revans2 July 8, 2022 10:43
@res-life
Copy link
Collaborator Author

Thanks, @wjxiz1992 verified this PR against the corresponding NV bug: NDS 2.0 convert CSV to Parquet failed by OOM

@res-life
Copy link
Collaborator Author

Filed a follow-on issue for DynamicPartitionDataConcurrentWriter: #5999

@res-life
Copy link
Collaborator Author

@revans2 Help review

@@ -465,3 +465,14 @@ def test_write_daytime_interval(spark_tmp_path):
lambda spark, path: spark.read.parquet(path),
data_path,
conf=writer_confs)

# TODO need to test large DF
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We simulate a large DF by setting the batch size to be very small. This lets us send multiple batches.

@@ -36,7 +36,8 @@ case class GpuCreateDataSourceTableAsSelectCommand(
query: LogicalPlan,
outputColumnNames: Seq[String],
origProvider: Class[_],
gpuFileFormat: ColumnarFileFormat)
gpuFileFormat: ColumnarFileFormat,
useStableSort: Boolean)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only nit is that we pass useStableSort around this code a lot, but in the final part when we do the sort we get it from a different location.

val sortType = if (RapidsConf.STABLE_SORT.get(plan.conf)) {
FullSortSingleBatch
} else {
OutOfCoreSort
}

Could we please make it consistent? Either we pass it all the way down all the time, or we go off of the plan.conf all the time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, see the below comment.

@res-life
Copy link
Collaborator Author

Premerge is blocked by #6003

@res-life
Copy link
Collaborator Author

build

@revans2 revans2 merged commit 6286d05 into NVIDIA:branch-22.08 Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] partitioned writes require single batches and sorting, causing gpu OOM in some cases
3 participants