Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize sample perf #4159

Merged
merged 5 commits into from
Dec 3, 2021
Merged

Optimize sample perf #4159

merged 5 commits into from
Dec 3, 2021

Conversation

res-life
Copy link
Collaborator

This fixes #4096

Samples data by GPU JNI to improve the performance.
Added "spark.rapids.sql.fast.sample" to switch.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

res-life commented Nov 19, 2021

The bottlenecks are XORShiftRandom.sample and ArrayBuffer append.
	It's hard to optimize the XORShiftRandom.sample. XORShiftRandom is serialized algorithm(current value depends on privious value), can't run paralleled.
	For ArrayBuffer append, it's also hard to avoid, we should collect the sampled row numbers.
Optimization
	Add pure GPU solution to invoke JNI, will be turned on by switching the config "spark.rapids.sql.fast.sample". 
	This solution is inconsistent with CPU exec although it's faster.
   This solution is 3x faster than the previous.
	Test Deails
		Data set: sample 1% from 70,000,000 rows
		Result:
			1. CPU samples and GPU gathers(Previous solution): avg 275 ms
			2. GPU sample JNI:  used time: avg 81ms
Others
	Add the metrics for all the scenarios
	Fixed the bug: When Columnar Bach has no columns, throws an exception.

Note: depends on CUDF issue: rapidsai/cudf#9728

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

Build failed because of CUDF PR is not merged.

@revans2 revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Nov 19, 2021
@revans2 revans2 marked this pull request as draft November 19, 2021 13:09
@revans2
Copy link
Collaborator

revans2 commented Nov 19, 2021

Marking as draft because the CUDF JNI dependency is not in yet.

// copy row indexes to host buffer
var idx = 0
while (idx < rows.length) {
hostBuffer.setInt(idx * intBytes, rows(idx))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly for my own curiosity. You could have also done this with

withResource(cudf.ColumnVector.fromInts(rows: _*)) { gatherCv =>
  withResource(GpuColumnVector.from(cb)) { table =>
    // GPU gather
    withResource(table.gather(gatherCv)) { gatheredTable =>
      GpuColumnVector.from(gatheredTable, colTypes)
    }
  }
}

Did you try this? If so was your current way more performant? I realize it might involve an extra memory copy because the ArrayBuffer needs to be transformed into an Array, so it can be passed to fromInts. So if you didn't I don't think it is worth spending too much time on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, code is more neat and more effecient.

preservesPartitioning = true,
seed)
} else {
val useGpuToSample = new RapidsConf(conf).isFastSampleEnabled
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it would be cleaner if we had a GpuFastSampleExec and a GpuSampleExec, then we can select which one to use when we replace it. Part of this is because creating a RapidsConf is not cheap.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.

c
} else {
withResource(GpuColumnVector.from(cb)) { table =>
withResource(table.sample(numSampleRows, withReplacement, seed)) { sampled =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Because we use the same seed every time then I think all of the batches will be sampled the same way. I am not sure mathematically how that all works out. Could we at least set the seed to seed + index so each task does it slightly differently?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, use the way like CPU to set the seed.

Chong Gao added 2 commits November 23, 2021 15:00
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

The test result is as follow:
data set
	parquet file only has one int column, one partition, has 2,100,000,000 rows.
config is
	spark.executor.cores=1
	spark.rapids.sql.concurrentGpuTasks=1
	// limit to one thead
sql is:
	spark.read.parquet("/home/chong/data/parquet").sample(0.01, 0).agg(f.sum("salary")).show()
result
	CPU used time: 4.6s
	GPU: 5.1s  		   // 	spark.conf.set("spark.rapids.sql.fast.sample", True)
	CPU GPU hybrid: 11.41s  // 	spark.conf.set("spark.rapids.sql.fast.sample", False)
conclusion
	GPU is a little slow than CPU, maybe we need to file a CUDF issue to improve the performance of the sample JNI

@res-life
Copy link
Collaborator Author

res-life commented Dec 1, 2021

build

@res-life
Copy link
Collaborator Author

res-life commented Dec 1, 2021

@revans2 Please see the above test result; Please help to review.

@revans2
Copy link
Collaborator

revans2 commented Dec 1, 2021

I get very different results, and I think your issue is because of Parquet reading.

I ran spark.time(spark.range(Int.MaxValue).sample(0.01, 0).count)

First single task/thread. This is the median of 3 runs, and measured in ms.

CPU GPU FAST GPU
10633 11108 83

It is also clear from the op time metrics on the UI for the runs.

When I run with 12 tasks/threads I get

CPU GPU FAST GPU
1236 2952 100

The regular GPU being slower is because I have my GPU semaphore set to 4 in parallel. So it cannot fully utilize the CPU when running.

If I change the query to include replacement spark.time(spark.range(Int.MaxValue).sample(0.01, 0).count) I get.

CPU GPU FAST GPU
11135 22556 116

So at this point the only thing that is lacking here is some documentation to let people know that there is a crazy fast version that is not the same as the Spark version.

@res-life
Copy link
Collaborator Author

res-life commented Dec 2, 2021

The following optimization makes counting a data frame very fast.

GpuFastSampleExec
  } else if (cb.numCols() == 0) {
	  // for count agg, num of cols is 0
	  val c = GpuColumnVector.emptyBatchFromTypes(colTypes)
	  c.setNumRows(numSampleRows.toInt)  // can skip GPU sample.
	  c
  } else {

When querying sum the CPU is a little faster.

@res-life res-life marked this pull request as ready for review December 2, 2021 13:22
@revans2
Copy link
Collaborator

revans2 commented Dec 2, 2021

Yup I got ahead of myself and did things too quickly. You are 100% correct. Could you file a follow on issue to then look at what we can do to possibly speed up sampling on the GPU?

revans2
revans2 previously approved these changes Dec 2, 2021
@revans2
Copy link
Collaborator

revans2 commented Dec 2, 2021

To be clear the issue appears to be related to doing it without replacement. With replacement is really fast. This indicates that thrust::shuffle_copy is the reason for the slowness.

spark.time(spark.range(Int.MaxValue * 12L).sample(true, 0.01, 0).selectExpr("SUM(id)", "COUNT(id)").show())
+-------------------+---------+
|            sum(id)|count(id)|
+-------------------+---------+
|3320410258800588221|257697960|
+-------------------+---------+

Time taken: 670 ms

@res-life
Copy link
Collaborator Author

res-life commented Dec 3, 2021

build

@res-life
Copy link
Collaborator Author

res-life commented Dec 3, 2021

build

@res-life
Copy link
Collaborator Author

res-life commented Dec 3, 2021

The issue filed rapidsai/cudf#9834

@res-life
Copy link
Collaborator Author

res-life commented Dec 3, 2021

Updated the doc, please review again @revans2

@revans2 revans2 merged commit 34994dd into NVIDIA:branch-22.02 Dec 3, 2021
@res-life res-life deleted the sample-perf branch December 4, 2021 07:35
res-life pushed a commit to res-life/spark-rapids that referenced this pull request Dec 6, 2021
Signed-off-by: Chong Gao <res_life@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GpuSampleExec is very slow, and missing opTime in some cases
2 participants