Increase row limit when doing count() for HostColumnarToGpu #1868

tgravescs · 2021-03-04T18:13:59Z

If doing a count() with HostcolumnToGPU which ends up doing a coalesce, the row limit is currently set at 512. This can be very inefficient for larger data. So when there aren't any columns specified just set the row limit to max integer.

This improved one query from 2 minutes down to 6 seconds.

I also added a new testing function assert_gpu_and_cpu_are_equal_count to be able to easily test count() calls.

fixes #1864

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs · 2021-03-04T18:14:11Z

build

revans2 · 2021-03-04T18:24:00Z

integration_tests/src/main/python/asserts.py

-    _assert_gpu_and_cpu_are_equal(func, True, conf=conf)
+    _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf)
+
+def assert_gpu_and_cpu_are_equal_count(func, conf={}):


I am not a huge fan of this API. It makes it really easy to accidentally only compare the number of rows for a query instead of the actual data of the query. I understand why you are trying to do this, but the same thing can be achieved with assert_gpu_and_cpu_are_equal_collect and inserting a groupBy().count() in there.

This is not a show stopper. If you want to keep it, that is fine, I would just like a warning in the doc string about this.

ok, I thought the _count would imply that you are only comparing counts. We in general have no tests that test count() that I see and we have hit lots of issues with it so I thought it would be a convenient api to easily add tests without having to change the operation or a bunch of code.

I'm definitely open to changing if we think it will cause problems but I don't think groupBy().count() is as intuitive or convenient.

Perhaps just changing the name to assert_gpu_and_cpu_row_counts_equal and add more documentation? Or if you can think of another convenience api.

That is fine with me. I mostly want to be sure that no one uses it on accident. Using it on purpose is fine.

gerashegalov · 2021-03-04T18:29:27Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/HostColumnarToGpu.scala

@@ -279,6 +279,11 @@ class HostToGpuCoalesceIterator(iter: Iterator[ColumnarBatch],
    // schema and desired batch size
    batchRowLimit = GpuBatchUtils.estimateRowCount(goal.targetSizeBytes,
      GpuBatchUtils.estimateGpuMemory(schema, 512), 512)
+    // when there aren't any columns, it generally means user is doing a count() and we don't


nice find.

Let us not call estimateRowCount/estimateGpuMemory when there no columns:

batchRowLimit = if (batch.numCols() > 0 ) { GpuBatchUtils.estimateRowCount(goal.targetSizeBytes, GpuBatchUtils.estimateGpuMemory(schema, 512), 512) } else { Integer.MAX_VALUE }

tgravescs · 2021-03-04T22:38:41Z

build

gerashegalov

LGTM, 🚀

) * Increase row limit when doing count() for HostColumnarToGpu Signed-off-by: Thomas Graves <tgraves@nvidia.com> * put test back in * comment * update test comment * update comment Signed-off-by: Thomas Graves <tgraves@nvidia.com> * Update count tests function, fix missed calls, and review comments

tgravescs added 5 commits March 4, 2021 10:15

Increase row limit when doing count() for HostColumnarToGpu

f2f5d46

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

put test back in

8900062

comment

edd1763

update test comment

5968107

update comment

8b8ca9b

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

tgravescs added the performance A performance related task/issue label Mar 4, 2021

tgravescs added this to the Mar 1 - Mar 12 milestone Mar 4, 2021

tgravescs self-assigned this Mar 4, 2021

revans2 reviewed Mar 4, 2021

View reviewed changes

gerashegalov reviewed Mar 4, 2021

View reviewed changes

Update count tests function, fix missed calls, and review comments

6d76e40

gerashegalov approved these changes Mar 5, 2021

View reviewed changes

tgravescs merged commit dc66f03 into NVIDIA:branch-0.5 Mar 5, 2021

tgravescs deleted the HostColumnCount branch March 5, 2021 22:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase row limit when doing count() for HostColumnarToGpu #1868

Increase row limit when doing count() for HostColumnarToGpu #1868

tgravescs commented Mar 4, 2021

tgravescs commented Mar 4, 2021

revans2 Mar 4, 2021

tgravescs Mar 4, 2021

revans2 Mar 4, 2021

gerashegalov Mar 4, 2021

tgravescs commented Mar 4, 2021

gerashegalov left a comment

Increase row limit when doing count() for HostColumnarToGpu #1868

Increase row limit when doing count() for HostColumnarToGpu #1868

Conversation

tgravescs commented Mar 4, 2021

tgravescs commented Mar 4, 2021

revans2 Mar 4, 2021

Choose a reason for hiding this comment

tgravescs Mar 4, 2021

Choose a reason for hiding this comment

revans2 Mar 4, 2021

Choose a reason for hiding this comment

gerashegalov Mar 4, 2021

Choose a reason for hiding this comment

tgravescs commented Mar 4, 2021

gerashegalov left a comment

Choose a reason for hiding this comment