Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase row limit when doing count() for HostColumnarToGpu #1868

Merged
merged 6 commits into from
Mar 5, 2021

Conversation

tgravescs
Copy link
Collaborator

If doing a count() with HostcolumnToGPU which ends up doing a coalesce, the row limit is currently set at 512. This can be very inefficient for larger data. So when there aren't any columns specified just set the row limit to max integer.

This improved one query from 2 minutes down to 6 seconds.

I also added a new testing function assert_gpu_and_cpu_are_equal_count to be able to easily test count() calls.

fixes #1864

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
@tgravescs tgravescs added the performance A performance related task/issue label Mar 4, 2021
@tgravescs tgravescs added this to the Mar 1 - Mar 12 milestone Mar 4, 2021
@tgravescs tgravescs self-assigned this Mar 4, 2021
@tgravescs
Copy link
Collaborator Author

build

_assert_gpu_and_cpu_are_equal(func, True, conf=conf)
_assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf)

def assert_gpu_and_cpu_are_equal_count(func, conf={}):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a huge fan of this API. It makes it really easy to accidentally only compare the number of rows for a query instead of the actual data of the query. I understand why you are trying to do this, but the same thing can be achieved with assert_gpu_and_cpu_are_equal_collect and inserting a groupBy().count() in there.

This is not a show stopper. If you want to keep it, that is fine, I would just like a warning in the doc string about this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I thought the _count would imply that you are only comparing counts. We in general have no tests that test count() that I see and we have hit lots of issues with it so I thought it would be a convenient api to easily add tests without having to change the operation or a bunch of code.

I'm definitely open to changing if we think it will cause problems but I don't think groupBy().count() is as intuitive or convenient.

Perhaps just changing the name to assert_gpu_and_cpu_row_counts_equal and add more documentation? Or if you can think of another convenience api.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine with me. I mostly want to be sure that no one uses it on accident. Using it on purpose is fine.

@@ -279,6 +279,11 @@ class HostToGpuCoalesceIterator(iter: Iterator[ColumnarBatch],
// schema and desired batch size
batchRowLimit = GpuBatchUtils.estimateRowCount(goal.targetSizeBytes,
GpuBatchUtils.estimateGpuMemory(schema, 512), 512)
// when there aren't any columns, it generally means user is doing a count() and we don't
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find.

Let us not call estimateRowCount/estimateGpuMemory when there no columns:

batchRowLimit = if (batch.numCols() > 0 ) {
  GpuBatchUtils.estimateRowCount(goal.targetSizeBytes,
      GpuBatchUtils.estimateGpuMemory(schema, 512), 512)
} else {
  Integer.MAX_VALUE
}      

@tgravescs
Copy link
Collaborator Author

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, 🚀

@tgravescs tgravescs merged commit dc66f03 into NVIDIA:branch-0.5 Mar 5, 2021
@tgravescs tgravescs deleted the HostColumnCount branch March 5, 2021 22:57
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
)

* Increase row limit when doing count() for HostColumnarToGpu

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* put test back in

* comment

* update test comment

* update comment

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update count tests function, fix missed calls, and review comments
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
)

* Increase row limit when doing count() for HostColumnarToGpu

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* put test back in

* comment

* update test comment

* update comment

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Update count tests function, fix missed calls, and review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]HostColumnarToGPU inefficient when only doing count()
3 participants