Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use contiguous table when deserializing columnar batch #851

Merged
merged 2 commits into from
Sep 25, 2020

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Sep 24, 2020

Fixes #849.

This was straightforward to implement, as @revans2 detailed it so excellently in the feature request.

I verified with a profile on a query that the contig splits have been removed after a shuffle. This cut the wall-clock time on a test query in local mode from over 32 seconds to under 25.

This also removes the unused numFields parameter from GpuColumnarBatchSerializer since I stumbled across it while doing this PR. Happy to move that to a separate PR if desired.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe added the performance A performance related task/issue label Sep 24, 2020
@jlowe jlowe added this to the Sep 14 - Sep 25 milestone Sep 24, 2020
@jlowe jlowe self-assigned this Sep 24, 2020
@jlowe
Copy link
Member Author

jlowe commented Sep 24, 2020

build

@revans2
Copy link
Collaborator

revans2 commented Sep 25, 2020

Just FYI: numFields came from the original spark serializer code where an UnsafeRow needs to know how many fields there will be when it initializes so it can allocate that amount of memory. I kept it when I first started to port over the code but should have dropped it a long time ago. Thanks.

@revans2 revans2 merged commit 14a9192 into NVIDIA:branch-0.3 Sep 25, 2020
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this pull request Nov 20, 2020
* Use contiguous table when deserializing a table

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Remove unused numFields parameter from GpuColumnarBatchSerializer

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Use contiguous table when deserializing a table

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Remove unused numFields parameter from GpuColumnarBatchSerializer

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Use contiguous table when deserializing a table

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

* Remove unused numFields parameter from GpuColumnarBatchSerializer

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe deleted the deserialize-contig branch September 10, 2021 15:41
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#851)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Have GpuColumnarBatchSerializer return GpuColumnVectorFromBuffer instances
3 participants