Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Consider creating combined GpuCoalesceBatches and GpuShuffleExchange operator #719

Closed
andygrove opened this issue Sep 10, 2020 · 3 comments
Labels
feature request New feature or request P2 Not required for release performance A performance related task/issue

Comments

@andygrove
Copy link
Contributor

Is your feature request related to a problem? Please describe.
When AQE is enabled and we are planning a new query stage, we must return an operator that implements ShuffleExchangeLike (since Spark 3.0.1) so we remove any GpuCoalesceBatches operator and insert it later around the GpuCustomShuffleReader that will read the shuffle output.

I think it is worth exploring an alternate approach where instead of removing the GpuCoalesceBatches operator, we create a new operator that combines GpuCoalesceBatches and GpuShuffleExchangeExec and returns that as the new query stage.

The benefit of this approach if it works is that it makes the AQE and non-AQE plans more consistent and removes some complexity. It may also result in improved performance if it means that the shuffle reader is now reading coalesced batches, but I'm not 100% sure if I am understanding this correctly, so could do with a second opinion on this.

Describe the solution you'd like
See the previous section.

Describe alternatives you've considered
The alternative is the current design of coalescing after the shuffle reader.

Additional context
N/A

@andygrove andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Sep 10, 2020
@JustPlay
Copy link

JustPlay commented Sep 11, 2020

It may also result in improved performance if it means that the shuffle reader is now reading coalesced batches

Will It may also result in improved performance if it means that the shuffle reader is now reading coalesced batches be helpful for this #679 (the gpu semaphore limit cpu concurrency)

@revans2
Copy link
Collaborator

revans2 commented Sep 11, 2020

Will It may also result in improved performance if it means that the shuffle reader is now reading coalesced batches be helpful for this #679 (the gpu semaphore limit cpu concurrency)

I personally don't think the shuffle reader will be able to read coalesced batches. The reason the code is doing a shuffle is because the data needs to move from one location to another so that the desired data is all in the same location for processing. Even with AQE where on the reading side we might end up in some cases reading more batches then we originally thought, the data has already been partitioned N ways assuming that it will be read N ways. With how the data is laid out we will still need to combine the data together at some point.

The only real advantage of putting them together for #679 would be in batching and signalling, like I talked about. It would let us bypass some of the complexity in trying to coordinate between the shuffle and coalesce.

@sameerz sameerz added P2 Not required for release performance A performance related task/issue and removed ? - Needs Triage Need team to review and classify labels Sep 15, 2020
@andygrove
Copy link
Contributor Author

I am closing this now that I understand this better and I agree with @revans2 comment that the shuffle reader would not be able to read coalesced batches.

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#719)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request P2 Not required for release performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

4 participants