Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Shim for BatchScanExec to Support Spark 4.0 [databricks] #10944

Merged
merged 2 commits into from
May 30, 2024

Conversation

razajafri
Copy link
Collaborator

Spark made a change to change BatchScanExec/DataSourceV2Relation to group splits by join keys if they differ from partition keys (previously grouped only by partition values). Do same for all auxiliary data structure, like commonPartValues.

This PR makes changes in the plugin to support that change.

fixes #10712

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri razajafri changed the base branch from branch-24.06 to branch-24.08 May 29, 2024 17:43
@razajafri razajafri added the Spark 4.0+ Spark 4.0+ issues label May 29, 2024
@mythrocks
Copy link
Collaborator

I can see how this addresses #10711 as well. I was hoping to address the same with less code repetition.

I'm taking a crack at #10711 right now. If there isn't a clean way to do this, I'll defer to the approach in this PR.

@razajafri
Copy link
Collaborator Author

I can see how this addresses #10711 as well. I was hoping to address the same with less code repetition.

I'm taking a crack at #10711 right now. If there isn't a clean way to do this, I'll defer to the approach in this PR.

I thought about refactoring but considering the previous GpuBatchScanExec versions I thought it is probably OK to keep it this way. Let's connect offline to see if we can merge your effort with this one

@mythrocks
Copy link
Collaborator

If there isn't a clean way to do this, I'll defer to the approach in this PR.

I can confirm that simply refactoring the call to groupPartitions() isn't going to solve the problem; the Spark 4.0 implementation differs drastically from the 3.5 version. Pulling out the commonality will only reduce the readability of this class.

I'm suspending my efforts on this front. Let's keep the approach we have on this PR.

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

build

Copy link
Collaborator

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, LGTM.

@razajafri razajafri merged commit a7cdaa9 into NVIDIA:branch-24.08 May 30, 2024
44 checks passed
SurajAralihalli pushed a commit to SurajAralihalli/spark-rapids that referenced this pull request Jul 12, 2024
…A#10944)

* Added shim for BatchScanExec to support Spark 4.0

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* fixed the failing shim

---------

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Spark 4.0+ Spark 4.0+ issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AUDIT] BatchScanExec/DataSourceV2Relation to group splits by join keys if they differ from partition keys
2 participants