Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable UCX + AQE #613

Merged
merged 13 commits into from
Sep 8, 2020
Merged

Enable UCX + AQE #613

merged 13 commits into from
Sep 8, 2020

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Aug 26, 2020

Enables UCX + AQE.

This closes #455

@andygrove andygrove added the feature request New feature or request label Aug 26, 2020
@andygrove andygrove added this to the Aug 17 - Aug 28 milestone Aug 26, 2020
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove linked an issue Sep 2, 2020 that may be closed by this pull request
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove changed the title [WIP] Enable UCX + AQE Enable UCX + AQE Sep 2, 2020
…supported

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

exchanges.foreach(_.willNotWorkOnGpu(consistentExchangeMessage))
}
} else if (exchanges.isEmpty) {
// we only have query stages, and we cannot do anything to modify them at this point
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the side effect of not being able to tag them if they already started executing? I like the comments here, but they almost point at some indeterminate state where work already started.

// since they already started to execute
} else {
// there is a mix of ShuffleExchangeExec and ShuffleQueryStageExec so we need to tag any
// ShuffleExchangeExec nodes based on the other nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if any exchange or query exchange is not on the GPU, all exchanges and query exchanges must fallback to the CPU, that feed to the join? Just making sure that's the end result.

Should that happen also for queryStages? E.g. if exchanges are on the cpu, but queryStages are on the gpu, the following code will just tag the exchanges again for the cpu (which seems unnecessary, but I could be misreading it). Wonder if we need to also tell the queryStages that they can't work because of the consistent hashing issue.

@abellina
Copy link
Collaborator

abellina commented Sep 6, 2020

Oh @andygrove so there are changes to the joins here too. Should we change the title of the PR to reflect?

@andygrove
Copy link
Contributor Author

Oh @andygrove so there are changes to the joins here too. Should we change the title of the PR to reflect?

Yes, the scope definitely increased here. I should probably split this up into two separate PRs.

@abellina
Copy link
Collaborator

abellina commented Sep 6, 2020

I am ok with it going in one, just think it could be renamed. Especially if you think it’s good to go. Maybe “shuffle fixes for AQE”?

@andygrove
Copy link
Contributor Author

I am ok with it going in one, just think it could be renamed. Especially if you think it’s good to go. Maybe “shuffle fixes for AQE”?

I went ahead and split it out anyway, so this PR now builds on #676

I will try and address your comments over there.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

@abellina This PR now only has the necessary changes for UCX support.

@abellina
Copy link
Collaborator

abellina commented Sep 7, 2020

build

1 similar comment
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit 9157870 into NVIDIA:branch-0.2 Sep 8, 2020
@andygrove andygrove deleted the ucx-aqe branch September 8, 2020 20:58
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Support UCX when AQE is enabled

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Bug fix: Shuffle exchange needs to look at tagged plan

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* re-implement makeShuffleConsistent to support AQE

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* update comments

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* revert tpcds bench changes

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* make comments more consistent

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* unit test

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* code refinement

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* remove redundant repartitioning from test

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix incorrect logic in test for determining which spark versions are supported

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix scalastyle

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Remove changes unrelated to supporting UCX

Signed-off-by: Andy Grove <andygrove@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Support UCX when AQE is enabled

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Bug fix: Shuffle exchange needs to look at tagged plan

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* re-implement makeShuffleConsistent to support AQE

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* update comments

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* revert tpcds bench changes

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* make comments more consistent

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* unit test

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* code refinement

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* remove redundant repartitioning from test

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix incorrect logic in test for determining which spark versions are supported

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix scalastyle

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Remove changes unrelated to supporting UCX

Signed-off-by: Andy Grove <andygrove@nvidia.com>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
[auto-merge] bot-auto-merge-branch-22.10 to branch-22.12 [skip ci] [bot]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support UCX shuffle with optimized AQE
3 participants