Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shim Layer to support multiple Spark versions #414

Merged
merged 62 commits into from
Jul 23, 2020

Conversation

tgravescs
Copy link
Collaborator

fixes #355

This adds a shim layer to support multiple Spark versions. This PR adds the framework and support for Apache Spark 3.0.0, Apache Spark 3.0.1 and Apache Spark 3.10. All the tests pass for Spark 3.0.0 and 3.0.1 but there are a few failures still for 3.1.0. We can finish resolving those after this is merged. The problem I keep running into is upmerged to spark 3.1 as well as the plugin itself keeps having conflicts and requires continuous retesting and fixing those. So if we can get this part reviewed and in, I think it will be easier to finish the remaining things and allow multiple people to look at it.

Note that to shim the ShuffleManger, it requires the user to specify the spark package version right now.

A lot of the changes are around Join. HashJoin had changes which requires a lot of things to be copied into the shim layer. There may be ways it improve this to share code but would like to look more at that in followup, if you have ideas let me know. If you diff those files there should be very little differences. The other issue there is BuildSide, BuildRight, BuildLeft all moved packages.

Spark 3.1.0 had api changes for other things - TimeSub, ScalaUDF, MapOutputTracker, ShuffleManager, FileSourceScan, First and Last

For the Shimlayer itself we use service loaders for each of the versions. There is a lightweight loader class that first determines if the loader applies to that Spark version, then it has a buildShim function to load the entire shim. This keeps us from loading a bunch of classes we don't really need.

I added a profile spark30tests and spark31tests that you can specify to run the tests on different versions. Once we have things working we can hook that up to CI along with running integration tests on both versions. slf4j in the tests had conflicts with the cudf versions, so for now I made them pull in the newer version. I want to update CUDF to have the same version as well.

Note that I tested the core sql-plugin code by building against both Spark 3.0 and Spark 3.1 to ensure I didn't miss anything there. Tests were run against both versions. 1 unit test failure on 3.1 and 3.1 has some integration test failures - some FullOuter join tests, log test, and TimeSub tests,

I filed a few followup issues to look at the remaining items for commonize join code, changing loaders to use priority, investigating if we want version match to be strict or not, and investigate shuffle having common base class.

@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

Note I'm in progress of running a final pass over integration tests now.

dist/pom.xml Show resolved Hide resolved
@tgravescs
Copy link
Collaborator Author

Ran integration tests on 3.0.0 and 3.0.1 and all pass, 3.1.0 has a few failures that I expected and have issue to investigate: #416

@tgravescs tgravescs merged commit d2383d2 into NVIDIA:branch-0.2 Jul 23, 2020
@jlowe jlowe added this to the Jul 20 - Jul 31 milestone Jul 24, 2020
@sameerz sameerz added the build Related to CI / CD or cleanly building label Jul 27, 2020
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Shim Layer to support multiple Spark versions - adds Spark 3.0.0, 3.0.1, and 3.1.0

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Shim Layer to support multiple Spark versions - adds Spark 3.0.0, 3.0.1, and 3.1.0

Signed-off-by: Thomas Graves <tgraves@nvidia.com>
pxLi pushed a commit to pxLi/spark-rapids that referenced this pull request May 12, 2022
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support Multiple Spark versions in the same jar
4 participants