Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shim for Databricks 10.4 [databricks] #4974

Merged
merged 9 commits into from
Mar 18, 2022
Merged

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Mar 17, 2022

Closes #4914.

Adds a shim for the Databricks 10.4 runtime. The dist pom has not been updated to reference this version, as we need to setup build pipelines for these new Databricks-based jars. Once those pipelines are setup, we can followup with a PR to update the dist pom and build scripts.

Some code from 301until330-all has been refactored into a 301until330-nondb directory to allow some reuse of files in the original 301until3300-all directory with the new 321db shim.

Once hiccup with this shim is that it appears First and possibly other aggregation expressions have an updated intermediate data format which lead to some aggregation test failures. See #4963. To be on the safe side until this is further investigated, aggregations are not replaced on Databricks 10.4 unless we can replace both sides of the aggregation.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe added this to the Feb 28 - Mar 18 milestone Mar 17, 2022
@jlowe jlowe requested a review from revans2 as a code owner March 17, 2022 15:53
@jlowe jlowe self-assigned this Mar 17, 2022
@jlowe
Copy link
Member Author

jlowe commented Mar 17, 2022

build

@jlowe
Copy link
Member Author

jlowe commented Mar 17, 2022

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this looks good to me, but I am not ready to approve it yet. There is just so much code I don't know what changes were copy/paste and what are specific to the new version of databricks. If you want to sit down and have us go over it together we can, or I can apply the patch and manually find the files with the same names and diff them. Either way works fine with me.

jenkins/databricks/build.sh Outdated Show resolved Hide resolved
@jlowe
Copy link
Member Author

jlowe commented Mar 17, 2022

build

@jlowe
Copy link
Member Author

jlowe commented Mar 17, 2022

I don't know what changes were copy/paste and what are specific to the new version of databricks.

I can give a quick overview of the majority of the changes. They can be summed up in the following high-level changes:

  • Moving some files in until330-all to until330-nondb, triggering copies in the other two existing Databricks shims
  • New files in 321db (this new shim). Most of these are copies of what the 321 shim was using, but there are exceptions.
  • New code in AggregationTagging and aggregate.scala to support avoiding replacing only one half of an aggregate when using the 321db shim.

As expected, Databricks 10.4 is not just Apache Spark 3.2.1 plus custom changes, as it also has changes from Apache Spark 3.3.0 mixed in. As such, there were files in '*until330-all' directories that no longer applied to all once this shim appeared. To reconcile that, I moved the files incompatible with the new Databricks shim into until330-nondb directories which then triggered a copy of those files into the two existing Databricks shims that do not use the directory. The 301db shim is going away very soon, so it's net one extra copy for the 312db shim. Everything that was added to 301db and 31xdb should be the same as the new files added to corresponding until330-nondb directory.

Similarly, there was a 30Xuntil33X shim class that, once this shim was added, no longer applied to all shims. I split out the incompatible methods into a new nondb shim and updated all the original users to use that as well. The existing Databricks shims got copies of these separated methods since they don't use nondb dirs/code.

@pxLi
Copy link
Collaborator

pxLi commented Mar 18, 2022

@NvTimLiu can you help take a look at the CICD requirement for db 10.4 shims? thanks!

@jlowe
Copy link
Member Author

jlowe commented Mar 18, 2022

@tgravescs @revans2 I think I have addressed your concerns.

@jlowe
Copy link
Member Author

jlowe commented Mar 18, 2022

build

@jlowe jlowe marked this pull request as draft March 18, 2022 16:19
@jlowe
Copy link
Member Author

jlowe commented Mar 18, 2022

Converting this to draft while I manually run the 321db tests

@jlowe
Copy link
Member Author

jlowe commented Mar 18, 2022

build

@tgravescs
Copy link
Collaborator

want to change out of draft?

@jlowe
Copy link
Member Author

jlowe commented Mar 18, 2022

want to change out of draft?

Still waiting for the manual run of the integration tests on Datarbricks 10.4 on the latest changes to complete. I expect it to pass as it did before, but I don't want this to be merged until we have confirmation those tests are passing.

@jlowe jlowe marked this pull request as ready for review March 18, 2022 19:30
@jlowe
Copy link
Member Author

jlowe commented Mar 18, 2022

Databricks 10.4 test run passed, so just waiting for the clean CI run at this point.

@jlowe jlowe merged commit 5818905 into NVIDIA:branch-22.04 Mar 18, 2022
@jlowe jlowe deleted the shim-321db branch March 18, 2022 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support for Databricks 10.4 ML LTS
4 participants