Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate Spark vendor shim dependency management [databricks] #9182

Merged

Conversation

gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented Sep 5, 2023

Replace numerous instance of duplicate dependency definitions for cloudera and databricks shims by aggregated definitions.

Verification along the lines :
buildall and unjar all jars in separate dirs

cd before
find . -path '*/target/*.jar' | grep -v 'dist/target/deps' | xargs -n 1 bash -c 'jar_dir=.jars/$(basename $1); mkdir -p $jar_dir; unzip -d $jar_dir $1 \*.class' _
...
diff -r before/spark-rapids/.jars after/spark-rapids/.jars
Only in before/spark-rapids/.jars/rapids-4-spark-integration-tests_2.12-23.10.0-SNAPSHOT-spark330db-jar-with-dependencies.jar/org/apache: arrow

The diff is because of the previous special-case compile-scope for arrow in integraion_tests just in the databricks prfoile. I think it may no longer be necessary. If a post-merge test breaks, will fix in a follow-up PR.

Signed-off-by: Gera Shegalov gera@apache.org

Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov gerashegalov self-assigned this Sep 5, 2023
@gerashegalov gerashegalov added the build Related to CI / CD or cleanly building label Sep 5, 2023
@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov changed the title Consolidate Spark vendor shim dependency management Consolidate Spark vendor shim dependency management [databricks] Sep 5, 2023
@gerashegalov
Copy link
Collaborator Author

build

<version>23.10.0-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
</parent>
<artifactId>rapids-4-spark-cdh-bom</artifactId>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does bom stand for?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@gerashegalov gerashegalov Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename if needed. My original intention was to use a pure BOM pattern I found recommended somewhere on stackoverflow. However, dependencyManagement does not support exclusions used in the CDH context, and even if it would, we still need to do some consolidation to avoid repeating <dependencies>. I found that the latter makes the former redundant. So the definition of a BOM in our project is not really the same what the Maven doc describes, but the same in spirit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its fine to leave as it is

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that nothing needs to change with how we package, publish the resulting poms/jars?

@revans2
Copy link
Collaborator

revans2 commented Sep 5, 2023

build

@gerashegalov
Copy link
Collaborator Author

I assume that nothing needs to change with how we package, publish the resulting poms/jars?

at least that is the goal

@gerashegalov
Copy link
Collaborator Author

I ran another round of verification:

cd before
find . -path '*/target/*.jar' | grep -v 'dist/target/deps' | xargs -n 1 bash -c 'jar_dir=.jars/$(basename $1); mkdir -p $jar_dir; unzip -d $jar_dir $1 \*.class' _
...
diff -r before/spark-rapids/.jars after/spark-rapids/.jars
Only in before/spark-rapids/.jars/rapids-4-spark-integration-tests_2.12-23.10.0-SNAPSHOT-spark330db-jar-with-dependencies.jar/org/apache: arrow

revealing a difference in the presence of arrow classes in the before integration_tests assembly jar. Which is attributable to the databricks profile listing those artifacts as compile-scope

https://github.com/NVIDIA/spark-rapids/pull/9182/files#diff-4b412ae5512cbf72928bdd8bfe6aa562a04f709fcbb7d97b04b6781e338ab987L255-L272

It looks to me that the provided scope is correct but it looks it was intentionally set to compile https://github.com/NVIDIA/spark-rapids/pull/3335/files#diff-4b412ae5512cbf72928bdd8bfe6aa562a04f709fcbb7d97b04b6781e338ab987R213-R230
@tgravescs do you remember why we needed compile for arrow in db shims?

@tgravescs
Copy link
Collaborator

I do not, we have some arrow test classes for datasource v2 testing in the integration tests, maybe around that.

@gerashegalov
Copy link
Collaborator Author

I do not, we have some arrow test classes for datasource v2 testing in the integration tests, maybe around that.

thanks, merging as is, will put in an override if it breaks tests.

@gerashegalov gerashegalov merged commit 0e2fc80 into NVIDIA:branch-23.10 Sep 6, 2023
31 checks passed
@gerashegalov gerashegalov deleted the consolidateDependencyManagement branch September 6, 2023 14:06
gerashegalov added a commit to gerashegalov/spark-rapids that referenced this pull request Oct 24, 2023
replace dbdeps not enabled using profiles, this case is missed in NVIDIA#9182

Signed-off-by: Gera Shegalov <gera@apache.org>
gerashegalov added a commit that referenced this pull request Oct 24, 2023
- replace dbdeps not enabled using profiles, this case is missed in #9182
- remove outdated comments

Signed-off-by: Gera Shegalov <gera@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants