Consolidate Spark vendor shim dependency management [databricks] #9182

gerashegalov · 2023-09-05T04:41:00Z

Replace numerous instance of duplicate dependency definitions for cloudera and databricks shims by aggregated definitions.

Verification along the lines :
buildall and unjar all jars in separate dirs

cd before
find . -path '*/target/*.jar' | grep -v 'dist/target/deps' | xargs -n 1 bash -c 'jar_dir=.jars/$(basename $1); mkdir -p $jar_dir; unzip -d $jar_dir $1 \*.class' _
...
diff -r before/spark-rapids/.jars after/spark-rapids/.jars
Only in before/spark-rapids/.jars/rapids-4-spark-integration-tests_2.12-23.10.0-SNAPSHOT-spark330db-jar-with-dependencies.jar/org/apache: arrow

The diff is because of the previous special-case compile-scope for arrow in integraion_tests just in the databricks prfoile. I think it may no longer be necessary. If a post-merge test breaks, will fix in a follow-up PR.

Signed-off-by: Gera Shegalov gera@apache.org

Signed-off-by: Gera Shegalov <gera@apache.org>

gerashegalov · 2023-09-05T04:42:14Z

build

gerashegalov · 2023-09-05T04:44:05Z

build

tgravescs · 2023-09-05T13:34:40Z

shim-deps/cloudera/pom.xml

+        <version>23.10.0-SNAPSHOT</version>
+        <relativePath>../../pom.xml</relativePath>
+    </parent>
+    <artifactId>rapids-4-spark-cdh-bom</artifactId>


what does bom stand for?

Usually it is bill of materials.

https://www.investopedia.com/terms/b/bill-of-materials.asp

I can rename if needed. My original intention was to use a pure BOM pattern I found recommended somewhere on stackoverflow. However, dependencyManagement does not support exclusions used in the CDH context, and even if it would, we still need to do some consolidation to avoid repeating <dependencies>. I found that the latter makes the former redundant. So the definition of a BOM in our project is not really the same what the Maven doc describes, but the same in spirit.

its fine to leave as it is

revans2

I assume that nothing needs to change with how we package, publish the resulting poms/jars?

revans2 · 2023-09-05T14:12:38Z

build

gerashegalov · 2023-09-05T15:52:34Z

I assume that nothing needs to change with how we package, publish the resulting poms/jars?

at least that is the goal

gerashegalov · 2023-09-06T10:09:00Z

I ran another round of verification:

cd before
find . -path '*/target/*.jar' | grep -v 'dist/target/deps' | xargs -n 1 bash -c 'jar_dir=.jars/$(basename $1); mkdir -p $jar_dir; unzip -d $jar_dir $1 \*.class' _
...
diff -r before/spark-rapids/.jars after/spark-rapids/.jars
Only in before/spark-rapids/.jars/rapids-4-spark-integration-tests_2.12-23.10.0-SNAPSHOT-spark330db-jar-with-dependencies.jar/org/apache: arrow

revealing a difference in the presence of arrow classes in the before integration_tests assembly jar. Which is attributable to the databricks profile listing those artifacts as compile-scope

https://github.com/NVIDIA/spark-rapids/pull/9182/files#diff-4b412ae5512cbf72928bdd8bfe6aa562a04f709fcbb7d97b04b6781e338ab987L255-L272

It looks to me that the provided scope is correct but it looks it was intentionally set to compile https://github.com/NVIDIA/spark-rapids/pull/3335/files#diff-4b412ae5512cbf72928bdd8bfe6aa562a04f709fcbb7d97b04b6781e338ab987R213-R230
@tgravescs do you remember why we needed compile for arrow in db shims?

tgravescs · 2023-09-06T12:53:29Z

I do not, we have some arrow test classes for datasource v2 testing in the integration tests, maybe around that.

gerashegalov · 2023-09-06T13:52:31Z

I do not, we have some arrow test classes for datasource v2 testing in the integration tests, maybe around that.

thanks, merging as is, will put in an override if it breaks tests.

replace dbdeps not enabled using profiles, this case is missed in NVIDIA#9182 Signed-off-by: Gera Shegalov <gera@apache.org>

- replace dbdeps not enabled using profiles, this case is missed in #9182 - remove outdated comments Signed-off-by: Gera Shegalov <gera@apache.org>

gerashegalov added 6 commits September 2, 2023 06:11

consolidate cdh dependencies

3e8b9c1

Signed-off-by: Gera Shegalov <gera@apache.org>

unified cdh

23d252f

Signed-off-by: Gera Shegalov <gera@apache.org>

wip

07dbecf

332db

57ef6b9

Signed-off-by: Gera Shegalov <gera@apache.org>

332db

b29829c

Signed-off-by: Gera Shegalov <gera@apache.org>

renamed bom

b5f77f6

gerashegalov requested review from jlowe, revans2, tgravescs, GaryShen2008, NvTimLiu and zhanga5 as code owners September 5, 2023 04:41

gerashegalov self-assigned this Sep 5, 2023

gerashegalov added the build Related to CI / CD or cleanly building label Sep 5, 2023

gerashegalov changed the title ~~Consolidate Spark vendor shim dependency management~~ Consolidate Spark vendor shim dependency management [databricks] Sep 5, 2023

tgravescs reviewed Sep 5, 2023

View reviewed changes

tgravescs approved these changes Sep 5, 2023

View reviewed changes

jlowe approved these changes Sep 5, 2023

View reviewed changes

revans2 approved these changes Sep 5, 2023

View reviewed changes

gerashegalov merged commit 0e2fc80 into NVIDIA:branch-23.10 Sep 6, 2023
31 checks passed

gerashegalov deleted the consolidateDependencyManagement branch September 6, 2023 14:06

gerashegalov added a commit to gerashegalov/spark-rapids that referenced this pull request Oct 24, 2023

Follow-up to dbdeps consolidation

2fce888

replace dbdeps not enabled using profiles, this case is missed in NVIDIA#9182 Signed-off-by: Gera Shegalov <gera@apache.org>

gerashegalov mentioned this pull request Oct 24, 2023

Follow-up to dbdeps consolidation [databricks] #9525

Merged

gerashegalov added a commit that referenced this pull request Oct 24, 2023

Follow-up to dbdeps consolidation (#9525)

cdfdc10

- replace dbdeps not enabled using profiles, this case is missed in #9182 - remove outdated comments Signed-off-by: Gera Shegalov <gera@apache.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate Spark vendor shim dependency management [databricks] #9182

Consolidate Spark vendor shim dependency management [databricks] #9182

gerashegalov commented Sep 5, 2023 •

edited

Loading

gerashegalov commented Sep 5, 2023

gerashegalov commented Sep 5, 2023

tgravescs Sep 5, 2023

revans2 Sep 5, 2023

gerashegalov Sep 5, 2023 •

edited

Loading

tgravescs Sep 5, 2023

revans2 left a comment

revans2 commented Sep 5, 2023

gerashegalov commented Sep 5, 2023

gerashegalov commented Sep 6, 2023

tgravescs commented Sep 6, 2023

gerashegalov commented Sep 6, 2023

Consolidate Spark vendor shim dependency management [databricks] #9182

Consolidate Spark vendor shim dependency management [databricks] #9182

Conversation

gerashegalov commented Sep 5, 2023 • edited Loading

gerashegalov commented Sep 5, 2023

gerashegalov commented Sep 5, 2023

tgravescs Sep 5, 2023

Choose a reason for hiding this comment

revans2 Sep 5, 2023

Choose a reason for hiding this comment

gerashegalov Sep 5, 2023 • edited Loading

Choose a reason for hiding this comment

tgravescs Sep 5, 2023

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

revans2 commented Sep 5, 2023

gerashegalov commented Sep 5, 2023

gerashegalov commented Sep 6, 2023

tgravescs commented Sep 6, 2023

gerashegalov commented Sep 6, 2023

gerashegalov commented Sep 5, 2023 •

edited

Loading

gerashegalov Sep 5, 2023 •

edited

Loading