Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move the udf-examples module from spark-rapids repository to here #94

Merged
merged 13 commits into from
Feb 11, 2022

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Jan 26, 2022

[BUG] Dependencies missing of spark-rapids v21.12.0 release jars NVIDIA/spark-rapids#4253
Move the udf-examples module from spark-rapids repository to here

Move the udf-examples to the external repository spark-rapids-examples and the spark-rapids will no longer contain the udf-examples module.
The moved udf-examples depends on the previous release dist jar, can't depends on the Snapshot jar, because of the Snapshot jar is not published to the Maven Central.
e.g.: udf-examples 22.04 Snapshot depends on dist 22.02.0 version.

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>
@res-life res-life marked this pull request as ready for review January 27, 2022 11:20
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Spark-Rapids directory seems redundant, as this repository is already named spark-rapids-examples. I suggest just dropping the Spark-Rapids directory and using udf-examples directly or possibly renaming it something like RAPIDS-accelerated-UDFs.

Comment on lines 17 to 18
<!-- Note: rapids-4-spark release version is not published in the Maven Central -->
<!-- Depends on its release version 21.12.0 -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This comment is redundant with the comment next to to the <version> tag of the dependency below, and this comment has the drawback that it needs to be kept in sync with the version used below. I suggest removing this and just keeping the comment below (if any comment at all).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


<parent>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-parent</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the need for this project to depend on the parent project of a separate repository. We should just copy the few relevant sections here (i.e.: license and developers), add its own scm section, etc. All the profile shenanigans and dependencyMgmt should not apply here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the dependency to Rapids parent project.
Removed the databricks and cdh311 profiles. The pre-merge and nightly build will be tracked by NVIDIA/spark-rapids#4704.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This examples are target to Spark 3.0.1, is it necessary to add profiles for other Spark versions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be fine. The UDF examples are using public, stable APIs of Spark and thus should not need to be recompiled per version of Spark.

If we had to compile per Spark verssion then we'd need to deploy multiple versions of the UDF examples jar with a Spark version classifier.

@nvliyuan
Copy link
Collaborator

nvliyuan commented Feb 7, 2022

Can we change this pr target to 22.02, the udf-examples can be found more easily. Since we will update 22.04 as the default branch two months later(the date when rapids 22.04 released)

@res-life res-life changed the base branch from branch-22.04 to branch-22.02 February 7, 2022 10:35
@res-life
Copy link
Collaborator Author

res-life commented Feb 7, 2022

Rebased to branch 22.02

user defined functions for use with the RAPIDS Accelerator
for Apache Spark
</description>
<version>22.04.0-SNAPSHOT</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're targeting branch-22.02 but marking it as 22.04-SNAPSHOT?

Changing this to 22.02-SNAPSHOT probably is not the right fix. I don't think it makes sense to target branch-22.02 because for version 22.02 of the plugin repo, the "official" artifact for 22.02 still comes from the plugin repository, and thus we'd have two separate codebases that claim to publish version 22.02 of the artifact.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to version 0.1

</parent>
<artifactId>rapids-4-spark-udf-examples_2.12</artifactId>
<groupId>com.nvidia</groupId>
<artifactId>RAPIDS-accelerated-UDFs_2.12</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment about RAPIDS-accelerated-UDFs referred to the directory name in the repo, not the artifact name.

I think we should prefer the previous artifact name. The previous name leverages the same rapids-4-spark name of the plugin, so it's clearly tied to and related to that project, and also notes that these jars contain UDF examples.

@res-life res-life self-assigned this Feb 8, 2022
user defined functions for use with the RAPIDS Accelerator
for Apache Spark
</description>
<version>0.1</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind picking version 0.1? We're using calendar versioning, and this is currently targeting branch-22.02 (but it should target branch-22.04 per my previous comment, as the RAPIDS Accelerator version 22.02 has the "official" source for this artifact for 22.02).

This should be version 22.04.0-SNAPSHOT and updated to 22.04.0 just before deploying the official artifact when we ship version 22.04 across our repos.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw the version of Spark-ETL+XGBoost is 0.2.2.
Updated to version 22.04.0-SNAPSHOT.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we also need to change Spark-ETL+XGBoost version to calendar version to keep in sync, I will update it in the coming up PR.

@res-life res-life changed the base branch from branch-22.02 to branch-22.04 February 9, 2022 01:24
jlowe
jlowe previously approved these changes Feb 9, 2022
README.md Outdated Show resolved Hide resolved
@nvliyuan nvliyuan self-requested a review February 11, 2022 07:49
@nvliyuan nvliyuan merged commit c5783c2 into NVIDIA:branch-22.04 Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants