Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move the udf-examples module to the external repository spark-rapids-examples [databricks] #4619

Merged
merged 12 commits into from
Feb 23, 2022
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 3 additions & 37 deletions docs/additional-functionality/rapids-udfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,44 +134,10 @@ type `DECIMAL64(scale=-2)`.

## RAPIDS Accelerated UDF Examples

Source code for examples of RAPIDS accelerated Hive UDFs is provided
in the [udf-examples](../../udf-examples) project.

### Spark Scala UDF Examples

- [URLDecode](../../udf-examples/src/main/scala/com/nvidia/spark/rapids/udf/scala/URLDecode.scala)
decodes URL-encoded strings using the
[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)
- [URLEncode](../../udf-examples/src/main/scala/com/nvidia/spark/rapids/udf/scala/URLEncode.scala)
URL-encodes strings using the
[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)

### Spark Java UDF Examples

- [URLDecode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/URLDecode.java)
decodes URL-encoded strings using the
[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)
- [URLEncode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/URLEncode.java)
URL-encodes strings using the
[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)
- [CosineSimilarity](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java)
computes the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
between two float vectors using [native code](../../udf-examples/src/main/cpp/src)

### Hive UDF Examples

- [URLDecode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/URLDecode.java)
implements a Hive simple UDF using the
[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)
to decode URL-encoded strings
- [URLEncode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/URLEncode.java)
implements a Hive generic UDF using the
[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable)
to URL-encode strings
- [StringWordCount](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java)
implements a Hive simple UDF using
[native code](../../udf-examples/src/main/cpp/src) to count words in strings
Source code for examples of RAPIDS accelerated UDFs is provided

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra whitespace ends up disconnecting this sentence when it's rendered.

Suggested change

<!-- Note: should update the branch name to tag when releasing-->
in the [udf-examples](https://github.com/NVIDIA/spark-rapids-examples/tree/branch-22.04/examples/Spark-Rapids/udf-examples) project.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate there isn't a main branch in the spark-rapids-examples repository we can point to so we're always pointing to the latest stable release code. Otherwise we're going to forget about this and then point to possibly very stale examples, and the user would miss any new examples that were added in later versions of the spark-rapids-examples code. @GaryShen2008 should we be doing a merge to main on the spark-rapids-examples repository as we do for this one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can create a main branch for the latest release version.


## GPU Support for Pandas UDF

Expand Down
8 changes: 4 additions & 4 deletions integration_tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,15 +126,15 @@ The test files are everything under `./integration_tests/src/test/resources/` B
where you placed them because you will need to tell the tests where they are.

When running these tests you will need to include the test jar, the integration test jar,
the udf-examples jar, scala-test and scalactic. You can find scala-test and scalactic under
the scala-test and scalactic. You can find scala-test and scalactic under
`~/.m2/repository`.

It is recommended that you use `spark-shell` and the scalatest shell to run each test
individually, so you don't risk running unit tests along with the integration tests.
http://www.scalatest.org/user_guide/using_the_scalatest_shell

```shell
spark-shell --jars rapids-4-spark-tests_2.12-22.04.0-SNAPSHOT-tests.jar,rapids-4-spark-udf-examples_2.12-22.04.0-SNAPSHOT.jar,rapids-4-spark-integration-tests_2.12-22.04.0-SNAPSHOT-tests.jar,scalatest_2.12-3.0.5.jar,scalactic_2.12-3.0.5.jar
spark-shell --jars rapids-4-spark-tests_2.12-22.04.0-SNAPSHOT-tests.jar,rapids-4-spark-integration-tests_2.12-22.04.0-SNAPSHOT-tests.jar,scalatest_2.12-3.0.5.jar,scalactic_2.12-3.0.5.jar
```

First you import the `scalatest_shell` and tell the tests where they can find the test files you
Expand All @@ -157,7 +157,7 @@ If you just want to verify the SQL replacement is working you will need to add t
example assumes CUDA 11.0 is being used.

```
$SPARK_HOME/bin/spark-submit --jars "rapids-4-spark_2.12-22.04.0-SNAPSHOT.jar,rapids-4-spark-udf-examples_2.12-22.04.0-SNAPSHOT.jar,cudf-22.04.0-SNAPSHOT-cuda11.jar" ./runtests.py
$SPARK_HOME/bin/spark-submit --jars "rapids-4-spark_2.12-22.04.0-SNAPSHOT.jar,cudf-22.04.0-SNAPSHOT-cuda11.jar" ./runtests.py
```

You don't have to enable the plugin for this to work, the test framework will do that for you.
Expand Down Expand Up @@ -256,7 +256,7 @@ To run cudf_udf tests, need following configuration changes:
As an example, here is the `spark-submit` command with the cudf_udf parameter on CUDA 11.0:

```
$SPARK_HOME/bin/spark-submit --jars "rapids-4-spark_2.12-22.04.0-SNAPSHOT.jar,rapids-4-spark-udf-examples_2.12-22.04.0-SNAPSHOT.jar,cudf-22.04.0-SNAPSHOT-cuda11.jar,rapids-4-spark-tests_2.12-22.04.0-SNAPSHOT.jar" --conf spark.rapids.memory.gpu.allocFraction=0.3 --conf spark.rapids.python.memory.gpu.allocFraction=0.3 --conf spark.rapids.python.concurrentPythonWorkers=2 --py-files "rapids-4-spark_2.12-22.04.0-SNAPSHOT.jar" --conf spark.executorEnv.PYTHONPATH="rapids-4-spark_2.12-22.04.0-SNAPSHOT.jar" ./runtests.py --cudf_udf
$SPARK_HOME/bin/spark-submit --jars "rapids-4-spark_2.12-22.04.0-SNAPSHOT.jar,cudf-22.04.0-SNAPSHOT-cuda11.jar,rapids-4-spark-tests_2.12-22.04.0-SNAPSHOT.jar" --conf spark.rapids.memory.gpu.allocFraction=0.3 --conf spark.rapids.python.memory.gpu.allocFraction=0.3 --conf spark.rapids.python.concurrentPythonWorkers=2 --py-files "rapids-4-spark_2.12-22.04.0-SNAPSHOT.jar" --conf spark.executorEnv.PYTHONPATH="rapids-4-spark_2.12-22.04.0-SNAPSHOT.jar" ./runtests.py --cudf_udf
```

## Writing tests
Expand Down
6 changes: 1 addition & 5 deletions integration_tests/conftest.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020-2021, NVIDIA CORPORATION.
# Copyright (c) 2020-2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -35,10 +35,6 @@ def pytest_addoption(parser):
parser.addoption(
"--cudf_udf", action='store_true', default=False, help="if true enable cudf_udf test"
)
parser.addoption(
"--rapids_udf_example_native", action='store_true', default=False,
help="if true enable tests for RAPIDS UDF examples with native code"
)
parser.addoption(
"--test_type", action='store', default="developer",
help="the type of tests that are being run to help check all the correct tests are run - developer, pre-commit, or nightly"
Expand Down
46 changes: 40 additions & 6 deletions integration_tests/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -60,17 +60,16 @@
<version>${project.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-udf-examples_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.test.version}</version>
</dependency>
<dependency>
<!-- for hive udf test cases -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
</dependency>
</dependencies>

<profiles>
Expand Down Expand Up @@ -108,6 +107,17 @@
<artifactId>curator-recipes</artifactId>
<version>4.3.0.7.2.7.0-184</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
<version>${spark311cdh.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
</exclusion>
</exclusions>
</dependency>
</dependencies>
</profile>
<profile>
Expand Down Expand Up @@ -178,6 +188,30 @@
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-serde</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-io</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
</profile>
</profiles>
Expand Down
3 changes: 1 addition & 2 deletions integration_tests/pytest.ini
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
; Copyright (c) 2020-2021, NVIDIA CORPORATION.
; Copyright (c) 2020-2022, NVIDIA CORPORATION.
;
; Licensed under the Apache License, Version 2.0 (the "License");
; you may not use this file except in compliance with the License.
Expand All @@ -22,7 +22,6 @@ markers =
limit(num_rows): Limit the number of rows that will be check in a result
qarun: Mark qa test
cudf_udf: Mark udf cudf test
rapids_udf_example_native: test UDFs that require custom cuda compilation
validate_execs_in_gpu_plan([execs]): Exec class names to validate they exist in the GPU plan.
shuffle_test: Mark to include test in the RAPIDS Shuffle Manager
premerge_ci_1: Mark test that will run in first k8s pod in case of parallel build premerge job
Expand Down
6 changes: 2 additions & 4 deletions integration_tests/run_pyspark_from_build.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
# Copyright (c) 2020-2021, NVIDIA CORPORATION.
# Copyright (c) 2020-2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -42,14 +42,12 @@ else
CUDF_JARS=$(echo "$LOCAL_JAR_PATH"/cudf-*.jar)
PLUGIN_JARS=$(echo "$LOCAL_JAR_PATH"/rapids-4-spark_*.jar)
TEST_JARS=$(echo "$LOCAL_JAR_PATH"/rapids-4-spark-integration-tests*-$SPARK_SHIM_VER*.jar)
UDF_EXAMPLE_JARS=$(echo "$LOCAL_JAR_PATH"/rapids-4-spark-udf-examples*.jar)
else
CUDF_JARS=$(echo "$SCRIPTPATH"/target/dependency/cudf-*.jar)
PLUGIN_JARS=$(echo "$SCRIPTPATH"/../dist/target/rapids-4-spark_*.jar)
TEST_JARS=$(echo "$SCRIPTPATH"/target/rapids-4-spark-integration-tests*-$SPARK_SHIM_VER*.jar)
UDF_EXAMPLE_JARS=$(echo "$SCRIPTPATH"/../udf-examples/target/rapids-4-spark-udf-examples*.jar)
fi
ALL_JARS="$CUDF_JARS $PLUGIN_JARS $TEST_JARS $UDF_EXAMPLE_JARS"
ALL_JARS="$CUDF_JARS $PLUGIN_JARS $TEST_JARS"
echo "AND PLUGIN JARS: $ALL_JARS"
if [[ "${TEST}" != "" ]];
then
Expand Down
9 changes: 1 addition & 8 deletions integration_tests/src/main/python/conftest.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020-2021, NVIDIA CORPORATION.
# Copyright (c) 2020-2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -321,10 +321,3 @@ def enable_cudf_udf(request):
if not enable_udf_cudf:
# cudf_udf tests are not required for any test runs
pytest.skip("cudf_udf not configured to run")

@pytest.fixture(scope="session")
def enable_rapids_udf_example_native(request):
native_enabled = request.config.getoption("rapids_udf_example_native")
if not native_enabled:
# udf_example_native tests are not required for any test runs
pytest.skip("rapids_udf_example_native is not configured to run")
3 changes: 1 addition & 2 deletions integration_tests/src/main/python/marks.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020-2021, NVIDIA CORPORATION.
# Copyright (c) 2020-2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -23,7 +23,6 @@
limit = pytest.mark.limit
qarun = pytest.mark.qarun
cudf_udf = pytest.mark.cudf_udf
rapids_udf_example_native = pytest.mark.rapids_udf_example_native
shuffle_test = pytest.mark.shuffle_test
nightly_gpu_mem_consuming_case = pytest.mark.nightly_gpu_mem_consuming_case
nightly_host_mem_consuming_case = pytest.mark.nightly_host_mem_consuming_case
133 changes: 0 additions & 133 deletions integration_tests/src/main/python/rapids_udf_test.py

This file was deleted.

Loading