Merge pull request NVIDIA#1346 from NVIDIA/branch-0.3

[auto-merge] branch-0.3 to branch-0.4 [skip ci] [bot]
nartal1 · Dec 9, 2020 · 38be2eb · 38be2eb
2 parents e833b2d + 8b7afd3
commit 38be2eb
Show file tree

Hide file tree

Showing 6 changed files with 158 additions and 133 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -28,7 +28,7 @@ Contributions to RAPIDS Accelerator for Apache Spark fall into the following thr
     or [help wanted](https://github.com/NVIDIA/spark-rapids/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22)
     labels.
 3. Comment on the issue stating that you are going to work on it.
-4. Code! Make sure to update unit tests!
+4. Code! Make sure to update unit tests and integration tests if needed! [refer to test section](#testing-your-code)
 5. When done, [create your pull request](https://github.com/NVIDIA/spark-rapids/compare).
 6. Verify that CI passes all [status checks](https://help.github.com/articles/about-status-checks/).
     Fix if needed.
@@ -119,6 +119,8 @@ By making a contribution to this project, I certify that:
     this project or the open source license(s) involved.
 ```
 
+### Testing Your Code
+Please visit the [testing doc](tests/README.md) for details about how to run tests
 
 ## Attribution
 Portions adopted from https://github.com/rapidsai/cudf/blob/main/CONTRIBUTING.md, https://github.com/NVIDIA/nvidia-docker/blob/main/CONTRIBUTING.md, and https://github.com/NVIDIA/DALI/blob/main/CONTRIBUTING.md  
diff --git a/README.md b/README.md
@@ -55,7 +55,9 @@ the "verify" phase of maven.  We recommend when building at least running to the
 mvn verify
 ```
 
-Tests are described [here](./docs/testing.md).
+## Testing 
+
+Tests are described [here](tests/README.md).
 
 ## Integration
 The RAPIDS Accelerator For Apache Spark does provide some APIs for doing zero copy data

diff --git a/docs/testing.md b/docs/testing.md
diff --git a/integration_tests/README.md b/integration_tests/README.md
@@ -5,7 +5,7 @@ are intended to be able to be run against any Spark-compatible cluster/release t
 verify that the plugin is doing the right thing in as many cases as possible.
 
 There are two sets of tests here. The pyspark tests are described here. The scala tests
-are described [here](../docs/testing.md)
+are described [here](../tests/README.md)
 
 ## Dependencies
 
@@ -43,24 +43,89 @@ to test improved transfer performance to pandas based user defined functions.
 
 ## Running
 
-Running the tests follows the pytest conventions. If you want to submit the tests as a python process you need to
-have `findspark` installed.  If you want to submit it with `spark-submit` you may do that too, but it will prevent
-you from running the tests in parallel with `pytest-xdist`.
-
-```
-$SPARK_HOME/bin/spark-submit ./runtests.py
+Tests will run as a part of the maven build if you have the environment variable `SPARK_HOME` set.
+
+The suggested way to run these tests is to use the shell-script file located in the
+ integration_tests folder called [run_pyspark_from_build.sh](run_pyspark_from_build.sh). This script takes 
+care of some of the flags that are required to run the tests which will have to be set for the 
+plugin to work. It will be very useful to read the contents of the 
+[run_pyspark_from_build.sh](run_pyspark_from_build.sh) to get a better insight 
+into what is needed as we constantly keep working on to improve and expand the plugin-support.
+
+The python tests run with pytest and the script honors pytest parameters. Some handy flags are:
+- `-k` <pytest-file-name>. This will run all the tests in that test file.
+- `-k` <test-name>. This will also run an individual test.
+- `-s` Doesn't capture the output and instead prints to the screen.
+- `-v` Increase the verbosity of the tests
+- `-r fExXs` Show extra test summary info as specified by chars: (f)ailed, (E)rror, (x)failed, (X)passed, (s)kipped
+- For other options and more details please visit [pytest-usage](https://docs.pytest.org/en/stable/usage.html) or type `pytest --help`
+
+By default the tests try to use the python packages `pytest-xdist` and `findspark` to oversubscribe
+your GPU and run the tests in Spark local mode. This can speed up these tests significantly as all
+of the tests that run by default process relatively small amounts of data. Be careful because if
+you have `SPARK_CONF_DIR` also set the tests will try to use whatever cluster you have configured.
+If you do want to run the tests in parallel on an existing cluster it is recommended that you set
+`-Dpytest.TEST_PARALLEL` to one less than the number of worker applications that will be
+running on the cluster.  This is because `pytest-xdist` will launch one control application that
+is not included in that number. All it does is farm out work to the other applications, but because
+it needs to know about the Spark cluster to determine which tests to run and how it still shows up
+as a Spark application.
+
+To run the tests separate from the build go to the `integration_tests` directory. You can submit
+`runtests.py` through `spark-submit`, but if you want to run the tests in parallel with
+`pytest-xdist` you will need to submit it as a regular python application and have `findspark`
+installed.  Be sure to include the necessary jars for the RAPIDS plugin either with
+`spark-submit` or with the cluster when it is 
+[setup](../docs/get-started/getting-started-on-prem.md).
+The command line arguments to `runtests.py` are the same as for 
+[pytest](https://docs.pytest.org/en/latest/usage.html). The only reason we have a separate script
+is that `spark-submit` uses python if the file name ends with `.py`.
+
+If you want to configure the Spark cluster you may also set environment variables for the tests.
+The name of the env var should be in the form `"PYSP_TEST_" + conf_key.replace('.', '_')`. Linux
+does not allow '.' in the name of an environment variable so we replace it with an underscore. As
+Spark configs avoid this character we have no other special processing.
+
+We also have a large number of integration tests that currently run as a part of the unit tests
+using scala test. Those are in the `src/test/scala` sub-directory and depend on the testing
+framework from the `rapids-4-spark-tests_2.12` test jar.
+
+You can run these tests against a cluster similar to how you can run `pytests` against an
+existing cluster. To do this you need to launch a cluster with the plugin jars on the
+classpath. The tests will enable and disable the plugin as they run.
+
+Next you need to copy over some test files to whatever distributed file system you are using.
+The test files are everything under `./integration_tests/src/test/resources/`  Be sure to note
+where you placed them because you will need to tell the tests where they are.
+
+When running these tests you will need to include the test jar, the integration test jar,
+scala-test and scalactic. You can find scala-test and scalactic under `~/.m2/repository`.
+
+It is recommended that you use `spark-shell` and the scalatest shell to run each test
+individually, so you don't risk running unit tests along with the integration tests.
+http://www.scalatest.org/user_guide/using_the_scalatest_shell
+
+```shell 
+spark-shell --jars rapids-4-spark-tests_2.12-0.3.0-SNAPSHOT-tests.jar,rapids-4-spark-integration-tests_2.12-0.3.0-SNAPSHOT-tests.jar,scalatest_2.12-3.0.5.jar,scalactic_2.12-3.0.5.jar
 ```
 
-or
+First you import the `scalatest_shell` and tell the tests where they can find the test files you
+just copied over.
 
-```
-python ./runtests.py
+```scala
+import org.scalatest._
+com.nvidia.spark.rapids.TestResourceFinder.setPrefix(PATH_TO_TEST_FILES)
 ```
 
-See `pytest -h` or `$SPARK_HOME/bin/spark-submit ./runtests.py -h` for more options.
+Next you can start to run the tests.
+
+```scala
+durations.run(new com.nvidia.spark.rapids.JoinsSuite)
+...
+```
 
 Most clusters probably will not have the RAPIDS plugin installed in the cluster yet.
-If just want to verify the SQL replacement is working you will need to add the `rapids-4-spark` and `cudf` jars to your `spark-submit` command.
+If you just want to verify the SQL replacement is working you will need to add the `rapids-4-spark` and `cudf` jars to your `spark-submit` command.
 
 ```
 $SPARK_HOME/bin/spark-submit --jars "rapids-4-spark_2.12-0.3.0-SNAPSHOT.jar,cudf-0.17-SNAPSHOT.jar" ./runtests.py

diff --git a/tests/README.md b/tests/README.md
@@ -0,0 +1,74 @@
+---
+layout: page
+title: Testing
+nav_order: 1
+parent: Developer Overview
+---
+# RAPIDS Accelerator for Apache Spark Testing
+
+We have several stand alone examples that you can run in the [integration tests](../integration_tests).
+
+One set is based off of the mortgage dataset you can download 
+[here](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html)
+and are in the `com.nvidia.spark.rapids.tests.mortgage` package.
+
+The other is based off of TPCH. You can use the TPCH `dbgen` tool to generate data for them.  They
+are in the `com.nvidia.spark.rapids.tests.tpch` package. `dbgen` has various options to
+generate the data. Please refer to the documentation that comes with dbgen on how to use it, but
+we typically run with the default options and only increase the scale factor depending on the test.
+```shell 
+dbgen -b dists.dss -s 10
+```
+
+You can include the test jar `rapids-4-spark-integration-tests_2.12-0.3.0-SNAPSHOT.jar` with the
+Spark --jars option to get the TPCH tests. To setup for the queries you can run 
+`TpchLikeSpark.setupAllCSV` for CSV formatted data or `TpchLikeSpark.setupAllParquet`
+for parquet formatted data.  Both of those take the Spark session, and a path to the dbgen
+generated data.  After that each query has its own object.
+
+So you can make a call like:
+```scala
+import com.nvidia.spark.rapids.tests.tpch._
+val pathTodbgenoutput = SPECIFY PATH
+TpchLikeSpark.setupAllCSV(spark, pathTodbgenoutput)
+Q1Like(spark).count()
+```
+
+They generally follow TPCH but are not guaranteed to be the same.
+`Q1Like(spark)` will return a DataFrame that can be executed to run the corresponding query.
+
+## Unit Tests
+
+Unit tests exist in the [tests]() directory. This is unconventional and is done so we can run the 
+tests on the final shaded version of the plugin. It also helps with how we collect code coverage. 
+
+Use Maven to run the unit tests via `mvn test`.
+
+To run targeted Scala tests append `-DwildcardSuites=<comma separated list of wildcard suite
+ names to execute>` to the above command. 
+
+For more information about using scalatest with Maven please refer to the
+[scalatest documentation](https://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin).
+
+#### Running Unit Tests Against Specific Apache Spark Versions
+You can run the unit tests against different versions of Spark using the different profiles. The
+default version runs against Spark 3.0.0, to run against other versions use one of the following
+ profiles:
+   - `-Pspark301tests` (spark 3.0.1)
+   - `-Pspark302tests` (spark 3.0.2)
+   - `-Pspark310tests` (spark 3.1.0)
+
+Please refer to the [tests project POM](pom.xml) to see the list of test profiles supported.
+Apache Spark specific configurations can be passed in by setting the `SPARK_CONF` environment
+variable.
+
+Examples: 
+- To run tests against Apache Spark 3.1.0, 
+ `mvn -P spark310tests test` 
+- To pass Apache Spark configs `--conf spark.dynamicAllocation.enabled=false --conf spark.task.cpus=1` do something like.
+ `SPARK_CONF="spark.dynamicAllocation.enabled=false,spark.task.cpus=1" mvn ...`
+- To run test ParquetWriterSuite in package com.nvidia.spark.rapids, issue `mvn test -DwildcardSuites="com.nvidia.spark.rapids.ParquetWriterSuite"`
+
+## Integration Tests
+
+Please refer to the integration-tests [README](../integration_tests/README.md)
diff --git a/tests/pom.xml b/tests/pom.xml
@@ -138,6 +138,7 @@
                 <configuration>
                     <excludes>
                         <exclude>src/test/resources/**</exclude>
+                        <exclude>**/*.md</exclude>
                         <exclude>rmm_log.txt</exclude>
                     </excludes>
                 </configuration>