Benchmarking guide (NVIDIA#820)

* Benchmarking guide Signed-off-by: Andy Grove <andygrove@nvidia.com> * add notes on spark shell configuration Signed-off-by: Andy Grove <andygrove@nvidia.com> * formatting Signed-off-by: Andy Grove <andygrove@nvidia.com> * add link from README to benchmark guide * Update docs/benchmarks.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * Update docs/benchmarks.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * Update docs/benchmarks.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * Update docs/benchmarks.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * Update docs/benchmarks.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * minor docs update Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
nartal1 · Oct 9, 2020 · 0d129ed · 0d129ed
1 parent e7ba59f
commit 0d129ed
Show file tree

Hide file tree

Showing 2 changed files with 134 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -15,6 +15,8 @@ GPU memory.
 
 To get started and try the plugin out use the [getting started guide](./docs/get-started/getting-started.md).
 
+For more information about these benchmarks, see the [benchmark guide](./docs/benchmarks.md).
+
 ## Compatibility
 
 The SQL plugin tries to produce results that are bit for bit identical with Apache Spark.

diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -0,0 +1,132 @@
+# Benchmarks
+
+The `integration_test` module contains benchmarks derived from the 
+[TPC-DS](http://www.tpc.org/tpcds/), [TPC-H](http://www.tpc.org/tpch/), and 
+[TPCx-BB](http://www.tpc.org/tpcx-bb/default5.asp) benchmarks. These are not official TPC 
+benchmarks and are only intended to be used to compare relative performance between CPU and GPU 
+and to help catch performance regressions in the plugin.
+
+## Data Generation
+
+For each of these benchmarks, source data must be generated by a utility that can generate data at
+any scale factor, where the scale factor is an integer representing approximately how many
+gigabytes of data will be generated, with scale factor 1 meaning ~1 GB, and scale factor 1000 meaning ~1 TB,
+for example.
+
+Further information on data generation can be found using the following links:
+
+- [TPC-DS Data Generator](https://github.com/databricks/tpcds-kit)
+- [TPC-H Data Generator](https://github.com/electrum/tpch-dbgen)
+- [TPCx-BB Data Generator](http://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp)
+
+The remainder of this document is based on the TPC-DS benchmark but the steps are very similar for
+the other benchmarks. The main difference is that the package and class name is different for each
+benchmark.
+
+| Benchmark | Package                              | Class Names                      |
+|-----------|--------------------------------------|----------------------------------|
+| TPC-DS    | com.nvidia.spark.rapids.tests.tpcds  | TpcdsLikeSpark, TpcdsLikeBench   |
+| TPC-xBB   | com.nvidia.spark.rapids.tests.tpcxbb | TpcxbbLikeSpark, TpcxbbLikeBench |
+| TPC-H     | com.nvidia.spark.rapids.tests.tpch   | TpchLikeSpark, TpchLikeBench     |
+
+## Spark Shell
+
+The integration test jar needs to be added to the `--jars` configuration option when launching the
+Spark shell. This jar can be found in the `integration_tests/target` directory after running
+`mvn package`, with a filename matching `rapids-4-spark-integration-tests_2.12-*-SNAPSHOT.jar`.
+
+To run benchmarks on the GPU, the RAPIDS Accelerator for Apache Spark must also be installed,
+following the instructions provided in the [Getting Started](get-started/getting-started.md) guide.
+
+## Converting to Parquet
+
+Although it is possible to run benchmarks directly against the CSV data generated by the TPC data 
+generators, it is common to convert the data to Parquet format and run benchmarks against the 
+Parquet files instead.
+
+The `integration_test` module contains code for converting the CSV data sets to Parquet.
+
+The following commands can be entered into spark-shell to perform the conversion.
+
+```scala
+import com.nvidia.spark.rapids.tests.tpcds._
+TpcdsLikeSpark.csvToParquet(spark, "/path/to/input", "/path/to/output")
+```
+
+Note that the code for converting CSV to Parquet does not explicitly specify the number of 
+partitions to write, so the size of the resulting parquet files will vary depending on the value 
+for `spark.default.parallelism`, which by default is based on the number of available executor 
+cores. This value can be set explicitly to better control the size of the output files.
+
+It should also be noted that no decimal types will be output. The conversion code uses explicit 
+schemas to ensure that decimal types are converted to floating-point types instead.
+
+## Running Benchmarks
+
+The benchmarks can be executed in two modes currently:
+
+- Execute the query and collect the results to the driver
+- Execute the query and write the results to disk (in Parquet or CSV format)
+
+The following commands can be entered into spark-shell to register the data files that the 
+benchmark will query.
+
+```scala
+import com.nvidia.spark.rapids.tests.tpcds._
+TpcdsLikeSpark.setupAllParquet(spark, "/path/to/tpcds")
+```
+
+The benchmark can be executed with the following syntax to execute the query and collect the 
+results to the driver.
+
+```scala
+TpcdsLikeBench.collect(spark, "q5", iterations=3)
+```
+
+The benchmark can be executed with the following syntax to execute the query and write the results 
+to Parquet. There is also a `writeCsv` method for writing the output to CSV files.
+
+```scala
+TpcdsLikeBench.writeParquet(spark, "q5", "/data/output/tpcds/q5", iterations=3)
+```
+
+## Benchmark JSON Output
+
+Each benchmark run produces a JSON file containing information about the environment and the query, 
+including the following items:
+
+- Spark version
+- Spark configuration
+- Environment variables
+- Logical and physical query plan
+- SQL metrics for the executed plan
+- Timing information for each query iteration
+
+Care should be taken to ensure that no sensitive information is captured from the environment 
+before sharing these JSON files. Environment variables with names containing the words `PASSWORD`, 
+`TOKEN`, or `SECRET` are filtered out, but this may not be sufficient to prevent leaking secrets.
+
+## Verifying Results
+
+It is important to verify that queries actually produced the correct output, especially when 
+comparing between CPU and GPU benchmarks. A utility is provided to help with this.
+
+This is a simple utility that pulls results down to the driver for comparison so will only work for 
+data sets that can fit in the  driver's memory. 
+
+If data needs sorting before comparison, this is delegated to Spark before collecting the results.
+
+Example usage from spark-shell:
+
+```scala
+val cpu = spark.read.parquet("/data/tpcxbb/q5-cpu")
+val gpu = spark.read.parquet("/data/tpcxbb/q5-gpu")
+import com.nvidia.spark.rapids.tests.common._
+BenchUtils.compareResults(cpu, gpu, ignoreOrdering=true, epsilon=0.0001)
+```
+
+This will report on any differences between the two dataframes.
+
+## Performance Tuning
+
+Please refer to the [Tuning Guide](tuning-guide.md) for information on performance tuning.