Skip to content

Commit

Permalink
Benchmarking guide (NVIDIA#820)
Browse files Browse the repository at this point in the history
* Benchmarking guide

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* add notes on spark shell configuration

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* formatting

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* add link from README to benchmark guide

* Update docs/benchmarks.md

Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>

* Update docs/benchmarks.md

Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>

* Update docs/benchmarks.md

Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>

* Update docs/benchmarks.md

Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>

* Update docs/benchmarks.md

Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>

* minor docs update

Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
  • Loading branch information
andygrove and sameerz authored Oct 9, 2020
1 parent e7ba59f commit 0d129ed
Show file tree
Hide file tree
Showing 2 changed files with 134 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ GPU memory.

To get started and try the plugin out use the [getting started guide](./docs/get-started/getting-started.md).

For more information about these benchmarks, see the [benchmark guide](./docs/benchmarks.md).

## Compatibility

The SQL plugin tries to produce results that are bit for bit identical with Apache Spark.
Expand Down
132 changes: 132 additions & 0 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Benchmarks

The `integration_test` module contains benchmarks derived from the
[TPC-DS](http://www.tpc.org/tpcds/), [TPC-H](http://www.tpc.org/tpch/), and
[TPCx-BB](http://www.tpc.org/tpcx-bb/default5.asp) benchmarks. These are not official TPC
benchmarks and are only intended to be used to compare relative performance between CPU and GPU
and to help catch performance regressions in the plugin.

## Data Generation

For each of these benchmarks, source data must be generated by a utility that can generate data at
any scale factor, where the scale factor is an integer representing approximately how many
gigabytes of data will be generated, with scale factor 1 meaning ~1 GB, and scale factor 1000 meaning ~1 TB,
for example.

Further information on data generation can be found using the following links:

- [TPC-DS Data Generator](https://github.com/databricks/tpcds-kit)
- [TPC-H Data Generator](https://github.com/electrum/tpch-dbgen)
- [TPCx-BB Data Generator](http://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp)

The remainder of this document is based on the TPC-DS benchmark but the steps are very similar for
the other benchmarks. The main difference is that the package and class name is different for each
benchmark.

| Benchmark | Package | Class Names |
|-----------|--------------------------------------|----------------------------------|
| TPC-DS | com.nvidia.spark.rapids.tests.tpcds | TpcdsLikeSpark, TpcdsLikeBench |
| TPC-xBB | com.nvidia.spark.rapids.tests.tpcxbb | TpcxbbLikeSpark, TpcxbbLikeBench |
| TPC-H | com.nvidia.spark.rapids.tests.tpch | TpchLikeSpark, TpchLikeBench |

## Spark Shell

The integration test jar needs to be added to the `--jars` configuration option when launching the
Spark shell. This jar can be found in the `integration_tests/target` directory after running
`mvn package`, with a filename matching `rapids-4-spark-integration-tests_2.12-*-SNAPSHOT.jar`.

To run benchmarks on the GPU, the RAPIDS Accelerator for Apache Spark must also be installed,
following the instructions provided in the [Getting Started](get-started/getting-started.md) guide.

## Converting to Parquet

Although it is possible to run benchmarks directly against the CSV data generated by the TPC data
generators, it is common to convert the data to Parquet format and run benchmarks against the
Parquet files instead.

The `integration_test` module contains code for converting the CSV data sets to Parquet.

The following commands can be entered into spark-shell to perform the conversion.

```scala
import com.nvidia.spark.rapids.tests.tpcds._
TpcdsLikeSpark.csvToParquet(spark, "/path/to/input", "/path/to/output")
```

Note that the code for converting CSV to Parquet does not explicitly specify the number of
partitions to write, so the size of the resulting parquet files will vary depending on the value
for `spark.default.parallelism`, which by default is based on the number of available executor
cores. This value can be set explicitly to better control the size of the output files.

It should also be noted that no decimal types will be output. The conversion code uses explicit
schemas to ensure that decimal types are converted to floating-point types instead.

## Running Benchmarks

The benchmarks can be executed in two modes currently:

- Execute the query and collect the results to the driver
- Execute the query and write the results to disk (in Parquet or CSV format)

The following commands can be entered into spark-shell to register the data files that the
benchmark will query.

```scala
import com.nvidia.spark.rapids.tests.tpcds._
TpcdsLikeSpark.setupAllParquet(spark, "/path/to/tpcds")
```

The benchmark can be executed with the following syntax to execute the query and collect the
results to the driver.

```scala
TpcdsLikeBench.collect(spark, "q5", iterations=3)
```

The benchmark can be executed with the following syntax to execute the query and write the results
to Parquet. There is also a `writeCsv` method for writing the output to CSV files.

```scala
TpcdsLikeBench.writeParquet(spark, "q5", "/data/output/tpcds/q5", iterations=3)
```

## Benchmark JSON Output

Each benchmark run produces a JSON file containing information about the environment and the query,
including the following items:

- Spark version
- Spark configuration
- Environment variables
- Logical and physical query plan
- SQL metrics for the executed plan
- Timing information for each query iteration

Care should be taken to ensure that no sensitive information is captured from the environment
before sharing these JSON files. Environment variables with names containing the words `PASSWORD`,
`TOKEN`, or `SECRET` are filtered out, but this may not be sufficient to prevent leaking secrets.

## Verifying Results

It is important to verify that queries actually produced the correct output, especially when
comparing between CPU and GPU benchmarks. A utility is provided to help with this.

This is a simple utility that pulls results down to the driver for comparison so will only work for
data sets that can fit in the driver's memory.

If data needs sorting before comparison, this is delegated to Spark before collecting the results.

Example usage from spark-shell:

```scala
val cpu = spark.read.parquet("/data/tpcxbb/q5-cpu")
val gpu = spark.read.parquet("/data/tpcxbb/q5-gpu")
import com.nvidia.spark.rapids.tests.common._
BenchUtils.compareResults(cpu, gpu, ignoreOrdering=true, epsilon=0.0001)
```

This will report on any differences between the two dataframes.

## Performance Tuning

Please refer to the [Tuning Guide](tuning-guide.md) for information on performance tuning.

0 comments on commit 0d129ed

Please sign in to comment.