forked from NVIDIA/spark-rapids
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Benchmarking guide Signed-off-by: Andy Grove <andygrove@nvidia.com> * add notes on spark shell configuration Signed-off-by: Andy Grove <andygrove@nvidia.com> * formatting Signed-off-by: Andy Grove <andygrove@nvidia.com> * add link from README to benchmark guide * Update docs/benchmarks.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * Update docs/benchmarks.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * Update docs/benchmarks.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * Update docs/benchmarks.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * Update docs/benchmarks.md Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * minor docs update Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
- Loading branch information
Showing
2 changed files
with
134 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
# Benchmarks | ||
|
||
The `integration_test` module contains benchmarks derived from the | ||
[TPC-DS](http://www.tpc.org/tpcds/), [TPC-H](http://www.tpc.org/tpch/), and | ||
[TPCx-BB](http://www.tpc.org/tpcx-bb/default5.asp) benchmarks. These are not official TPC | ||
benchmarks and are only intended to be used to compare relative performance between CPU and GPU | ||
and to help catch performance regressions in the plugin. | ||
|
||
## Data Generation | ||
|
||
For each of these benchmarks, source data must be generated by a utility that can generate data at | ||
any scale factor, where the scale factor is an integer representing approximately how many | ||
gigabytes of data will be generated, with scale factor 1 meaning ~1 GB, and scale factor 1000 meaning ~1 TB, | ||
for example. | ||
|
||
Further information on data generation can be found using the following links: | ||
|
||
- [TPC-DS Data Generator](https://github.com/databricks/tpcds-kit) | ||
- [TPC-H Data Generator](https://github.com/electrum/tpch-dbgen) | ||
- [TPCx-BB Data Generator](http://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) | ||
|
||
The remainder of this document is based on the TPC-DS benchmark but the steps are very similar for | ||
the other benchmarks. The main difference is that the package and class name is different for each | ||
benchmark. | ||
|
||
| Benchmark | Package | Class Names | | ||
|-----------|--------------------------------------|----------------------------------| | ||
| TPC-DS | com.nvidia.spark.rapids.tests.tpcds | TpcdsLikeSpark, TpcdsLikeBench | | ||
| TPC-xBB | com.nvidia.spark.rapids.tests.tpcxbb | TpcxbbLikeSpark, TpcxbbLikeBench | | ||
| TPC-H | com.nvidia.spark.rapids.tests.tpch | TpchLikeSpark, TpchLikeBench | | ||
|
||
## Spark Shell | ||
|
||
The integration test jar needs to be added to the `--jars` configuration option when launching the | ||
Spark shell. This jar can be found in the `integration_tests/target` directory after running | ||
`mvn package`, with a filename matching `rapids-4-spark-integration-tests_2.12-*-SNAPSHOT.jar`. | ||
|
||
To run benchmarks on the GPU, the RAPIDS Accelerator for Apache Spark must also be installed, | ||
following the instructions provided in the [Getting Started](get-started/getting-started.md) guide. | ||
|
||
## Converting to Parquet | ||
|
||
Although it is possible to run benchmarks directly against the CSV data generated by the TPC data | ||
generators, it is common to convert the data to Parquet format and run benchmarks against the | ||
Parquet files instead. | ||
|
||
The `integration_test` module contains code for converting the CSV data sets to Parquet. | ||
|
||
The following commands can be entered into spark-shell to perform the conversion. | ||
|
||
```scala | ||
import com.nvidia.spark.rapids.tests.tpcds._ | ||
TpcdsLikeSpark.csvToParquet(spark, "/path/to/input", "/path/to/output") | ||
``` | ||
|
||
Note that the code for converting CSV to Parquet does not explicitly specify the number of | ||
partitions to write, so the size of the resulting parquet files will vary depending on the value | ||
for `spark.default.parallelism`, which by default is based on the number of available executor | ||
cores. This value can be set explicitly to better control the size of the output files. | ||
|
||
It should also be noted that no decimal types will be output. The conversion code uses explicit | ||
schemas to ensure that decimal types are converted to floating-point types instead. | ||
|
||
## Running Benchmarks | ||
|
||
The benchmarks can be executed in two modes currently: | ||
|
||
- Execute the query and collect the results to the driver | ||
- Execute the query and write the results to disk (in Parquet or CSV format) | ||
|
||
The following commands can be entered into spark-shell to register the data files that the | ||
benchmark will query. | ||
|
||
```scala | ||
import com.nvidia.spark.rapids.tests.tpcds._ | ||
TpcdsLikeSpark.setupAllParquet(spark, "/path/to/tpcds") | ||
``` | ||
|
||
The benchmark can be executed with the following syntax to execute the query and collect the | ||
results to the driver. | ||
|
||
```scala | ||
TpcdsLikeBench.collect(spark, "q5", iterations=3) | ||
``` | ||
|
||
The benchmark can be executed with the following syntax to execute the query and write the results | ||
to Parquet. There is also a `writeCsv` method for writing the output to CSV files. | ||
|
||
```scala | ||
TpcdsLikeBench.writeParquet(spark, "q5", "/data/output/tpcds/q5", iterations=3) | ||
``` | ||
|
||
## Benchmark JSON Output | ||
|
||
Each benchmark run produces a JSON file containing information about the environment and the query, | ||
including the following items: | ||
|
||
- Spark version | ||
- Spark configuration | ||
- Environment variables | ||
- Logical and physical query plan | ||
- SQL metrics for the executed plan | ||
- Timing information for each query iteration | ||
|
||
Care should be taken to ensure that no sensitive information is captured from the environment | ||
before sharing these JSON files. Environment variables with names containing the words `PASSWORD`, | ||
`TOKEN`, or `SECRET` are filtered out, but this may not be sufficient to prevent leaking secrets. | ||
|
||
## Verifying Results | ||
|
||
It is important to verify that queries actually produced the correct output, especially when | ||
comparing between CPU and GPU benchmarks. A utility is provided to help with this. | ||
|
||
This is a simple utility that pulls results down to the driver for comparison so will only work for | ||
data sets that can fit in the driver's memory. | ||
|
||
If data needs sorting before comparison, this is delegated to Spark before collecting the results. | ||
|
||
Example usage from spark-shell: | ||
|
||
```scala | ||
val cpu = spark.read.parquet("/data/tpcxbb/q5-cpu") | ||
val gpu = spark.read.parquet("/data/tpcxbb/q5-gpu") | ||
import com.nvidia.spark.rapids.tests.common._ | ||
BenchUtils.compareResults(cpu, gpu, ignoreOrdering=true, epsilon=0.0001) | ||
``` | ||
|
||
This will report on any differences between the two dataframes. | ||
|
||
## Performance Tuning | ||
|
||
Please refer to the [Tuning Guide](tuning-guide.md) for information on performance tuning. |