Skip to content

Commit

Permalink
Merge pull request #1835 from jlowe/fix-merge
Browse files Browse the repository at this point in the history
Fix merge conflict with branch-0.4
  • Loading branch information
jlowe authored Mar 1, 2021
2 parents bb03535 + c40ec37 commit 50fd165
Show file tree
Hide file tree
Showing 32 changed files with 9 additions and 10,818 deletions.
10 changes: 0 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,8 @@ The RAPIDS Accelerator for Apache Spark provides a set of plugins for
[Apache Spark](https://spark.apache.org) that leverage GPUs to accelerate processing
via the [RAPIDS](https://rapids.ai) libraries and [UCX](https://www.openucx.org/).

![TPCxBB Like query results](./docs/img/tpcxbb-like-results.png "TPCxBB Like Query Results")

The chart above shows results from running ETL queries based off of the
[TPCxBB benchmark](http://www.tpc.org/tpcx-bb/default.asp). These are **not** official results in
any way. It uses a 10TB Dataset (scale factor 10,000), stored in parquet. The processing happened on
a two node DGX-2 cluster. Each node has 96 CPU cores, 1.5TB host memory, 16 V100 GPUs, and 512 GB
GPU memory.

To get started and try the plugin out use the [getting started guide](./docs/get-started/getting-started.md).

For more information about these benchmarks, see the [benchmark guide](./docs/benchmarks.md).

## Compatibility

The SQL plugin tries to produce results that are bit for bit identical with Apache Spark.
Expand Down
212 changes: 0 additions & 212 deletions docs/benchmarks.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/get-started/Dockerfile.cuda
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ RUN set -ex && \
ln -s /lib /lib64 && \
mkdir -p /opt/spark && \
mkdir -p /opt/spark/jars && \
mkdir -p /opt/tpch && \
mkdir -p /opt/spark/examples && \
mkdir -p /opt/spark/work-dir && \
mkdir -p /opt/sparkRapidsPlugin && \
Expand Down
20 changes: 0 additions & 20 deletions integration_tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,26 +171,6 @@ any GPU resources on the cluster. For standalone, Mesos, and Kubernetes you can
of executors you want to use per application. The extra core is for the driver. Dynamic allocation can mess with these settings
under YARN and even though it is off by default you probably want to be sure it is disabled (spark.dynamicAllocation.enabled=false).

### Enabling TPCxBB/TPCH/TPCDS/Mortgage Tests

The TPCxBB, TPCH, TPCDS, and Mortgage tests in this framework can be enabled by providing a couple of options:

* TPCxBB `tpcxbb-format` (optional, defaults to "parquet"), and `tpcxbb-path` (required, path to the TPCxBB data).
* TPCH `tpch-format` (optional, defaults to "parquet"), and `tpch-path` (required, path to the TPCH data).
* TPCDS `tpcds-format` (optional, defaults to "parquet"), and `tpcds-path` (required, path to the TPCDS data).
* Mortgage `mortgage-format` (optional, defaults to "parquet"), and `mortgage-path` (required, path to the Mortgage data).

As an example, here is the `spark-submit` command with the TPCxBB parameters on CUDA 10.1:

```
$SPARK_HOME/bin/spark-submit --jars "rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar,rapids-4-spark-udf-examples_2.12-0.5.0-SNAPSHOT.jar,cudf-0.19-SNAPSHOT-cuda10-1.jar,rapids-4-spark-tests_2.12-0.5.0-SNAPSHOT.jar" ./runtests.py --tpcxbb_format="csv" --tpcxbb_path="/path/to/tpcxbb/csv"
```

Be aware that running these tests with read data requires at least an entire GPU, and preferable several GPUs/executors
in your cluster so please be careful when enabling these tests. Also some of these test actually produce non-deterministic
results when run in a real cluster. If you do see failures when running these tests please contact us so we can investigate
them and possibly tag the tests appropriately when running on an actual cluster.

### Enabling cudf_udf Tests

The cudf_udf tests in this framework are testing Pandas UDF(user-defined function) with cuDF. They are disabled by default not only because of the complicated environment setup, but also because GPU resources scheduling for Pandas UDF is an experimental feature now, the performance may not always be better.
Expand Down
19 changes: 0 additions & 19 deletions integration_tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,24 +14,6 @@

def pytest_addoption(parser):
"""Pytest hook to define command line options for pytest"""
parser.addoption(
"--tpcxbb_format", action="store", default="parquet", help="format of TPCXbb data"
)
parser.addoption(
"--tpcxbb_path", action="store", default=None, help="path to TPCXbb data"
)
parser.addoption(
"--tpcds_format", action="store", default="parquet", help="format of TPC-DS data"
)
parser.addoption(
"--tpcds_path", action="store", default=None, help="path to TPC-DS data"
)
parser.addoption(
"--tpch_format", action="store", default="parquet", help="format of TPCH data"
)
parser.addoption(
"--tpch_path", action="store", default=None, help="path to TPCH data"
)
parser.addoption(
"--mortgage_format", action="store", default="parquet", help="format of Mortgage data"
)
Expand Down Expand Up @@ -61,4 +43,3 @@ def pytest_addoption(parser):
"--test_type", action='store', default="developer",
help="the type of tests that are being run to help check all the correct tests are run - developer, pre-commit, or nightly"
)

Loading

0 comments on commit 50fd165

Please sign in to comment.