Skip to content

Commit

Permalink
Merge pull request NVIDIA#1495 from pxLi/fix-merge-conflict
Browse files Browse the repository at this point in the history
Fix merge conflict, merge branch 0.3 to branch 0.4 [skip ci]
  • Loading branch information
pxLi authored Jan 12, 2021
2 parents 935a1c4 + acca222 commit d870a58
Show file tree
Hide file tree
Showing 16 changed files with 168 additions and 93 deletions.
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,11 @@ The plugin has a set of Spark configs that control its behavior and are document
We use github issues to track bugs, feature requests, and to try and answer questions. You
may file one [here](https://github.com/NVIDIA/spark-rapids/issues/new/choose).

## Download

The jar files for the most recent release can be retrieved from the [download](docs/download.md)
page.

## Build

There are two types of branches in this repository:
Expand Down
8 changes: 4 additions & 4 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Frequently Asked Questions
nav_order: 8
nav_order: 9
---
# Frequently Asked Questions

Expand Down Expand Up @@ -37,13 +37,13 @@ with most cloud service providers to set up testing and validation on their dist

### What is the right hardware setup to run GPU accelerated Spark?

Reference architectures should be available around Q4 2020.
Reference architectures should be available around Q1 2021.

### What CUDA versions are supported?

CUDA 10.1, 10.2 and 11.0 are currently supported, but you need to download the cudf jar that
corresponds to the version you are using. Please look [here][version/stable-release.md] for download
links for the stable release.
corresponds to the version you are using. Please look [here][download.md] for download
links for the latest release.

### What parts of Apache Spark are accelerated?

Expand Down
5 changes: 5 additions & 0 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
layout: page
title: Benchmarks
nav_exclude: true
---
# Benchmarks

The `integration_test` module contains benchmarks derived from the
Expand Down
2 changes: 1 addition & 1 deletion docs/compatibility.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Compatibility
nav_order: 4
nav_order: 5
---

# RAPIDS Accelerator for Apache Spark Compatibility with Apache Spark
Expand Down
4 changes: 2 additions & 2 deletions docs/configs.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Configuration
nav_order: 3
nav_order: 4
---
<!-- Generated by RapidsConf.help. DO NOT EDIT! -->
# RAPIDS Accelerator for Apache Spark Configuration
Expand Down Expand Up @@ -68,7 +68,7 @@ Name | Description | Default Value
<a name="sql.format.parquet.multiThreadedRead.maxNumFilesParallel"></a>spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel|A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.parquet.reader.type|2147483647
<a name="sql.format.parquet.multiThreadedRead.numThreads"></a>spark.rapids.sql.format.parquet.multiThreadedRead.numThreads|The maximum number of threads, on the executor, to use for reading small parquet files in parallel. This can not be changed at runtime after the executor has started. Used with COALESCING and MULTITHREADED reader, see spark.rapids.sql.format.parquet.reader.type.|20
<a name="sql.format.parquet.read.enabled"></a>spark.rapids.sql.format.parquet.read.enabled|When set to false disables parquet input acceleration|true
<a name="sql.format.parquet.reader.type"></a>spark.rapids.sql.format.parquet.reader.type|Sets the parquet reader type. We support different types that are optimized for different environments. The original Spark style reader can be selected by setting this to PERFILE which individually reads and copies files to the GPU. Loading many small files individually has high overhead, and using either COALESCING or MULTITHREADED is recommended instead. The COALESCING reader is good when using a local file system where the executors are on the same nodes or close to the nodes the data is being read on. This reader coalesces all the files assigned to a task into a single host buffer before sending it down to the GPU. It copies blocks from a single file into a host buffer in separate threads in parallel, see spark.rapids.sql.format.parquet.multiThreadedRead.numThreads. MULTITHREADED is good for cloud environments where you are reading from a blobstore that is totally separate and likely has a higher I/O read cost. Many times the cloud environments also get better throughput when you have multiple readers in parallel. This reader uses multiple threads to read each file in parallel and each file is sent to the GPU separately. This allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.format.parquet.multiThreadedRead.numThreads and spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. By default this is set to AUTO so we select the reader we think is best. This will either be the COALESCING or the MULTITHREADED based on whether we think the file is in the cloud. See spark.rapids.cloudSchemes.|AUTO
<a name="sql.format.parquet.reader.type"></a>spark.rapids.sql.format.parquet.reader.type|Sets the parquet reader type. We support different types that are optimized for different environments. The original Spark style reader can be selected by setting this to PERFILE which individually reads and copies files to the GPU. Loading many small files individually has high overhead, and using either COALESCING or MULTITHREADED is recommended instead. The COALESCING reader is good when using a local file system where the executors are on the same nodes or close to the nodes the data is being read on. This reader coalesces all the files assigned to a task into a single host buffer before sending it down to the GPU. It copies blocks from a single file into a host buffer in separate threads in parallel, see spark.rapids.sql.format.parquet.multiThreadedRead.numThreads. MULTITHREADED is good for cloud environments where you are reading from a blobstore that is totally separate and likely has a higher I/O read cost. Many times the cloud environments also get better throughput when you have multiple readers in parallel. This reader uses multiple threads to read each file in parallel and each file is sent to the GPU separately. This allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.format.parquet.multiThreadedRead.numThreads and spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. This can be set to AUTO to select the reader we think is best. This will either be the COALESCING or the MULTITHREADED based on whether we think the file is in the cloud. See spark.rapids.cloudSchemes. The default is currently set to MULTITHREADED because the COALESCING reader does not handle partitioned data efficiently. If you aren't using partitioned data in a non cloud environment, the COALESCING reader would be a good choice.|MULTITHREADED
<a name="sql.format.parquet.write.enabled"></a>spark.rapids.sql.format.parquet.write.enabled|When set to false disables parquet output acceleration|true
<a name="sql.hasNans"></a>spark.rapids.sql.hasNans|Config to indicate if your data has NaN's. Cudf doesn't currently support NaN's properly so you can get corrupt data if you have NaN's in your data and it runs on the GPU.|true
<a name="sql.hashOptimizeSort.enabled"></a>spark.rapids.sql.hashOptimizeSort.enabled|Whether sorts should be inserted after some hashed operations to improve output ordering. This can improve output file sizes when saving to columnar formats.|false
Expand Down
2 changes: 1 addition & 1 deletion docs/dev/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Developer Overview
nav_order: 9
nav_order: 10
has_children: true
permalink: /developer-overview/
---
Expand Down
2 changes: 1 addition & 1 deletion docs/dev/idea-code-style-settings.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: IDEA Code Style Settings
nav_order: 3
nav_order: 4
parent: Developer Overview
---
```xml
Expand Down
10 changes: 10 additions & 0 deletions docs/dev/testing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
layout: page
title: Testing
nav_order: 1
parent: Developer Overview
---
An overview of testing can be found within the repository at:
* [Unit tests](https://github.com/NVIDIA/spark-rapids/tree/branch-0.3/tests)
* [Integration testing](https://github.com/NVIDIA/spark-rapids/tree/branch-0.3/integration_tests)
* [Benchmarks](../benchmarks.md)
122 changes: 122 additions & 0 deletions docs/download.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
---
layout: page
title: Download
nav_order: 3
---

## Release v0.3.0
This release includes additional performance improvements, including
* Use of per thread default stream to make more efficient use of the GPU
* Further supporting Spark's adaptive query execution, with more rewritten query plans now able to
run on the GPU
* Performance improvements for reading small Parquet files
* RAPIDS Shuffle with UCX updated to UCX 1.9.0

New functionality for the release includes
* Parquet reading for lists and structs,
* Lead/lag for windows, and
* Greatest/least operators

The release is supported on Apache Spark 3.0.0, 3.0.1, Databricks 7.3 ML LTS and Google Cloud
Platform Dataproc 2.0.

The list of all supported operations is provided [here](supported_ops.md).

For a detailed list of changes, please refer to the
[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md).

Hardware Requirements:

GPU Architecture: NVIDIA Pascal™ or better (Tested on V100, T4 and A100 GPU)

Software Requirements:

OS: Ubuntu 16.04, Ubuntu 18.04 or CentOS 7

CUDA & Nvidia Drivers: 10.1.2 & v418.87+, 10.2 & v440.33+ or 11.0 & v450.36+

Apache Spark 3.0, 3.0.1, Databricks 7.3 ML LTS Runtime, or GCP Dataproc 2.0

Apache Hadoop 2.10+ or 3.1.1+ (3.1.1 for nvidia-docker version 2)

Python 3.6+, Scala 2.12, Java 8

### Download v0.3.0
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.3.0/rapids-4-spark_2.12-0.3.0.jar)
* [cuDF 11.0 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.17/cudf-0.17-cuda11.jar)
* [cuDF 10.2 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.17/cudf-0.17-cuda10-2.jar)
* [cuDF 10.1 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.17/cudf-0.17-cuda10-1.jar)

## Release v0.2.0
This is the second release of the RAPIDS Accelerator for Apache Spark. Adaptive Query Execution
[SPARK-31412](https://issues.apache.org/jira/browse/SPARK-31412) is a new enhancement that was
included in Spark 3.0 that alters the physical execution plan dynamically to improve the performance
of the query. The RAPIDS Accelerator v0.2 introduces Adaptive Query Execution (AQE) for GPUs and
leverages columnar processing [SPARK-32332](https://issues.apache.org/jira/browse/SPARK-32332)
starting from Spark 3.0.1.

Another enhancement in v0.2 is improvement in reading small Parquet files. This feature takes into
account the scenario where input data can be stored across many small files. By leveraging multiple
CPU threads v0.2 delivers up to 6x performance improvement over the previous release for small
Parquet file reads.

The RAPIDS Accelerator introduces a beta feature that accelerates [Spark shuffle for
GPUs](get-started/getting-started-on-prem.md#enabling-rapidsshufflemanager). Accelerated
shuffle makes use of high bandwidth transfers between GPUs (NVLink or p2p over PCIe) and leverages
RDMA (RoCE or Infiniband) for remote transfers.

The list of all supported operations is provided
[here](configs.md#supported-gpu-operators-and-fine-tuning).

For a detailed list of changes, please refer to the
[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md).

Hardware Requirements:

GPU Architecture: NVIDIA Pascal™ or better (Tested on V100, T4 and A100 GPU)

Software Requirements:

OS: Ubuntu 16.04, Ubuntu 18.04 or CentOS 7

CUDA & Nvidia Drivers: 10.1.2 & v418.87+, 10.2 & v440.33+ or 11.0 & v450.36+

Apache Spark 3.0, 3.0.1

Apache Hadoop 2.10+ or 3.1.1+ (3.1.1 for nvidia-docker version 2)

Python 3.x, Scala 2.12, Java 8

### Download v0.2.0
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.2.0/rapids-4-spark_2.12-0.2.0.jar)
* [cuDF 11.0 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.15/cudf-0.15-cuda11.jar)
* [cuDF 10.2 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.15/cudf-0.15-cuda10-2.jar)
* [cuDF 10.1 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.15/cudf-0.15-cuda10-1.jar)

## Release v0.1.0

Hardware Requirements:

GPU Architecture: NVIDIA Pascal™ or better (Tested on V100 and T4 GPU)

Software Requirements:

OS: Ubuntu 16.04, Ubuntu 18.04 or CentOS 7
(RHEL 7 support is provided through CentOS 7 builds/installs)

CUDA & NVIDIA Drivers: 10.1.2 & v418.87+ or 10.2 & v440.33+

Apache Spark 3.0

Apache Hadoop 2.10+ or 3.1.1+ (3.1.1 for nvidia-docker version 2)

Python 3.x, Scala 2.12, Java 8


### Download v0.1.0
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.1.0/rapids-4-spark_2.12-0.1.0.jar)
* [cuDF 10.2 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/cudf-0.14-cuda10-2.jar)
* [cuDF 10.1 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/cudf-0.14-cuda10-1.jar)



2 changes: 1 addition & 1 deletion docs/examples.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Demos
nav_order: 4
nav_order: 11
---
# Demos

Expand Down
4 changes: 2 additions & 2 deletions docs/get-started/getting-started-on-prem.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,13 +38,13 @@ to read the deployment method sections before doing any installations.
## Install Spark
To install Apache Spark please follow the official
[instructions](https://spark.apache.org/docs/latest/#launching-on-a-cluster). Supported versions of
Spark are listed on the [stable release](../version/stable-release.md) page. Please note that only
Spark are listed on the [download](../download.md) page. Please note that only
scala version 2.12 is currently supported by the accelerator.

## Download the RAPIDS jars
The [accelerator](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark_2.12) and
[cudf](https://mvnrepository.com/artifact/ai.rapids/cudf) jars are available in the
[download](../version/stable-release.md) section.
[download](../download.md) section.

Download the RAPIDS Accelerator for Apache Spark plugin jar. Then download the version of the cudf
jar that your version of the accelerator depends on. Each cudf jar is for a specific version of
Expand Down
2 changes: 1 addition & 1 deletion docs/ml-integration.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: ML Integration
nav_order: 7
nav_order: 8
---
# RAPIDS Accelerator for Apache Spark ML Library Integration

Expand Down
22 changes: 12 additions & 10 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
layout: page
title: Operators
nav_order: 4
title: Supported Operators
nav_order: 6
---
<!-- Generated by SupportedOpsDocs.help. DO NOT EDIT! -->
Apache Spark supports processing various types of data. Not all expressions
support all data types. The RAPIDS Accelerator for Apache Spark has further
restrictions on what types are supported for processing. This tries
to document what operations are supported and what data types they support.
to document what operations are supported and what data types each operation supports.
Because Apache Spark is under active development too and this document was generated
against version 3.0.0 of Spark. Most of this should still
apply to other versions of Spark, but there may be slight changes.
Expand All @@ -28,7 +28,7 @@ The RAPIDS Accelerator only supports UTC as the time zone for timestamps.

## `CalendarInterval`
In Spark `CalendarInterval`s store three values, months, days, and microseconds.
Support for this types is still very limited in the accelerator. In some cases
Support for this type is still very limited in the accelerator. In some cases
only a a subset of the type is supported, like window ranges only support days currently.

## Configuration
Expand All @@ -44,6 +44,7 @@ the reasons why this particular operator or expression is on the CPU or GPU.

# Key
## Types

|Type Name|Type Description|
|---------|----------------|
|BOOLEAN|Holds true or false values.|
Expand All @@ -53,19 +54,20 @@ the reasons why this particular operator or expression is on the CPU or GPU.
|LONG|Signed 64-bit integer value.|
|FLOAT|32-bit floating point value.|
|DOUBLE|64-bit floating point value.|
|DATE|A date with no time component. Stored as 32-bit integer with days since Jan 1, 1970|
|DATE|A date with no time component. Stored as 32-bit integer with days since Jan 1, 1970.|
|TIMESTAMP|A date and time. Stored as 64-bit integer with microseconds since Jan 1, 1970 in the current time zone.|
|STRING|A text string. Stored as UTF-8 encoded bytes.|
|DECIMAL|A fixed point decimal value with configurable precision and scale.|
|NULL|Only stores null values and is typically only used when no other type can be determined from the SQL.|
|BINARY|An array of non-nullable bytes|
|BINARY|An array of non-nullable bytes.|
|CALENDAR|Represents a period of time. Stored as months, days and microseconds.|
|ARRAY|A sequence of elements.|
|MAP|A set of key value pairs, the keys cannot be null.|
|STRUCT|A series of named fields.|
|UDT|User defined types and java Objects. These are not standard SQL types|
|UDT|User defined types and java Objects. These are not standard SQL types.|

## Support

|Value|Description|
|---------|----------------|
|S| (Supported) Both Apache Spark and the RAPIDS Accelerator support this type.|
Expand All @@ -75,7 +77,7 @@ the reasons why this particular operator or expression is on the CPU or GPU.
|_PS*_| (Partial Support with limitations) Like regular Partial Support but with general limitations on `Timestamp` or `Decimal` types.|
|**NS**| (Not Supported) Apache Spark supports this type but the RAPIDS Accelerator does not.

# `SparkPlan` or Executor Nodes
# SparkPlan or Executor Nodes
Apache Spark uses a Directed Acyclic Graph(DAG) of processing to build a query.
The nodes in this graph are instances of `SparkPlan` and represent various high
level operations like doing a filter or project. The operations that the RAPIDS
Expand Down Expand Up @@ -726,12 +728,12 @@ Accelerator supports are described below.
<td><b>NS</b></td>
</tr>
</table>
* as was state previously Decimal is only supported up to a precision of
* as was stated previously Decimal is only supported up to a precision of
18 and Timestamp is only supported in the
UTC time zone. Decimals are off by default due to performance impact in
some cases.

# `Expression` and SQL Functions
# Expression and SQL Functions
Inside each node in the DAG there can be one or more trees of expressions
that describe various types of processing that happens in that part of the plan.
These can be things like adding two numbers together or checking for null.
Expand Down
2 changes: 1 addition & 1 deletion docs/tuning-guide.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Tuning
nav_order: 5
nav_order: 7
---
# RAPIDS Accelerator for Apache Spark Tuning Guide
Tuning a Spark job's configuration settings from the defaults can often improve job performance,
Expand Down
Loading

0 comments on commit d870a58

Please sign in to comment.