Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc]Update 22.06 documentation[skip ci] #5641

Merged
merged 22 commits into from
Jun 3, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,7 +307,9 @@ Yes

### Are the R APIs for Spark supported?

Yes, but we don't actively test them.
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at
jlowe marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for this text and the Java API text below.

Suggested change
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at
Yes, but we don't actively test them, because the RAPIDS Accelerator hooks into Spark not at
Suggested change
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for this text and the Java API text below.

Suggested change
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at
Yes, but we don't actively test them, because the RAPIDS Accelerator hooks into Spark not at
Suggested change
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at

the various language APIs but at the Catalyst level after all the various APIs have converged into
the DataFrame API.

### Are the Java APIs for Spark supported?

Expand Down Expand Up @@ -410,6 +412,14 @@ The Scala UDF byte-code analyzer is disabled by default and must be enabled by t
[`spark.rapids.sql.udfCompiler.enabled`](configs.md#sql.udfCompiler.enabled) configuration
setting.

#### Optimize a row-based UDF in a GPU operation

If the UDF can not be implemented by RAPIDS Accelerated UDFs or be automatically translated to
Apache Spark operations, the RAPIDS Accelerator has an experimental feature to transfer only the
data it needs between GPU and CPU inside a query operation, instead of falling this operation back
to CPU. This feature can be enabled by setting `spark.rapids.sql.rowBasedUDF.enabled` to true.


### Why is the size of my output Parquet/ORC file different?

This can come down to a number of factors. The GPU version often compresses data in smaller chunks
Expand Down
59 changes: 59 additions & 0 deletions docs/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,65 @@ cuDF jar, that is either preinstalled in the Spark classpath on all nodes or sub
that uses the RAPIDS Accelerator For Apache Spark. See the [getting-started
guide](https://nvidia.github.io/spark-rapids/Getting-Started/) for more details.

## Release v22.06.0
Hardware Requirements:

The plugin is tested on the following architectures:

GPU Models: NVIDIA V100, T4 and A2/A10/A30/A100 GPUs

Software Requirements:

OS: Ubuntu 18.04, Ubuntu 20.04 or CentOS 7, CentOS 8

CUDA & NVIDIA Drivers*: 11.x & v450.80.02+

Apache Spark 3.1.1, 3.1.2, 3.1.3, 3.2.0, 3.2.1, Databricks 9.1 ML LTS or 10.4 ML LTS Runtime and GCP Dataproc 2.0

Python 3.6+, Scala 2.12, Java 8

*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ.

### Download v22.06.0
* Download the [RAPIDS
Accelerator for Apache Spark 22.06.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this currently bad link, I'd like to see this checked in as late as possible. Otherwise we end up with every PR in the meantime being flagged for a bad link because it's checked in that way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can wait for some time to merge this PR.
My plan is to merge this PR before the merge request to main, so that future gh-pages update PR can take it from there.


This package is built against CUDA 11.5 and has [CUDA forward
compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html) enabled. It is tested
jlowe marked this conversation as resolved.
Show resolved Hide resolved
on V100, T4, A2, A10, A30 and A100 GPUs with CUDA 11.0-11.5. For those using other types of GPUs which
do not have CUDA forward compatibility (for example, GeForce), CUDA 11.5 is required. Users will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should say "CUDA 11.5 or later is required" here, as CUDA backward compatibility will allow us to run on CUDA versions > 11.5.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

need to ensure the minimum driver (450.80.02) and CUDA toolkit are installed on each Spark node.

### Verify signature
* Download the [RAPIDS Accelerator for Apache Spark 22.06.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar)
and [RAPIDS Accelerator for Apache Spark 22.06.0 jars.asc](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar.asc)
* Download the [PUB_KEY](https://keys.openpgp.org/search?q=sw-spark@nvidia.com).
* Import the public key: `gpg --import PUB_KEY`
* Verify the signature: `gpg --verify rapids-4-spark_2.12-22.06.0.jar.asc rapids-4-spark_2.12-22.06.0.jar`

The output if signature verify:

gpg: Good signature from "NVIDIA Spark (For the signature of spark-rapids release jars) <sw-spark@nvidia.com>"

### Release Notes
New functionality and performance improvements for this release include:
* Combined cuDF jar and rapids-4-spark jar to a single rapids-4-spark jar
jlowe marked this conversation as resolved.
Show resolved Hide resolved
* Add UI for Qualification tool
viadea marked this conversation as resolved.
Show resolved Hide resolved
* Support function map_filter
* Support spark.sql.mapKeyDedupPolicy=LAST_WIN for function transform_keys
* Enable MIG with YARN on Dataproc 2.0
* Enable CSV raed by default
viadea marked this conversation as resolved.
Show resolved Hide resolved
* Enable regular expression by default
* Enable some float related configurations by default
Comment on lines +71 to +72
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling CSV reads, regular expressions, and floating point operations by default ought to be higher on the list of new features. spark.sql.mapKeyDedupPolicy=LAST_WIN is probably not that important to highlight. Rather, we can highlight features like: Improved ANSI support, Supporting for Avro reading of primitive types,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the release notes.

BTW: for "Avro reading of primitive types" it was added for 22.04 before.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.

* Changed to ASYNC allocator from ARENA by default

For a detailed list of changes, please refer to the
[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md).

## Release v22.04.0
Hardware Requirements:

Expand Down
2 changes: 1 addition & 1 deletion docs/get-started/getting-started-databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ cluster.
```bash
spark.rapids.sql.python.gpu.enabled true
spark.python.daemon.module rapids.daemon_databricks
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-22.04.0.jar:/databricks/spark/python
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-22.06.0.jar:/databricks/spark/python
```

7. Once you’ve added the Spark config, click “Confirm and Restart”.
Expand Down
4 changes: 4 additions & 0 deletions docs/get-started/getting-started-kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ Kubernetes requires a Docker image to run Spark. Generally everything needed is
image - Spark, the RAPIDS Accelerator for Spark jars, and the discovery script. See this
[Dockerfile.cuda](Dockerfile.cuda) example.

You can find other supported base CUDA images for from
[CUDA dockerhub](https://hub.docker.com/r/nvidia/cuda). Its source Dockerfile is inside
[GitLab repoistory](https://gitlab.com/nvidia/container-images/cuda/) which can be used to build
viadea marked this conversation as resolved.
Show resolved Hide resolved
the docker images from OS base image from scratch.

## Prerequisites
* Kubernetes cluster is up and running with NVIDIA GPU support
Expand Down
7 changes: 1 addition & 6 deletions docs/get-started/gpu_dataproc_packages_ubuntu_sample.sh
Original file line number Diff line number Diff line change
Expand Up @@ -139,14 +139,12 @@ EOF
systemctl start dataproc-cgroup-device-permissions
}

readonly DEFAULT_SPARK_RAPIDS_VERSION="22.04.0"
readonly DEFAULT_SPARK_RAPIDS_VERSION="22.06.0"
readonly DEFAULT_CUDA_VERSION="11.0"
readonly DEFAULT_CUDF_VERSION="22.04.0"
readonly DEFAULT_XGBOOST_VERSION="1.4.2"
readonly DEFAULT_XGBOOST_GPU_SUB_VERSION="0.3.0"
readonly SPARK_VERSION="3.0"

readonly CUDF_VERSION=${DEFAULT_CUDF_VERSION}
# SPARK config
readonly SPARK_RAPIDS_VERSION=${DEFAULT_SPARK_RAPIDS_VERSION}
readonly XGBOOST_VERSION=${DEFAULT_XGBOOST_VERSION}
Expand Down Expand Up @@ -174,9 +172,6 @@ function install_spark_rapids() {
wget -nv --timeout=30 --tries=5 --retry-connrefused \
"${nvidia_repo_url}/rapids-4-spark_2.12/${SPARK_RAPIDS_VERSION}/rapids-4-spark_2.12-${SPARK_RAPIDS_VERSION}.jar" \
-P /usr/lib/spark/jars/
wget -nv --timeout=30 --tries=5 --retry-connrefused \
"${rapids_repo_url}/cudf/${CUDF_VERSION}/cudf-${CUDF_VERSION}-cuda${cudf_cuda_version}.jar" \
-P /usr/lib/spark/jars/
}

function configure_spark() {
Expand Down
2 changes: 1 addition & 1 deletion docs/spark-profiling-tool.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ more information.
The Profiling tool requires the Spark 3.x jars to be able to run but do not need an Apache Spark run time.
If you do not already have Spark 3.x installed,
you can download the Spark distribution to any machine and include the jars in the classpath.
- Download the jar file from [Maven repository](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/22.04.0/)
- Download the jar file from [Maven repository](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/22.06.0/)
- [Download Apache Spark 3.x](http://spark.apache.org/downloads.html) - Spark 3.1.1 for Apache Hadoop is recommended
If you want to compile the jars, please refer to the instructions [here](./spark-qualification-tool.md#How-to-compile-the-tools-jar).

Expand Down
5 changes: 2 additions & 3 deletions docs/spark-qualification-tool.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ layout: page
title: Qualification Tool
nav_order: 8
---

# Qualification Tool

The Qualification tool analyzes Spark events generated from CPU based Spark applications to determine
Expand Down Expand Up @@ -41,7 +40,7 @@ more information.
The Qualification tool require the Spark 3.x jars to be able to run but do not need an Apache Spark run time.
If you do not already have Spark 3.x installed, you can download the Spark distribution to
any machine and include the jars in the classpath.
- Download the jar file from [Maven repository](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/22.04.0/)
- Download the jar file from [Maven repository](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/22.06.0/)
- [Download Apache Spark 3.x](http://spark.apache.org/downloads.html) - Spark 3.1.1 for Apache Hadoop is recommended

### Step 2 Run the Qualification tool
Expand Down Expand Up @@ -236,7 +235,7 @@ below for the description of output fields.
- Java 8 or above, Spark 3.0.1+

### Download the tools jar
- Download the jar file from [Maven repository](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/22.04.0/)
- Download the jar file from [Maven repository](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/22.06.0/)

### Modify your application code to call the api's

Expand Down
8 changes: 8 additions & 0 deletions docs/tuning-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,14 @@ rather than megabytes or smaller.
Note that the GPU can encode Parquet and ORC data much faster than the CPU, so the costs of
writing large files can be significantly lower.

## Input Files' column order
When there are a large number of columns for file formats like Parquet and ORC the size of the
contiguous data for each individual column can be very small. This can result in doing lots of very
small random reads to the file system to read the data for the subset of columns that are needed.

We would suggest reorder the columns needed by the queries and then rewrite the files to make those
viadea marked this conversation as resolved.
Show resolved Hide resolved
columns adjacent. This could help both Spark on CPU and GPU.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a comment here about using spark.rapids.sql.format.parquet.reader.footer.type=NATIVE if there are a large number of columns and the data format is Parquet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature is experimental. Not sure we're ready to widely advertise it yet, but I'd defer to @revans2 on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, we can add the note about it in the tuning guide after it is no longer experimental.

## Input Partition Size

Similar to the discussion on [input file size](#input-files), many queries can benefit from using
Expand Down