Skip to content

Commit

Permalink
Extended the FAQ (NVIDIA#401)
Browse files Browse the repository at this point in the history
* Extended the FAQ

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

* Addressed review comments

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
  • Loading branch information
revans2 authored Jul 22, 2020
1 parent e8cb4dd commit ce684f7
Showing 1 changed file with 181 additions and 1 deletion.
182 changes: 181 additions & 1 deletion docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,187 @@ settings. This is true of all configs in Spark. If you changed
`spark.sql.autoBroadcastJoinThreshold` after running `explain()` on a `DataFrame`, the resulting
query would not change to reflect that config and still show a `SortMergeJoin` even though the
new config might have changed to be a `BroadcastHashJoin` instead. When actually running something
like with `collect`, `show` or `write` a new `DataFrame` is constructed causing spark to replan the
like with `collect`, `show` or `write` a new `DataFrame` is constructed causing spark to re-plan the
query. This is why `spark.rapids.sql.enabled` is still respected when running, even if explain
shows stale results.

### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support?

The RAPIDS Accelerator for Apache Spark requires version 3.0.0 of Apache Spark. Because the plugin
replaces parts of the physical plan that Apache Spark considers to be internal the code for those
plans can change even between bug fix releases. As a part of our process, we try to stay on top of
these changes and release updates as quickly as possible.

### Which distributions are supported?

The RAPIDS Accelerator for Apache Spark officially supports
[Apache Spark](get-started/getting-started.md),
[Databricks Runtime 7.0](get-started/getting-started-with-rapids-accelerator-on-databricks.md)
and [Google Cloud Dataproc](get-started/getting-started-gcp.md).
Most distributions based off of Apache Spark 3.0.0 should work, but because the plugin replaces
parts of the physical plan that Apache Spark considers to be internal the code for those plans
can change from one distribution to another. We are working with most cloud service providers to
set up testing and validation on their distributions.

### What is the right hardware setup to run GPU accelerated Spark?

Reference Architectures should be available around Q4 2020.

### What CUDA versions are supported?

CUDA 10.1 and 10.2 are currently supported, but you need to download the cudf jar that corresponds
to the version you are using. Please look [here][version/stable-release.md] for download links
for the stable release.

### What parts of Apache Spark are accelerated?

Currently a limited set of SQL and DataFrame operations are supported, please see the
[configs](configs.md) for a more complete list of what is supported. Some of structured streaming
is likely to be accelerated, but it has not been an area of focus right now. Other areas like
MLLib, GraphX or RDDs are not accelerated.

### What is the road-map like?

Please look at the github repository https://github.com/nvidia/spark-rapids It contains
issue tracking and planning for sprints and releases.

### How much faster will my query run?

Any single operator isn’t a fixed amount faster. So there is no simple algorithm to see how much
faster a query will run. In addition, Apache Spark can store intermediate data to disk and send it
across the network, both of which we typically see as bottlenecks in real world queries. Generally
for complicated queries where all the processing can run on the GPU we see between 3x and 7x
speedup, with a 4x speedup typical. We have seen as high as 100x in some specific cases.

### What operators are best suited for the GPU?

* Group by operations with high cardinality
* Joins with a high cardinality
* Sorts with a high cardinality
* Window operations, especially for large windows
* Complicated processing
* Writing Parquet/ORC
* Reading CSV
* Transcoding (reading an input file and doing minimal processing before writing it out again,
possibly in a different format, like CSV to parquet)

### Are there initialization costs?

From our tests the GPU typically takes about 2 to 3 seconds to initialize when an executor first
starts. If you are only going to run a single query that only takes a few seconds to run this can
be problematic. In general if you are going to do 30 seconds or more of processing within a single
session the overhead can be amortized.

### Why is the size of my output Parquet/ORC file different?

This can come down to a number of factors. The GPU version often compresses data in smaller chunks
to get more parallelism and performance. This can result in larger files in some instances. We have
also seen instances where the ordering of the data can have a big impact on the output size of the
files. Spark tends to prefer sort based joins, and in some cases sort based aggregations, whereas
the GPU versions are all hash based. This means that the resulting data can come out in a different
order for the CPU and the GPU. This is not wrong, but can make the size of the output data
different because of compression. Users can turn on
(spark.rapids.sql.hashOptimizeSort.enabled)[configs.md#sql.hashOptimizeSort.enabled] to have
the GPU try to replicate more closely what the output ordering would have been if sort were used,
like on the CPU.

### How are failures handled?

The RAPIDS accelerator does not change the way failures are normally handled by Apache Spark.

### How does the Spark scheduler decide what to do on the GPU vs the CPU?

Technically the Spark scheduler does not make those decisions. The plugin has a set of rules that
decide if an operation can safely be replaced by a GPU enabled version. We are working on some cost
based optimizations to try and improve performance for some situations where it might be more
efficient to stay on the CPU instead of going back and forth.

### Is Dynamic Partition Pruning (DPP) Supported?

Yes, DPP still works. It might not be as efficient as it could be, and we are working to improve it.

### Is Adaptive Query Execution (AQE) Supported?

We are in the process of making sure AQE works. Some parts work now, but other parts require some
changes to the internals of Spark, that we are working with the community to be able to support.

### Are cache and persist supported?

Yes cache and persist are supported, but they are not GPU accelerated yet. We are working with
the Spark community on changes that would allow us to accelerate compression when caching data.

### Can I cache data into GPU memory?

No, that is not currently supported. It would require much larger changes to Apache Spark to be able
to support this.

### Is pyspark supported?

Yes

### Are the R APIs for Spark supported?

Yes, but we don't actively test them.

## Are the Java APIs for Spark supported?

Yes, but we don't actively test them.

## Are the Scala APIs for Spark supported?

Yes

## Is the GPU needed on the driver? Are there any benefits to having a GPU on the driver?

The GPU is not needed on the driver and there is no benefit to having one available on the driver
for the RAPIDS plugin.

## How does the performance compare to DataBricks' DeltaEngine?

We have not evaluated the performance yet. DeltaEngine is not open source, so any analysis needs to
be done with Databricks in some form. When DeltaEngine is generally available and the terms of
service allow it, we will look into doing a comparison.

### How many tasks can I run per executor? How many should I run per executor?

There is no limit on the number of tasks per executor that you can run. Generally we recommend 2 to
6 tasks per executor and 1 GPU per executor. The GPU typically benefits from having 2 tasks run
in [parallel](configs.md#sql.concurrentGpuTasks) on it at a time, assuming your GPU has enough
memory to support that. Having 2 to 3 times as many tasks off of the GPU as on the GPU allows for
I/O to be run in parallel with the processing. If you increase the tasks too high you can overload
the I/O and starting the initial processing can suffer. But if you have a lot of processing that
cannot be done on the GPU, like complex UDFs, the more tasks you have the more CPU processing you
can throw at it.

### How can I run custom expressions/UDFs on the GPU?

We do not currently support this, but we are working on ways to make it possible.

### Why am I getting an error when trying to use pinned memory?

```
Caused by: ai.rapids.cudf.CudaException: OS call failed or operation not supported on this OS
at ai.rapids.cudf.Cuda.hostAllocPinned(Native Method)
at ai.rapids.cudf.PinnedMemoryPool.<init>(PinnedMemoryPool.java:254)
at ai.rapids.cudf.PinnedMemoryPool.lambda$initialize$1(PinnedMemoryPool.java:185)
at java.util.concurrent.FutureTask.run(FutureTask.java:264)
```

This is typically caused by the IOMMU being enabled. Please see the
[CUDA docs](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux)
for this issue.

To fix it you can either disable the IOMMU, or you can disable using pinned memory by setting
[spark.rapids.memory.pinnedPool.size](configs.md#memory.pinnedPool.size) to 0.

# Is speculative execution supported?

Yes, speculative execution in Spark is fine with the RAPIDS accelerator plugin.

As with all speculative execution, it may or may not be beneficial depending on the nature of why a
particular task is slow and how easily speculation is triggered. You should monitor your Spark jobs
to see how often task speculation occurs and how often the speculating task (i.e.: the one launched
later) finishes before the slow task that triggered speculation. If the speculating task often
finishes first then that's good, it is working as intended. If many tasks are speculating, but the
original task always finishes first then this is a pure loss, the speculation is adding load to
the Spark cluster with no benefit.

0 comments on commit ce684f7

Please sign in to comment.