Branch 0.4 FAQ reorg (NVIDIA#1725)

* Reorganization of FAQ, addition of .toc Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Fix links and capitalization in the FAQ Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
nartal1 · Feb 17, 2021 · 0dfc2a8 · 0dfc2a8
1 parent 6bef3a0
commit 0dfc2a8
Showing 1 changed file with 97 additions and 93 deletions.
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -5,17 +5,76 @@ nav_order: 11
 ---
 # Frequently Asked Questions
 
-### Why does `explain()` show that the GPU will be used even after setting `spark.rapids.sql.enabled` to `false`?
+* TOC
+{:toc}
 
-Apache Spark caches what is used to build the output of the `explain()` function. That cache has no
-knowledge about configs, so it may return results that are not up to date with the current config
-settings. This is true of all configs in Spark. If you changed
-`spark.sql.autoBroadcastJoinThreshold` after running `explain()` on a `DataFrame`, the resulting
-query would not change to reflect that config and still show a `SortMergeJoin` even though the new
-config might have changed to be a `BroadcastHashJoin` instead. When actually running something like
-with `collect`, `show` or `write` a new `DataFrame` is constructed causing Spark to re-plan the
-query. This is why `spark.rapids.sql.enabled` is still respected when running, even if explain shows
-stale results.
+### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support?
+
+The RAPIDS Accelerator for Apache Spark requires version 3.0.0 or 3.0.1 of Apache Spark. Because the
+plugin replaces parts of the physical plan that Apache Spark considers to be internal the code for
+those plans can change even between bug fix releases. As a part of our process, we try to stay on
+top of these changes and release updates as quickly as possible.
+
+### Which distributions are supported?
+
+The RAPIDS Accelerator for Apache Spark officially supports [Apache
+Spark](get-started/getting-started-on-prem.md), [AWS EMR
+6.2.0](get-started/getting-started-aws-emr.md), [Databricks Runtime
+7.3](get-started/getting-started-databricks.md) and [Google Cloud
+Dataproc](get-started/getting-started-gcp.md).  Most distributions based off of Apache Spark 3.0.0
+should work, but because the plugin replaces parts of the physical plan that Apache Spark considers
+to be internal the code for those plans can change from one distribution to another. We are working
+with most cloud service providers to set up testing and validation on their distributions.
+
+### What CUDA versions are supported?
+
+CUDA 10.1, 10.2 and 11.0 are currently supported, but you need to download the cudf jar that 
+corresponds to the version you are using. Please look [here](download.md) for download 
+links for the latest release.
+
+### What is the right hardware setup to run GPU accelerated Spark?
+
+Reference architectures should be available around Q1 2021.
+
+### What parts of Apache Spark are accelerated?
+
+Currently a limited set of SQL and DataFrame operations are supported, please see the
+[configs](configs.md) and [supported operations](supported_ops.md) for a more complete list of what
+is supported. Some of structured streaming is likely to be accelerated, but it has not been an area
+of focus right now. Other areas like MLLib, GraphX or RDDs are not accelerated.
+
+### What is the road-map like?
+
+Please look at the github repository
+[https://github.com/nvidia/spark-rapids](https://github.com/nvidia/spark-rapids). It contains issue
+tracking and planning for sprints and releases.
+
+### How much faster will my query run?
+
+Any single operator isn’t a fixed amount faster. So there is no simple algorithm to see how much
+faster a query will run. In addition, Apache Spark can store intermediate data to disk and send it
+across the network, both of which we typically see as bottlenecks in real world queries. Generally
+for complicated queries where all the processing can run on the GPU we see between 3x and 7x
+speedup, with a 4x speedup typical. We have seen as high as 100x in some specific cases.
+
+### What operators are best suited for the GPU?
+
+* Group by operations with high cardinality
+* Joins with a high cardinality
+* Sorts with a high cardinality
+* Window operations, especially for large windows
+* Complicated processing
+* Writing Parquet/ORC
+* Reading CSV
+* Transcoding (reading an input file and doing minimal processing before writing it out again,
+possibly in a different format, like CSV to Parquet)
+
+### Are there initialization costs?
+
+From our tests the GPU typically takes about 2 to 3 seconds to initialize when an executor first
+starts. If you are only going to run a single query that only takes a few seconds to run this can
+be problematic. In general if you are going to do 30 seconds or more of processing within a single
+session the overhead can be amortized.
 
 ### How can I tell what will run on the GPU and what will not run on it?
 <a name="explain"></a>
@@ -88,7 +147,7 @@ few places where this is not the case.
   this we want to process larger batches of data when possible. These operators will try to combine
   smaller batches of data into fewer, larger batches to process more efficiently.
 
-* `SortMergeJoin` - The RAPIDS accelerator does not support sort merge joins yet. For now, we 
+* `SortMergeJoin` - The RAPIDS Accelerator does not support sort merge joins yet. For now, we 
   translate sort merge joins into shuffled hash joins. Because of this there are times when sorts 
   may be removed or other sorts added to meet the ordering requirements of the query.
 
@@ -98,89 +157,21 @@ few places where this is not the case.
   these operations into a single stage the metrics were confusing to understand. Instead, we split 
   the single stage up into multiple smaller parts, so the metrics are clearer.
 
-### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support?
-
-The RAPIDS Accelerator for Apache Spark requires version 3.0.0 or 3.0.1 of Apache Spark. Because the
-plugin replaces parts of the physical plan that Apache Spark considers to be internal the code for
-those plans can change even between bug fix releases. As a part of our process, we try to stay on
-top of these changes and release updates as quickly as possible.
-
-### Which distributions are supported?
-
-The RAPIDS Accelerator for Apache Spark officially supports [Apache
-Spark](get-started/getting-started-on-prem.md), [AWS EMR
-6.2.0](get-started/getting-started-aws-emr.md), [Databricks Runtime
-7.3](get-started/getting-started-databricks.md) and [Google Cloud
-Dataproc](get-started/getting-started-gcp.md).  Most distributions based off of Apache Spark 3.0.0
-should work, but because the plugin replaces parts of the physical plan that Apache Spark considers
-to be internal the code for those plans can change from one distribution to another. We are working
-with most cloud service providers to set up testing and validation on their distributions.
-
-### What is the right hardware setup to run GPU accelerated Spark?
-
-Reference architectures should be available around Q1 2021.
-
-### What CUDA versions are supported?
-
-CUDA 10.1, 10.2 and 11.0 are currently supported, but you need to download the cudf jar that 
-corresponds to the version you are using. Please look [here][download.md] for download 
-links for the latest release.
-
-### What parts of Apache Spark are accelerated?
-
-Currently a limited set of SQL and DataFrame operations are supported, please see the
-[configs](configs.md) for a more complete list of what is supported. Some of structured streaming
-is likely to be accelerated, but it has not been an area of focus right now. Other areas like
-MLLib, GraphX or RDDs are not accelerated.
-
-### What is the road-map like?
-
-Please look at the github repository https://github.com/nvidia/spark-rapids It contains
-issue tracking and planning for sprints and releases.
-
-### How much faster will my query run?
-
-Any single operator isn’t a fixed amount faster. So there is no simple algorithm to see how much
-faster a query will run. In addition, Apache Spark can store intermediate data to disk and send it
-across the network, both of which we typically see as bottlenecks in real world queries. Generally
-for complicated queries where all the processing can run on the GPU we see between 3x and 7x
-speedup, with a 4x speedup typical. We have seen as high as 100x in some specific cases.
-
-### What operators are best suited for the GPU?
-
-* Group by operations with high cardinality
-* Joins with a high cardinality
-* Sorts with a high cardinality
-* Window operations, especially for large windows
-* Complicated processing
-* Writing Parquet/ORC
-* Reading CSV
-* Transcoding (reading an input file and doing minimal processing before writing it out again,
-possibly in a different format, like CSV to Parquet)
-
-### Are there initialization costs?
-
-From our tests the GPU typically takes about 2 to 3 seconds to initialize when an executor first
-starts. If you are only going to run a single query that only takes a few seconds to run this can
-be problematic. In general if you are going to do 30 seconds or more of processing within a single
-session the overhead can be amortized.
-
-### Why is the size of my output Parquet/ORC file different?
+### Why does `explain()` show that the GPU will be used even after setting `spark.rapids.sql.enabled` to `false`?
 
-This can come down to a number of factors.  The GPU version often compresses data in smaller chunks
-to get more parallelism and performance. This can result in larger files in some instances. We have
-also seen instances where the ordering of the data can have a big impact on the output size of the
-files.  Spark tends to prefer sort based joins, and in some cases sort based aggregations, whereas
-the GPU versions are all hash based. This means that the resulting data can come out in a different
-order for the CPU and the GPU. This is not wrong, but can make the size of the output data
-different because of compression. Users can turn on
-(spark.rapids.sql.hashOptimizeSort.enabled)[configs.md#sql.hashOptimizeSort.enabled] to have
-the GPU try to replicate more closely what the output ordering would have been if sort were used,
-like on the CPU.
+Apache Spark caches what is used to build the output of the `explain()` function. That cache has no
+knowledge about configs, so it may return results that are not up to date with the current config
+settings. This is true of all configs in Spark. If you changed
+`spark.sql.autoBroadcastJoinThreshold` after running `explain()` on a `DataFrame`, the resulting
+query would not change to reflect that config and still show a `SortMergeJoin` even though the new
+config might have changed to be a `BroadcastHashJoin` instead. When actually running something like
+with `collect`, `show` or `write` a new `DataFrame` is constructed causing Spark to re-plan the
+query. This is why `spark.rapids.sql.enabled` is still respected when running, even if explain shows
+stale results.
 
 ### How are failures handled?
 
-The RAPIDS accelerator does not change the way failures are normally handled by Apache Spark.
+The RAPIDS Accelerator does not change the way failures are normally handled by Apache Spark.
 
 ### How does the Spark scheduler decide what to do on the GPU vs the CPU?
 
@@ -230,7 +221,7 @@ Yes
 The GPU is not needed on the driver and there is no benefit to having one available on the driver
 for the RAPIDS plugin.
 
-### How does the performance compare to DataBricks' DeltaEngine?
+### How does the performance compare to Databricks' DeltaEngine?
 
 We have not evaluated the performance yet. DeltaEngine is not open source, so any analysis needs to
 be done with Databricks in some form. When DeltaEngine is generally available and the terms of
@@ -266,7 +257,7 @@ There are multiple reasons why this a problematic configuration:
 - CUDA context switches between processes sharing a single GPU can be expensive
 - Each executor would have a fraction of the GPU memory available for processing
 
-### Is [Multi Instance GPU (MIG)](https://docs.nvidia.com/cuda/mig/index.html) supported?
+### Is [Multi-Instance GPU (MIG)](https://docs.nvidia.com/cuda/mig/index.html) supported?
 
 Yes, but it requires support from the underlying cluster manager to isolate the MIG GPU instance
 for each executor (e.g.: by setting `CUDA_VISIBLE_DEVICES` or other means).
@@ -299,6 +290,19 @@ The Scala UDF byte-code analyzer is disabled by default and must be enabled by t
 [`spark.rapids.sql.udfCompiler.enabled`](configs.md#sql.udfCompiler.enabled) configuration
 setting.
 
+### Why is the size of my output Parquet/ORC file different?
+
+This can come down to a number of factors.  The GPU version often compresses data in smaller chunks
+to get more parallelism and performance. This can result in larger files in some instances. We have
+also seen instances where the ordering of the data can have a big impact on the output size of the
+files.  Spark tends to prefer sort based joins, and in some cases sort based aggregations, whereas
+the GPU versions are all hash based. This means that the resulting data can come out in a different
+order for the CPU and the GPU. This is not wrong, but can make the size of the output data
+different because of compression. Users can turn on
+[spark.rapids.sql.hashOptimizeSort.enabled](configs.md#sql.hashOptimizeSort.enabled) to have
+the GPU try to replicate more closely what the output ordering would have been if sort were used,
+like on the CPU.
+
 ### Why am I getting an error when trying to use pinned memory?
 
 ```
@@ -318,7 +322,7 @@ To fix it you can either disable the IOMMU, or you can disable using pinned memo
 
 ### Is speculative execution supported?
 
-Yes, speculative execution in Spark is fine with the RAPIDS accelerator plugin.
+Yes, speculative execution in Spark is fine with the RAPIDS Accelerator plugin.
 
 As with all speculative execution, it may or may not be beneficial depending on the nature of why a
 particular task is slow and how easily speculation is triggered. You should monitor your Spark jobs