Merge branch 'branch-21.10' into nested_docs_update

NVIDIA · Aug 30, 2021 · 4ebd909 · 4ebd909
2 parents 77fbc87 + 5290586
commit 4ebd909
Show file tree

Hide file tree

Showing 90 changed files with 2,262 additions and 780 deletions.
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -10,9 +10,9 @@ nav_order: 11
 
 ### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support?
 
-The RAPIDS Accelerator for Apache Spark requires version 3.0.1, 3.0.2, 3.1.1 or 3.1.2 of Apache
-Spark. Because the plugin replaces parts of the physical plan that Apache Spark considers to be
-internal the code for those plans can change even between bug fix releases. As a part of our
+The RAPIDS Accelerator for Apache Spark requires version 3.0.1, 3.0.2, 3.0.3, 3.1.1, or 3.1.2 of
+Apache Spark. Because the plugin replaces parts of the physical plan that Apache Spark considers to
+be internal the code for those plans can change even between bug fix releases. As a part of our
 process, we try to stay on top of these changes and release updates as quickly as possible.
 
 ### Which distributions are supported?
@@ -30,15 +30,15 @@ to set up testing and validation on their distributions.
 
 ### What CUDA versions are supported?
 
-CUDA 11.0 and 11.2 are currently supported.  Please look [here](download.md) for download links for
-the latest release.
+CUDA 11.x is currently supported.  Please look [here](download.md) for download links for the latest
+release.
 
 ### What hardware is supported? 
 
 The plugin is tested and supported on V100, T4, A10, A30 and A100 datacenter GPUs.  It is possible
 to run the plugin on GeForce desktop hardware with Volta or better architectures.  GeForce hardware
-does not support [CUDA enhanced
-compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#enhanced-compat-minor-releases),
+does not support [CUDA forward
+compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title),
 and will need CUDA 11.2 installed. If not, the following error will be displayed:
 
 ```
@@ -47,6 +47,9 @@ ai.rapids.cudf.CudaException: forward compatibility was attempted on non support
         at com.nvidia.spark.rapids.GpuDeviceManager$.findGpuAndAcquire(GpuDeviceManager.scala:78)
 ```
 
+More information about cards that support forward compatibility can be found
+[here](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#faq).
+
 ### How can I check if the RAPIDS Accelerator is installed and which version is running?
 
 On startup the RAPIDS Accelerator will log a warning message on the Spark driver showing the

diff --git a/docs/additional-functionality/qualification-profiling-tools.md b/docs/additional-functionality/qualification-profiling-tools.md
@@ -71,11 +71,12 @@ If any input is a S3 file path or directory path, 2 extra steps are needed to ac
    - `hadoop-aws-<version>.jar`
    - `aws-java-sdk-<version>.jar`
 
-Take Hadoop 2.7.4 for example, we can download and include below jars in the '--jars' option to spark-shell or spark-submit:
-[hadoop-aws-2.7.4.jar](https://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar) and 
-[aws-java-sdk-1.7.4.jar](https://repo.maven.apache.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar)
+   Take Hadoop 2.7.4 for example, we can download and include below jars in the '--jars' option to spark-shell or spark-submit:
+   [hadoop-aws-2.7.4.jar](https://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar) and 
+   [aws-java-sdk-1.7.4.jar](https://repo.maven.apache.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar)
 
 2. In $SPARK_HOME/conf, create `hdfs-site.xml` with below AWS S3 keys inside:
+
 ```xml
 <?xml version="1.0"?>
 <configuration>
@@ -89,6 +90,7 @@ Take Hadoop 2.7.4 for example, we can download and include below jars in the '--
 </property>
 </configuration>
 ```
+
 Please refer to this [doc](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) on 
 more options about integrating hadoop-aws module with S3.
 
@@ -145,13 +147,15 @@ or sql execution time we find as the end time used to calculate the duration.
 Note that SQL queries that contain failed jobs are not included.
 
 Sample output in csv:
+
 ```
 App Name,App ID,Score,Potential Problems,SQL DF Duration,SQL Dataframe Task Duration,App Duration,Executor CPU Time Percent,App Duration Estimated,SQL Duration with Potential Problems,SQL Ids with Failures,Read Score Percent,Read File Format Score,Unsupported Read File Formats and Types
 job3,app-20210507174503-1704,4320658.0,"",9569,4320658,26171,35.34,false,0,"",20,100.0,""
 job1,app-20210507174503-2538,19864.04,"",6760,21802,83728,71.3,false,0,"",20,55.56,"Parquet[decimal]"
 ```
 
 Sample output in text:
+
 ```
 ===========================================================================
 |                 App ID|App Duration|SQL DF Duration|Problematic Duration|
@@ -174,6 +178,7 @@ Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
 ```
 
 Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath)
+
 ```bash
 java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  com.nvidia.spark.rapids.tool.qualification.QualificationMain  /eventlogDir
@@ -182,6 +187,7 @@ java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP
 ### Qualification tool options
 
   Note: `--help` should be before the trailing event logs.
+
 ```bash
 java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  com.nvidia.spark.rapids.tool.qualification.QualificationMain --help
@@ -282,16 +288,21 @@ Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
 
 Example commands:
 - Process the 10 newest logs, and only output the top 3 in the output:
+
 ```bash
 java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 10-newest -n 3 /eventlogDir
 ```
+
 - Process last 100 days' logs:
+
 ```bash
 java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  com.nvidia.spark.rapids.tool.qualification.QualificationMain -s 100d /eventlogDir
 ```
+
 - Process only the newest log with the same application name: 
+
 ```bash
 java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 1-newest-per-app-name /eventlogDir
@@ -359,6 +370,7 @@ Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
 ```
 
 Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath)
+
 ```bash
 java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
  com.nvidia.spark.rapids.tool.profiling.ProfileMain  /eventlogDir
@@ -457,6 +469,8 @@ For example, GPU run vs CPU run performance comparison or different runs with di
 We can input multiple Spark event logs and this tool can compare environments, executors, Rapids related Spark parameters,
 
 - Compare the durations/versions/gpuMode on or off:
+
+
 ```
 ### A. Information Collected ###
 Application Information:
@@ -470,6 +484,7 @@ Application Information:
 ```
 
 - Executor information:
+
 ```
 Executor Information:
 +--------+-----------------+------------+-------------+-----------+------------+-------------+--------------+------------------+---------------+-------+-------+
@@ -483,6 +498,7 @@ Executor Information:
 - Data Source information
 The details of this output differ between using a Spark Data Source V1 and Data Source V2 reader. The Data Source V2 truncates the schema, so if you see `...`, then
 the full schema is not available.
+
 ```
 Data Source Information:
 +--------+-----+-------+---------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------------------------------------------------------------+
@@ -501,6 +517,7 @@ Data Source Information:
 ```
 
 - Matching SQL IDs Across Applications:
+
 ```
 Matching SQL IDs Across Applications:
 +-----------------------+-----------------------+
@@ -522,6 +539,7 @@ also match between CPU plans and GPU plans so in most cases the same query run o
 CPU and on the GPU will match.
 
 - Matching Stage IDs Across Applications:
+
 ```
 Matching Stage IDs Across Applications:
 +-----------------------+-----------------------+
@@ -545,6 +563,7 @@ cases there are a different number of stages because of slight differences in th
 is a best effort, and it is not guaranteed to match up all stages in a plan.
 
 - Compare Rapids related Spark properties side-by-side:
+
 ```
 Compare Rapids Properties which are set explicitly:
 +-------------------------------------------+----------+----------+
@@ -563,6 +582,7 @@ Compare Rapids Properties which are set explicitly:
 ```
  
 - List rapids-4-spark and cuDF jars based on classpath: 
+
 ```
 Rapids Accelerator Jar and cuDF Jar:
 +--------+------------------------------------------------------------+
@@ -576,6 +596,7 @@ Rapids Accelerator Jar and cuDF Jar:
 ```
 
 - Job, stage and SQL ID information(not in `compare` mode yet):
+
 ```
 +--------+-----+---------+-----+
 |appIndex|jobID|stageIds |sqlID|
@@ -588,6 +609,7 @@ Rapids Accelerator Jar and cuDF Jar:
 - SQL Plan Metrics for Application for each SQL plan node in each SQL:
 
 These are also called accumulables in Spark.
+
 ```
 SQL Plan Metrics for Application:
 +--------+-----+------+-----------------------------------------------------------+-------------+-----------------------+-------------+----------+
@@ -607,11 +629,14 @@ For example if your application id is app-20210507103057-0000, then the
 filename will be `app-20210507103057-0000-planDescriptions.log`
 
 - Generate DOT graph for each SQL (-g option):
+
 ```
 Generated DOT graphs for app app-20210507103057-0000 to /path/. in 17 second(s)
 ```
+
 Once the DOT file is generated, you can install [graphviz](http://www.graphviz.org) to convert the DOT file 
 as a graph in pdf format using below command:
+
 ```bash
 dot -Tpdf ./app-20210507103057-0000-query-0/0.dot > app-20210507103057-0000.pdf
 ```
@@ -620,6 +645,7 @@ Or to svg using
 ```bash
 dot -Tsvg ./app-20210507103057-0000-query-0/0.dot > app-20210507103057-0000.svg
 ```
+
 The pdf or svg file has the SQL plan graph with metrics. The svg file will act a little
 more like the Spark UI and include extra information for nodes when hovering over it with
 a mouse.
@@ -641,20 +667,20 @@ timeline view similar Apache Spark's
 
 This displays several data sections.
 
-1) **Tasks** This shows all tasks in the application divided by executor.  Please note that this
+1. **Tasks** This shows all tasks in the application divided by executor.  Please note that this
    tries to pack the tasks in the graph. It does not represent actual scheduling on CPU cores.
    The tasks are labeled with the time it took for them to run, but there is no breakdown about
    different aspects of each task, like there is in Spark's timeline.
-2) **STAGES** This shows the stages times reported by Spark. It starts with when the stage was 
+2. **STAGES** This shows the stages times reported by Spark. It starts with when the stage was 
    scheduled and ends when Spark considered the stage done.
-3) **STAGE RANGES** This shows the time from the start of the first task to the end of the last
+3. **STAGE RANGES** This shows the time from the start of the first task to the end of the last
    task. Often a stage is scheduled, but there are not enough resources in the cluster to run it.
    This helps to show. How long it takes for a task to start running after it is scheduled, and in
    many cases how long it took to run all of the tasks in the stage. This is not always true because
    Spark can intermix tasks from different stages.
-4) **JOBS** This shows the time range reported by Spark from when a job was scheduled to when it
+4. **JOBS** This shows the time range reported by Spark from when a job was scheduled to when it
    completed.
-5) **SQL** This shows the time range reported by Spark from when a SQL statement was scheduled to
+5. **SQL** This shows the time range reported by Spark from when a SQL statement was scheduled to
    when it completed.
 
 Tasks and stages all are color coordinated to help know what tasks are associated with a given
@@ -669,6 +695,7 @@ stage. Jobs and SQL are not color coordinated.
 Below we will aggregate the task level metrics at different levels to do some analysis such as detecting possible shuffle skew.
 
 - Job + Stage level aggregated task metrics:
+
 ```
 ### B. Analysis ###
 
@@ -678,9 +705,9 @@ Job + Stage level aggregated task metrics:
 +--------+-------+--------+--------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+---------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+
 |1       |job_0  |3333    |222222  |0                   |11111111    |111111      |111         |1111.1      |6666666            |55555                         |55555                      |55555555           |0                    |222222222222       |22222222222          |111111       |0                     |0                      |0                        |222222222              |1                          |11111         |11111               |99999                    |22222222222          |2222221                   |222222222222          |0                           |222222222222         |222222222222       |5555555              |444444          |
 ```
-  
 
 - SQL level aggregated task metrics:
+
 ```
 SQL level aggregated task metrics:
 +--------+------------------------------+-----+--------------------+--------+--------+---------------+---------------+----------------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+---------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+
@@ -690,6 +717,7 @@ SQL level aggregated task metrics:
 ```
 
 - SQL duration, application during, if it contains a Dataset operation, potential problems, executor CPU time percent: 
+
 ```
 SQL Duration and Executor CPU Time Percent
 +--------+------------------------------+-----+------------+-------------------+------------+------------------+-------------------------+
@@ -700,6 +728,7 @@ SQL Duration and Executor CPU Time Percent
 ```
 
 - Shuffle Skew Check: 
+
 ```
 Shuffle Skew Check: (When task's Shuffle Read Size > 3 * Avg Stage-level size)
 +--------+-------+--------------+------+-------+---------------+--------------+-----------------+----------------+----------------+----------+----------------------------------------------------------------------------------------------------+
@@ -709,13 +738,15 @@ Shuffle Skew Check: (When task's Shuffle Read Size > 3 * Avg Stage-level size)
 |1       |2      |0             |2224  |1      |222.22         |8.8           |3333.33          |111.11          |0.01            |false     |ExceptionFailure(ai.rapids.cudf.CudfException,cuDF failure at: /dddd/xxxxxxx/ccccc/bbbbbbbbb/aaaaaaa|
 +--------+-------+--------------+------+-------+---------------+--------------+-----------------+----------------+----------------+----------+----------------------------------------------------------------------------------------------------+
 ```
+
 #### C. Health Check
 - List failed tasks, stages and jobs
 - Removed BlockManagers and Executors
 - SQL Plan HealthCheck
 
 Below are examples.
 - Print failed tasks:
+
 ```
 Failed tasks:
 +--------+-------+--------------+------+-------+----------------------------------------------------------------------------------------------------+
@@ -731,6 +762,7 @@ Failed tasks:
 ```
 
 - Print failed stages:
+
 ```
 Failed stages:
 +--------+-------+---------+-------------------------------------+--------+---------------------------------------------------+
@@ -741,6 +773,7 @@ Failed stages:
 ```
 
 - Print failed jobs:
+
 ```
 Failed jobs:
 +--------+-----+---------+------------------------------------------------------------------------+
@@ -753,6 +786,7 @@ Failed jobs:
 - SQL Plan HealthCheck:
 
   Prints possibly unsupported query plan nodes such as `$Lambda` key word means dataset API.
+
 ```
 +--------+-----+------+--------+---------------------------------------------------------------------------------------------------+
 |appIndex|sqlID|nodeID|nodeName|nodeDescription                                                                                    |