Skip to content

Commit

Permalink
Merge branch 'branch-21.10' into nested_docs_update
Browse files Browse the repository at this point in the history
  • Loading branch information
revans2 committed Aug 30, 2021
2 parents 77fbc87 + 5290586 commit 4ebd909
Show file tree
Hide file tree
Showing 90 changed files with 2,262 additions and 780 deletions.
17 changes: 10 additions & 7 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ nav_order: 11

### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support?

The RAPIDS Accelerator for Apache Spark requires version 3.0.1, 3.0.2, 3.1.1 or 3.1.2 of Apache
Spark. Because the plugin replaces parts of the physical plan that Apache Spark considers to be
internal the code for those plans can change even between bug fix releases. As a part of our
The RAPIDS Accelerator for Apache Spark requires version 3.0.1, 3.0.2, 3.0.3, 3.1.1, or 3.1.2 of
Apache Spark. Because the plugin replaces parts of the physical plan that Apache Spark considers to
be internal the code for those plans can change even between bug fix releases. As a part of our
process, we try to stay on top of these changes and release updates as quickly as possible.

### Which distributions are supported?
Expand All @@ -30,15 +30,15 @@ to set up testing and validation on their distributions.

### What CUDA versions are supported?

CUDA 11.0 and 11.2 are currently supported. Please look [here](download.md) for download links for
the latest release.
CUDA 11.x is currently supported. Please look [here](download.md) for download links for the latest
release.

### What hardware is supported?

The plugin is tested and supported on V100, T4, A10, A30 and A100 datacenter GPUs. It is possible
to run the plugin on GeForce desktop hardware with Volta or better architectures. GeForce hardware
does not support [CUDA enhanced
compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#enhanced-compat-minor-releases),
does not support [CUDA forward
compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title),
and will need CUDA 11.2 installed. If not, the following error will be displayed:

```
Expand All @@ -47,6 +47,9 @@ ai.rapids.cudf.CudaException: forward compatibility was attempted on non support
at com.nvidia.spark.rapids.GpuDeviceManager$.findGpuAndAcquire(GpuDeviceManager.scala:78)
```

More information about cards that support forward compatibility can be found
[here](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#faq).

### How can I check if the RAPIDS Accelerator is installed and which version is running?

On startup the RAPIDS Accelerator will log a warning message on the Spark driver showing the
Expand Down
52 changes: 43 additions & 9 deletions docs/additional-functionality/qualification-profiling-tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,11 +71,12 @@ If any input is a S3 file path or directory path, 2 extra steps are needed to ac
- `hadoop-aws-<version>.jar`
- `aws-java-sdk-<version>.jar`

Take Hadoop 2.7.4 for example, we can download and include below jars in the '--jars' option to spark-shell or spark-submit:
[hadoop-aws-2.7.4.jar](https://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar) and
[aws-java-sdk-1.7.4.jar](https://repo.maven.apache.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar)
Take Hadoop 2.7.4 for example, we can download and include below jars in the '--jars' option to spark-shell or spark-submit:
[hadoop-aws-2.7.4.jar](https://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar) and
[aws-java-sdk-1.7.4.jar](https://repo.maven.apache.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar)

2. In $SPARK_HOME/conf, create `hdfs-site.xml` with below AWS S3 keys inside:

```xml
<?xml version="1.0"?>
<configuration>
Expand All @@ -89,6 +90,7 @@ Take Hadoop 2.7.4 for example, we can download and include below jars in the '--
</property>
</configuration>
```

Please refer to this [doc](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) on
more options about integrating hadoop-aws module with S3.

Expand Down Expand Up @@ -145,13 +147,15 @@ or sql execution time we find as the end time used to calculate the duration.
Note that SQL queries that contain failed jobs are not included.

Sample output in csv:

```
App Name,App ID,Score,Potential Problems,SQL DF Duration,SQL Dataframe Task Duration,App Duration,Executor CPU Time Percent,App Duration Estimated,SQL Duration with Potential Problems,SQL Ids with Failures,Read Score Percent,Read File Format Score,Unsupported Read File Formats and Types
job3,app-20210507174503-1704,4320658.0,"",9569,4320658,26171,35.34,false,0,"",20,100.0,""
job1,app-20210507174503-2538,19864.04,"",6760,21802,83728,71.3,false,0,"",20,55.56,"Parquet[decimal]"
```

Sample output in text:

```
===========================================================================
| App ID|App Duration|SQL DF Duration|Problematic Duration|
Expand All @@ -174,6 +178,7 @@ Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
```

Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath)

```bash
java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
com.nvidia.spark.rapids.tool.qualification.QualificationMain /eventlogDir
Expand All @@ -182,6 +187,7 @@ java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP
### Qualification tool options

Note: `--help` should be before the trailing event logs.

```bash
java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
com.nvidia.spark.rapids.tool.qualification.QualificationMain --help
Expand Down Expand Up @@ -282,16 +288,21 @@ Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
Example commands:
- Process the 10 newest logs, and only output the top 3 in the output:
```bash
java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 10-newest -n 3 /eventlogDir
```
- Process last 100 days' logs:
```bash
java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
com.nvidia.spark.rapids.tool.qualification.QualificationMain -s 100d /eventlogDir
```
- Process only the newest log with the same application name:
```bash
java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 1-newest-per-app-name /eventlogDir
Expand Down Expand Up @@ -359,6 +370,7 @@ Usage: java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*
```
Example running on files in HDFS: (include $HADOOP_CONF_DIR in classpath)
```bash
java -cp ~/rapids-4-spark-tools_2.12-21.<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \
com.nvidia.spark.rapids.tool.profiling.ProfileMain /eventlogDir
Expand Down Expand Up @@ -457,6 +469,8 @@ For example, GPU run vs CPU run performance comparison or different runs with di
We can input multiple Spark event logs and this tool can compare environments, executors, Rapids related Spark parameters,
- Compare the durations/versions/gpuMode on or off:
```
### A. Information Collected ###
Application Information:
Expand All @@ -470,6 +484,7 @@ Application Information:
```
- Executor information:
```
Executor Information:
+--------+-----------------+------------+-------------+-----------+------------+-------------+--------------+------------------+---------------+-------+-------+
Expand All @@ -483,6 +498,7 @@ Executor Information:
- Data Source information
The details of this output differ between using a Spark Data Source V1 and Data Source V2 reader. The Data Source V2 truncates the schema, so if you see `...`, then
the full schema is not available.
```
Data Source Information:
+--------+-----+-------+---------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------------------------------------------------------------+
Expand All @@ -501,6 +517,7 @@ Data Source Information:
```
- Matching SQL IDs Across Applications:
```
Matching SQL IDs Across Applications:
+-----------------------+-----------------------+
Expand All @@ -522,6 +539,7 @@ also match between CPU plans and GPU plans so in most cases the same query run o
CPU and on the GPU will match.
- Matching Stage IDs Across Applications:
```
Matching Stage IDs Across Applications:
+-----------------------+-----------------------+
Expand All @@ -545,6 +563,7 @@ cases there are a different number of stages because of slight differences in th
is a best effort, and it is not guaranteed to match up all stages in a plan.
- Compare Rapids related Spark properties side-by-side:
```
Compare Rapids Properties which are set explicitly:
+-------------------------------------------+----------+----------+
Expand All @@ -563,6 +582,7 @@ Compare Rapids Properties which are set explicitly:
```
- List rapids-4-spark and cuDF jars based on classpath:
```
Rapids Accelerator Jar and cuDF Jar:
+--------+------------------------------------------------------------+
Expand All @@ -576,6 +596,7 @@ Rapids Accelerator Jar and cuDF Jar:
```
- Job, stage and SQL ID information(not in `compare` mode yet):
```
+--------+-----+---------+-----+
|appIndex|jobID|stageIds |sqlID|
Expand All @@ -588,6 +609,7 @@ Rapids Accelerator Jar and cuDF Jar:
- SQL Plan Metrics for Application for each SQL plan node in each SQL:
These are also called accumulables in Spark.
```
SQL Plan Metrics for Application:
+--------+-----+------+-----------------------------------------------------------+-------------+-----------------------+-------------+----------+
Expand All @@ -607,11 +629,14 @@ For example if your application id is app-20210507103057-0000, then the
filename will be `app-20210507103057-0000-planDescriptions.log`
- Generate DOT graph for each SQL (-g option):
```
Generated DOT graphs for app app-20210507103057-0000 to /path/. in 17 second(s)
```
Once the DOT file is generated, you can install [graphviz](http://www.graphviz.org) to convert the DOT file
as a graph in pdf format using below command:
```bash
dot -Tpdf ./app-20210507103057-0000-query-0/0.dot > app-20210507103057-0000.pdf
```
Expand All @@ -620,6 +645,7 @@ Or to svg using
```bash
dot -Tsvg ./app-20210507103057-0000-query-0/0.dot > app-20210507103057-0000.svg
```
The pdf or svg file has the SQL plan graph with metrics. The svg file will act a little
more like the Spark UI and include extra information for nodes when hovering over it with
a mouse.
Expand All @@ -641,20 +667,20 @@ timeline view similar Apache Spark's
This displays several data sections.
1) **Tasks** This shows all tasks in the application divided by executor. Please note that this
1. **Tasks** This shows all tasks in the application divided by executor. Please note that this
tries to pack the tasks in the graph. It does not represent actual scheduling on CPU cores.
The tasks are labeled with the time it took for them to run, but there is no breakdown about
different aspects of each task, like there is in Spark's timeline.
2) **STAGES** This shows the stages times reported by Spark. It starts with when the stage was
2. **STAGES** This shows the stages times reported by Spark. It starts with when the stage was
scheduled and ends when Spark considered the stage done.
3) **STAGE RANGES** This shows the time from the start of the first task to the end of the last
3. **STAGE RANGES** This shows the time from the start of the first task to the end of the last
task. Often a stage is scheduled, but there are not enough resources in the cluster to run it.
This helps to show. How long it takes for a task to start running after it is scheduled, and in
many cases how long it took to run all of the tasks in the stage. This is not always true because
Spark can intermix tasks from different stages.
4) **JOBS** This shows the time range reported by Spark from when a job was scheduled to when it
4. **JOBS** This shows the time range reported by Spark from when a job was scheduled to when it
completed.
5) **SQL** This shows the time range reported by Spark from when a SQL statement was scheduled to
5. **SQL** This shows the time range reported by Spark from when a SQL statement was scheduled to
when it completed.
Tasks and stages all are color coordinated to help know what tasks are associated with a given
Expand All @@ -669,6 +695,7 @@ stage. Jobs and SQL are not color coordinated.
Below we will aggregate the task level metrics at different levels to do some analysis such as detecting possible shuffle skew.
- Job + Stage level aggregated task metrics:
```
### B. Analysis ###
Expand All @@ -678,9 +705,9 @@ Job + Stage level aggregated task metrics:
+--------+-------+--------+--------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+---------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+
|1 |job_0 |3333 |222222 |0 |11111111 |111111 |111 |1111.1 |6666666 |55555 |55555 |55555555 |0 |222222222222 |22222222222 |111111 |0 |0 |0 |222222222 |1 |11111 |11111 |99999 |22222222222 |2222221 |222222222222 |0 |222222222222 |222222222222 |5555555 |444444 |
```
- SQL level aggregated task metrics:
```
SQL level aggregated task metrics:
+--------+------------------------------+-----+--------------------+--------+--------+---------------+---------------+----------------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+---------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+
Expand All @@ -690,6 +717,7 @@ SQL level aggregated task metrics:
```
- SQL duration, application during, if it contains a Dataset operation, potential problems, executor CPU time percent:
```
SQL Duration and Executor CPU Time Percent
+--------+------------------------------+-----+------------+-------------------+------------+------------------+-------------------------+
Expand All @@ -700,6 +728,7 @@ SQL Duration and Executor CPU Time Percent
```
- Shuffle Skew Check:
```
Shuffle Skew Check: (When task's Shuffle Read Size > 3 * Avg Stage-level size)
+--------+-------+--------------+------+-------+---------------+--------------+-----------------+----------------+----------------+----------+----------------------------------------------------------------------------------------------------+
Expand All @@ -709,13 +738,15 @@ Shuffle Skew Check: (When task's Shuffle Read Size > 3 * Avg Stage-level size)
|1 |2 |0 |2224 |1 |222.22 |8.8 |3333.33 |111.11 |0.01 |false |ExceptionFailure(ai.rapids.cudf.CudfException,cuDF failure at: /dddd/xxxxxxx/ccccc/bbbbbbbbb/aaaaaaa|
+--------+-------+--------------+------+-------+---------------+--------------+-----------------+----------------+----------------+----------+----------------------------------------------------------------------------------------------------+
```
#### C. Health Check
- List failed tasks, stages and jobs
- Removed BlockManagers and Executors
- SQL Plan HealthCheck
Below are examples.
- Print failed tasks:
```
Failed tasks:
+--------+-------+--------------+------+-------+----------------------------------------------------------------------------------------------------+
Expand All @@ -731,6 +762,7 @@ Failed tasks:
```
- Print failed stages:
```
Failed stages:
+--------+-------+---------+-------------------------------------+--------+---------------------------------------------------+
Expand All @@ -741,6 +773,7 @@ Failed stages:
```
- Print failed jobs:
```
Failed jobs:
+--------+-----+---------+------------------------------------------------------------------------+
Expand All @@ -753,6 +786,7 @@ Failed jobs:
- SQL Plan HealthCheck:
Prints possibly unsupported query plan nodes such as `$Lambda` key word means dataset API.
```
+--------+-----+------+--------+---------------------------------------------------------------------------------------------------+
|appIndex|sqlID|nodeID|nodeName|nodeDescription |
Expand Down
Loading

0 comments on commit 4ebd909

Please sign in to comment.