From 42c7eb11f6828f3970368d522a1e026faeaf8062 Mon Sep 17 00:00:00 2001 From: liyuan <84758614+nvliyuan@users.noreply.github.com> Date: Sat, 12 Feb 2022 00:38:35 +0800 Subject: [PATCH 1/2] Fix broken hyperlinks in documentation [skip ci] (#4751) * fix broken links in branch2202 Signed-off-by: liyuan * update a link which split across lines Signed-off-by: liyuan --- docs/additional-functionality/rapids-udfs.md | 12 ++++++------ docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb | 2 +- docs/demo/GCP/Mortgage-ETL-CPU.ipynb | 2 +- docs/demo/GCP/Mortgage-ETL-GPU.ipynb | 2 +- docs/download.md | 4 ++-- docs/get-started/getting-started-gcp.md | 6 +++--- .../getting-started-workload-qualification.md | 12 ++++++------ docs/tuning-guide.md | 2 +- 8 files changed, 21 insertions(+), 21 deletions(-) diff --git a/docs/additional-functionality/rapids-udfs.md b/docs/additional-functionality/rapids-udfs.md index 13c6767f1c6..a58130ef97c 100644 --- a/docs/additional-functionality/rapids-udfs.md +++ b/docs/additional-functionality/rapids-udfs.md @@ -141,19 +141,19 @@ in the [udf-examples](../../udf-examples) project. - [URLDecode](../../udf-examples/src/main/scala/com/nvidia/spark/rapids/udf/scala/URLDecode.scala) decodes URL-encoded strings using the -[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable) +[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy) - [URLEncode](../../udf-examples/src/main/scala/com/nvidia/spark/rapids/udf/scala/URLEncode.scala) URL-encodes strings using the -[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable) +[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy) ### Spark Java UDF Examples - [URLDecode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/URLDecode.java) decodes URL-encoded strings using the -[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable) +[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy) - [URLEncode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/URLEncode.java) URL-encodes strings using the -[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable) +[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy) - [CosineSimilarity](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java) computes the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between two float vectors using [native code](../../udf-examples/src/main/cpp/src) @@ -162,11 +162,11 @@ between two float vectors using [native code](../../udf-examples/src/main/cpp/sr - [URLDecode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/URLDecode.java) implements a Hive simple UDF using the -[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable) +[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy) to decode URL-encoded strings - [URLEncode](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/URLEncode.java) implements a Hive generic UDF using the -[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/stable) +[Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy) to URL-encode strings - [StringWordCount](../../udf-examples/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java) implements a Hive simple UDF using diff --git a/docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb b/docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb index 2be5b0f1419..5ec48bf4c58 100644 --- a/docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb +++ b/docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb @@ -12,7 +12,7 @@ "\n", "Dataset is derived from Fannie Mae’s [Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae. For the full raw dataset visit [Fannie Mae]() to register for an account and to download\n", "\n", - "Instruction is available at NVIDIA [RAPIDS demo site](https://rapidsai.github.io/demos/datasets/mortgage-data).\n", + "Instruction is available at NVIDIA [RAPIDS demo site](https://docs.rapids.ai/datasets/mortgage-data).\n", "\n", "## Prerequisite\n", "\n", diff --git a/docs/demo/GCP/Mortgage-ETL-CPU.ipynb b/docs/demo/GCP/Mortgage-ETL-CPU.ipynb index 23394714162..83bbe0e1202 100644 --- a/docs/demo/GCP/Mortgage-ETL-CPU.ipynb +++ b/docs/demo/GCP/Mortgage-ETL-CPU.ipynb @@ -8,7 +8,7 @@ "\n", "Dataset is derived from Fannie Mae’s [Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae. For the full raw dataset visit [Fannie Mae]() to register for an account and to download\n", "\n", - "Instruction is available at NVIDIA [RAPIDS demo site](https://rapidsai.github.io/demos/datasets/mortgage-data).\n", + "Instruction is available at NVIDIA [RAPIDS demo site](https://docs.rapids.ai/datasets/mortgage-data).\n", "\n", "### Prerequisite\n", "\n", diff --git a/docs/demo/GCP/Mortgage-ETL-GPU.ipynb b/docs/demo/GCP/Mortgage-ETL-GPU.ipynb index 059a38082b9..1740074b9fa 100644 --- a/docs/demo/GCP/Mortgage-ETL-GPU.ipynb +++ b/docs/demo/GCP/Mortgage-ETL-GPU.ipynb @@ -12,7 +12,7 @@ "\n", "Dataset is derived from Fannie Mae’s [Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae. For the full raw dataset visit [Fannie Mae]() to register for an account and to download\n", "\n", - "Instruction is available at NVIDIA [RAPIDS demo site](https://rapidsai.github.io/demos/datasets/mortgage-data).\n", + "Instruction is available at NVIDIA [RAPIDS demo site](https://docs.rapids.ai/datasets/mortgage-data).\n", "\n", "### Prerequisite\n", "\n", diff --git a/docs/download.md b/docs/download.md index 7cb561a5ae6..57e07898100 100644 --- a/docs/download.md +++ b/docs/download.md @@ -619,8 +619,8 @@ account the scenario where input data can be stored across many small files. By CPU threads v0.2 delivers up to 6x performance improvement over the previous release for small Parquet file reads. -The RAPIDS Accelerator introduces a beta feature that accelerates [Spark shuffle for -GPUs](get-started/getting-started-on-prem.md#enabling-rapidsshufflemanager). Accelerated +The RAPIDS Accelerator introduces a beta feature that accelerates +[Spark shuffle for GPUs](get-started/getting-started-on-prem.md#enabling-rapids-shuffle-manager). Accelerated shuffle makes use of high bandwidth transfers between GPUs (NVLink or p2p over PCIe) and leverages RDMA (RoCE or Infiniband) for remote transfers. diff --git a/docs/get-started/getting-started-gcp.md b/docs/get-started/getting-started-gcp.md index 606138732ff..94fe1208639 100644 --- a/docs/get-started/getting-started-gcp.md +++ b/docs/get-started/getting-started-gcp.md @@ -85,9 +85,9 @@ If you'd like to further accelerate init time to 4-5 minutes, create a custom Da ## Run PySpark or Scala Notebook on a Dataproc Cluster Accelerated by GPUs To use notebooks with a Dataproc cluster, click on the cluster name under the Dataproc cluster tab and navigate to the "Web Interfaces" tab. Under "Web Interfaces", click on the JupyterLab or -Jupyter link to start to use sample [Mortgage ETL on GPU Jupyter -Notebook](../demo/GCP/Mortgage-ETL-GPU.ipynb) to process full 17 years [Mortgage -data](https://rapidsai.github.io/demos/datasets/mortgage-data). +Jupyter link to start to use sample +[Mortgage ETL on GPU Jupyter Notebook](../demo/GCP/Mortgage-ETL-GPU.ipynb) to process full 17 years +[Mortgage data](https://docs.rapids.ai/datasets/mortgage-data). ![Dataproc Web Interfaces](../img/GCP/dataproc-service.png) diff --git a/docs/get-started/getting-started-workload-qualification.md b/docs/get-started/getting-started-workload-qualification.md index 7272f36b0af..27bd0d0a174 100644 --- a/docs/get-started/getting-started-workload-qualification.md +++ b/docs/get-started/getting-started-workload-qualification.md @@ -30,8 +30,8 @@ This article describes the tools we provide and how to do gap analysis and workl ### How to use If you have Spark event logs from prior runs of the applications on Spark 2.x or 3.x, you can use -the [Qualification tool](../spark-qualification-tool.md) and [Profiling -tool](../spark-profiling-tool.md) to analyze them. The qualification tool outputs the score, rank +the [Qualification tool](../spark-qualification-tool.md) and +[Profiling tool](../spark-profiling-tool.md) to analyze them. The qualification tool outputs the score, rank and some of the potentially not-supported features for each Spark application. For example, the CSV output can print `Unsupported Read File Formats and Types`, `Unsupported Write Data Format` and `Potential Problems` which are the indication of some not-supported features. Its output can help @@ -119,8 +119,8 @@ the driver logs with `spark.rapids.sql.explain=all`. This log can show you which operators (on what data type) can not run on GPU and the reason. If it shows a specific RAPIDS Accelerator parameter which can be turned on to enable that feature, -you should first understand the risk and applicability of that parameter based on [configs -doc](../configs.md) and then enable that parameter and try the tool again. +you should first understand the risk and applicability of that parameter based on +[configs doc](../configs.md) and then enable that parameter and try the tool again. Since its output is directly based on specific version of `rapids-4-spark` jar, the gap analysis is pretty accurate. @@ -213,8 +213,8 @@ which is the same as the driver logs with `spark.rapids.sql.explain=all`. This log can show you which operators (on what data type) can not run on GPU and the reason. If it shows a specific RAPIDS Accelerator parameter which can be turned on to enable that feature, -you should first understand the risk and applicability of that parameter based on [configs -doc](../configs.md) and then enable that parameter and try the tool again. +you should first understand the risk and applicability of that parameter based on +[configs doc](../configs.md) and then enable that parameter and try the tool again. Since its output is directly based on specific version of `rapids-4-spark` jar, the gap analysis is pretty accurate. diff --git a/docs/tuning-guide.md b/docs/tuning-guide.md index 0796d6e943f..1b17fca67cd 100644 --- a/docs/tuning-guide.md +++ b/docs/tuning-guide.md @@ -337,7 +337,7 @@ Custom Spark SQL Metrics are available which can help identify performance bottl Not all metrics are enabled by default. The configuration setting `spark.rapids.sql.metrics.level` can be set to `DEBUG`, `MODERATE`, or `ESSENTIAL`, with `MODERATE` being the default value. More information about this -configuration option is available in the configuration documentation. +configuration option is available in the [configuration documentation](configs.md#sql.metrics.level). Output row and batch counts show up for operators where the number of output rows or batches are expected to change. For example a filter operation would show the number of rows that passed the From 75860513610a169c1e721f225c6accec46e4edc7 Mon Sep 17 00:00:00 2001 From: Hao Zhu <9665750+viadea@users.noreply.github.com> Date: Fri, 11 Feb 2022 08:40:57 -0800 Subject: [PATCH 2/2] Fix databricks doc for limitations. (#4755) Signed-off-by: Hao Zhu --- docs/get-started/getting-started-databricks.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/get-started/getting-started-databricks.md b/docs/get-started/getting-started-databricks.md index d66243660ca..296ef52e61f 100644 --- a/docs/get-started/getting-started-databricks.md +++ b/docs/get-started/getting-started-databricks.md @@ -26,12 +26,12 @@ The number of GPUs per node dictates the number of Spark executors that can run 1. Adaptive query execution(AQE) and Delta optimization write do not work. These should be disabled when using the plugin. Queries may still see significant speedups even with AQE disabled. - ```bash - spark.databricks.delta.optimizeWrite.enabled false - spark.sql.adaptive.enabled false - ``` + ```bash + spark.databricks.delta.optimizeWrite.enabled false + spark.sql.adaptive.enabled false + ``` - See [issue-1059](https://github.com/NVIDIA/spark-rapids/issues/1059) for more detail. + See [issue-1059](https://github.com/NVIDIA/spark-rapids/issues/1059) for more detail. 2. Dynamic partition pruning(DPP) does not work. This results in poor performance for queries which would normally benefit from DPP. See @@ -42,10 +42,10 @@ when using the plugin. Queries may still see significant speedups even with AQE 4. Cannot spin off multiple executors on a multi-GPU node. - Even though it is possible to set `spark.executor.resource.gpu.amount=N` (where N is the number - of GPUs per node) in the in Spark Configuration tab, Databricks overrides this to - `spark.executor.resource.gpu.amount=1`. This will result in failed executors when starting the - cluster. + Even though it is possible to set `spark.executor.resource.gpu.amount=1` in the in Spark + Configuration tab, Databricks overrides this to `spark.executor.resource.gpu.amount=N` + (where N is the number of GPUs per node). This will result in failed executors when starting the + cluster. 5. Databricks makes changes to the runtime without notification.