From 474c12bfa95e65510f508d9518a939da4b4eb8c2 Mon Sep 17 00:00:00 2001 From: Haoyang Li Date: Thu, 28 Sep 2023 13:24:42 +0800 Subject: [PATCH 1/2] fix incorrect links Signed-off-by: Haoyang Li --- docs/FAQ.md | 2 +- docs/compatibility.md | 26 +++++++++---------- .../get-started/getting-started-databricks.md | 2 +- docs/tuning-guide.md | 4 +-- 4 files changed, 17 insertions(+), 17 deletions(-) diff --git a/docs/FAQ.md b/docs/FAQ.md index e2c7242ae46..1d920bfc7cb 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -458,7 +458,7 @@ files. Spark tends to prefer sort based joins, and in some cases sort based agg the GPU versions are all hash based. This means that the resulting data can come out in a different order for the CPU and the GPU. This is not wrong, but can make the size of the output data different because of compression. Users can turn on -[spark.rapids.sql.hashOptimizeSort.enabled](configs.md#sql.hashOptimizeSort.enabled) to have +[spark.rapids.sql.hashOptimizeSort.enabled](additional-functionality/advanced_configs.md#sql.hashOptimizeSort.enabled) to have the GPU try to replicate more closely what the output ordering would have been if sort were used, like on the CPU. diff --git a/docs/compatibility.md b/docs/compatibility.md index 01f9707e17a..335cac93665 100644 --- a/docs/compatibility.md +++ b/docs/compatibility.md @@ -36,7 +36,7 @@ task/partition. The RAPIDS Accelerator does an unstable simply means that the sort algorithm allows for spilling parts of the data if it is larger than can fit in the GPU's memory, but it does not guarantee ordering of rows when the ordering of the keys is ambiguous. If you do rely on a stable sort in your processing you can request this by -setting [spark.rapids.sql.stableSort.enabled](configs.md#sql.stableSort.enabled) to `true` and +setting [spark.rapids.sql.stableSort.enabled](additional-functionality/advanced_configs.md#sql.stableSort.enabled) to `true` and RAPIDS will try to sort all the data for a given task/partition at once on the GPU. This may change in the future to allow for a spillable stable sort. @@ -67,7 +67,7 @@ joins on a floating point value, which is not wise to do anyways, and the value floating point aggregation then the join may fail to work properly with the plugin but would have worked with plain Spark. Starting from 22.06 this is behavior is enabled by default but can be disabled with the config -[`spark.rapids.sql.variableFloatAgg.enabled`](configs.md#sql.variableFloatAgg.enabled). +[`spark.rapids.sql.variableFloatAgg.enabled`](additional-functionality/advanced_configs.md#sql.variableFloatAgg.enabled). ### `0.0` vs `-0.0` @@ -513,13 +513,13 @@ GPU: WrappedArray([0], [19], [19], [19], [19], [19], [19], [19], [19], [19], [19 ``` To enable byte-range windowing on the GPU, set -[`spark.rapids.sql.window.range.byte.enabled`](configs.md#sql.window.range.byte.enabled) to true. +[`spark.rapids.sql.window.range.byte.enabled`](additional-functionality/advanced_configs.md#sql.window.range.byte.enabled) to true. We also provide configurations for other integral range types: -- [`spark.rapids.sql.window.range.short.enabled`](configs.md#sql.window.range.short.enabled) -- [`spark.rapids.sql.window.range.int.enabled`](configs.md#sql.window.range.int.enabled) -- [`spark.rapids.sql.window.range.long.enabled`](configs.md#sql.window.range.short.enabled) +- [`spark.rapids.sql.window.range.short.enabled`](additional-functionality/advanced_configs.md#sql.window.range.short.enabled) +- [`spark.rapids.sql.window.range.int.enabled`](additional-functionality/advanced_configs.md#sql.window.range.int.enabled) +- [`spark.rapids.sql.window.range.long.enabled`](additional-functionality/advanced_configs.md#sql.window.range.short.enabled) The reason why we default the configurations to false for byte/short and to true for int/long is that we think the most real-world queries are based on int or long. @@ -563,7 +563,7 @@ extensively tested and may produce different results compared to the CPU. Known values on GPU where Spark would treat the data as invalid and return null To attempt to use other formats on the GPU, set -[`spark.rapids.sql.incompatibleDateFormats.enabled`](configs.md#sql.incompatibleDateFormats.enabled) +[`spark.rapids.sql.incompatibleDateFormats.enabled`](additional-functionality/advanced_configs.md#sql.incompatibleDateFormats.enabled) to `true`. Formats that contain any of the following characters are unsupported and will fall back to CPU: @@ -585,7 +585,7 @@ Formats that contain any of the following words are unsupported and will fall ba ### LEGACY timeParserPolicy With timeParserPolicy set to `LEGACY` and -[`spark.rapids.sql.incompatibleDateFormats.enabled`](configs.md#sql.incompatibleDateFormats.enabled) +[`spark.rapids.sql.incompatibleDateFormats.enabled`](additional-functionality/advanced_configs.md#sql.incompatibleDateFormats.enabled) set to `true`, and `spark.sql.ansi.enabled` set to `false`, the following formats are supported but not guaranteed to produce the same results as the CPU: @@ -642,7 +642,7 @@ leads to restrictions: Starting from 22.06 this conf is enabled, to disable this operation on the GPU when using Spark 3.1.0 or later, set -[`spark.rapids.sql.castFloatToDecimal.enabled`](configs.md#sql.castFloatToDecimal.enabled) to `false` +[`spark.rapids.sql.castFloatToDecimal.enabled`](additional-functionality/advanced_configs.md#sql.castFloatToDecimal.enabled) to `false` ### Float to Integral Types @@ -653,7 +653,7 @@ starting with 3.1.0 these are now integral types such as `Int.MaxValue` so this affected the valid range of values and now differs slightly from the behavior on GPU in some cases. Starting from 22.06 this conf is enabled, to disable this operation on the GPU when using Spark 3.1.0 or later, set -[`spark.rapids.sql.castFloatToIntegralTypes.enabled`](configs.md#sql.castFloatToIntegralTypes.enabled) +[`spark.rapids.sql.castFloatToIntegralTypes.enabled`](additional-functionality/advanced_configs.md#sql.castFloatToIntegralTypes.enabled) to `false`. This configuration setting is ignored when using Spark versions prior to 3.1.0. @@ -665,7 +665,7 @@ types to strings. The GPU uses a lowercase `e` prefix for an exponent while Spar `E`. As a result the computed string can differ from the default behavior in Spark. Starting from 22.06 this conf is enabled by default, to disable this operation on the GPU, set -[`spark.rapids.sql.castFloatToString.enabled`](configs.md#sql.castFloatToString.enabled) to `false`. +[`spark.rapids.sql.castFloatToString.enabled`](additional-functionality/advanced_configs.md#sql.castFloatToString.enabled) to `false`. ### String to Float @@ -679,7 +679,7 @@ default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respe Also, the GPU does not support casting from strings containing hex values. Starting from 22.06 this conf is enabled by default, to enable this operation on the GPU, set -[`spark.rapids.sql.castStringToFloat.enabled`](configs.md#sql.castStringToFloat.enabled) to `false`. +[`spark.rapids.sql.castStringToFloat.enabled`](additional-functionality/advanced_configs.md#sql.castStringToFloat.enabled) to `false`. ### String to Date @@ -703,7 +703,7 @@ The following formats/patterns are supported on the GPU. Timezone of UTC is assu ### String to Timestamp To allow casts from string to timestamp on the GPU, enable the configuration property -[`spark.rapids.sql.castStringToTimestamp.enabled`](configs.md#sql.castStringToTimestamp.enabled). +[`spark.rapids.sql.castStringToTimestamp.enabled`](additional-functionality/advanced_configs.md#sql.castStringToTimestamp.enabled). Casting from string to timestamp currently has the following limitations. diff --git a/docs/get-started/getting-started-databricks.md b/docs/get-started/getting-started-databricks.md index f429d361fd9..459a637153c 100644 --- a/docs/get-started/getting-started-databricks.md +++ b/docs/get-started/getting-started-databricks.md @@ -107,7 +107,7 @@ cluster meets the prerequisites above by configuring it as follows: of python for Databricks. On Databricks, the python runtime requires different parameters than the Spark one, so a dedicated python deamon module `rapids.daemon_databricks` is created and should be specified here. Set the config - [`spark.rapids.sql.python.gpu.enabled`](../configs.md#sql.python.gpu.enabled) to `true` to + [`spark.rapids.sql.python.gpu.enabled`](../additional-functionality/advanced_configs.md#sql.python.gpu.enabled) to `true` to enable GPU support for python. Add the path of the plugin jar (supposing it is placed under `/databricks/jars/`) to the `spark.executorEnv.PYTHONPATH` option. For more details please go to [GPU Scheduling For Pandas UDF](../additional-functionality/rapids-udfs.md#gpu-support-for-pandas-udf) diff --git a/docs/tuning-guide.md b/docs/tuning-guide.md index 19935364124..2e61e72c425 100644 --- a/docs/tuning-guide.md +++ b/docs/tuning-guide.md @@ -46,11 +46,11 @@ If there are too many tasks this can increase the memory pressure on the GPU and spilling. ## Pooled Memory -Configuration key: [`spark.rapids.memory.gpu.pooling.enabled`](configs.md#memory.gpu.pooling.enabled) +Configuration key: [`spark.rapids.memory.gpu.pooling.enabled`](additional-functionality/advanced_configs.md#memory.gpu.pooling.enabled) Default value: `true` -Configuration key: [`spark.rapids.memory.gpu.allocFraction`](configs.md#memory.gpu.allocFraction) +Configuration key: [`spark.rapids.memory.gpu.allocFraction`](additional-functionality/advanced_configs.md#memory.gpu.allocFraction) Default value: `1.0` From 3743152d4ebc10e26cd82ed811519559af5d649d Mon Sep 17 00:00:00 2001 From: Haoyang Li Date: Thu, 28 Sep 2023 14:13:58 +0800 Subject: [PATCH 2/2] address comment Signed-off-by: Haoyang Li --- docs/compatibility.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/compatibility.md b/docs/compatibility.md index 335cac93665..e72415b634f 100644 --- a/docs/compatibility.md +++ b/docs/compatibility.md @@ -519,7 +519,7 @@ We also provide configurations for other integral range types: - [`spark.rapids.sql.window.range.short.enabled`](additional-functionality/advanced_configs.md#sql.window.range.short.enabled) - [`spark.rapids.sql.window.range.int.enabled`](additional-functionality/advanced_configs.md#sql.window.range.int.enabled) -- [`spark.rapids.sql.window.range.long.enabled`](additional-functionality/advanced_configs.md#sql.window.range.short.enabled) +- [`spark.rapids.sql.window.range.long.enabled`](additional-functionality/advanced_configs.md#sql.window.range.long.enabled) The reason why we default the configurations to false for byte/short and to true for int/long is that we think the most real-world queries are based on int or long.