From c24f92b685f5edf1615bc6ff72cc203dfece9cb4 Mon Sep 17 00:00:00 2001 From: "Robert (Bobby) Evans" Date: Mon, 23 Nov 2020 08:53:53 -0600 Subject: [PATCH] Updated documentation for distinct count compatibility Signed-off-by: Robert (Bobby) Evans --- docs/compatibility.md | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/docs/compatibility.md b/docs/compatibility.md index 347dce36e16..0b1f19a45b6 100644 --- a/docs/compatibility.md +++ b/docs/compatibility.md @@ -48,23 +48,35 @@ floating point aggregations are off by default but can be enabled with the confi Additionally, some aggregations on floating point columns that contain `NaN` can produce incorrect results. More details on this behavior can be found -[here](https://github.com/NVIDIA/spark-rapids/issues/87) +[here](https://github.com/NVIDIA/spark-rapids/issues/87), +[here](https://github.com/NVIDIA/spark-rapids/issues/837), and in this cudf [feature request](https://github.com/rapidsai/cudf/issues/4753). If it is known with certainty that the floating point columns do not contain `NaN`, set [`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) to `false` to run GPU enabled aggregations on them. +In the case of a distinct count on `NaN` values the +[issue](https://github.com/NVIDIA/spark-rapids/issues/837) only shows up if you have different +`NaN` values. There are several different binary values that are all considered to be `NaN` by +floating point. The plugin treats all of these as the same value, where as Spark treats them +all as different values. Because this is considered to be rare we do not disable distinct count +for floating point values even if [`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) is `true`. + ### `0.0` vs `-0.0` -Floating point allows zero to be encoded as `0.0` and `-0.0`, but the standard says that +Floating point allows zero to be encoded as `0.0` and `-0.0`, but the IEEE standard says that they should be interpreted as the same. Most databases normalize these values to always be `0.0`. Spark does this in some cases but not all as is documented [here](https://issues.apache.org/jira/browse/SPARK-32110). The underlying implementation of this plugin treats them as the same for essentially all processing. This can result in some differences with Spark for operations like [sorting](https://github.com/NVIDIA/spark-rapids/issues/84), +[distinct count](https://github.com/NVIDIA/spark-rapids/issues/837), [joins, and comparisons](https://github.com/NVIDIA/spark-rapids/issues/294). +We do not disable operations that produce different results due to `-0.0` in the data because +it is considered to be a rare occurrence. + ## Unicode Spark delegates Unicode operations to the underlying JVM. Each version of Java complies with a @@ -407,4 +419,4 @@ When translating UDFs to Catalyst expressions, the supported UDF functions are l | | Array.empty[Float] | | | Array.empty[Double] | | | Array.empty[String] | -| Method call | Only if the method being called
  1. consists of operations supported by the UDF compiler, and
  2. is one of the folllowing:
    • a final method, or
    • a method in a final class, or
    • a method in a final object
| +| Method call | Only if the method being called
  1. consists of operations supported by the UDF compiler, and
  2. is one of the following:
    • a final method, or
    • a method in a final class, or
    • a method in a final object
|