Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated documentation for distinct count compatibility #1187

Merged
merged 1 commit into from
Nov 23, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,23 +48,35 @@ floating point aggregations are off by default but can be enabled with the confi

Additionally, some aggregations on floating point columns that contain `NaN` can produce
incorrect results. More details on this behavior can be found
[here](https://github.com/NVIDIA/spark-rapids/issues/87)
[here](https://github.com/NVIDIA/spark-rapids/issues/87),
[here](https://github.com/NVIDIA/spark-rapids/issues/837),
and in this cudf [feature request](https://github.com/rapidsai/cudf/issues/4753).
If it is known with certainty that the floating point columns do not contain `NaN`,
set [`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) to `false` to run GPU enabled
aggregations on them.

In the case of a distinct count on `NaN` values the
[issue](https://github.com/NVIDIA/spark-rapids/issues/837) only shows up if you have different
`NaN` values. There are several different binary values that are all considered to be `NaN` by
floating point. The plugin treats all of these as the same value, where as Spark treats them
all as different values. Because this is considered to be rare we do not disable distinct count
for floating point values even if [`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) is `true`.

### `0.0` vs `-0.0`

Floating point allows zero to be encoded as `0.0` and `-0.0`, but the standard says that
Floating point allows zero to be encoded as `0.0` and `-0.0`, but the IEEE standard says that
they should be interpreted as the same. Most databases normalize these values to always
be `0.0`. Spark does this in some cases but not all as is documented
[here](https://issues.apache.org/jira/browse/SPARK-32110). The underlying implementation of
this plugin treats them as the same for essentially all processing. This can result in some
differences with Spark for operations like
[sorting](https://github.com/NVIDIA/spark-rapids/issues/84),
[distinct count](https://github.com/NVIDIA/spark-rapids/issues/837),
[joins, and comparisons](https://github.com/NVIDIA/spark-rapids/issues/294).

We do not disable operations that produce different results due to `-0.0` in the data because
it is considered to be a rare occurrence.

## Unicode

Spark delegates Unicode operations to the underlying JVM. Each version of Java complies with a
Expand Down Expand Up @@ -407,4 +419,4 @@ When translating UDFs to Catalyst expressions, the supported UDF functions are l
| | Array.empty[Float] |
| | Array.empty[Double] |
| | Array.empty[String] |
| Method call | Only if the method being called <ol><li>consists of operations supported by the UDF compiler, and</li><li>is one of the folllowing:<ul><li>a final method, or</li><li>a method in a final class, or</li><li>a method in a final object</li></ul></li></ol> |
| Method call | Only if the method being called <ol><li>consists of operations supported by the UDF compiler, and</li><li>is one of the following:<ul><li>a final method, or</li><li>a method in a final class, or</li><li>a method in a final object</li></ul></li></ol> |