[SPARK-26021][SQL][followup] only deal with NaN and -0.0 in UnsafeWriter #23239

cloud-fan · 2018-12-05T15:58:30Z

What changes were proposed in this pull request?

A followup of #23043

There are 4 places we need to deal with NaN and -0.0:

comparison expressions. -0.0 and 0.0 should be treated as same. Different NaNs should be treated as same. This should be true for prefix comparator as well, [SPARK-26382][CORE] prefix comparator should handle -0.0 #23334 fixes it.
Join keys. -0.0 and 0.0 should be treated as same. Different NaNs should be treated as same.
grouping keys. -0.0 and 0.0 should be assigned to the same group. Different NaNs should be assigned to the same group.
window partition keys. -0.0 and 0.0 should be treated as same. Different NaNs should be treated as same.

The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements.

Case 2, 3 and 4 are problematic, as they compare UnsafeRow binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0.

To fix it, a simple solution is: normalize float/double when building unsafe data (UnsafeRow, UnsafeArrayData, UnsafeMapData). Then we don't need to worry about it anymore.

Following this direction, this PR moves the handling of NaN and -0.0 from Platform to UnsafeWriter, so that places like UnsafeRow.setFloat will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in UnsafeWriter.

How was this patch tested?

existing tests

cloud-fan · 2018-12-05T15:59:35Z

cc @adoron @kiszk @viirya @gatorsmile

adoron · 2018-12-05T16:18:36Z

@cloud-fan what about UnsafeRow::setDouble/Float? It doesn't go through the same flow. Is it not used?

cloud-fan · 2018-12-05T16:25:49Z

Yes, the 3 cases I pointed that need to handle NaN and -0.0 do not change the value in UnsafeRow.

SparkQA · 2018-12-05T20:13:12Z

Test build #99738 has finished for PR 23239 at commit 7d5ff06.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-05T20:21:50Z

Test build #99737 has finished for PR 23239 at commit 797ade3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-05T22:08:50Z

common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java

-      value = Float.NaN;
-    } else if (value == -0.0f) {
-      value = 0.0f;
-    }


These change are expected to cause the following test case failure in PlatformUtilSuite, but it seems to be missed. Could you fix the test case or remove it together, @cloud-fan ?

https://github.com/apache/spark/blob/master/common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java#L162-L163

viirya · 2018-12-06T01:40:40Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeWriter.java

+  // We need to take care of NaN and -0.0 in several places:
+  //   1. When compare values, different NaNs should be treated as same, `-0.0` and `0.0` should be
+  //      treated as same.
+  //   2. In range partitioner, different NaNs should belong to the same partition, -0.0 and 0.0


Do we already have related test for case 2?

It turns out this is not a problem. The doc of RangePartitioning is misleading. I'm updating the doc at #23249

As this is not a problem, we should update the PR description too.

SparkQA · 2018-12-06T20:35:36Z

Test build #99780 has finished for PR 23239 at commit 84e3989.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-07T03:11:14Z

retest this please

viirya · 2018-12-07T04:23:16Z

The migration guide has changed by another followup #23141:

In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via Dataset.show, Dataset.collect etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more.

Is above still correct after this change?

cloud-fan · 2018-12-07T04:30:14Z

Yes it is. UnsafeProjection always normalize NaN and -0.0, and Spark uses UnsafeProjection to produce output. So users can't distinguish them.

dongjoon-hyun

+1, LGTM.

SparkQA · 2018-12-07T07:36:19Z

Test build #99801 has finished for PR 23239 at commit 84e3989.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-12-07T07:43:16Z

The change looks fine.
Do we already have tests for cases 2 and 4? We know test for case 3 is here.

cloud-fan · 2018-12-07T08:45:45Z

I checked the original PR that handles NaN: c032b0b

It didn't add end-to-end tests, so I added 2 new tests.

SparkQA · 2018-12-07T12:18:41Z

Test build #99816 has finished for PR 23239 at commit c9dfe67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-07T13:16:55Z

Test build #99820 has finished for PR 23239 at commit b7a5497.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-07T13:55:58Z

Test build #99819 has finished for PR 23239 at commit b9371a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-12-07T17:10:02Z

LGTM

dongjoon-hyun

It seems that all comments are addressed.
Merged to master.

dongjoon-hyun · 2018-12-08T19:19:55Z

@cloud-fan . Please make another PR for branch-2.4. There is a conflict on branch-2.4.

A followup of apache#23043 There are 4 places we need to deal with NaN and -0.0: 1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group. 4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements. Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0. To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore. Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`. existing tests Closes apache#23239 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…feWriter backport #23239 to 2.4 --------- ## What changes were proposed in this pull request? A followup of #23043 There are 4 places we need to deal with NaN and -0.0: 1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group. 4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements. Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0. To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore. Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`. ## How was this patch tested? existing tests Closes #23265 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

## What changes were proposed in this pull request? This is kind of a followup of #23239 The `UnsafeProject` will normalize special float/double values(NaN and -0.0), so the sorter doesn't have to handle it. However, for consistency and future-proof, this PR proposes to normalize `-0.0` in the prefix comparator, so that it's same with the normal ordering. Note that prefix comparator handles NaN as well. This is not a bug fix, but a safe guard. ## How was this patch tested? existing tests Closes #23334 from cloud-fan/sort. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit befca98) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

## What changes were proposed in this pull request? This is kind of a followup of apache#23239 The `UnsafeProject` will normalize special float/double values(NaN and -0.0), so the sorter doesn't have to handle it. However, for consistency and future-proof, this PR proposes to normalize `-0.0` in the prefix comparator, so that it's same with the normal ordering. Note that prefix comparator handles NaN as well. This is not a bug fix, but a safe guard. ## How was this patch tested? existing tests Closes apache#23334 from cloud-fan/sort. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

## What changes were proposed in this pull request? A followup of apache#23043 There are 4 places we need to deal with NaN and -0.0: 1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group. 4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements. Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0. To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore. Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`. ## How was this patch tested? existing tests Closes apache#23239 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

## What changes were proposed in this pull request? This is kind of a followup of apache#23239 The `UnsafeProject` will normalize special float/double values(NaN and -0.0), so the sorter doesn't have to handle it. However, for consistency and future-proof, this PR proposes to normalize `-0.0` in the prefix comparator, so that it's same with the normal ordering. Note that prefix comparator handles NaN as well. This is not a bug fix, but a safe guard. ## How was this patch tested? existing tests Closes apache#23334 from cloud-fan/sort. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…feWriter backport apache#23239 to 2.4 --------- ## What changes were proposed in this pull request? A followup of apache#23043 There are 4 places we need to deal with NaN and -0.0: 1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group. 4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements. Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0. To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore. Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`. ## How was this patch tested? existing tests Closes apache#23265 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

## What changes were proposed in this pull request? This is kind of a followup of apache#23239 The `UnsafeProject` will normalize special float/double values(NaN and -0.0), so the sorter doesn't have to handle it. However, for consistency and future-proof, this PR proposes to normalize `-0.0` in the prefix comparator, so that it's same with the normal ordering. Note that prefix comparator handles NaN as well. This is not a bug fix, but a safe guard. ## How was this patch tested? existing tests Closes apache#23334 from cloud-fan/sort. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit befca98) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…feWriter backport apache#23239 to 2.4 --------- ## What changes were proposed in this pull request? A followup of apache#23043 There are 4 places we need to deal with NaN and -0.0: 1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group. 4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements. Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0. To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore. Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`. ## How was this patch tested? existing tests Closes apache#23265 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

## What changes were proposed in this pull request? This is kind of a followup of apache#23239 The `UnsafeProject` will normalize special float/double values(NaN and -0.0), so the sorter doesn't have to handle it. However, for consistency and future-proof, this PR proposes to normalize `-0.0` in the prefix comparator, so that it's same with the normal ordering. Note that prefix comparator handles NaN as well. This is not a bug fix, but a safe guard. ## How was this patch tested? existing tests Closes apache#23334 from cloud-fan/sort. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit befca98) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…feWriter backport apache/spark#23239 to 2.4 --------- ## What changes were proposed in this pull request? A followup of apache/spark#23043 There are 4 places we need to deal with NaN and -0.0: 1. comparison expressions. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 2. Join keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. 3. grouping keys. `-0.0` and `0.0` should be assigned to the same group. Different NaNs should be assigned to the same group. 4. window partition keys. `-0.0` and `0.0` should be treated as same. Different NaNs should be treated as same. The case 1 is OK. Our comparison already handles NaN and -0.0, and for struct/array/map, we will recursively compare the fields/elements. Case 2, 3 and 4 are problematic, as they compare `UnsafeRow` binary directly, and different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0. To fix it, a simple solution is: normalize float/double when building unsafe data (`UnsafeRow`, `UnsafeArrayData`, `UnsafeMapData`). Then we don't need to worry about it anymore. Following this direction, this PR moves the handling of NaN and -0.0 from `Platform` to `UnsafeWriter`, so that places like `UnsafeRow.setFloat` will not handle them, which reduces the perf overhead. It's also easier to add comments explaining why we do it in `UnsafeWriter`. ## How was this patch tested? existing tests Closes #23265 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 33460c5)

## What changes were proposed in this pull request? This is kind of a followup of apache/spark#23239 The `UnsafeProject` will normalize special float/double values(NaN and -0.0), so the sorter doesn't have to handle it. However, for consistency and future-proof, this PR proposes to normalize `-0.0` in the prefix comparator, so that it's same with the normal ordering. Note that prefix comparator handles NaN as well. This is not a bug fix, but a safe guard. ## How was this patch tested? existing tests Closes #23334 from cloud-fan/sort. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit befca98) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 16986b2)

only deal with NaN and -0.0 in UnsafeWriter

7d5ff06

cloud-fan force-pushed the minor branch from 797ade3 to 7d5ff06 Compare December 5, 2018 16:24

dongjoon-hyun mentioned this pull request Dec 5, 2018

[SPARK-26021][SQL][followup] add test for special floating point values #23141

Closed

dongjoon-hyun reviewed Dec 5, 2018

View reviewed changes

viirya reviewed Dec 6, 2018

View reviewed changes

dongjoon-hyun approved these changes Dec 7, 2018

View reviewed changes

viirya approved these changes Dec 7, 2018

View reviewed changes

address comments

c9dfe67

cloud-fan force-pushed the minor branch from 84e3989 to c9dfe67 Compare December 7, 2018 08:06

add more tests

b7a5497

cloud-fan force-pushed the minor branch from b9371a6 to b7a5497 Compare December 7, 2018 08:54

dongjoon-hyun approved these changes Dec 8, 2018

View reviewed changes

asfgit closed this in bdf3284 Dec 8, 2018

cloud-fan mentioned this pull request Dec 9, 2018

[2.4][SPARK-26021][SQL][FOLLOWUP] only deal with NaN and -0.0 in UnsafeWriter #23265

Closed

cloud-fan mentioned this pull request Dec 17, 2018

[SPARK-26382][CORE] prefix comparator should handle -0.0 #23334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26021][SQL][followup] only deal with NaN and -0.0 in UnsafeWriter #23239

[SPARK-26021][SQL][followup] only deal with NaN and -0.0 in UnsafeWriter #23239

cloud-fan commented Dec 5, 2018 •

edited

Loading

cloud-fan commented Dec 5, 2018

adoron commented Dec 5, 2018

cloud-fan commented Dec 5, 2018

SparkQA commented Dec 5, 2018

SparkQA commented Dec 5, 2018

dongjoon-hyun Dec 5, 2018 •

edited

Loading

viirya Dec 6, 2018

cloud-fan Dec 6, 2018

viirya Dec 7, 2018

SparkQA commented Dec 6, 2018

cloud-fan commented Dec 7, 2018

viirya commented Dec 7, 2018

cloud-fan commented Dec 7, 2018

dongjoon-hyun left a comment

SparkQA commented Dec 7, 2018

kiszk commented Dec 7, 2018

cloud-fan commented Dec 7, 2018

SparkQA commented Dec 7, 2018

SparkQA commented Dec 7, 2018

SparkQA commented Dec 7, 2018

kiszk commented Dec 7, 2018

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Dec 8, 2018 •

edited

Loading

[SPARK-26021][SQL][followup] only deal with NaN and -0.0 in UnsafeWriter #23239

[SPARK-26021][SQL][followup] only deal with NaN and -0.0 in UnsafeWriter #23239

Conversation

cloud-fan commented Dec 5, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Dec 5, 2018

adoron commented Dec 5, 2018

cloud-fan commented Dec 5, 2018

SparkQA commented Dec 5, 2018

SparkQA commented Dec 5, 2018

dongjoon-hyun Dec 5, 2018 • edited Loading

Choose a reason for hiding this comment

viirya Dec 6, 2018

Choose a reason for hiding this comment

cloud-fan Dec 6, 2018

Choose a reason for hiding this comment

viirya Dec 7, 2018

Choose a reason for hiding this comment

SparkQA commented Dec 6, 2018

cloud-fan commented Dec 7, 2018

viirya commented Dec 7, 2018

cloud-fan commented Dec 7, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 7, 2018

kiszk commented Dec 7, 2018

cloud-fan commented Dec 7, 2018

SparkQA commented Dec 7, 2018

SparkQA commented Dec 7, 2018

SparkQA commented Dec 7, 2018

kiszk commented Dec 7, 2018

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 8, 2018 • edited Loading

cloud-fan commented Dec 5, 2018 •

edited

Loading

dongjoon-hyun Dec 5, 2018 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Dec 8, 2018 •

edited

Loading