[BUG] test_array_union_before_spark313 failed in UCX job #6249

NVnavkumar · 2022-08-06T04:18:34Z

Describe the bug
related to #5958 and #6208

cpu = [Decimal('5767.891'), Decimal('7754.981')], gpu = [Decimal('7754.981')]
...
AssertionError: CPU and GPU list have different lengths at [512, 'sort_array(array_union(a, a), true)'] CPU: 2 GPU: 1

Filing this separately from #6208 since this looks to have a different root cause from the test_array_intersect failure there.

The text was updated successfully, but these errors were encountered:

NVnavkumar · 2022-08-06T04:51:09Z

Saw another failure, this one in the UCX standalone integration tests

test_array_union_before_spark313[Integer]

cpu = None, gpu = 1035607283
...
AssertionError: GPU and CPU are not both null at [714, 'sort_array(array_union(a, array()), true)', 0]

revans2 · 2022-08-08T15:07:09Z

Do we know if the CPU is correct and the GPU is failing occasionally or is it the CPU that produces different answers from time to time? Meaning when the test passes what is on line 714 of the result? Is it None or is it 1035607283?

abellina · 2022-08-08T15:45:41Z

This looks to be the line, I used from_cpu[714] to get it:

sort_array(array_union(a, array()), true)=[None, -2144931963, -1086727123, -1031458801, -1016247846, -980093895, -933089324, -532120838, 225126737, 433087733, 1355589379, 1579167152, 1790542840, 2146429012]

So if I understand the path correctly from the assertion error, the None became 1035607283 at some point.

In all of the output, Row 546 is the only one with this number, where it appears in two columns:

Row(
sort_array(array_union(a, b), true)=None, 
sort_array(array_union(b, a), true)=None, 
sort_array(array_union(a, array()), true)=[
  -1709711017, -1296908204, -1194847573, -1134859072, -821509576, -778649659, 
  -433746138, -376668313, -297019147, -208121173, -1, 30024366, 1035607283, 
  1246886084, 1734760735, 2016189556, 2028370132], 
sort_array(array_union(array(), b), true)=None, 
sort_array(array_union(a, a), true)=[
  -1709711017, -1296908204, -1194847573, -1134859072, -821509576, -778649659, 
  -433746138, -376668313, -297019147, -208121173, -1, 30024366, 1035607283, 
  1246886084, 1734760735, 2016189556, 2028370132
 ], 
sort_array(array_union(array(1), array(1, 2, 3)), true)=[1, 2, 3], 
sort_array(array_union(array(), array(1, 2, 3)), true)=[1, 2, 3])

abellina · 2022-08-08T16:23:45Z

Attempting to replicate this in a loop (1k iterations, same app) is not reproducing it. The plan is very simple, it's basically just a project sandwiched between a row->columnar and columnar->row transitions, as such I don't believe this is related to UCX at all since there is no shuffle, and it is some sort of race condition likely in cuDF as the plugin code seems pretty straightforward. @ttnghia any ideas of what could cause the corruption here?

ttnghia · 2022-08-08T16:29:27Z

array_union is implemented as:

Concatenating two arrays
Remove duplicates

The second step should be valid. So I'll investigate the first step to see if there's something wrong with it.

ttnghia · 2022-08-08T16:36:19Z

Can somebody provide me the full input arrays, please?

ttnghia · 2022-08-08T17:07:33Z

Is there any chance that the Spark CPU on databrick has a bug instead? We need to look at the input to know that.

abellina · 2022-08-08T17:12:10Z

This was on our own cluster, so no I don't think so.

ttnghia · 2022-08-08T17:14:59Z

1035607283 seems to be given in an input row that doesn't have nulls. So a null sometimes becomes 1035607283 look suspicious to me. It seems like the array concatenation process is wrongly generating the output, pushing data around between rows.

abellina · 2022-08-08T22:17:58Z

I cannot reproduce this bug so far, I have tried making the test "worse" by adding several attempts at the union of a column with an empty array, and a column with itself. I've also made the input much bigger 40K rows and run it in a loop, and it hasn't reproduced so far.

In `cudf::detail::label_segments`, when the input lists column has empty/nulls lists at the end of the column, its `offsets` column will contain out-of-bound indices. This leads to invalid memory access bug. Such bug is elusive and doesn't show up consistently. Test failures reported in NVIDIA/spark-rapids#6249 are due to this. The existing unit tests already cover such corner case. Unfortunately, the bug didn't show up until being tested on some systems. Even that, it was very difficult to reproduce it. Closes #11495. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Tobias Ribizel (https://github.com/upsj) - Bradley Dice (https://github.com/bdice) - Jim Brennan (https://github.com/jbrennan333) - Alessandro Bellina (https://github.com/abellina) - Karthikeyan (https://github.com/karthikeyann)

ttnghia · 2022-08-09T20:57:28Z

The cudf fix went in. I'm not sure if we have to manually pull cudf in spark-rapids-jni (into branch 22.08)?

abellina · 2022-08-10T18:54:27Z

I tried out the @ttnghia's fix before and after with a repro case and I think this issue should be resolved. Note that the actual test never reproed for me, but instead a modified test without sort_array and with 3 array_union(a,a) expressions was used. rapidsai/cudf#11497 (comment).

NVnavkumar added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 6, 2022

NVnavkumar mentioned this issue Aug 6, 2022

[BUG] test_array_intersect failed in databricks 10.4 runtime and Spark 3.3+ #6208

Closed

NVnavkumar changed the title ~~[BUG] test_array_union_before_spark313[Decimal(7,3)] failed in YARN testing~~ [BUG] test_array_union_before_spark313 failed in YARN testing Aug 6, 2022

NVnavkumar changed the title ~~[BUG] test_array_union_before_spark313 failed in YARN testing~~ [BUG] test_array_union_before_spark313 failed in UCX Aug 6, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label Aug 8, 2022

ttnghia mentioned this issue Aug 9, 2022

[BUG] Out-of-bound access in cudf::detail::label_segments rapidsai/cudf#11495

Closed

abellina changed the title ~~[BUG] test_array_union_before_spark313 failed in UCX~~ [BUG] test_array_union_before_spark313 failed in UCX job Aug 9, 2022

ttnghia mentioned this issue Aug 9, 2022

Fix out-of-bound access in cudf::detail::label_segments rapidsai/cudf#11497

Merged

sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Aug 9, 2022

abellina closed this as completed Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_array_union_before_spark313 failed in UCX job #6249

[BUG] test_array_union_before_spark313 failed in UCX job #6249

NVnavkumar commented Aug 6, 2022

NVnavkumar commented Aug 6, 2022

revans2 commented Aug 8, 2022

abellina commented Aug 8, 2022

abellina commented Aug 8, 2022 •

edited

Loading

ttnghia commented Aug 8, 2022

ttnghia commented Aug 8, 2022

ttnghia commented Aug 8, 2022 •

edited

Loading

abellina commented Aug 8, 2022

ttnghia commented Aug 8, 2022

abellina commented Aug 8, 2022

ttnghia commented Aug 9, 2022 •

edited

Loading

abellina commented Aug 10, 2022

[BUG] test_array_union_before_spark313 failed in UCX job #6249

[BUG] test_array_union_before_spark313 failed in UCX job #6249

Comments

NVnavkumar commented Aug 6, 2022

NVnavkumar commented Aug 6, 2022

revans2 commented Aug 8, 2022

abellina commented Aug 8, 2022

abellina commented Aug 8, 2022 • edited Loading

ttnghia commented Aug 8, 2022

ttnghia commented Aug 8, 2022

ttnghia commented Aug 8, 2022 • edited Loading

abellina commented Aug 8, 2022

ttnghia commented Aug 8, 2022

abellina commented Aug 8, 2022

ttnghia commented Aug 9, 2022 • edited Loading

abellina commented Aug 10, 2022

abellina commented Aug 8, 2022 •

edited

Loading

ttnghia commented Aug 8, 2022 •

edited

Loading

ttnghia commented Aug 9, 2022 •

edited

Loading