Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_array_union_before_spark313 failed in UCX job #6249

Closed
NVnavkumar opened this issue Aug 6, 2022 · 12 comments
Closed

[BUG] test_array_union_before_spark313 failed in UCX job #6249

NVnavkumar opened this issue Aug 6, 2022 · 12 comments
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@NVnavkumar
Copy link
Collaborator

Describe the bug
related to #5958 and #6208

cpu = [Decimal('5767.891'), Decimal('7754.981')], gpu = [Decimal('7754.981')]
...
AssertionError: CPU and GPU list have different lengths at [512, 'sort_array(array_union(a, a), true)'] CPU: 2 GPU: 1

Filing this separately from #6208 since this looks to have a different root cause from the test_array_intersect failure there.

@NVnavkumar NVnavkumar added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 6, 2022
@NVnavkumar NVnavkumar changed the title [BUG] test_array_union_before_spark313[Decimal(7,3)] failed in YARN testing [BUG] test_array_union_before_spark313 failed in YARN testing Aug 6, 2022
@NVnavkumar NVnavkumar changed the title [BUG] test_array_union_before_spark313 failed in YARN testing [BUG] test_array_union_before_spark313 failed in UCX Aug 6, 2022
@NVnavkumar
Copy link
Collaborator Author

Saw another failure, this one in the UCX standalone integration tests

test_array_union_before_spark313[Integer]

cpu = None, gpu = 1035607283
...
AssertionError: GPU and CPU are not both null at [714, 'sort_array(array_union(a, array()), true)', 0]

@revans2
Copy link
Collaborator

revans2 commented Aug 8, 2022

Do we know if the CPU is correct and the GPU is failing occasionally or is it the CPU that produces different answers from time to time? Meaning when the test passes what is on line 714 of the result? Is it None or is it 1035607283?

@abellina
Copy link
Collaborator

abellina commented Aug 8, 2022

This looks to be the line, I used from_cpu[714] to get it:

sort_array(array_union(a, array()), true)=[None, -2144931963, -1086727123, -1031458801, -1016247846, -980093895, -933089324, -532120838, 225126737, 433087733, 1355589379, 1579167152, 1790542840, 2146429012]

So if I understand the path correctly from the assertion error, the None became 1035607283 at some point.

In all of the output, Row 546 is the only one with this number, where it appears in two columns:

Row(
sort_array(array_union(a, b), true)=None, 
sort_array(array_union(b, a), true)=None, 
sort_array(array_union(a, array()), true)=[
  -1709711017, -1296908204, -1194847573, -1134859072, -821509576, -778649659, 
  -433746138, -376668313, -297019147, -208121173, -1, 30024366, 1035607283, 
  1246886084, 1734760735, 2016189556, 2028370132], 
sort_array(array_union(array(), b), true)=None, 
sort_array(array_union(a, a), true)=[
  -1709711017, -1296908204, -1194847573, -1134859072, -821509576, -778649659, 
  -433746138, -376668313, -297019147, -208121173, -1, 30024366, 1035607283, 
  1246886084, 1734760735, 2016189556, 2028370132
 ], 
sort_array(array_union(array(1), array(1, 2, 3)), true)=[1, 2, 3], 
sort_array(array_union(array(), array(1, 2, 3)), true)=[1, 2, 3])

@abellina
Copy link
Collaborator

abellina commented Aug 8, 2022

Attempting to replicate this in a loop (1k iterations, same app) is not reproducing it. The plan is very simple, it's basically just a project sandwiched between a row->columnar and columnar->row transitions, as such I don't believe this is related to UCX at all since there is no shuffle, and it is some sort of race condition likely in cuDF as the plugin code seems pretty straightforward. @ttnghia any ideas of what could cause the corruption here?

@ttnghia
Copy link
Collaborator

ttnghia commented Aug 8, 2022

array_union is implemented as:

  • Concatenating two arrays
  • Remove duplicates

The second step should be valid. So I'll investigate the first step to see if there's something wrong with it.

@ttnghia
Copy link
Collaborator

ttnghia commented Aug 8, 2022

Can somebody provide me the full input arrays, please?

@ttnghia
Copy link
Collaborator

ttnghia commented Aug 8, 2022

Is there any chance that the Spark CPU on databrick has a bug instead? We need to look at the input to know that.

@abellina
Copy link
Collaborator

abellina commented Aug 8, 2022

This was on our own cluster, so no I don't think so.

@ttnghia
Copy link
Collaborator

ttnghia commented Aug 8, 2022

1035607283 seems to be given in an input row that doesn't have nulls. So a null sometimes becomes 1035607283 look suspicious to me. It seems like the array concatenation process is wrongly generating the output, pushing data around between rows.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Aug 8, 2022
@abellina
Copy link
Collaborator

abellina commented Aug 8, 2022

I cannot reproduce this bug so far, I have tried making the test "worse" by adding several attempts at the union of a column with an empty array, and a column with itself. I've also made the input much bigger 40K rows and run it in a loop, and it hasn't reproduced so far.

@abellina abellina changed the title [BUG] test_array_union_before_spark313 failed in UCX [BUG] test_array_union_before_spark313 failed in UCX job Aug 9, 2022
@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Aug 9, 2022
raydouglass pushed a commit to rapidsai/cudf that referenced this issue Aug 9, 2022
In `cudf::detail::label_segments`, when the input lists column has empty/nulls lists at the end of the column, its `offsets` column will contain out-of-bound indices. This leads to invalid memory access bug. Such bug is elusive and doesn't show up consistently. Test failures reported in NVIDIA/spark-rapids#6249 are due to this.

The existing unit tests already cover such corner case. Unfortunately, the bug didn't show up until being tested on some systems. Even that, it was very difficult to reproduce it.

Closes #11495.

Authors:
   - Nghia Truong (https://github.com/ttnghia)

Approvers:
   - Tobias Ribizel (https://github.com/upsj)
   - Bradley Dice (https://github.com/bdice)
   - Jim Brennan (https://github.com/jbrennan333)
   - Alessandro Bellina (https://github.com/abellina)
   - Karthikeyan (https://github.com/karthikeyann)
@ttnghia
Copy link
Collaborator

ttnghia commented Aug 9, 2022

The cudf fix went in. I'm not sure if we have to manually pull cudf in spark-rapids-jni (into branch 22.08)?

@abellina
Copy link
Collaborator

I tried out the @ttnghia's fix before and after with a repro case and I think this issue should be resolved. Note that the actual test never reproed for me, but instead a modified test without sort_array and with 3 array_union(a,a) expressions was used. rapidsai/cudf#11497 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

No branches or pull requests

5 participants