-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] test_array_union_before_spark313 failed in UCX job #6249
Comments
Saw another failure, this one in the UCX standalone integration tests
|
Do we know if the CPU is correct and the GPU is failing occasionally or is it the CPU that produces different answers from time to time? Meaning when the test passes what is on line 714 of the result? Is it |
This looks to be the line, I used
So if I understand the In all of the output, Row 546 is the only one with this number, where it appears in two columns:
|
Attempting to replicate this in a loop (1k iterations, same app) is not reproducing it. The plan is very simple, it's basically just a project sandwiched between a row->columnar and columnar->row transitions, as such I don't believe this is related to UCX at all since there is no shuffle, and it is some sort of race condition likely in cuDF as the plugin code seems pretty straightforward. @ttnghia any ideas of what could cause the corruption here? |
The second step should be valid. So I'll investigate the first step to see if there's something wrong with it. |
Can somebody provide me the full input arrays, please? |
Is there any chance that the Spark CPU on databrick has a bug instead? We need to look at the input to know that. |
This was on our own cluster, so no I don't think so. |
|
I cannot reproduce this bug so far, I have tried making the test "worse" by adding several attempts at the union of a column with an empty array, and a column with itself. I've also made the input much bigger 40K rows and run it in a loop, and it hasn't reproduced so far. |
In `cudf::detail::label_segments`, when the input lists column has empty/nulls lists at the end of the column, its `offsets` column will contain out-of-bound indices. This leads to invalid memory access bug. Such bug is elusive and doesn't show up consistently. Test failures reported in NVIDIA/spark-rapids#6249 are due to this. The existing unit tests already cover such corner case. Unfortunately, the bug didn't show up until being tested on some systems. Even that, it was very difficult to reproduce it. Closes #11495. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Tobias Ribizel (https://github.com/upsj) - Bradley Dice (https://github.com/bdice) - Jim Brennan (https://github.com/jbrennan333) - Alessandro Bellina (https://github.com/abellina) - Karthikeyan (https://github.com/karthikeyann)
The cudf fix went in. I'm not sure if we have to manually pull cudf in spark-rapids-jni (into branch 22.08)? |
I tried out the @ttnghia's fix before and after with a repro case and I think this issue should be resolved. Note that the actual test never reproed for me, but instead a modified test without |
Describe the bug
related to #5958 and #6208
Filing this separately from #6208 since this looks to have a different root cause from the test_array_intersect failure there.
The text was updated successfully, but these errors were encountered: