-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] UCX EGX integration test array_test.py::test_array_exists failures #5183
Comments
I was able to reproduce this in the same environment, but I didn't run with UCX on.
I digged at the differences a bit more:
It seems that either an array of string or an array of integer fails, with 3VL off (that said 3VL on seems to be ok?) |
I tried this against 22.04 and the test passes there. |
note I also tested with 22.06 from April 1st and it passed the tests, but trying April 5th build it fails. |
So just to be sure, you were getting the |
The LZ4Compress reference was removed by #5151 |
Thanks @jbrennan333. This looks to be a cuDF issue, and it is in this diff, according to what I've triaged so far: https://github.com/rapidsai/cudf/compare/291fbcfdf38c33641da277365fc2a40fa3ddb606..090f6b886ad0ebef62ffb0ea25adc42f5b059081. I am building cuDF without the RMM static changes to see if we see it. If not, my guess is something really odd could be happening with the thrust patch, but that was just headers (so that doesn't make a whole lot of sense). |
According to my experiments, the issue looks to be related to ASYNC allocator. In 22.06 we added a fix to cudfjni where we let RMM know that indeed the libcudart to use was the statically linked one, so we started using ASYNC allocator after this commit (abellina/cudf@fa0938f). If I back this out, the job works but it does so because it fails to initialize ASYNC (and also all of RMM). Another way to make it pass is to force ARENA. So this looks to be some sort of race condition in cuDF triggered by ASYNC, but it is really odd since there shouldn't be multiple streams involved. |
I also want to bring back up that I can't get this to happen for 3VL on. So I believe the code here is suspect: https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/higherOrderFunctions.scala#L368, since the code here doesn't trigger it: |
Describe the bug
The latest integration tests on EGX with UCX had the following failures:
12:48:34 E AssertionError: GPU and CPU boolean values are different at [1682, 'exists_longer_than_5']
and aopther one with different values
The text was updated successfully, but these errors were encountered: