Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] UCX EGX integration test array_test.py::test_array_exists failures #5183

Closed
tgravescs opened this issue Apr 8, 2022 · 8 comments · Fixed by #5232
Closed

[BUG] UCX EGX integration test array_test.py::test_array_exists failures #5183

tgravescs opened this issue Apr 8, 2022 · 8 comments · Fixed by #5232
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
The latest integration tests on EGX with UCX had the following failures:

12:48:34  FAILED ../../src/main/python/array_test.py::test_array_exists[3VL:off-data_gen0]
12:48:34  FAILED ../../src/main/python/array_test.py::test_array_exists[3VL:off-data_gen1]

12:48:34 E AssertionError: GPU and CPU boolean values are different at [1682, 'exists_longer_than_5']

and aopther one with different values

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels Apr 8, 2022
@abellina
Copy link
Collaborator

I was able to reproduce this in the same environment, but I didn't run with UCX on.

../../src/main/python/array_test.py::test_array_exists[3VL:off-data_gen0][IGNORE_ORDER({'local': True})] ^[[31mFAILED^[[0m^[[31m [ 25%]^[[0m
../../src/main/python/array_test.py::test_array_exists[3VL:off-data_gen1][IGNORE_ORDER({'local': True})] ^[[31mFAILED^[[0m^[[31m [ 50%]^[[0m
../../src/main/python/array_test.py::test_array_exists[3VL:on-data_gen0][IGNORE_ORDER({'local': True})] ^[[32mPASSED^[[0m^[[31m [ 75%]^[[0m
../../src/main/python/array_test.py::test_array_exists[3VL:on-data_gen1][IGNORE_ORDER({'local': True})] ^[[32mPASSED^[[0m^[[31m [100%]^[[0m

I digged at the differences a bit more:

CPU:
Row(a=[332248246, 545395431, -495984289], exists_even=True, exists_negative=True

GPU:
Row(a=[332248246, 545395431, -495984289], exists_even=True, exists_negative=False
CPU:
Row(a=[-655713209, 0, None, -132615933, -808871756, None], exists_even=True, exists_negative=True, exists_non_negative=True

GPU:
Row(a=[-655713209, 0, None, -132615933, -808871756, None], exists_even=True, exists_negative=True, exists_non_negative=False

It seems that either an array of string or an array of integer fails, with 3VL off (that said 3VL on seems to be ok?)

@tgravescs
Copy link
Collaborator Author

I tried this against 22.04 and the test passes there.

@tgravescs
Copy link
Collaborator Author

note I also tested with 22.06 from April 1st and it passed the tests, but trying April 5th build it fails.

@abellina
Copy link
Collaborator

but trying April 5th build it fails

So just to be sure, you were getting the LZ4Compressor missing reference in cuDF. That's what I am getting. So it seems April 1 works, and something broke after that.

@jbrennan333
Copy link
Collaborator

The LZ4Compress reference was removed by #5151

@abellina
Copy link
Collaborator

abellina commented Apr 11, 2022

Thanks @jbrennan333.

This looks to be a cuDF issue, and it is in this diff, according to what I've triaged so far: https://github.com/rapidsai/cudf/compare/291fbcfdf38c33641da277365fc2a40fa3ddb606..090f6b886ad0ebef62ffb0ea25adc42f5b059081.

I am building cuDF without the RMM static changes to see if we see it. If not, my guess is something really odd could be happening with the thrust patch, but that was just headers (so that doesn't make a whole lot of sense).

@abellina abellina self-assigned this Apr 11, 2022
@abellina
Copy link
Collaborator

abellina commented Apr 12, 2022

According to my experiments, the issue looks to be related to ASYNC allocator. In 22.06 we added a fix to cudfjni where we let RMM know that indeed the libcudart to use was the statically linked one, so we started using ASYNC allocator after this commit (abellina/cudf@fa0938f).

If I back this out, the job works but it does so because it fails to initialize ASYNC (and also all of RMM). Another way to make it pass is to force ARENA.

So this looks to be some sort of race condition in cuDF triggered by ASYNC, but it is really odd since there shouldn't be multiple streams involved.

@abellina
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants