-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] orc_test.py::test_orc_scan_with_aggregate_pushdown fails with a standalone cluster on spark 3.3.0 #10099
Comments
@NvTimLiu this has nothing to do with UCX or the environment. I got it to happen on spark-3.3.0 locally. This is all around setting the spark master to a standalone cluster:
Yields:
It really just says to me that either the test or the distributed nature of a cluster is causing the orc metadata not to be read somehow. |
The issue looks to be on the ORC write path. If I disable GPU to write the orc file it works. If I write with the GPU and read with the CPU it fails. |
Do you know why it is failing in standalone cluster mode but not in local mode? |
I don't. I suspect it has to do with the different partitioning we are likely hitting in standalone mode => different row count per writer? I am not sure if there is optional code where the metadata isn't written in all scenarios, as that's what it seems to be. |
e.g. could some partitions have 0 rows, and does that mean that the orc writer for the gpu behaves differently? |
Oh yeah, that's may be the issue. I looked at the written files and indeed there is one of them having 0 row. I'm going to check that in cudf code. |
So for the problematic file:
It seems that there are at least 2 problems:
|
I've filed a cudf issue: rapidsai/cudf#14675 |
Thanks @ttnghia !! |
Verified that rapidsai/cudf#14707 fixes this. |
@NvTimLiu The cudf PR above is merged. Please verify in your new environment and close this if the issue no longer exists. |
Test PASS now, close |
Describe the bug
orc_test.py::test_orc_scan_with_aggregate_pushdown FAILED against UCX/MULTITHREAD tests
This failure looks like environment related issue, as
BTW, we did not observe these FAILUREs, because we ran spark-3.1.2 JDK8 tests on EGX06 before, which(spark3.1.2) SKIPPED the
orc_test.py::test_orc_scan_with_aggregate_pushdown
After switch EGX06 to JDK17/spark-3.3.0, these tests FAILED.
Environment details (please complete the following information)
Additional context
The text was updated successfully, but these errors were encountered: