You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A user reported orc_test failing when running in their environment. The diffs were in the hours of the DateTime types.
One of them was:
FAILED src/main/python/orc_test.py::test_basic_read[{'spark.rapids.sql.format.orc.reader.type': 'PERFILE'}-native--read_orc_df-timestamp-date-test.orc]
with lots of others as well....
After some debugging it turns out they only set:
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC
and were missing:
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
The Host timezone was set to: Time zone: America/New_York (EDT, -0400)
So here I think the planning on driver passed because it was UTC but executors weren't UTC so data returned wasn't the same as CPU generated.
Perhaps we can add more validation on executor side to make sure timezone UTC, if nothing else throw so it fails rather then corrupting data.
Note I haven't tried to reproduce this yet. The user set all the timezone settings properly and the test started to pass.
The text was updated successfully, but these errors were encountered:
tgravescs
changed the title
[BUG] mismatching UTC settings on executor and driver can cause ORC read data corruption
[BUG] mismatching timezone settings on executor and driver can cause ORC read data corruption
Oct 29, 2021
I was able to reproduce this bug on YARN cluster by passing --conf spark.driver.extraJavaOptions=-Duser.timezone=America/New_York to the tests.
After further looking into the results, the mismatch in GPU and CPU results is that in CPU the timestamp is read in the timezone provided i.e America/New_York(EDT) time but in the GPU results the timestamps are read in "UTC" .
I thought it would have failed by giving unsupported data type error(as spark-rapids supports on UTC) but that is not the case. GPU is reading in UTC.
Describe the bug
A user reported orc_test failing when running in their environment. The diffs were in the hours of the DateTime types.
One of them was:
FAILED src/main/python/orc_test.py::test_basic_read[{'spark.rapids.sql.format.orc.reader.type': 'PERFILE'}-native--read_orc_df-timestamp-date-test.orc]
with lots of others as well....
After some debugging it turns out they only set:
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC
and were missing:
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
The Host timezone was set to: Time zone: America/New_York (EDT, -0400)
So here I think the planning on driver passed because it was UTC but executors weren't UTC so data returned wasn't the same as CPU generated.
Perhaps we can add more validation on executor side to make sure timezone UTC, if nothing else throw so it fails rather then corrupting data.
Note I haven't tried to reproduce this yet. The user set all the timezone settings properly and the test started to pass.
The text was updated successfully, but these errors were encountered: