Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] driver time zone check does not cover run-time default timezone changes #5820

Open
gerashegalov opened this issue Jun 14, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@gerashegalov
Copy link
Collaborator

gerashegalov commented Jun 14, 2022

Describe the bug
We have the init code in the executor that is supposed to prevent non-UTC default timezone on the executor side if the driver side is on UTC

it's does not deal with the fact that JVM's default timezone is mutable.

Steps/Code to reproduce bug

Start driver & executor in GMT-8
 $SPARK_HOME/bin/spark-shell \
  --jars ./dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \
  --driver-java-options -Duser.timezone="GMT-8" \
  --conf spark.executor.extraJavaOptions="-Duser.timezone=GMT-8" \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.rapids.sql.explain=ALL \
  --master local-cluster[1,1,1200]

to avoid the check

if (TypeChecks.areTimestampsSupported(driverTimezone)) {
val executorTimezone = ZoneId.systemDefault()
if (executorTimezone.normalized() != driverTimezone.normalized()) {
throw new RuntimeException(s" Driver and executor timezone mismatch. " +
s"Driver timezone is $driverTimezone and executor timezone is " +
s"$executorTimezone. Set executor timezone to $driverTimezone.")
}

Change the default timezone to UTC on the driver
scala> java.util.TimeZone.setDefault(java.util.TimeZone.getTimeZone("UTC"))
Read the orc file from test_basic_reads on GPU
scala> spark.read.orc("integration_tests/src/test/resources/timestamp-date-test.orc").select($"time").take(1)
22/06/15 22:27:49 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <FileSourceScanExec> will run on GPU

res45: Array[org.apache.spark.sql.Row] = Array([1900-05-05 12:34:56.1])
read on CPU
scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> spark.read.orc("integration_tests/src/test/resources/timestamp-date-test.orc").select($"time").take(1)
res47: Array[org.apache.spark.sql.Row] = Array([1900-05-05 20:34:56.1])

and observe an 8-hour difference between CPU and GPU

Expected behavior

Environment details (please complete the following information)

  • Environment location: any
  • Spark configuration settings related to the issue: see repro

Additional context
Add any other context about the problem here.

Originally posted by @gerashegalov in #5767 (comment)

@gerashegalov gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 16, 2022
@gerashegalov gerashegalov changed the title [BUG] driver time zone check is brittle [BUG] driver time zone check does not cover run-time default timezone changes Jun 16, 2022
@sameerz sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Jun 21, 2022
@mattahrens mattahrens removed the P1 Nice to have for release label Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants