Add CPU POC of TimeZoneDB; Test some time zones by comparing CPU POC and Spark #9536

res-life · 2023-10-25T11:17:35Z

contributes #6832

In order to test GPUTimeZoneDB, we first implement a CPUTimeZoneDB, and test CPU POC against Spark.
When GPU version kernel is ready, it's easy to shift to test GPU kernel.
And here we want to test every time zone and every time point(every 15 mins from 0001 year to 9999 year)

Add CPU POC of TimeZoneDB
Test Shanghai time zone and other 2 random time zones by comparing CPU POC and Spark

Now this PR is testing Shanghai timezone from 1 year to 9999 year step by 7 years.

Signed-off-by: Chong Gao res_life@163.com

…OC and Spark Signed-off-by: Chong Gao <res_life@163.com>

revans2 · 2023-10-25T14:32:15Z

tests/src/test/scala/com/nvidia/spark/rapids/timezone/CpuTimeZoneDB.scala

+   *
+   * @return
+   */
+  def convertToUTC(inputVector: ColumnVector, currentTimeZone: ZoneId): ColumnVector = {


I am kind of confused by some of the code here. I think the equivalent of these functions are DateTimeUtils.fromUTCTime and DateTimeUtils.toUTCTime.

https://github.com/apache/spark/blob/94607dd001b133a25dc9865f25b3f9e7f5a5daa3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L491-L505

Which ends up being implemented with SparkDateTimeUtils.convertTz

https://github.com/apache/spark/blob/94607dd001b133a25dc9865f25b3f9e7f5a5daa3/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L130-L133

And for conversion to/from dates it is SparkDateTimeUtils.daysToMicros and SparkDateTimeUtils.microsToDays

https://github.com/apache/spark/blob/94607dd001b133a25dc9865f25b3f9e7f5a5daa3/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L159-L172

To me if this is just for testing and getting things started it feels very much like we should just be using these methods instead of trying to implement something ourselves. It also might be nice to split up the timestamp and date APIs so it is more clear what we expect to be passed in.

revans2 · 2023-10-26T13:21:33Z

tests/src/test/scala/com/nvidia/spark/rapids/timezone/TimeZoneSuite.scala

+  }
+
+  test("test all time zones") {
+    assume(false,


Right now we are doing one time zone. Should we have a few time zones enabled by default? Could we also reduce the number of years that we test so that we don't do the loop 9999 times? Perhaps every 7 years or every 13 years?

Just as a side note we are going to have the same problem with testing different time zones for all kinds of operators. We probably want a representative handful op timezones that we try to test in all cases. We also have the other problem of testing what happens if the default timezone is set differently. This gets to be even harder because we might need to launch a new JVM/python process for each of those to set it properly.

Should we have a few time zones enabled by default?
Added Brazil "America/Sao_Paulo", "Asia/Shanghai". And randomly select 2 zones.

Perhaps every 7 years or every 13 years?

Now it's 7 years.

We also have the other problem of testing what happens if the default timezone is set differently.

Our plugin already tested this, the default time zone on executors should be same.
https://github.com/NVIDIA/spark-rapids/blob/v23.08.2/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala#L338-L341

if (executorTimezone.normalized() != driverTimezone.normalized()) { throw new RuntimeException(s" Driver and executor timezone mismatch. " + s"Driver timezone is $driverTimezone and executor timezone is " + s"$executorTimezone. Set executor timezone to $driverTimezone.") }

revans2 · 2023-10-26T13:25:47Z

It looks like you need to update the formatting for some of the code here.

winningsix

Just a few nits. We can address them in next PR.

winningsix · 2023-10-26T22:54:32Z

tests/src/test/scala/com/nvidia/spark/rapids/timezone/TimeZoneSuite.scala

+  test("test all time zones") {
+    assume(false,
+      "It's time consuming for test all time zones, by default it's disabled")
+    //    val zones = ZoneId.getAvailableZoneIds.asScala.toList.map(z => ZoneId.of(z)).filter { z =>


How about making this as a Fuzzer and randomly pick up some timezones to verify? We can still reserve tests for basic zone ids (e.g., Asia/Shanghai) below.

Now randomly select 2 time zones to test.

winningsix · 2023-10-26T22:58:09Z

tests/src/test/scala/com/nvidia/spark/rapids/timezone/CpuTimeZoneDB.scala

+
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+
+object CpuTimeZoneDB {


nit: Probably call it TimeZoneDB directly. As discussed, later this class will also provide GPU path and controlled by an internal Rapids configuration.

res-life · 2023-10-27T05:23:40Z

build

res-life · 2023-10-27T05:24:33Z

It looks like you need to update the formatting for some of the code here.

Done

winningsix

LGTM.

Add CPU POC of TimeZoneDB; Test Shanghai time zone by comparing CPU P…

ac3d798

…OC and Spark Signed-off-by: Chong Gao <res_life@163.com>

revans2 reviewed Oct 25, 2023

View reviewed changes

Chong Gao added 2 commits October 26, 2023 18:09

Fix compare logic, set spark.sql.datetime.java8API.enabled

c475ade

Disable test case

397f7ca

revans2 previously approved these changes Oct 26, 2023

View reviewed changes

Add cast between data and timestamp cases

3652632

res-life dismissed revans2’s stale review via 3652632 October 26, 2023 15:04

winningsix previously approved these changes Oct 26, 2023

View reviewed changes

format, update

32ffb5e

res-life dismissed winningsix’s stale review via 32ffb5e October 27, 2023 03:52

res-life marked this pull request as ready for review October 27, 2023 05:24

res-life changed the title ~~Add CPU POC of TimeZoneDB; Test Shanghai time zone by comparing CPU POC and Spark~~ Add CPU POC of TimeZoneDB; Test some time zones by comparing CPU POC and Spark Oct 27, 2023

winningsix approved these changes Oct 27, 2023

View reviewed changes

res-life merged commit a0815e0 into NVIDIA:branch-23.12 Oct 27, 2023
29 of 30 checks passed

res-life deleted the non-utc branch October 27, 2023 09:47

sameerz added the test Only impacts tests label Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU POC of TimeZoneDB; Test some time zones by comparing CPU POC and Spark #9536

Add CPU POC of TimeZoneDB; Test some time zones by comparing CPU POC and Spark #9536

res-life commented Oct 25, 2023 •

edited

Loading

revans2 Oct 25, 2023

res-life Oct 27, 2023

revans2 Oct 26, 2023

res-life Oct 27, 2023 •

edited

Loading

revans2 commented Oct 26, 2023

winningsix left a comment

winningsix Oct 26, 2023

res-life Oct 27, 2023

winningsix Oct 26, 2023

res-life Oct 27, 2023

res-life commented Oct 27, 2023

res-life commented Oct 27, 2023 •

edited

Loading

winningsix left a comment


		import org.apache.spark.sql.catalyst.util.DateTimeUtils

		object CpuTimeZoneDB {

Add CPU POC of TimeZoneDB; Test some time zones by comparing CPU POC and Spark #9536

Add CPU POC of TimeZoneDB; Test some time zones by comparing CPU POC and Spark #9536

Conversation

res-life commented Oct 25, 2023 • edited Loading

revans2 Oct 25, 2023

Choose a reason for hiding this comment

res-life Oct 27, 2023

Choose a reason for hiding this comment

revans2 Oct 26, 2023

Choose a reason for hiding this comment

res-life Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

revans2 commented Oct 26, 2023

winningsix left a comment

Choose a reason for hiding this comment

winningsix Oct 26, 2023

Choose a reason for hiding this comment

res-life Oct 27, 2023

Choose a reason for hiding this comment

winningsix Oct 26, 2023

Choose a reason for hiding this comment

res-life Oct 27, 2023

Choose a reason for hiding this comment

res-life commented Oct 27, 2023

res-life commented Oct 27, 2023 • edited Loading

winningsix left a comment

Choose a reason for hiding this comment

res-life commented Oct 25, 2023 •

edited

Loading

res-life Oct 27, 2023 •

edited

Loading

res-life commented Oct 27, 2023 •

edited

Loading