Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CPU POC of TimeZoneDB; Test some time zones by comparing CPU POC and Spark #9536

Merged
merged 5 commits into from
Oct 27, 2023

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Oct 25, 2023

contributes #6832

In order to test GPUTimeZoneDB, we first implement a CPUTimeZoneDB, and test CPU POC against Spark.
When GPU version kernel is ready, it's easy to shift to test GPU kernel.
And here we want to test every time zone and every time point(every 15 mins from 0001 year to 9999 year)

  • Add CPU POC of TimeZoneDB
  • Test Shanghai time zone and other 2 random time zones by comparing CPU POC and Spark

Now this PR is testing Shanghai timezone from 1 year to 9999 year step by 7 years.

Signed-off-by: Chong Gao res_life@163.com

…OC and Spark

Signed-off-by: Chong Gao <res_life@163.com>
*
* @return
*/
def convertToUTC(inputVector: ColumnVector, currentTimeZone: ZoneId): ColumnVector = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am kind of confused by some of the code here. I think the equivalent of these functions are DateTimeUtils.fromUTCTime and DateTimeUtils.toUTCTime.

https://github.com/apache/spark/blob/94607dd001b133a25dc9865f25b3f9e7f5a5daa3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L491-L505

Which ends up being implemented with SparkDateTimeUtils.convertTz

https://github.com/apache/spark/blob/94607dd001b133a25dc9865f25b3f9e7f5a5daa3/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L130-L133

And for conversion to/from dates it is SparkDateTimeUtils.daysToMicros and SparkDateTimeUtils.microsToDays

https://github.com/apache/spark/blob/94607dd001b133a25dc9865f25b3f9e7f5a5daa3/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala#L159-L172

To me if this is just for testing and getting things started it feels very much like we should just be using these methods instead of trying to implement something ourselves. It also might be nice to split up the timestamp and date APIs so it is more clear what we expect to be passed in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

revans2
revans2 previously approved these changes Oct 26, 2023
}

test("test all time zones") {
assume(false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we are doing one time zone. Should we have a few time zones enabled by default? Could we also reduce the number of years that we test so that we don't do the loop 9999 times? Perhaps every 7 years or every 13 years?

Just as a side note we are going to have the same problem with testing different time zones for all kinds of operators. We probably want a representative handful op timezones that we try to test in all cases. We also have the other problem of testing what happens if the default timezone is set differently. This gets to be even harder because we might need to launch a new JVM/python process for each of those to set it properly.

Copy link
Collaborator Author

@res-life res-life Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a few time zones enabled by default?
Added Brazil "America/Sao_Paulo", "Asia/Shanghai". And randomly select 2 zones.

Perhaps every 7 years or every 13 years?

Now it's 7 years.

We also have the other problem of testing what happens if the default timezone is set differently.

Our plugin already tested this, the default time zone on executors should be same.
https://github.com/NVIDIA/spark-rapids/blob/v23.08.2/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala#L338-L341

        if (executorTimezone.normalized() != driverTimezone.normalized()) {
          throw new RuntimeException(s" Driver and executor timezone mismatch. " +
              s"Driver timezone is $driverTimezone and executor timezone is " +
              s"$executorTimezone. Set executor timezone to $driverTimezone.")
        }

@revans2
Copy link
Collaborator

revans2 commented Oct 26, 2023

It looks like you need to update the formatting for some of the code here.

winningsix
winningsix previously approved these changes Oct 26, 2023
Copy link
Collaborator

@winningsix winningsix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits. We can address them in next PR.

test("test all time zones") {
assume(false,
"It's time consuming for test all time zones, by default it's disabled")
// val zones = ZoneId.getAvailableZoneIds.asScala.toList.map(z => ZoneId.of(z)).filter { z =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making this as a Fuzzer and randomly pick up some timezones to verify? We can still reserve tests for basic zone ids (e.g., Asia/Shanghai) below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now randomly select 2 time zones to test.


import org.apache.spark.sql.catalyst.util.DateTimeUtils

object CpuTimeZoneDB {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably call it TimeZoneDB directly. As discussed, later this class will also provide GPU path and controlled by an internal Rapids configuration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

res-life commented Oct 27, 2023

It looks like you need to update the formatting for some of the code here.

Done

@res-life res-life marked this pull request as ready for review October 27, 2023 05:24
@res-life res-life changed the title Add CPU POC of TimeZoneDB; Test Shanghai time zone by comparing CPU POC and Spark Add CPU POC of TimeZoneDB; Test some time zones by comparing CPU POC and Spark Oct 27, 2023
Copy link
Collaborator

@winningsix winningsix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@res-life res-life merged commit a0815e0 into NVIDIA:branch-23.12 Oct 27, 2023
29 of 30 checks passed
@res-life res-life deleted the non-utc branch October 27, 2023 09:47
@sameerz sameerz added the test Only impacts tests label Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants