-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CPU POC of TimeZoneDB; Test some time zones by comparing CPU POC and Spark #9536
Conversation
…OC and Spark Signed-off-by: Chong Gao <res_life@163.com>
* | ||
* @return | ||
*/ | ||
def convertToUTC(inputVector: ColumnVector, currentTimeZone: ZoneId): ColumnVector = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am kind of confused by some of the code here. I think the equivalent of these functions are DateTimeUtils.fromUTCTime
and DateTimeUtils.toUTCTime
.
Which ends up being implemented with SparkDateTimeUtils.convertTz
And for conversion to/from dates it is SparkDateTimeUtils.daysToMicros
and SparkDateTimeUtils.microsToDays
To me if this is just for testing and getting things started it feels very much like we should just be using these methods instead of trying to implement something ourselves. It also might be nice to split up the timestamp and date APIs so it is more clear what we expect to be passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
test("test all time zones") { | ||
assume(false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we are doing one time zone. Should we have a few time zones enabled by default? Could we also reduce the number of years that we test so that we don't do the loop 9999 times? Perhaps every 7 years or every 13 years?
Just as a side note we are going to have the same problem with testing different time zones for all kinds of operators. We probably want a representative handful op timezones that we try to test in all cases. We also have the other problem of testing what happens if the default timezone is set differently. This gets to be even harder because we might need to launch a new JVM/python process for each of those to set it properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a few time zones enabled by default?
Added Brazil "America/Sao_Paulo", "Asia/Shanghai". And randomly select 2 zones.
Perhaps every 7 years or every 13 years?
Now it's 7 years.
We also have the other problem of testing what happens if the default timezone is set differently.
Our plugin already tested this, the default time zone on executors should be same.
https://github.com/NVIDIA/spark-rapids/blob/v23.08.2/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala#L338-L341
if (executorTimezone.normalized() != driverTimezone.normalized()) {
throw new RuntimeException(s" Driver and executor timezone mismatch. " +
s"Driver timezone is $driverTimezone and executor timezone is " +
s"$executorTimezone. Set executor timezone to $driverTimezone.")
}
It looks like you need to update the formatting for some of the code here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits. We can address them in next PR.
test("test all time zones") { | ||
assume(false, | ||
"It's time consuming for test all time zones, by default it's disabled") | ||
// val zones = ZoneId.getAvailableZoneIds.asScala.toList.map(z => ZoneId.of(z)).filter { z => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making this as a Fuzzer
and randomly pick up some timezones to verify? We can still reserve tests for basic zone ids (e.g., Asia/Shanghai
) below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now randomly select 2 time zones to test.
|
||
import org.apache.spark.sql.catalyst.util.DateTimeUtils | ||
|
||
object CpuTimeZoneDB { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Probably call it TimeZoneDB
directly. As discussed, later this class will also provide GPU path and controlled by an internal Rapids configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
build |
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
contributes #6832
In order to test GPUTimeZoneDB, we first implement a
CPU
TimeZoneDB, and test CPU POC against Spark.When GPU version kernel is ready, it's easy to shift to test GPU kernel.
And here we want to test every time zone and every time point(every 15 mins from 0001 year to 9999 year)
Now this PR is testing Shanghai timezone from 1 year to 9999 year step by 7 years.
Signed-off-by: Chong Gao res_life@163.com