-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support tagging Cast
for timezone conditionally
#7894
Conversation
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Need to double confirm if any timezone aware expression will escape from the type checks. |
Signed-off-by: Liangcai Li <liangcail@nvidia.com>
Cast
for timezone conditionally
build |
Signed-off-by: Liangcai Li <liangcail@nvidia.com>
build |
The issue was previously raised in #6138. Thanks for fixing this. LGTM pending other review items |
Signed-off-by: Liangcai Li <liangcail@nvidia.com>
Signed-off-by: Liangcai Li <liangcail@nvidia.com>
Signed-off-by: Liangcai Li <liangcail@nvidia.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
@@ -840,6 +842,10 @@ object TypeChecks { | |||
areTimestampsSupported(ZoneId.systemDefault()) && | |||
areTimestampsSupported(SQLConf.get.sessionLocalTimeZone) | |||
} | |||
|
|||
def isTimezoneSensitiveType(dataType: DataType): Boolean = { | |||
dataType == DateType || dataType == TimestampType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a DateType really timezone sensitive? Seems like it depends on what you're doing with the date rather than declaring "all dates are a problem if timezone is not UTC." For example, casting a date to a timestamp would be problematic (mostly for the timestamp part) but casting a date to a string seems unrelated to a timezone setting (unless the timezone setting appears in the resulting string somehow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. What I currently know is Spark has a timezone requirement check in Cast
. Conversions between String
and Date
also need time zone.
/**
* Return true if we need to use the `timeZone` information casting `from` type to `to` type.
* The patterns matched reflect the current implementation in the Cast node.
* c.f. usage of `timeZone` in:
* * Cast.castToString
* * Cast.castToDate
* * Cast.castToTimestamp
*/
def needsTimeZone(from: DataType, to: DataType): Boolean = (from, to) match {
case (StringType, TimestampType | DateType) => true
case (TimestampType | DateType, StringType) => true
case (DateType, TimestampType) => true
case (TimestampType, DateType) => true
case (ArrayType(fromType, _), ArrayType(toType, _)) => needsTimeZone(fromType, toType)
case (MapType(fromKey, fromValue, _), MapType(toKey, toValue, _)) =>
needsTimeZone(fromKey, toKey) || needsTimeZone(fromValue, toValue)
case (StructType(fromFields), StructType(toFields)) =>
fromFields.length == toFields.length &&
fromFields.zip(toFields).exists {
case (fromField, toField) =>
needsTimeZone(fromField.dataType, toField.dataType)
}
case _ => false
}
Besides, DateAddInterval
is an expression for the pure date calculation, and it is also a TimeZoneAwareExpression. I checked its eval
code, and it indeed passes in the zoneId
.
val startTs = DateTimeUtils.daysToMicros(start.asInstanceOf[Int], zoneId)
val resultTs = DateTimeUtils.timestampAddInterval(
startTs, itvl.months, itvl.days, itvl.microseconds, zoneId)
DateTimeUtils.microsToDays(resultTs, zoneId)
I guess DateType
is a time zone sensitive type for most cases, but not 100% sure. Since Cast
does not need timezone for casting a Date to an Integer or Long.
So we do not treat DateType
as a time zone sensitive type in type checks, but do for a TimeZoneAwareExpression except Cast
. And for Cast
, we can leverage the Spark Cast.needsTimeZone
API.
Does it make sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks to be broken to me. Look at the code for casting a sting to a date.
String to date does not take time zone info into account at all. Timestamp to date does. I can see that Spark should take time zone info into account when doing it. Because a string can technically have a time zone associated with it. But the current Spark code ignores it.
I am fine with keeping it as is to match Spark, but we should file something, because Spark has a bug either in not taking the timezone into account or a bug in marking it as needing the time zone.
As for DateAddInterval it is NOT just adding dates. It is adding an interval, This could include microseconds, so it really ends up being CAST(AddInterval(CAST(date_col, TimeStamp), interval_value), Date)
That is why it uses/needs a time zone. But in reality they are being dumb and the only time they need to deal with a time zone is if microseconds is non zero. But in the general case they cannot tell this so they do the generic processing all the time. Because we don't support time zones we do specific processing to make it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a little spelunking and 3 years ago https://issues.apache.org/jira/browse/SPARK-35581 removed zone ID from being used as a part of the cast. This went into 3.2.0. So anything in 3.1.X uses the time zone when doing a cast from string to date. Technically even then the time zone was only used for for special cases that were removed as a part of the Spark issue. So it is a bug in Spark. We probably should put up a small PR to fix the issue in Spark. @firestarman if you want to do it the go ahead. If not I'll put up a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @revans2, can you help file a PR for that ? And thx for the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, I choose to do timzone tagging for Cast
only when TimestampType
exists in input or output types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/SPARK-42898 Will try to get a PR up soon-ish
Signed-off-by: Liangcai Li <liangcail@nvidia.com>
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
fixes #7893
fixes #6138
Cast
is a specialTimeZoneAwareExpression
who requires tagging for timezone only when it has timetamp or date type as input or output. This PR adds in the support for this feature.