Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tagging Cast for timezone conditionally #7894

Merged
merged 13 commits into from
Mar 23, 2023

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Mar 16, 2023

fixes #7893
fixes #6138

Cast is a special TimeZoneAwareExpression who requires tagging for timezone only when it has timetamp or date type as input or output. This PR adds in the support for this feature.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@firestarman firestarman marked this pull request as draft March 16, 2023 09:17
@firestarman
Copy link
Collaborator Author

Need to double confirm if any timezone aware expression will escape from the type checks.

@firestarman firestarman changed the title Remove the timezone check for the TimeZoneAwareExpression Support tagging Cast for timezone conditionally Mar 17, 2023
@firestarman firestarman marked this pull request as ready for review March 17, 2023 04:28
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Liangcai Li <liangcail@nvidia.com>
Signed-off-by: Liangcai Li <liangcail@nvidia.com>
@firestarman
Copy link
Collaborator Author

build

@gerashegalov
Copy link
Collaborator

gerashegalov commented Mar 17, 2023

The issue was previously raised in #6138. Thanks for fixing this. LGTM pending other review items

firestarman and others added 4 commits March 20, 2023 02:49
Signed-off-by: Liangcai Li <liangcail@nvidia.com>
Signed-off-by: Liangcai Li <liangcail@nvidia.com>
Signed-off-by: Liangcai Li <liangcail@nvidia.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@firestarman firestarman requested a review from jlowe March 20, 2023 07:21
@@ -840,6 +842,10 @@ object TypeChecks {
areTimestampsSupported(ZoneId.systemDefault()) &&
areTimestampsSupported(SQLConf.get.sessionLocalTimeZone)
}

def isTimezoneSensitiveType(dataType: DataType): Boolean = {
dataType == DateType || dataType == TimestampType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a DateType really timezone sensitive? Seems like it depends on what you're doing with the date rather than declaring "all dates are a problem if timezone is not UTC." For example, casting a date to a timestamp would be problematic (mostly for the timestamp part) but casting a date to a string seems unrelated to a timezone setting (unless the timezone setting appears in the resulting string somehow).

Copy link
Collaborator Author

@firestarman firestarman Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. What I currently know is Spark has a timezone requirement check in Cast. Conversions between String and Date also need time zone.

  /**
   * Return true if we need to use the `timeZone` information casting `from` type to `to` type.
   * The patterns matched reflect the current implementation in the Cast node.
   * c.f. usage of `timeZone` in:
   * * Cast.castToString
   * * Cast.castToDate
   * * Cast.castToTimestamp
   */
  def needsTimeZone(from: DataType, to: DataType): Boolean = (from, to) match {
    case (StringType, TimestampType | DateType) => true
    case (TimestampType | DateType, StringType) => true
    case (DateType, TimestampType) => true
    case (TimestampType, DateType) => true
    case (ArrayType(fromType, _), ArrayType(toType, _)) => needsTimeZone(fromType, toType)
    case (MapType(fromKey, fromValue, _), MapType(toKey, toValue, _)) =>
      needsTimeZone(fromKey, toKey) || needsTimeZone(fromValue, toValue)
    case (StructType(fromFields), StructType(toFields)) =>
      fromFields.length == toFields.length &&
        fromFields.zip(toFields).exists {
          case (fromField, toField) =>
            needsTimeZone(fromField.dataType, toField.dataType)
        }
    case _ => false
  }

Besides, DateAddInterval is an expression for the pure date calculation, and it is also a TimeZoneAwareExpression. I checked its eval code, and it indeed passes in the zoneId.

     val startTs = DateTimeUtils.daysToMicros(start.asInstanceOf[Int], zoneId)
     val resultTs = DateTimeUtils.timestampAddInterval(
       startTs, itvl.months, itvl.days, itvl.microseconds, zoneId)
     DateTimeUtils.microsToDays(resultTs, zoneId)

I guess DateType is a time zone sensitive type for most cases, but not 100% sure. Since Cast does not need timezone for casting a Date to an Integer or Long.

So we do not treat DateType as a time zone sensitive type in type checks, but do for a TimeZoneAwareExpression except Cast. And for Cast, we can leverage the Spark Cast.needsTimeZone API.
Does it make sense ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks to be broken to me. Look at the code for casting a sting to a date.

https://github.com/apache/spark/blob/ef0a76eeea30fabb04499908b04124464225f5fd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L844-L849

String to date does not take time zone info into account at all. Timestamp to date does. I can see that Spark should take time zone info into account when doing it. Because a string can technically have a time zone associated with it. But the current Spark code ignores it.

I am fine with keeping it as is to match Spark, but we should file something, because Spark has a bug either in not taking the timezone into account or a bug in marking it as needing the time zone.

As for DateAddInterval it is NOT just adding dates. It is adding an interval, This could include microseconds, so it really ends up being CAST(AddInterval(CAST(date_col, TimeStamp), interval_value), Date) That is why it uses/needs a time zone. But in reality they are being dumb and the only time they need to deal with a time zone is if microseconds is non zero. But in the general case they cannot tell this so they do the generic processing all the time. Because we don't support time zones we do specific processing to make it work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a little spelunking and 3 years ago https://issues.apache.org/jira/browse/SPARK-35581 removed zone ID from being used as a part of the cast. This went into 3.2.0. So anything in 3.1.X uses the time zone when doing a cast from string to date. Technically even then the time zone was only used for for special cases that were removed as a part of the Spark issue. So it is a bug in Spark. We probably should put up a small PR to fix the issue in Spark. @firestarman if you want to do it the go ahead. If not I'll put up a PR.

Copy link
Collaborator Author

@firestarman firestarman Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @revans2, can you help file a PR for that ? And thx for the details.

Copy link
Collaborator Author

@firestarman firestarman Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, I choose to do timzone tagging for Cast only when TimestampType exists in input or output types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/SPARK-42898 Will try to get a PR up soon-ish

Signed-off-by: Liangcai Li <liangcail@nvidia.com>
@firestarman firestarman requested a review from jlowe March 21, 2023 03:21
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@firestarman firestarman merged commit b078e0f into NVIDIA:branch-23.04 Mar 23, 2023
@firestarman firestarman deleted the fix-7893 branch March 23, 2023 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants