-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode #30260
Changes from 5 commits
18b49bf
f74c488
7bfb1a6
e6faf4b
33452cd
ce0e775
1d57a24
6003bef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -107,6 +107,22 @@ SELECT * FROM t; | |
+---+ | ||
``` | ||
|
||
The valid combinations of target data type and source data type in a `Cast` expression are given by the following table. | ||
“Y” indicates that the combination is syntactically valid without restriction and “N” indicates that the combination is not valid. | ||
|
||
| From\To | Numeric | String | Date | Timestamp | Interval | Boolean | Binary | Array | Map | Struct | | ||
|-----------|---------|--------|------|-----------|----------|---------|--------|-------|-----|--------| | ||
| Numeric | Y | Y | N | N | N | Y | N | N | N | N | | ||
| String | Y | Y | Y | Y | Y | Y | Y | N | N | N | | ||
| Date | N | Y | Y | Y | N | N | N | N | N | N | | ||
| Timestamp | N | Y | Y | Y | N | N | N | N | N | N | | ||
| Interval | N | Y | N | N | Y | N | N | N | N | N | | ||
| Boolean | Y | Y | N | N | N | Y | N | N | N | N | | ||
| Binary | Y | N | N | N | N | N | Y | N | N | N | | ||
| Array | N | N | N | N | N | N | N | Y | N | N | | ||
| Map | N | N | N | N | N | N | N | N | Y | N | | ||
| Struct | N | N | N | N | N | N | N | N | N | Y | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about following the wording in the data type document, e.g., |
||
|
||
### SQL Functions | ||
|
||
The behavior of some SQL functions can be different under ANSI mode (`spark.sql.ansi.enabled=true`). | ||
|
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -25,6 +25,7 @@ import java.util.concurrent.TimeUnit._ | |||||||||||
import org.apache.spark.SparkException | ||||||||||||
import org.apache.spark.sql.catalyst.InternalRow | ||||||||||||
import org.apache.spark.sql.catalyst.analysis.{TypeCheckResult, TypeCoercion} | ||||||||||||
import org.apache.spark.sql.catalyst.expressions.Cast.{canCast, forceNullable, resolvableNullability} | ||||||||||||
import org.apache.spark.sql.catalyst.expressions.codegen._ | ||||||||||||
import org.apache.spark.sql.catalyst.expressions.codegen.Block._ | ||||||||||||
import org.apache.spark.sql.catalyst.util._ | ||||||||||||
|
@@ -258,13 +259,18 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit | |||||||||||
|
||||||||||||
def dataType: DataType | ||||||||||||
|
||||||||||||
/** | ||||||||||||
* Returns true iff we can cast `from` type to `to` type. | ||||||||||||
*/ | ||||||||||||
def canCast(from: DataType, to: DataType): Boolean | ||||||||||||
|
||||||||||||
override def toString: String = { | ||||||||||||
val ansi = if (ansiEnabled) "ansi_" else "" | ||||||||||||
s"${ansi}cast($child as ${dataType.simpleString})" | ||||||||||||
} | ||||||||||||
|
||||||||||||
override def checkInputDataTypes(): TypeCheckResult = { | ||||||||||||
if (Cast.canCast(child.dataType, dataType)) { | ||||||||||||
if (canCast(child.dataType, dataType)) { | ||||||||||||
TypeCheckResult.TypeCheckSuccess | ||||||||||||
} else { | ||||||||||||
TypeCheckResult.TypeCheckFailure( | ||||||||||||
|
@@ -1753,6 +1759,12 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String | |||||||||||
copy(timeZoneId = Option(timeZoneId)) | ||||||||||||
|
||||||||||||
override protected val ansiEnabled: Boolean = SQLConf.get.ansiEnabled | ||||||||||||
|
||||||||||||
override def canCast(from: DataType, to: DataType): Boolean = if (ansiEnabled) { | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about describing this new behaviour in the usage above of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well, then we need to mention about the behavior of throwing overflow exceptions when ANSI flag enabled. I will add some content in the |
||||||||||||
AnsiCast.canCast(from, to) | ||||||||||||
} else { | ||||||||||||
Cast.canCast(from, to) | ||||||||||||
} | ||||||||||||
} | ||||||||||||
|
||||||||||||
/** | ||||||||||||
|
@@ -1770,6 +1782,110 @@ case class AnsiCast(child: Expression, dataType: DataType, timeZoneId: Option[St | |||||||||||
copy(timeZoneId = Option(timeZoneId)) | ||||||||||||
|
||||||||||||
override protected val ansiEnabled: Boolean = true | ||||||||||||
|
||||||||||||
override def canCast(from: DataType, to: DataType): Boolean = AnsiCast.canCast(from, to) | ||||||||||||
} | ||||||||||||
|
||||||||||||
object AnsiCast { | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you leave some comments to summarize the current behaivour of the ANSI explicit cast as described in the PR description (references |
||||||||||||
/** | ||||||||||||
* As per section 6.13 "cast specification" in "Information technology — Database languages " + | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice update. Thanks! |
||||||||||||
* "- SQL — Part 2: Foundation (SQL/Foundation)": | ||||||||||||
* If the <cast operand> is a <value expression>, then the valid combinations of TD and SD | ||||||||||||
* in a <cast specification> are given by the following table. “Y” indicates that the | ||||||||||||
* combination is syntactically valid without restriction; “M” indicates that the combination | ||||||||||||
* is valid subject to other Syntax Rules in this Sub- clause being satisfied; and “N” indicates | ||||||||||||
* that the combination is not valid: | ||||||||||||
* SD TD | ||||||||||||
* EN AN C D T TS YM DT BO UDT B RT CT RW | ||||||||||||
* EN Y Y Y N N N M M N M N M N N | ||||||||||||
* AN Y Y Y N N N N N N M N M N N | ||||||||||||
* C Y Y Y Y Y Y Y Y Y M N M N N | ||||||||||||
* D N N Y Y N Y N N N M N M N N | ||||||||||||
* T N N Y N Y Y N N N M N M N N | ||||||||||||
* TS N N Y Y Y Y N N N M N M N N | ||||||||||||
* YM M N Y N N N Y N N M N M N N | ||||||||||||
* DT M N Y N N N N Y N M N M N N | ||||||||||||
* BO N N Y N N N N N Y M N M N N | ||||||||||||
* UDT M M M M M M M M M M M M M N | ||||||||||||
* B N N N N N N N N N M Y M N N | ||||||||||||
* RT M M M M M M M M M M M M N N | ||||||||||||
* CT N N N N N N N N N M N N M N | ||||||||||||
* RW N N N N N N N N N N N N N M | ||||||||||||
* | ||||||||||||
* Where: | ||||||||||||
* EN = Exact Numeric | ||||||||||||
* AN = Approximate Numeric | ||||||||||||
* C = Character (Fixed- or Variable-Length, or Character Large Object) | ||||||||||||
* D = Date | ||||||||||||
* T = Time | ||||||||||||
* TS = Timestamp | ||||||||||||
* YM = Year-Month Interval | ||||||||||||
* DT = Day-Time Interval | ||||||||||||
* BO = Boolean | ||||||||||||
* UDT = User-Defined Type | ||||||||||||
* B = Binary (Fixed- or Variable-Length or Binary Large Object) | ||||||||||||
* RT = Reference type | ||||||||||||
* CT = Collection type | ||||||||||||
* RW = Row type | ||||||||||||
* | ||||||||||||
* Spark's ANSI mode follows the syntax rules, except it specially allow the following | ||||||||||||
* straightforward type conversions which are disallowed as per the SQL standard: | ||||||||||||
* - Numeric <=> Boolean | ||||||||||||
* - String <=> Binary | ||||||||||||
Comment on lines
+1831
to
+1834
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about describing the special case above in |
||||||||||||
*/ | ||||||||||||
def canCast(from: DataType, to: DataType): Boolean = (from, to) match { | ||||||||||||
case (fromType, toType) if fromType == toType => true | ||||||||||||
|
||||||||||||
case (NullType, _) => true | ||||||||||||
|
||||||||||||
case (StringType, _: BinaryType) => true | ||||||||||||
|
||||||||||||
case (StringType, BooleanType) => true | ||||||||||||
case (_: NumericType, BooleanType) => true | ||||||||||||
|
||||||||||||
case (StringType, TimestampType) => true | ||||||||||||
case (DateType, TimestampType) => true | ||||||||||||
|
||||||||||||
case (StringType, _: CalendarIntervalType) => true | ||||||||||||
|
||||||||||||
case (StringType, DateType) => true | ||||||||||||
case (TimestampType, DateType) => true | ||||||||||||
|
||||||||||||
case (_: NumericType, _: NumericType) => true | ||||||||||||
case (StringType, _: NumericType) => true | ||||||||||||
case (BooleanType, _: NumericType) => true | ||||||||||||
Comment on lines
+1854
to
+1856
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Just a suggestion) For readability, could you reorder these entries according to spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Lines 70 to 74 in 35ac314
|
||||||||||||
|
||||||||||||
case (_: NumericType, StringType) => true | ||||||||||||
case (_: DateType, StringType) => true | ||||||||||||
case (_: TimestampType, StringType) => true | ||||||||||||
case (_: CalendarIntervalType, StringType) => true | ||||||||||||
case (BooleanType, StringType) => true | ||||||||||||
case (BinaryType, StringType) => true | ||||||||||||
|
||||||||||||
case (ArrayType(fromType, fn), ArrayType(toType, tn)) => | ||||||||||||
canCast(fromType, toType) && | ||||||||||||
resolvableNullability(fn || forceNullable(fromType, toType), tn) | ||||||||||||
|
||||||||||||
case (MapType(fromKey, fromValue, fn), MapType(toKey, toValue, tn)) => | ||||||||||||
canCast(fromKey, toKey) && | ||||||||||||
(!forceNullable(fromKey, toKey)) && | ||||||||||||
canCast(fromValue, toValue) && | ||||||||||||
resolvableNullability(fn || forceNullable(fromValue, toValue), tn) | ||||||||||||
|
||||||||||||
case (StructType(fromFields), StructType(toFields)) => | ||||||||||||
fromFields.length == toFields.length && | ||||||||||||
fromFields.zip(toFields).forall { | ||||||||||||
case (fromField, toField) => | ||||||||||||
canCast(fromField.dataType, toField.dataType) && | ||||||||||||
resolvableNullability( | ||||||||||||
fromField.nullable || forceNullable(fromField.dataType, toField.dataType), | ||||||||||||
toField.nullable) | ||||||||||||
} | ||||||||||||
|
||||||||||||
case (udt1: UserDefinedType[_], udt2: UserDefinedType[_]) if udt2.acceptsType(udt1) => true | ||||||||||||
|
||||||||||||
case _ => false | ||||||||||||
} | ||||||||||||
} | ||||||||||||
|
||||||||||||
/** | ||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving this statements into L61-62?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Cast
expression ->CAST
syntax ?