Skip to content

Commit

Permalink
[SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

In section 6.13 of the ANSI SQL standard, there are syntax rules for valid combinations of the source and target data types.
![image](https://user-images.githubusercontent.com/1097932/98212874-17356f80-1ef9-11eb-8f2b-385f32db404a.png)

Comparing the ANSI CAST syntax rules with the current default behavior of Spark:
![image](https://user-images.githubusercontent.com/1097932/98789831-b7870a80-23b7-11eb-9b5f-469a42e0ee4a.png)

To make Spark's ANSI mode more ANSI SQL Compatible,I propose to disallow the following casting in ANSI mode:
```
TimeStamp <=> Boolean
Date <=> Boolean
Numeric <=> Timestamp
Numeric <=> Date
Numeric <=> Binary
String <=> Array
String <=> Map
String <=> Struct
```
The following castings are considered invalid in ANSI SQL standard, but they are quite straight forward. Let's Allow them for now
```
Numeric <=> Boolean
String <=> Binary
```
### Why are the changes needed?

Better ANSI SQL compliance

### Does this PR introduce _any_ user-facing change?

Yes, the following castings will not be allowed in ANSI mode:
```
TimeStamp <=> Boolean
Date <=> Boolean
Numeric <=> Timestamp
Numeric <=> Date
Numeric <=> Binary
String <=> Array
String <=> Map
String <=> Struct
```

### How was this patch tested?

Unit test

The ANSI Compliance doc preview:
![image](https://user-images.githubusercontent.com/1097932/98946017-2cd20880-24a8-11eb-8161-65749bfdd03a.png)

Closes #30260 from gengliangwang/ansiCanCast.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
  • Loading branch information
gengliangwang authored and maropu committed Nov 19, 2020
1 parent fbfc0bf commit 9a4c790
Show file tree
Hide file tree
Showing 4 changed files with 635 additions and 395 deletions.
21 changes: 21 additions & 0 deletions docs/sql-ref-ansi-compliance.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,27 @@ Spark SQL has three kinds of type conversions: explicit casting, type coercion,
When `spark.sql.ansi.enabled` is set to `true`, explicit casting by `CAST` syntax throws a runtime exception for illegal cast patterns defined in the standard, e.g. casts from a string to an integer.
On the other hand, `INSERT INTO` syntax throws an analysis exception when the ANSI mode enabled via `spark.sql.storeAssignmentPolicy=ANSI`.

The type conversion of Spark ANSI mode follows the syntax rules of section 6.13 "cast specification" in [ISO/IEC 9075-2:2011 Information technology — Database languages - SQL — Part 2: Foundation (SQL/Foundation)"](https://www.iso.org/standard/53682.html), except it specially allows the following
straightforward type conversions which are disallowed as per the ANSI standard:
* NumericType <=> BooleanType
* StringType <=> BinaryType

The valid combinations of target data type and source data type in a `CAST` expression are given by the following table.
“Y” indicates that the combination is syntactically valid without restriction and “N” indicates that the combination is not valid.

| From\To | NumericType | StringType | DateType | TimestampType | IntervalType | BooleanType | BinaryType | ArrayType | MapType | StructType |
|-----------|---------|--------|------|-----------|----------|---------|--------|-------|-----|--------|
| NumericType | Y | Y | N | N | N | Y | N | N | N | N |
| StringType | Y | Y | Y | Y | Y | Y | Y | N | N | N |
| DateType | N | Y | Y | Y | N | N | N | N | N | N |
| TimestampType | N | Y | Y | Y | N | N | N | N | N | N |
| IntervalType | N | Y | N | N | Y | N | N | N | N | N |
| BooleanType | Y | Y | N | N | N | Y | N | N | N | N |
| BinaryType | Y | N | N | N | N | N | Y | N | N | N |
| ArrayType | N | N | N | N | N | N | N | Y | N | N |
| MapType | N | N | N | N | N | N | N | N | Y | N |
| StructType | N | N | N | N | N | N | N | N | N | Y |

Currently, the ANSI mode affects explicit casting and assignment casting only.
In future releases, the behaviour of type coercion might change along with the other two type conversion rules.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import java.util.concurrent.TimeUnit._
import org.apache.spark.SparkException
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.analysis.{TypeCheckResult, TypeCoercion}
import org.apache.spark.sql.catalyst.expressions.Cast.{canCast, forceNullable, resolvableNullability}
import org.apache.spark.sql.catalyst.expressions.codegen._
import org.apache.spark.sql.catalyst.expressions.codegen.Block._
import org.apache.spark.sql.catalyst.util._
Expand Down Expand Up @@ -258,13 +259,18 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit

def dataType: DataType

/**
* Returns true iff we can cast `from` type to `to` type.
*/
def canCast(from: DataType, to: DataType): Boolean

override def toString: String = {
val ansi = if (ansiEnabled) "ansi_" else ""
s"${ansi}cast($child as ${dataType.simpleString})"
}

override def checkInputDataTypes(): TypeCheckResult = {
if (Cast.canCast(child.dataType, dataType)) {
if (canCast(child.dataType, dataType)) {
TypeCheckResult.TypeCheckSuccess
} else {
TypeCheckResult.TypeCheckFailure(
Expand Down Expand Up @@ -1753,6 +1759,12 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String
copy(timeZoneId = Option(timeZoneId))

override protected val ansiEnabled: Boolean = SQLConf.get.ansiEnabled

override def canCast(from: DataType, to: DataType): Boolean = if (ansiEnabled) {
AnsiCast.canCast(from, to)
} else {
Cast.canCast(from, to)
}
}

/**
Expand All @@ -1770,6 +1782,110 @@ case class AnsiCast(child: Expression, dataType: DataType, timeZoneId: Option[St
copy(timeZoneId = Option(timeZoneId))

override protected val ansiEnabled: Boolean = true

override def canCast(from: DataType, to: DataType): Boolean = AnsiCast.canCast(from, to)
}

object AnsiCast {
/**
* As per section 6.13 "cast specification" in "Information technology — Database languages " +
* "- SQL — Part 2: Foundation (SQL/Foundation)":
* If the <cast operand> is a <value expression>, then the valid combinations of TD and SD
* in a <cast specification> are given by the following table. “Y” indicates that the
* combination is syntactically valid without restriction; “M” indicates that the combination
* is valid subject to other Syntax Rules in this Sub- clause being satisfied; and “N” indicates
* that the combination is not valid:
* SD TD
* EN AN C D T TS YM DT BO UDT B RT CT RW
* EN Y Y Y N N N M M N M N M N N
* AN Y Y Y N N N N N N M N M N N
* C Y Y Y Y Y Y Y Y Y M N M N N
* D N N Y Y N Y N N N M N M N N
* T N N Y N Y Y N N N M N M N N
* TS N N Y Y Y Y N N N M N M N N
* YM M N Y N N N Y N N M N M N N
* DT M N Y N N N N Y N M N M N N
* BO N N Y N N N N N Y M N M N N
* UDT M M M M M M M M M M M M M N
* B N N N N N N N N N M Y M N N
* RT M M M M M M M M M M M M N N
* CT N N N N N N N N N M N N M N
* RW N N N N N N N N N N N N N M
*
* Where:
* EN = Exact Numeric
* AN = Approximate Numeric
* C = Character (Fixed- or Variable-Length, or Character Large Object)
* D = Date
* T = Time
* TS = Timestamp
* YM = Year-Month Interval
* DT = Day-Time Interval
* BO = Boolean
* UDT = User-Defined Type
* B = Binary (Fixed- or Variable-Length or Binary Large Object)
* RT = Reference type
* CT = Collection type
* RW = Row type
*
* Spark's ANSI mode follows the syntax rules, except it specially allow the following
* straightforward type conversions which are disallowed as per the SQL standard:
* - Numeric <=> Boolean
* - String <=> Binary
*/
def canCast(from: DataType, to: DataType): Boolean = (from, to) match {
case (fromType, toType) if fromType == toType => true

case (NullType, _) => true

case (StringType, _: BinaryType) => true

case (StringType, BooleanType) => true
case (_: NumericType, BooleanType) => true

case (StringType, TimestampType) => true
case (DateType, TimestampType) => true

case (StringType, _: CalendarIntervalType) => true

case (StringType, DateType) => true
case (TimestampType, DateType) => true

case (_: NumericType, _: NumericType) => true
case (StringType, _: NumericType) => true
case (BooleanType, _: NumericType) => true

case (_: NumericType, StringType) => true
case (_: DateType, StringType) => true
case (_: TimestampType, StringType) => true
case (_: CalendarIntervalType, StringType) => true
case (BooleanType, StringType) => true
case (BinaryType, StringType) => true

case (ArrayType(fromType, fn), ArrayType(toType, tn)) =>
canCast(fromType, toType) &&
resolvableNullability(fn || forceNullable(fromType, toType), tn)

case (MapType(fromKey, fromValue, fn), MapType(toKey, toValue, tn)) =>
canCast(fromKey, toKey) &&
(!forceNullable(fromKey, toKey)) &&
canCast(fromValue, toValue) &&
resolvableNullability(fn || forceNullable(fromValue, toValue), tn)

case (StructType(fromFields), StructType(toFields)) =>
fromFields.length == toFields.length &&
fromFields.zip(toFields).forall {
case (fromField, toField) =>
canCast(fromField.dataType, toField.dataType) &&
resolvableNullability(
fromField.nullable || forceNullable(fromField.dataType, toField.dataType),
toField.nullable)
}

case (udt1: UserDefinedType[_], udt2: UserDefinedType[_]) if udt2.acceptsType(udt1) => true

case _ => false
}
}

/**
Expand Down
Loading

0 comments on commit 9a4c790

Please sign in to comment.