You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CPU ORC reading supports schema evolution as discribed in issue #135. But GPU does not. GPU will run into exceptions when users specify a reader schema which is different from the file one, e.g
scala> spark.conf.set("spark.rapids.sql.enabled", "false")
scala> Seq(1,2,3,4,5).toDF("ci").write.mode("overwrite").orc("/data/tmp/orc/")
scala> sql("create table if not exists tcl2 (`ci` LONG) using ORC options(path '/data/tmp/orc/')")
res2: org.apache.spark.sql.DataFrame = []
scala> sql("select * from tcl2").show
+---+
| ci|
+---+
| 3|
| 4|
| 1|
| 2|
| 5|
+---+
scala> spark.conf.set("spark.rapids.sql.enabled", "true")
scala> sql("create table if not exists tcl (`ci` LONG) using ORC options(path '/data/tmp/orc/')")
res5: org.apache.spark.sql.DataFrame = []
scala> sql("select * from tcl").show
22/06/20 08:42:49 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 10)
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1244)
at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.$anonfun$checkSchemaCompatibility$3(GpuOrcScan.scala:1279)
at scala.collection.immutable.List.foreach(List.scala:431)
at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1264)
at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.getOrcPartitionReaderContext(GpuOrcScan.scala:990)
at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$4(GpuOrcScan.scala:844)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.withResource(GpuOrcScan.scala:789)
at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$1(GpuOrcScan.scala:842)
at com.nvidia.spark.rapids.shims.OrcShims320untilAllBase.withReader(OrcShims320untilAllBase.scala:36)
at com.nvidia.spark.rapids.shims.OrcShims320untilAllBase.withReader$(OrcShims320untilAllBase.scala:34)
at com.nvidia.spark.rapids.shims.OrcShims$.withReader(OrcShims.scala:22)
...
Unfornately we can not detect this case at tagging stage, because we do not know the file schema at that time. dataSchema is not reliable because files may have different schemas.
Here is the feature list I can think of to support the schema evolution for ORC reading.
Remove columns being not required but existing in the file.
CPU ORC reading supports schema evolution as discribed in issue #135. But GPU does not. GPU will run into exceptions when users specify a reader schema which is different from the file one, e.g
Unfornately we can not detect this case at tagging stage, because we do not know the file schema at that time.
dataSchema
is not reliable because files may have different schemas.Here is the feature list I can think of to support the schema evolution for ORC reading.
The text was updated successfully, but these errors were encountered: