[FEA] ORC reading supports schema evolution #5895

firestarman · 2022-06-23T02:57:14Z

CPU ORC reading supports schema evolution as discribed in issue #135. But GPU does not. GPU will run into exceptions when users specify a reader schema which is different from the file one, e.g

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> Seq(1,2,3,4,5).toDF("ci").write.mode("overwrite").orc("/data/tmp/orc/")

scala> sql("create table if not exists tcl2 (`ci` LONG) using ORC options(path '/data/tmp/orc/')")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("select * from tcl2").show
+---+
| ci|
+---+
|  3|
|  4|
|  1|
|  2|
|  5|
+---+

scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> sql("create table if not exists tcl (`ci` LONG) using ORC options(path '/data/tmp/orc/')")
res5: org.apache.spark.sql.DataFrame = []

scala> sql("select * from tcl").show
22/06/20 08:42:49 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 10)
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:208)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1244)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.$anonfun$checkSchemaCompatibility$3(GpuOrcScan.scala:1279)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1264)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.getOrcPartitionReaderContext(GpuOrcScan.scala:990)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$4(GpuOrcScan.scala:844)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.withResource(GpuOrcScan.scala:789)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$1(GpuOrcScan.scala:842)
	at com.nvidia.spark.rapids.shims.OrcShims320untilAllBase.withReader(OrcShims320untilAllBase.scala:36)
	at com.nvidia.spark.rapids.shims.OrcShims320untilAllBase.withReader$(OrcShims320untilAllBase.scala:34)
	at com.nvidia.spark.rapids.shims.OrcShims$.withReader(OrcShims.scala:22)
...

Unfornately we can not detect this case at tagging stage, because we do not know the file schema at that time. dataSchema is not reliable because files may have different schemas.

Here is the feature list I can think of to support the schema evolution for ORC reading.

Remove columns being not required but existing in the file.
Add columns missing in the file.
Re-order columns according to the read schema.
Support type casting between compatible types.
- Set up the framework of type casting for ORC reading #5960
- Implement all the casting cases that GPU can support for ORC reading. #6149
[BUG] GPU ORC reading fails when positional schema is enabled and more columns are required. #5948

The text was updated successfully, but these errors were encountered:

firestarman added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jun 23, 2022

firestarman mentioned this issue Jun 23, 2022

[BUG] mergeSchema on ORC reads does not work #135

Closed

sameerz removed the ? - Needs Triage Need team to review and classify label Jun 28, 2022

firestarman mentioned this issue Jul 7, 2022

Set up the framework of type casting for ORC reading #5960

Merged

firestarman mentioned this issue Sep 8, 2022

ORC reading supports mergeSchema[databricks] #6523

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] ORC reading supports schema evolution #5895

[FEA] ORC reading supports schema evolution #5895

firestarman commented Jun 23, 2022 •

edited by sinkinben

Loading

[FEA] ORC reading supports schema evolution #5895

[FEA] ORC reading supports schema evolution #5895

Comments

firestarman commented Jun 23, 2022 • edited by sinkinben Loading

firestarman commented Jun 23, 2022 •

edited by sinkinben

Loading