Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] ORC reading supports schema evolution #5895

Open
5 of 7 tasks
firestarman opened this issue Jun 23, 2022 · 0 comments
Open
5 of 7 tasks

[FEA] ORC reading supports schema evolution #5895

firestarman opened this issue Jun 23, 2022 · 0 comments
Labels
feature request New feature or request

Comments

@firestarman
Copy link
Collaborator

firestarman commented Jun 23, 2022

CPU ORC reading supports schema evolution as discribed in issue #135. But GPU does not. GPU will run into exceptions when users specify a reader schema which is different from the file one, e.g

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> Seq(1,2,3,4,5).toDF("ci").write.mode("overwrite").orc("/data/tmp/orc/")

scala> sql("create table if not exists tcl2 (`ci` LONG) using ORC options(path '/data/tmp/orc/')")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("select * from tcl2").show
+---+
| ci|
+---+
|  3|
|  4|
|  1|
|  2|
|  5|
+---+

scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> sql("create table if not exists tcl (`ci` LONG) using ORC options(path '/data/tmp/orc/')")
res5: org.apache.spark.sql.DataFrame = []

scala> sql("select * from tcl").show
22/06/20 08:42:49 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 10)
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:208)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1244)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.$anonfun$checkSchemaCompatibility$3(GpuOrcScan.scala:1279)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1264)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.getOrcPartitionReaderContext(GpuOrcScan.scala:990)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$4(GpuOrcScan.scala:844)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.withResource(GpuOrcScan.scala:789)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$1(GpuOrcScan.scala:842)
	at com.nvidia.spark.rapids.shims.OrcShims320untilAllBase.withReader(OrcShims320untilAllBase.scala:36)
	at com.nvidia.spark.rapids.shims.OrcShims320untilAllBase.withReader$(OrcShims320untilAllBase.scala:34)
	at com.nvidia.spark.rapids.shims.OrcShims$.withReader(OrcShims.scala:22)
...

Unfornately we can not detect this case at tagging stage, because we do not know the file schema at that time. dataSchema is not reliable because files may have different schemas.

Here is the feature list I can think of to support the schema evolution for ORC reading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants