Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] mergeSchema on ORC reads does not work #135

Closed
revans2 opened this issue Jun 9, 2020 · 5 comments · Fixed by #6523
Closed

[BUG] mergeSchema on ORC reads does not work #135

revans2 opened this issue Jun 9, 2020 · 5 comments · Fixed by #6523
Labels
bug Something isn't working P1 Nice to have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin SQL part of the SQL/Dataframe plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented Jun 9, 2020

Describe the bug
This is a lot like #60 but for ORC files. If you try to use mergeSchema or provide your own reader schema that has more columns than the orc file does it results in an error.

Steps/Code to reproduce bug
an integration test is being added for this.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels Jun 9, 2020
@revans2
Copy link
Collaborator Author

revans2 commented Jun 11, 2020

This is actually rather complex now that I have dug into it and to fully make this work we are going to need to support schema evolution for orc. Which is rather hard. I filed rapidsai/cudf#5447 for this with CUDF.

@revans2
Copy link
Collaborator Author

revans2 commented Jun 11, 2020

In the short term I am going to do what I can to fall back to the CPU in cases we know that will not work.

@sameerz sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Aug 18, 2020
@revans2 revans2 added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin ? - Needs Triage Need team to review and classify labels May 11, 2022
@revans2
Copy link
Collaborator Author

revans2 commented May 11, 2022

Marking this for us to look at again because it is related to #5445 in parquet.

@firestarman
Copy link
Collaborator

firestarman commented Jun 20, 2022

Seems mergeSchema itself does not need all the schema evolution functionalities, only the support of adding/re-order columns (already supported on GPU) would be enough according to the Schema Merging doc (Maybe I missed some cases). Yes the doc is for Parquet, but I think it would be also applicable for ORC.

Personally the schema evolution is required to support the user specified schema. I can run into the type casting case even without mergeSchema by specifying a read schema as below. Shall we track all the TODOs in this issue or have a new one to track the full support of schema evolution ?

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> Seq(1,2,3,4,5).toDF("ci").write.mode("overwrite").orc("/data/tmp/orc/")

scala> sql("create table if not exists tcl2 (`ci` LONG) using ORC options(path '/data/tmp/orc/')")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("select * from tcl2").show
+---+
| ci|
+---+
|  3|
|  4|
|  1|
|  2|
|  5|
+---+

scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> sql("create table if not exists tcl (`ci` LONG) using ORC options(path '/data/tmp/orc/')")
res5: org.apache.spark.sql.DataFrame = []

scala> sql("select * from tcl").show
22/06/20 08:42:49 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 10)
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:208)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1244)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.$anonfun$checkSchemaCompatibility$3(GpuOrcScan.scala:1279)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1264)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.getOrcPartitionReaderContext(GpuOrcScan.scala:990)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$4(GpuOrcScan.scala:844)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.withResource(GpuOrcScan.scala:789)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$1(GpuOrcScan.scala:842)
	at com.nvidia.spark.rapids.shims.OrcShims320untilAllBase.withReader(OrcShims320untilAllBase.scala:36)
	at com.nvidia.spark.rapids.shims.OrcShims320untilAllBase.withReader$(OrcShims320untilAllBase.scala:34)
	at com.nvidia.spark.rapids.shims.OrcShims$.withReader(OrcShims.scala:22)
...

@firestarman
Copy link
Collaborator

firestarman commented Jun 21, 2022

Filed a new issue #5895 to track the schema evolution feature.

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Nice to have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin SQL part of the SQL/Dataframe plugin
Projects
None yet
4 participants