[BUG] mergeSchema on ORC reads does not work #135

revans2 · 2020-06-09T17:37:22Z

Describe the bug
This is a lot like #60 but for ORC files. If you try to use mergeSchema or provide your own reader schema that has more columns than the orc file does it results in an error.

Steps/Code to reproduce bug
an integration test is being added for this.

The text was updated successfully, but these errors were encountered:

revans2 · 2020-06-11T16:42:47Z

This is actually rather complex now that I have dug into it and to fully make this work we are going to need to support schema evolution for orc. Which is rather hard. I filed rapidsai/cudf#5447 for this with CUDF.

revans2 · 2020-06-11T16:43:36Z

In the short term I am going to do what I can to fall back to the CPU in cases we know that will not work.

revans2 · 2022-05-11T13:57:04Z

Marking this for us to look at again because it is related to #5445 in parquet.

firestarman · 2022-06-20T08:49:47Z

Seems mergeSchema itself does not need all the schema evolution functionalities, only the support of adding/re-order columns (already supported on GPU) would be enough according to the Schema Merging doc (Maybe I missed some cases). Yes the doc is for Parquet, but I think it would be also applicable for ORC.

Personally the schema evolution is required to support the user specified schema. I can run into the type casting case even without mergeSchema by specifying a read schema as below. Shall we track all the TODOs in this issue or have a new one to track the full support of schema evolution ?

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> Seq(1,2,3,4,5).toDF("ci").write.mode("overwrite").orc("/data/tmp/orc/")

scala> sql("create table if not exists tcl2 (`ci` LONG) using ORC options(path '/data/tmp/orc/')")
res2: org.apache.spark.sql.DataFrame = []

scala> sql("select * from tcl2").show
+---+
| ci|
+---+
|  3|
|  4|
|  1|
|  2|
|  5|
+---+

scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> sql("create table if not exists tcl (`ci` LONG) using ORC options(path '/data/tmp/orc/')")
res5: org.apache.spark.sql.DataFrame = []

scala> sql("select * from tcl").show
22/06/20 08:42:49 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 10)
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:208)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1244)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.$anonfun$checkSchemaCompatibility$3(GpuOrcScan.scala:1279)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1264)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.getOrcPartitionReaderContext(GpuOrcScan.scala:990)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$4(GpuOrcScan.scala:844)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.withResource(GpuOrcScan.scala:789)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$1(GpuOrcScan.scala:842)
	at com.nvidia.spark.rapids.shims.OrcShims320untilAllBase.withReader(OrcShims320untilAllBase.scala:36)
	at com.nvidia.spark.rapids.shims.OrcShims320untilAllBase.withReader$(OrcShims320untilAllBase.scala:34)
	at com.nvidia.spark.rapids.shims.OrcShims$.withReader(OrcShims.scala:22)
...

firestarman · 2022-06-21T02:47:08Z

Filed a new issue #5895 to track the schema evolution feature.

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels Jun 9, 2020

revans2 mentioned this issue Jun 11, 2020

Orc merge schema fallback and FileScan format configs #158

Merged

sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Aug 18, 2020

revans2 added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin ? - Needs Triage Need team to review and classify labels May 11, 2022

revans2 mentioned this issue May 11, 2022

[FEA] parquet and orc corner case tests #5462

Open

sperlingxx self-assigned this May 13, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label May 17, 2022

sperlingxx mentioned this issue May 20, 2022

[BUG] read ORC file with various file schemas #5562

Closed

sperlingxx mentioned this issue May 24, 2022

Enable merge schema reading on ORC #5605

Closed

sperlingxx removed their assignment Jun 6, 2022

firestarman mentioned this issue Jul 4, 2022

[FEA] ORC reading supports schema evolution #5895

Open

7 tasks

firestarman mentioned this issue Sep 8, 2022

ORC reading supports mergeSchema[databricks] #6523

Merged

firestarman closed this as completed in #6523 Sep 14, 2022

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to 54918d8 (NVIDIA#135)

63a2dd0

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] mergeSchema on ORC reads does not work #135

[BUG] mergeSchema on ORC reads does not work #135

revans2 commented Jun 9, 2020

revans2 commented Jun 11, 2020

revans2 commented Jun 11, 2020

revans2 commented May 11, 2022

firestarman commented Jun 20, 2022 •

edited

Loading

firestarman commented Jun 21, 2022 •

edited

Loading

[BUG] mergeSchema on ORC reads does not work #135

[BUG] mergeSchema on ORC reads does not work #135

Comments

revans2 commented Jun 9, 2020

revans2 commented Jun 11, 2020

revans2 commented Jun 11, 2020

revans2 commented May 11, 2022

firestarman commented Jun 20, 2022 • edited Loading

firestarman commented Jun 21, 2022 • edited Loading

firestarman commented Jun 20, 2022 •

edited

Loading

firestarman commented Jun 21, 2022 •

edited

Loading