Disable orc write by default because of https://issues.apache.org/jir…

…a/browse/ORC-1075 (NVIDIA#4471) * Disable orc write by default because of https://issues.apache.org/jira/browse/ORC-1075 Spark 3.1.1+ failed to read the orc file written by cudf when filter is pushed down. This PR disables orc write by default and gives some information about the reason. Signed-off-by: Bobby Wang <wbo4958@gmail.com> * Update sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * resolve comments * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com> * resolve comment Co-authored-by: Jason Lowe <jlowe@nvidia.com>
wbo4958 · Jan 10, 2022 · 8773141 · 8773141
1 parent 5749673
commit 8773141
Show file tree

Hide file tree

Showing 3 changed files with 12 additions and 6 deletions.
diff --git a/docs/configs.md b/docs/configs.md
@@ -86,7 +86,7 @@ Name | Description | Default Value
 <a name="sql.format.orc.multiThreadedRead.numThreads"></a>spark.rapids.sql.format.orc.multiThreadedRead.numThreads|The maximum number of threads, on the executor, to use for reading small orc files in parallel. This can not be changed at runtime after the executor has started. Used with MULTITHREADED reader, see spark.rapids.sql.format.orc.reader.type.|20
 <a name="sql.format.orc.read.enabled"></a>spark.rapids.sql.format.orc.read.enabled|When set to false disables orc input acceleration|true
 <a name="sql.format.orc.reader.type"></a>spark.rapids.sql.format.orc.reader.type|Sets the orc reader type. We support different types that are optimized for different environments. The original Spark style reader can be selected by setting this to PERFILE which individually reads and copies files to the GPU. Loading many small files individually has high overhead, and using either COALESCING or MULTITHREADED is recommended instead. The COALESCING reader is good when using a local file system where the executors are on the same nodes or close to the nodes the data is being read on. This reader coalesces all the files assigned to a task into a single host buffer before sending it down to the GPU. It copies blocks from a single file into a host buffer in separate threads in parallel, see spark.rapids.sql.format.orc.multiThreadedRead.numThreads. MULTITHREADED is good for cloud environments where you are reading from a blobstore that is totally separate and likely has a higher I/O read cost. Many times the cloud environments also get better throughput when you have multiple readers in parallel. This reader uses multiple threads to read each file in parallel and each file is sent to the GPU separately. This allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.format.orc.multiThreadedRead.numThreads and spark.rapids.sql.format.orc.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. By default this is set to AUTO so we select the reader we think is best. This will either be the COALESCING or the MULTITHREADED based on whether we think the file is in the cloud. See spark.rapids.cloudSchemes.|AUTO
-<a name="sql.format.orc.write.enabled"></a>spark.rapids.sql.format.orc.write.enabled|When set to false disables orc output acceleration|true
+<a name="sql.format.orc.write.enabled"></a>spark.rapids.sql.format.orc.write.enabled|When set to true enables orc output acceleration. We default it to false is because there is an ORC bug that ORC Java library fails to read ORC file without statistics in RowIndex. For more details, please refer to https://issues.apache.org/jira/browse/ORC-1075|false
 <a name="sql.format.parquet.enabled"></a>spark.rapids.sql.format.parquet.enabled|When set to false disables all parquet input and output acceleration|true
 <a name="sql.format.parquet.multiThreadedRead.maxNumFilesParallel"></a>spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel|A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.parquet.reader.type|2147483647
 <a name="sql.format.parquet.multiThreadedRead.numThreads"></a>spark.rapids.sql.format.parquet.multiThreadedRead.numThreads|The maximum number of threads, on the executor, to use for reading small parquet files in parallel. This can not be changed at runtime after the executor has started. Used with COALESCING and MULTITHREADED reader, see spark.rapids.sql.format.parquet.reader.type.|20

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
@@ -779,9 +779,11 @@ object RapidsConf {
     .createWithDefault(true)
 
   val ENABLE_ORC_WRITE = conf("spark.rapids.sql.format.orc.write.enabled")
-    .doc("When set to false disables orc output acceleration")
+    .doc("When set to true enables orc output acceleration. We default it to false is because " +
+      "there is an ORC bug that ORC Java library fails to read ORC file without statistics in " +
+      "RowIndex. For more details, please refer to https://issues.apache.org/jira/browse/ORC-1075")
     .booleanConf
-    .createWithDefault(true)
+    .createWithDefault(false)
 
   // This will be deleted when COALESCING is implemented for ORC
   object OrcReaderType extends Enumeration {

diff --git a/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala b/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020-2021, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2022, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -53,8 +53,12 @@ object GpuOrcFileFormat extends Logging {
     }
 
     if (!meta.conf.isOrcWriteEnabled) {
-      meta.willNotWorkOnGpu("ORC output has been disabled. To enable set" +
-        s"${RapidsConf.ENABLE_ORC_WRITE} to true")
+      meta.willNotWorkOnGpu("ORC output has been disabled. To enable set " +
+        s"${RapidsConf.ENABLE_ORC_WRITE} to true.\n" +
+        "Please note that, the ORC file written by spark-rapids will not include statistics " +
+        "in RowIndex, which will result in Spark 3.1.1+ failed to read ORC file when the filter " +
+        "is pushed down. This is an ORC issue, " +
+        "please refer to https://issues.apache.org/jira/browse/ORC-1075")
     }
 
     FileFormatChecks.tag(meta, schema, OrcFormatType, WriteFileOp)