[BUG] test_read_merge_schema fails on Databricks #192

tgravescs · 2020-06-16T22:10:10Z

Describe the bug
A clear and concise description of what the bug is.

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

startup databricks cluster on AWS
log into master node and checkout source code
build and run integration tests

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

E : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1854.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1854.0 (TID 40584, ip-10-59-249-53.us-west-2.compute.internal, executor driver): com.databricks.sql.io.FileReadException: Error while reading file file:/tmp/pyspark_tests/227876/PARQUET_DATA/key=0/part-00000-tid-2694644956922271293-3424fdcc-0a76-4dc5-9edc-495760f7c104-40579-1-c000.snappy.parquet.
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:343)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:322)
E at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:409)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:258)
E at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:126)
E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.gpucolumnartorow_nextBatch_0$(Unknown Source)
E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
E at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
E at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
E at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
E at org.apache.spark.sql.execution.collect.Collector.$anonfun$processPartition$1(Collector.scala:179)
E at org.apache.spark.SparkContext.$anonfun$runJob$6(SparkContext.scala:2401)
E at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
E at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
E at org.apache.spark.scheduler.Task.run(Task.scala:117)
E at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
E at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
E at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
E at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
E at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
E at java.lang.Thread.run(Thread.java:748)
E Caused by: org.apache.spark.sql.execution.QueryExecutionException: Expected 15 columns but read 10 from file:/tmp/pyspark_tests/227876/PARQUET_DATA/key=0/part-00000-tid-2694644956922271293-3424fdcc-0a76-4dc5-9edc-495760f7c104-40579-1-c000.snappy.parquet
E at ai.rapids.spark.ParquetPartitionReader.readToTable(GpuParquetScan.scala:477)
E at ai.rapids.spark.ParquetPartitionReader.readBatch(GpuParquetScan.scala:435)
E at ai.rapids.spark.ParquetPartitionReader.next(GpuParquetScan.scala:254)
E at ai.rapids.spark.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:35)
E at ai.rapids.spark.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:38)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:291)
E ... 20 more
E
E Driver stacktrace:
E at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2476)
E at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2425)
E at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2424)
E at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
E at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
E at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
E at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2424)
E at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1129)
E at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1129)
E at scala.Option.foreach(Option.scala:407)
E at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1129)
E at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2676)
E at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2623)
E at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2611)
E at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
E at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:915)
E at org.apache.spark.SparkContext.runJob(SparkContext.scala:2307)
E at org.apache.spark.SparkContext.runJob(SparkContext.scala:2402)
E at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:273)
E at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:308)
E at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:82)
E at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88)
E at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:508)
E at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:480)
E at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:396)
E at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3487)
E at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3682)
E at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:115)
E at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:246)
E at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:100)
E at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:828)
E at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:76)
E at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:196)
E at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3680)
E at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3485)
E at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source)
E at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E at java.lang.reflect.Method.invoke(Method.java:498)
E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
E at py4j.Gateway.invoke(Gateway.java:295)
E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E at py4j.commands.CallCommand.execute(CallCommand.java:79)
E at py4j.GatewayConnection.run(GatewayConnection.java:251)
E at java.lang.Thread.run(Thread.java:748)
E Caused by: com.databricks.sql.io.FileReadException: Error while reading file file:/tmp/pyspark_tests/227876/PARQUET_DATA/key=0/part-00000-tid-2694644956922271293-3424fdcc-0a76-4dc5-9edc-495760f7c104-40579-1-c000.snappy.parquet.
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:343)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:322)
E at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:409)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:258)
E at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:126)
E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.gpucolumnartorow_nextBatch_0$(Unknown Source)
E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
E at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
E at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
E at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
E at org.apache.spark.sql.execution.collect.Collector.$anonfun$processPartition$1(Collector.scala:179)
E at org.apache.spark.SparkContext.$anonfun$runJob$6(SparkContext.scala:2401)
E at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
E at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
E at org.apache.spark.scheduler.Task.run(Task.scala:117)
E at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
E at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
E at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
E at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
E at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
E ... 1 more
E Caused by: org.apache.spark.sql.execution.QueryExecutionException: Expected 15 columns but read 10 from file:/tmp/pyspark_tests/227876/PARQUET_DATA/key=0/part-00000-tid-2694644956922271293-3424fdcc-0a76-4dc5-9edc-495760f7c104-40579-1-c000.snappy.parquet
E at ai.rapids.spark.ParquetPartitionReader.readToTable(GpuParquetScan.scala:477)
E at ai.rapids.spark.ParquetPartitionReader.readBatch(GpuParquetScan.scala:435)
E at ai.rapids.spark.ParquetPartitionReader.next(GpuParquetScan.scala:254)
E at ai.rapids.spark.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:35)
E at ai.rapids.spark.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:38)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:291)
E ... 20 more

tgravescs · 2020-06-16T22:12:00Z

Note this requires additional changes not yet committed. There are PRs up for those changes and some build changes are yet to be pull requested

sameerz · 2020-07-22T21:32:07Z

Note this requires additional changes not yet committed. There are PRs up for those changes and some build changes are yet to be pull requested

@tgravescs, can you link to the PRs for these changes?

tgravescs · 2020-07-23T13:33:58Z

changes are already committed for Databricks build. You can see the Jenkinsfile.databricksnightly build file on how to build, but it requires a cluster to be setup. I am happy to show folks how to do that.

tgravescs · 2020-08-20T19:52:14Z

this issue actually doesn't exist anymore since we rearranged the FileSource stuff. I'll put up a PR to remove the xfail from the test.

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 16, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Jun 29, 2020

sameerz added the P1 Nice to have for release label Aug 19, 2020

tgravescs mentioned this issue Aug 20, 2020

Remove the xfail for parquet test_read_merge_schema on Databricks #597

Merged

jlowe closed this as completed in #597 Aug 21, 2020

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to ba1173d (NVIDIA#192)

4ed3740

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_read_merge_schema fails on Databricks #192

[BUG] test_read_merge_schema fails on Databricks #192

tgravescs commented Jun 16, 2020 •

edited

Loading

tgravescs commented Jun 16, 2020

sameerz commented Jul 22, 2020

tgravescs commented Jul 23, 2020

tgravescs commented Aug 20, 2020

[BUG] test_read_merge_schema fails on Databricks #192

[BUG] test_read_merge_schema fails on Databricks #192

Comments

tgravescs commented Jun 16, 2020 • edited Loading

tgravescs commented Jun 16, 2020

sameerz commented Jul 22, 2020

tgravescs commented Jul 23, 2020

tgravescs commented Aug 20, 2020

tgravescs commented Jun 16, 2020 •

edited

Loading