Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_read_merge_schema fails on Databricks #192

Closed
tgravescs opened this issue Jun 16, 2020 · 4 comments · Fixed by #597
Closed

[BUG] test_read_merge_schema fails on Databricks #192

tgravescs opened this issue Jun 16, 2020 · 4 comments · Fixed by #597
Labels
bug Something isn't working P1 Nice to have for release

Comments

@tgravescs
Copy link
Collaborator

tgravescs commented Jun 16, 2020

Describe the bug
A clear and concise description of what the bug is.

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

  • startup databricks cluster on AWS
  • log into master node and checkout source code
  • build and run integration tests

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

E : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1854.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1854.0 (TID 40584, ip-10-59-249-53.us-west-2.compute.internal, executor driver): com.databricks.sql.io.FileReadException: Error while reading file file:/tmp/pyspark_tests/227876/PARQUET_DATA/key=0/part-00000-tid-2694644956922271293-3424fdcc-0a76-4dc5-9edc-495760f7c104-40579-1-c000.snappy.parquet.
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:343)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:322)
E at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:409)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:258)
E at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:126)
E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.gpucolumnartorow_nextBatch_0$(Unknown Source)
E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
E at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
E at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
E at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
E at org.apache.spark.sql.execution.collect.Collector.$anonfun$processPartition$1(Collector.scala:179)
E at org.apache.spark.SparkContext.$anonfun$runJob$6(SparkContext.scala:2401)
E at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
E at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
E at org.apache.spark.scheduler.Task.run(Task.scala:117)
E at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
E at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
E at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
E at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
E at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
E at java.lang.Thread.run(Thread.java:748)
E Caused by: org.apache.spark.sql.execution.QueryExecutionException: Expected 15 columns but read 10 from file:/tmp/pyspark_tests/227876/PARQUET_DATA/key=0/part-00000-tid-2694644956922271293-3424fdcc-0a76-4dc5-9edc-495760f7c104-40579-1-c000.snappy.parquet
E at ai.rapids.spark.ParquetPartitionReader.readToTable(GpuParquetScan.scala:477)
E at ai.rapids.spark.ParquetPartitionReader.readBatch(GpuParquetScan.scala:435)
E at ai.rapids.spark.ParquetPartitionReader.next(GpuParquetScan.scala:254)
E at ai.rapids.spark.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:35)
E at ai.rapids.spark.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:38)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:291)
E ... 20 more
E
E Driver stacktrace:
E at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2476)
E at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2425)
E at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2424)
E at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
E at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
E at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
E at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2424)
E at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1129)
E at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1129)
E at scala.Option.foreach(Option.scala:407)
E at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1129)
E at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2676)
E at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2623)
E at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2611)
E at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
E at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:915)
E at org.apache.spark.SparkContext.runJob(SparkContext.scala:2307)
E at org.apache.spark.SparkContext.runJob(SparkContext.scala:2402)
E at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:273)
E at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:308)
E at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:82)
E at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88)
E at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:508)
E at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:480)
E at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:396)
E at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3487)
E at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3682)
E at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:115)
E at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:246)
E at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:100)
E at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:828)
E at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:76)
E at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:196)
E at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3680)
E at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3485)
E at sun.reflect.GeneratedMethodAccessor88.invoke(Unknown Source)
E at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E at java.lang.reflect.Method.invoke(Method.java:498)
E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
E at py4j.Gateway.invoke(Gateway.java:295)
E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E at py4j.commands.CallCommand.execute(CallCommand.java:79)
E at py4j.GatewayConnection.run(GatewayConnection.java:251)
E at java.lang.Thread.run(Thread.java:748)
E Caused by: com.databricks.sql.io.FileReadException: Error while reading file file:/tmp/pyspark_tests/227876/PARQUET_DATA/key=0/part-00000-tid-2694644956922271293-3424fdcc-0a76-4dc5-9edc-495760f7c104-40579-1-c000.snappy.parquet.
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:343)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:322)
E at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:409)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:258)
E at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:126)
E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.gpucolumnartorow_nextBatch_0$(Unknown Source)
E at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
E at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
E at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
E at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
E at org.apache.spark.sql.execution.collect.Collector.$anonfun$processPartition$1(Collector.scala:179)
E at org.apache.spark.SparkContext.$anonfun$runJob$6(SparkContext.scala:2401)
E at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
E at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
E at org.apache.spark.scheduler.Task.run(Task.scala:117)
E at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
E at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
E at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
E at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
E at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
E ... 1 more
E Caused by: org.apache.spark.sql.execution.QueryExecutionException: Expected 15 columns but read 10 from file:/tmp/pyspark_tests/227876/PARQUET_DATA/key=0/part-00000-tid-2694644956922271293-3424fdcc-0a76-4dc5-9edc-495760f7c104-40579-1-c000.snappy.parquet
E at ai.rapids.spark.ParquetPartitionReader.readToTable(GpuParquetScan.scala:477)
E at ai.rapids.spark.ParquetPartitionReader.readBatch(GpuParquetScan.scala:435)
E at ai.rapids.spark.ParquetPartitionReader.next(GpuParquetScan.scala:254)
E at ai.rapids.spark.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:35)
E at ai.rapids.spark.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:38)
E at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:291)
E ... 20 more

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 16, 2020
@tgravescs
Copy link
Collaborator Author

Note this requires additional changes not yet committed. There are PRs up for those changes and some build changes are yet to be pull requested

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jun 29, 2020
@sameerz
Copy link
Collaborator

sameerz commented Jul 22, 2020

Note this requires additional changes not yet committed. There are PRs up for those changes and some build changes are yet to be pull requested

@tgravescs, can you link to the PRs for these changes?

@tgravescs
Copy link
Collaborator Author

changes are already committed for Databricks build. You can see the Jenkinsfile.databricksnightly build file on how to build, but it requires a cluster to be setup. I am happy to show folks how to do that.

@sameerz sameerz added the P1 Nice to have for release label Aug 19, 2020
@tgravescs
Copy link
Collaborator Author

this issue actually doesn't exist anymore since we rearranged the FileSource stuff. I'll put up a PR to remove the xfail from the test.

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Nice to have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants