Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] udf_cudf_test.py integration tests fail #2027

Closed
tgravescs opened this issue Mar 26, 2021 · 10 comments
Closed

[BUG] udf_cudf_test.py integration tests fail #2027

tgravescs opened this issue Mar 26, 2021 · 10 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
10:14:01 FAILED src/main/python/udf_cudf_test.py::test_with_column[small data] - py4j....
10:14:01 FAILED src/main/python/udf_cudf_test.py::test_with_column[large data] - py4j....
10:14:01 FAILED src/main/python/udf_cudf_test.py::test_sql - py4j.protocol.Py4JJavaErr...
10:14:01 FAILED src/main/python/udf_cudf_test.py::test_select - py4j.protocol.Py4JJava...
10:14:01 FAILED src/main/python/udf_cudf_test.py::test_map_in_pandas[ALLOW_NON_GPU(GpuMapInPandasExec,PythonUDF)]
10:14:01 FAILED src/main/python/udf_cudf_test.py::test_group_apply[ALLOW_NON_GPU(GpuFlatMapGroupsInPandasExec,PythonUDF)]
10:14:01 FAILED src/main/python/udf_cudf_test.py::test_group_apply_in_pandas[ALLOW_NON_GPU(GpuFlatMapGroupsInPandasExec,PythonUDF)]
10:14:01 FAILED src/main/python/udf_cudf_test.py::test_group_agg[ALLOW_NON_GPU(GpuAggregateInPandasExec,PythonUDF,Alias)]
10:14:01 FAILED src/main/python/udf_cudf_test.py::test_sql_group[ALLOW_NON_GPU(GpuAggregateInPandasExec,PythonUDF,Alias)]
10:14:01 FAILED src/main/python/udf_cudf_test.py::test_window[ALLOW_NON_GPU(GpuWindowInPandasExec,PythonUDF,Alias,WindowExpression,WindowSpecDefinition,SpecifiedWindowFrame,UnboundedPreceding$,UnboundedFollowing$)]
10:14:01 FAILED src/main/python/udf_cudf_test.py::test_cogroup[ALLOW_NON_GPU(GpuFlatMapCoGroupsInPandasExec,PythonUDF)]

10:14:01  E                   py4j.protocol.Py4JJavaError: An error occurred while calling o2084.collectToPython.
10:14:01  E                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 35.0 failed 4 times, most recent failure: Lost task 8.3 in stage 35.0 (TID 443, 10.233.122.216, executor 0): java.io.EOFException
10:14:01  E                   	at java.io.DataInputStream.readInt(DataInputStream.java:392)
10:14:01  E                   	at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:120)
10:14:01  E                   	at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:136)
10:14:01  E                   	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:135)
10:14:01  E                   	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:105)
10:14:01  E                   	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
10:14:01  E                   	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:131)
10:14:01  E                   	at org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)
10:14:01  E                   	at org.apache.spark.sql.execution.python.rapids.GpuPandasUtils$.executePython(GpuPandasUtils.scala:35)
10:14:01  E                   	at org.apache.spark.sql.rapids.execution.python.GpuFlatMapCoGroupsInPandasExec.$anonfun$doExecute$1(GpuFlatMapCoGroupsInPandasExec.scala:135)
10:14:01  E                   	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
10:14:01  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
10:14:01  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
10:14:01  E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
10:14:01  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
10:14:01  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
10:14:01  E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
10:14:01  E                   	at org.apache.spark.scheduler.Task.run(Task.scala:127)
10:14:01  E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
10:14:01  E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
10:14:01  E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
10:14:01  E                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
10:14:01  E                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
10:14:01  E                   	at java.lang.Thread.run(Thread.java:748)
10:14:01  E                   
10:14:01  E                   Driver stacktrace:
10:14:01  E                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
10:14:01  E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
10:14:01  E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
10:14:01  E                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
10:14:01  E                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
10:14:01  E                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
10:14:01  E                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
10:14:01  E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
10:14:01  E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
10:14:01  E                   	at scala.Option.foreach(Option.scala:407)
10:14:01  E                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
10:14:01  E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
10:14:01  E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
10:14:01  E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
10:14:01  E                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
10:14:01  E                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
10:14:01  E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
10:14:01  E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
10:14:01  E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
10:14:01  E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
10:14:01  E                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
10:14:01  E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
10:14:01  E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
10:14:01  E                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
10:14:01  E                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
10:14:01  E                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
10:14:01  E                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3450)
10:14:01  E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
10:14:01  E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
10:14:01  E                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
10:14:01  E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
10:14:01  E                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
10:14:01  E                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
10:14:01  E                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
10:14:01  E                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3447)
10:14:01  E                   	at sun.reflect.GeneratedMethodAccessor82.invoke(Unknown Source)
10:14:01  E                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
10:14:01  E                   	at java.lang.reflect.Method.invoke(Method.java:498)
10:14:01  E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
10:14:01  E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
10:14:01  E                   	at py4j.Gateway.invoke(Gateway.java:282)
10:14:01  E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
10:14:01  E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
10:14:01  E                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
10:14:01  E                   	at java.lang.Thread.run(Thread.java:748)
10:14:01  E                   Caused by: java.io.EOFException
10:14:01  E                   	at java.io.DataInputStream.readInt(DataInputStream.java:392)
10:14:01  E                   	at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:120)
10:14:01  E                   	at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:136)
10:14:01  E                   	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:135)
10:14:01  E                   	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:105)
10:14:01  E                   	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
10:14:01  E                   	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:131)
10:14:01  E                   	at org.apache.spark.sql.execution.python.PandasGroupUtils$.executePython(PandasGroupUtils.scala:44)
10:14:01  E                   	at org.apache.spark.sql.execution.python.rapids.GpuPandasUtils$.executePython(GpuPandasUtils.scala:35)
10:14:01  E                   	at org.apache.spark.sql.rapids.execution.python.GpuFlatMapCoGroupsInPandasExec.$anonfun$doExecute$1(GpuFlatMapCoGroupsInPandasExec.scala:135)
10:14:01  E                   	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
10:14:01  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
10:14:01  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
10:14:01  E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
10:14:01  E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
10:14:01  E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
10:14:01  E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
10:14:01  E                   	at org.apache.spark.scheduler.Task.run(Task.scala:127)
10:14:01  E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
10:14:01  E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
10:14:01  E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
10:14:01  E                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
10:14:01  E                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
10:14:01  E                   	... 1 more
10:14:01  
@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 26, 2021
@tgravescs
Copy link
Collaborator Author

@firestarman @GaryShen2008

@tgravescs tgravescs added the P0 Must have for release label Mar 26, 2021
@viadea
Copy link
Collaborator

viadea commented Mar 26, 2021

I remembered I did this test couple of days ago and it succeeded. Of course, i have to tweak lots of configs to make it work.
Here are my tests

@firestarman
Copy link
Collaborator

did they fail in all the spark versions ?

@GaryShen2008
Copy link
Collaborator

did they fail in all the spark versions ?

From the Jenkins job page, it failed on all the spark versions(3.0.1, 3.0.2 and 3.1.2).
It began from Mar 26.

@firestarman
Copy link
Collaborator

firestarman commented Mar 29, 2021

It passed in the latest run. Build NO. is 127.

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Mar 29, 2021

Build ID 131 PASS, too.

@firestarman
Copy link
Collaborator

We can close this if the next build passes.

@sameerz
Copy link
Collaborator

sameerz commented Mar 30, 2021

@firestarman do we know why this was broken and is now working?

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Mar 30, 2021
@sameerz sameerz added this to the Mar 30 - Apr 9 milestone Mar 30, 2021
@firestarman
Copy link
Collaborator

firestarman commented Mar 31, 2021

@firestarman do we know why this was broken and is now working?

@sameerz Acutually i have no idea about this, may be related to environment . Since nothing was changed in code.

@sameerz
Copy link
Collaborator

sameerz commented Apr 6, 2021

Spoke with @GaryShen2008 and we have not seen this failure since it happened. Closing for now, in case it happens again we need to investigate.

@sameerz sameerz closed this as completed Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

6 participants