Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] IT hang when running locally with Java max memory is 25G and non-UTC time zone. #9915

Closed
res-life opened this issue Dec 1, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@res-life
Copy link
Collaborator

res-life commented Dec 1, 2023

Describe the bug
It's separated from 9829
IT hang when running locally with Java max memory is 25G and non-UTC time zone.
It's caused by a heartbeats timeout.

Steps/Code to reproduce bug

  1. export TEST_PARALLEL=1
  2. export TZ=Iran
  3. Set driver memory as 32g, refer to the following change on run_pyspark_from_build.sh
    Here set 32g is to avoid CPU OOM, refer to 9829
-        exec "$SPARK_HOME"/bin/spark-submit "${jarOpts[@]}" \
+        exec "$SPARK_HOME"/bin/spark-submit --driver-memory 32g "${jarOpts[@]}" \
  1. run IT
### ITERATOR: GPU TOOK 0.5742542743682861 CPU TOOK 3.6484925746917725 ###
23/11/30 22:03:21 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 158268 ms exceeds timeout 120000 ms
23/11/30 22:03:21 WARN SparkContext: Killing executors is not supported by current scheduler.
23/11/30 22:29:29 ERROR BlockManagerMasterEndpoint: Fail to know the executor driver is alive or not.
org.apache.spark.SparkException: Exception thrown in awaitResult:
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
  at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
  at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
  at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
  at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:112)
  at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$driverEndpoint(BlockManagerMasterEndpoint.scala:111)
  at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$handleBlockRemovalFailure$1.applyOrElse(BlockManagerMasterEndpoint.scala:226)
  at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$handleBlockRemovalFailure$1.applyOrElse(BlockManagerMasterEndpoint.scala:217)
  at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
  at scala.util.Failure.recover(Try.scala:234)
  at scala.concurrent.Future.$anonfun$recover$1(Future.scala:395)
  at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
  at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedScheduler@chongg-pc:44589
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144)
  at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
  at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
  at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99)
  at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
  at scala.concurrent.Promise.complete(Promise.scala:53)
  at scala.concurrent.Promise.complete$(Promise.scala:52)
  at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
  at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
  at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99)
  at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
  at scala.concurrent.Promise.complete(Promise.scala:53)
  at scala.concurrent.Promise.complete$(Promise.scala:52)
  at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
  at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
  at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
  at scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:67)
  at scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:82)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
  at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:59)
  at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:875)
  at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:110)
  at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107)
  at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873)
  at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
  at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
  at scala.concurrent.Promise.trySuccess(Promise.scala:94)
  at scala.concurrent.Promise.trySuccess$(Promise.scala:94)
  at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187)
  at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$5(NettyRpcEnv.scala:239)
  at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$5$adapted(NettyRpcEnv.scala:238)

Environment details (please complete the following information)
Spark 311

Additional context
When running the case mortgage_test.py::test_mortgage, error occured.
It's may related to slow GC, it's may related to JVM OOM 9829.
From the log it does report JVM OOM or GPU OOM.

@res-life res-life added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 1, 2023
@res-life res-life self-assigned this Dec 1, 2023
@res-life
Copy link
Collaborator Author

res-life commented Dec 1, 2023

Actually it's the same with #9829.
In 9829 the IT reports OutOfMemory and this issues did not report OutOfMemory. Actually From the GC log it also reports OutOfMemory.

it-gc.log

Refer to the last lines of this file, it constantly is doing Full GC.

@res-life res-life closed this as completed Dec 1, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 1, 2023
@mattahrens mattahrens closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants