Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPY-302 backporting of SPARK-1620, SPARK-1685, SPARK-1686, SPARK-1772 #7

Merged
merged 5 commits into from
May 14, 2014

Commits on May 14, 2014

  1. SPARK-1772 Stop catching Throwable, let Executors die

    The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772), in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.
    
    This patch also continues the fight in the neverending war against `case t: Throwable =>`, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs.
    
    It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The uncaughtExceptionHandler is not called from the places that we expected it would be.
    [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR.
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes apache#715 from aarondav/throwable and squashes the following commits:
    
    f9b9bfe [Aaron Davidson] Remove other redundant 'throw e'
    e937a0a [Aaron Davidson] Address Prashant and Matei's comments
    1867867 [Aaron Davidson] [RFC] SPARK-1772 Stop catching Throwable, let Executors die
    (cherry picked from commit 3af1f38)
    
    Signed-off-by: Patrick Wendell <pwendell@gmail.com>
    
    Conflicts:
    	core/src/main/scala/org/apache/spark/ContextCleaner.scala
    	core/src/main/scala/org/apache/spark/SparkContext.scala
    	core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
    	core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala
    	core/src/main/scala/org/apache/spark/deploy/Client.scala
    	core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
    	core/src/main/scala/org/apache/spark/deploy/master/Master.scala
    	core/src/main/scala/org/apache/spark/deploy/worker/DriverWrapper.scala
    	core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
    	core/src/main/scala/org/apache/spark/executor/Executor.scala
    	core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala
    	core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala
    	core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
    	core/src/main/scala/org/apache/spark/storage/TachyonBlockManager.scala
    	core/src/main/scala/org/apache/spark/util/AkkaUtils.scala
    	core/src/main/scala/org/apache/spark/util/Utils.scala
    aarondav authored and markhamstra committed May 14, 2014
    Configuration menu
    Copy the full SHA
    be93773 View commit details
    Browse the repository at this point in the history
  2. [SPARK-1620] Handle uncaught exceptions in function run by Akka sched…

    …uler
    
    If the intended behavior was that uncaught exceptions thrown in functions being run by the Akka scheduler would end up being handled by the default uncaught exception handler set in Executor, and if that behavior is, in fact, correct, then this is a way to accomplish that.  I'm not certain, though, that we shouldn't be doing something different to handle uncaught exceptions from some of these scheduled functions.
    
    In any event, this PR covers all of the cases I comment on in [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620).
    
    Author: Mark Hamstra <markhamstra@gmail.com>
    
    Closes apache#622 from markhamstra/SPARK-1620 and squashes the following commits:
    
    071d193 [Mark Hamstra] refactored post-SPARK-1772
    1a6a35e [Mark Hamstra] another style fix
    d30eb94 [Mark Hamstra] scalastyle
    3573ecd [Mark Hamstra] Use wrapped try/catch in Utils.tryOrExit
    8fc0439 [Mark Hamstra] Make functions run by the Akka scheduler use Executor's UncaughtExceptionHandler
    (cherry picked from commit 17f3075)
    
    Signed-off-by: Patrick Wendell <pwendell@gmail.com>
    
    Conflicts:
    	core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala
    	core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
    	core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
    	core/src/main/scala/org/apache/spark/util/Utils.scala
    markhamstra committed May 14, 2014
    Configuration menu
    Copy the full SHA
    7cfc5b4 View commit details
    Browse the repository at this point in the history
  3. [SPARK-1685] Cancel retryTimer on restart of Worker or AppClient

    See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up.
    
    Author: Mark Hamstra <markhamstra@gmail.com>
    
    Closes apache#602 from markhamstra/SPARK-1685 and squashes the following commits:
    
    11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer
    69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient
    
    Conflicts:
    	core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala
    	core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
    markhamstra committed May 14, 2014
    Configuration menu
    Copy the full SHA
    989e184 View commit details
    Browse the repository at this point in the history
  4. Fixed mis-merge

    markhamstra committed May 14, 2014
    Configuration menu
    Copy the full SHA
    75c8638 View commit details
    Browse the repository at this point in the history
  5. SPARK-1686: keep schedule() calling in the main thread

    https://issues.apache.org/jira/browse/SPARK-1686
    
    moved from original JIRA (by @markhamstra):
    
    In deploy.master.Master, the completeRecovery method is the last thing to be called when a standalone Master is recovering from failure. It is responsible for resetting some state, relaunching drivers, and eventually resuming its scheduling duties.
    
    There are currently four places in Master.scala where completeRecovery is called. Three of them are from within the actor's receive method, and aren't problems. The last starts from within receive when the ElectedLeader message is received, but the actual completeRecovery() call is made from the Akka scheduler. That means that it will execute on a different scheduler thread, and Master itself will end up running (i.e., schedule() ) from that Akka scheduler thread.
    
    In this PR, I added a new master message TriggerSchedule to trigger the "local" call of schedule() in the scheduler thread
    
    Author: CodingCat <zhunansjtu@gmail.com>
    
    Closes apache#639 from CodingCat/SPARK-1686 and squashes the following commits:
    
    81bb4ca [CodingCat] rename variable
    69e0a2a [CodingCat] style fix
    36a2ac0 [CodingCat] address Aaron's comments
    ec9b7bb [CodingCat] address the comments
    02b37ca [CodingCat] keep schedule() calling in the main thread
    
    Conflicts:
    	core/src/main/scala/org/apache/spark/deploy/master/Master.scala
    CodingCat authored and markhamstra committed May 14, 2014
    Configuration menu
    Copy the full SHA
    76dc266 View commit details
    Browse the repository at this point in the history