Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1601 & SPARK-1602: two bug fixes related to cancellation #521

Closed
wants to merge 3 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Apr 24, 2014

This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached.

  1. Do not put partially executed partitions into cache (in task killing).
  2. Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs.

Thanks @aarondav and @ahirreddy for reporting and helping debug.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14419/

@aarondav
Copy link
Contributor

Test failure seems unrelated since the test does not seem to construct an RDD or interact with a SparkContext or InterruptibleIterator at all.

Jenkins, retest this please.

@@ -206,3 +235,9 @@ class JobCancellationSuite extends FunSuite with ShouldMatchers with BeforeAndAf
}
}
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra new line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually two blank lines to separate top level objects is a habit and sometimes recommended style :)

@aarondav
Copy link
Contributor

Note that this also fixes a bug where the iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs.

@aarondav
Copy link
Contributor

I created SPARK-1601 and SPARK-1602 to track the issues fixed by this PR. As SPARK-1602 is the more serious issue, and the main one fixed here, please put it in the PR title.

@rxin rxin changed the title Fixed a bug that partially executed partitions can be put into cache (in task killing). SPARK-1601 & SPARK-1602: Fixed a bug that partially executed partitions can be put into cache (in task killing). Apr 24, 2014
@rxin rxin changed the title SPARK-1601 & SPARK-1602: Fixed a bug that partially executed partitions can be put into cache (in task killing). SPARK-1601 & SPARK-1602: Do not put partially executed partitions into cache (in task killing). Apr 24, 2014
@rxin rxin changed the title SPARK-1601 & SPARK-1602: Do not put partially executed partitions into cache (in task killing). SPARK-1601 & SPARK-1602: bug fix related to cancellation Apr 24, 2014
@rxin rxin changed the title SPARK-1601 & SPARK-1602: bug fix related to cancellation SPARK-1601 & SPARK-1602: two bug fixes related to cancellation Apr 24, 2014
…into kill

Conflicts:
	core/src/test/scala/org/apache/spark/JobCancellationSuite.scala
@rxin
Copy link
Contributor Author

rxin commented Apr 24, 2014

Ok I brought this up to date. There are couple commits that show up because the asf git bot hasn't sync those commits to github yet.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14425/

@mateiz
Copy link
Contributor

mateiz commented Apr 24, 2014

Looks good to me.

@rxin
Copy link
Contributor Author

rxin commented Apr 24, 2014

Thanks. I've merged this.

asfgit pushed a commit that referenced this pull request Apr 24, 2014
This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached.

1. Do not put partially executed partitions into cache (in task killing).

2. Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs.

Thanks @aarondav and @ahirreddy for reporting and helping debug.

Author: Reynold Xin <rxin@apache.org>

Closes #521 from rxin/kill and squashes the following commits:

401033f [Reynold Xin] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into kill
7a7bdd2 [Reynold Xin] Add a new line in the end of JobCancellationSuite.scala.
35cd9f7 [Reynold Xin] Fixed a bug that partially executed partitions can be put into cache (in task killing).

(cherry picked from commit 1fdf659)
Signed-off-by: Reynold Xin <rxin@apache.org>
@asfgit asfgit closed this in 1fdf659 Apr 24, 2014
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Apr 24, 2014
This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached.

1. Do not put partially executed partitions into cache (in task killing).

2. Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs.

Thanks @aarondav and @ahirreddy for reporting and helping debug.

Author: Reynold Xin <rxin@apache.org>

Closes apache#521 from rxin/kill and squashes the following commits:

401033f [Reynold Xin] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into kill
7a7bdd2 [Reynold Xin] Add a new line in the end of JobCancellationSuite.scala.
35cd9f7 [Reynold Xin] Fixed a bug that partially executed partitions can be put into cache (in task killing).

Conflicts:
	core/src/main/scala/org/apache/spark/CacheManager.scala
	core/src/main/scala/org/apache/spark/executor/Executor.scala
	core/src/test/scala/org/apache/spark/JobCancellationSuite.scala
markhamstra pushed a commit to markhamstra/spark that referenced this pull request May 28, 2014
This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached.

1. Do not put partially executed partitions into cache (in task killing).

2. Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs.

Thanks @aarondav and @ahirreddy for reporting and helping debug.

Author: Reynold Xin <rxin@apache.org>

Closes apache#521 from rxin/kill and squashes the following commits:

401033f [Reynold Xin] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into kill
7a7bdd2 [Reynold Xin] Add a new line in the end of JobCancellationSuite.scala.
35cd9f7 [Reynold Xin] Fixed a bug that partially executed partitions can be put into cache (in task killing).

Conflicts:
	core/src/main/scala/org/apache/spark/CacheManager.scala
	core/src/main/scala/org/apache/spark/executor/Executor.scala
	core/src/test/scala/org/apache/spark/JobCancellationSuite.scala

Conflicts:
	core/src/main/scala/org/apache/spark/CacheManager.scala
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
This should go into 1.0 since it would return wrong data when the bug happens (which is pretty likely if cancellation is used). Test case attached.

1. Do not put partially executed partitions into cache (in task killing).

2. Iterator returned by CacheManager#getOrCompute was not an InterruptibleIterator, and was thus leading to uninterruptible jobs.

Thanks @aarondav and @ahirreddy for reporting and helping debug.

Author: Reynold Xin <rxin@apache.org>

Closes apache#521 from rxin/kill and squashes the following commits:

401033f [Reynold Xin] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into kill
7a7bdd2 [Reynold Xin] Add a new line in the end of JobCancellationSuite.scala.
35cd9f7 [Reynold Xin] Fixed a bug that partially executed partitions can be put into cache (in task killing).
@rxin rxin deleted the kill branch August 13, 2014 08:01
helenyugithub pushed a commit to helenyugithub/spark that referenced this pull request Aug 20, 2019
* Add back local file mounting.

* Commit back entrypoint changes, add unit test

* Remove unnecessary whitespace change
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
After talking with mada, 2 TF jobs running in parallel can also present
the performance comparing, and will avoid the CI job timeout of kubeflow
job.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants