Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6955] Perform port retries at NettyBlockTransferService level #5575

Closed
wants to merge 4 commits into from

Conversation

aarondav
Copy link
Contributor

Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services.

@SparkQA
Copy link

SparkQA commented Apr 18, 2015

Test build #30526 has started for PR 5575 at commit c671f6f.

@SparkQA
Copy link

SparkQA commented Apr 18, 2015

Test build #30526 has finished for PR 5575 at commit c671f6f.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30526/
Test FAILed.

@aarondav
Copy link
Contributor Author

cc @SaintBacchus @andrewor13 @vanzin This patch aims to resolve SPARK-6955 while keeping the fix for SPARK-5444.

@aarondav
Copy link
Contributor Author

err, cc @andrewor14, not @andrewor13.

@SparkQA
Copy link

SparkQA commented Apr 18, 2015

Test build #30527 has started for PR 5575 at commit a3dc42d.

@SparkQA
Copy link

SparkQA commented Apr 18, 2015

Test build #30527 timed out for PR 5575 at commit a3dc42d after a configured wait of 150m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30527/
Test FAILed.

@aarondav
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 19, 2015

Test build #30548 has started for PR 5575 at commit a3dc42d.

@SparkQA
Copy link

SparkQA commented Apr 19, 2015

Test build #30548 timed out for PR 5575 at commit a3dc42d after a configured wait of 150m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30548/
Test FAILed.

@aarondav
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 19, 2015

Test build #30568 has started for PR 5575 at commit a3dc42d.

@SparkQA
Copy link

SparkQA commented Apr 19, 2015

Test build #30568 timed out for PR 5575 at commit a3dc42d after a configured wait of 150m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30568/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Apr 19, 2015

Test build #30569 has started for PR 5575 at commit df60b80.

@SparkQA
Copy link

SparkQA commented Apr 20, 2015

Test build #30569 has finished for PR 5575 at commit df60b80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30569/
Test PASSed.

@SaintBacchus
Copy link
Contributor

I seems to be a better way. Thanks for taking time to solve it.

val port = 17634
service0 = createService(port)
service0.port should be >= port
service0.port should be <= (port + 10) // avoid testing equality in case of simultaneous tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems flaky. Isn't the previous check enough? Otherwise, instead of 10 I'd use 1024 which is the actual limit in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can increase the number no problem, though I guess we should use Utils.portMaxRetries instead. However, note that even 10 should be very unlikely to be hit, that would mean there's 5-10 concurrently running tests at this exact part of the build, or else we're very, very unlucky.

@vanzin
Copy link
Contributor

vanzin commented Apr 20, 2015

LGTM aside from the potentially flaky test.

@@ -43,6 +43,9 @@ class ExternalShuffleServiceSuite extends ShuffleSuite with BeforeAndAfterAll {
conf.set("spark.shuffle.manager", "sort")
conf.set("spark.shuffle.service.enabled", "true")
conf.set("spark.shuffle.service.port", server.getPort.toString)

// local-cluster mode starts a Worker which would start its own shuffle service without this:
conf.set("spark.worker.shouldHostShuffleServiceIfEnabled", "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually ever want to start an external shuffle service in a local cluster? If not I think it makes more sense to just set spark.shuffle.service.enabled to false in LocalSparkCluster (we already do this for the REST submission server for Master)

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #31002 has started for PR 5575 at commit df60b80.

@AmplabJenkins
Copy link

Merged build triggered.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31568 timed out for PR 5575 at commit d38075b after a configured wait of 150m.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31568/
Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31584 has started for PR 5575 at commit ca1151d.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31584 has finished for PR 5575 at commit ca1151d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class Binarizer extends Transformer with HasInputCol with HasOutputCol

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31584/
Test PASSed.

Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behvior for the Yarn and Standalone shuffle services.
@aarondav
Copy link
Contributor Author

aarondav commented May 8, 2015

cc @andrewor14 @pwendell We should consider merging this into the 1.4 branch in case there is another RC. It has been an outstanding issue for a while.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32254 has started for PR 5575 at commit 3c2d6ed.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32254 has finished for PR 5575 at commit 3c2d6ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class EnumUtil
    • class Binarizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class IDF(JavaEstimator, HasInputCol, HasOutputCol):
    • class IDFModel(JavaModel):
    • class Normalizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class OneHotEncoder(JavaTransformer, HasInputCol, HasOutputCol):
    • class PolynomialExpansion(JavaTransformer, HasInputCol, HasOutputCol):
    • class RegexTokenizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class StandardScaler(JavaEstimator, HasInputCol, HasOutputCol):
    • class StandardScalerModel(JavaModel):
    • class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol):
    • class StringIndexerModel(JavaModel):
    • class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol):
    • class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol):
    • class Word2Vec(JavaEstimator, HasStepSize, HasMaxIter, HasSeed, HasInputCol, HasOutputCol):
    • class Word2VecModel(JavaModel):
    • class HasSeed(Params):
    • class HasTol(Params):
    • class HasStepSize(Params):

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32254/
Test PASSed.

@andrewor14
Copy link
Contributor

Nice, I merge.

@andrewor14
Copy link
Contributor

master 1.4

@asfgit asfgit closed this in ffdc40c May 9, 2015
asfgit pushed a commit that referenced this pull request May 9, 2015
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services.

Author: Aaron Davidson <aaron@databricks.com>

Closes #5575 from aarondav/port-bind and squashes the following commits:

3c2d6ed [Aaron Davidson] Oops, never do it.
a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled
e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite
59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level

(cherry picked from commit ffdc40c)
Signed-off-by: Andrew Or <andrew@databricks.com>
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services.

Author: Aaron Davidson <aaron@databricks.com>

Closes apache#5575 from aarondav/port-bind and squashes the following commits:

3c2d6ed [Aaron Davidson] Oops, never do it.
a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled
e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite
59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services.

Author: Aaron Davidson <aaron@databricks.com>

Closes apache#5575 from aarondav/port-bind and squashes the following commits:

3c2d6ed [Aaron Davidson] Oops, never do it.
a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled
e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite
59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services.

Author: Aaron Davidson <aaron@databricks.com>

Closes apache#5575 from aarondav/port-bind and squashes the following commits:

3c2d6ed [Aaron Davidson] Oops, never do it.
a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled
e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite
59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants