SPARK-1974. Most examples fail at startup because spark.master is not set #926

srowen · 2014-05-30T21:48:44Z

Most example code has a few lines like:

val sparkConf = new SparkConf().setAppName("Foo")
val sc = new SparkContext(sparkConf)

The SparkContext constructor throws a SparkException if spark.master is not set though, so this fails immediately.

This changes all examples to call new SparkContext("local[2]", "Foo") or similar. local[2] because it's necessary for streaming examples and because it's otherwise used as the default over local[1] in Spark.

(Since this started off when debugging the Kafka streaming code, I also included some refinements there to the logging and resource management. Lightly related.)

AmplabJenkins · 2014-05-30T21:52:58Z

Merged build triggered.

AmplabJenkins · 2014-05-30T21:53:04Z

Merged build started.

vanzin · 2014-05-30T21:58:28Z

Won't this break things if you try to submit the examples with spark-submit (which I think is the New And Approved Way (tm))?

spark-submit will set spark.master, and if I'm following things correctly, this change will override that.

srowen · 2014-05-30T22:00:52Z

Hm good question, you could make it work with -Dspark.master=.... This would overwrite those settings. It seems like the examples are intended to work without that setting though. Better to make the global internal default something like local[2]?

AmplabJenkins · 2014-05-30T22:33:32Z

Merged build finished.

AmplabJenkins · 2014-05-30T22:33:33Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15309/

vanzin · 2014-05-31T00:33:32Z

If you really want to keep the tests working outside of spark-submit, I'd suggest using the SparkContext(SparkConf) constructor instead, and using SparkConf.setIfMissing() to set "spark.master".

Given that bin/run-example uses spark-submit, I'm not sure we should bother though.

srowen · 2014-05-31T10:30:44Z

Yeah I think it's essential to not prevent -Dspark.master=... from working, oops. I think it may be useful to have this work if one copies-and-pastes too, as I just did. The javadoc doesn't indicate that you have to set the master either. I will rework it to use setIfMissing()

AmplabJenkins · 2014-05-31T12:02:58Z

Merged build triggered.

AmplabJenkins · 2014-05-31T12:03:05Z

Merged build started.

srowen · 2014-05-31T12:04:44Z

I pushed again, with setIfMissing. Is it better in the SparkConf constructor? or I am off base here?

AmplabJenkins · 2014-05-31T12:43:25Z

Merged build finished.

AmplabJenkins · 2014-05-31T12:43:25Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15321/

pwendell · 2014-05-31T17:03:22Z

Hey Sean, how are you running the examples. Are you using the run-example script? That script should set the master to local[*] if the user hasn't specified it, which will use all cores locally. I think in some cases we might need to update the javadocs in examples to tell users to use run-example.

srowen · 2014-05-31T17:07:40Z

I'm just copying-and-pasting to get something similar running externally. Maybe it's a little surprising that the example code doesn't work that way -- being in a main() kind of suggests this is a stand-alone program. Maybe just me.

I think there are a few possibilities:

Change all example code to set master if missing (that's what the current PR does)
Change SparkConf to do something similar as a global default
Just update javadoc to make it clear that the examples require the spark.master system property to be set

I slightly prefer one of the first two on the principle of least surprise, but can go any direction. I think at least the third should be done. What say everyone?

vanzin · 2014-06-02T22:33:45Z

Hey Sean,

I'm still a little confused about what it is you're doing. What is the javadoc you refer to? I've looked at a few classes in org.apache.spark.examples and I don't really see a lot of comments.

If you're using spark-submit or run-example, you shouldn't be running into this issue ("spark.master" not set) at all. You'd only run into it if using spark-class directly.

In 0.9 the examples (or at least a few of them) used to take the master as the first argument. In 1.0.0 it seems the approach taken was to just ignore that argument since it's provided automatically by spark-submit. Your command line is "backwards compatible" although that one argument is just ignored. I don't know how important that is for examples, though - I'd rather have the examples as an example of how to write an app, and having that old argument there kinda defeats that purpose.

So, if there are docs telling people to run the examples directly with java or spark-class, we should fix those to use run-example / spark-submit instead.

Perhaps a future approach could be to have a "SparkApp" base class that calls a "run(SparkContext)" method so that initializing the context clearly becomes the job of the framework. But that approach has all sorts of other issues.

syedhashmi · 2014-06-02T22:35:30Z

@vanzin : I was running into this issue while trying to run and debug examples from intellij. It was working fine pre-1.0.

vanzin · 2014-06-02T22:38:55Z

Ah, that case makes sense. For that I think Sean's current fix should be enough. But still, if there's documentation telling users to run examples that way, it should probably be fixed.

srowen · 2014-06-02T22:53:51Z

@vanzin I was just copying-and-pasting the substance of the main() method from an example to modify it. You are right that most examples tell you to use run-example and that works. One asks you set a master arg directly but it's just one.

Nothing wrong with that per se, but the code almost completely configures Spark, except for the master. At a glance, I thought it wasn't intentional, and that spark.master was supposed to default to local[2].

I don't quite like my change -- a global default is much simpler but has perhaps more implications. I am not really wedded to inserting a default.

I could instead make sure every bit of example code notes in javadoc that run-example should be used. (Really this all started because I wanted to also suggest a few touch-ups to KafkaInputDStream too so wouldn't mind slipping that in -- that's how I got started...)

mateiz · 2014-06-03T01:15:30Z

I don't think we want to change the examples like this. Instead, we should make the default spark.master be local instead of having an exception (I believe there was a JIRA for that). Nonetheless the examples show how you are supposed to write a spark-submit app today so we should leave them as they are.

srowen · 2014-06-03T09:19:56Z

Got it. I'm going to close this, and will open a PR for the JIRA you may have in mind (https://issues.apache.org/jira/browse/SPARK-1906) to just set a spark.master default. (I'll deal with this Kafka change separately). Then it's there for consideration.

…subquery reuse ### What changes were proposed in this pull request? This PR: 1. Fixes an issue in `ReuseExchange` rule that can result a `ReusedExchange` node pointing to an invalid exchange. This can happen due to the 2 separate traversals in `ReuseExchange` when the 2nd traversal modifies an exchange that has already been referenced (reused) in the 1st traversal. Consider the following query: ``` WITH t AS ( SELECT df1.id, df2.k FROM df1 JOIN df2 ON df1.k = df2.k WHERE df2.id < 2 ) SELECT * FROM t AS a JOIN t AS b ON a.id = b.id ``` Before this PR the plan of the query was (note the `<== this reuse node points to a non-existing node` marker): ``` == Physical Plan == *(7) SortMergeJoin [id#14L], [id#18L], Inner :- *(3) Sort [id#14L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#14L, 5), true, [id=#298] : +- *(2) Project [id#14L, k#17L] : +- *(2) BroadcastHashJoin [k#15L], [k#17L], Inner, BuildRight : :- *(2) Project [id#14L, k#15L] : : +- *(2) Filter isnotnull(id#14L) : : +- *(2) ColumnarToRow : : +- FileScan parquet default.df1[id#14L,k#15L] Batched: true, DataFilters: [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#15L), dynamicpruningexpression(k#15L IN dynamicpruning#26)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : +- SubqueryBroadcast dynamicpruning#26, 0, [k#17L], [id=#289] : : +- ReusedExchange [k#17L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#179] : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#179] : +- *(1) Project [k#17L] : +- *(1) Filter ((isnotnull(id#16L) AND (id#16L < 2)) AND isnotnull(k#17L)) : +- *(1) ColumnarToRow : +- FileScan parquet default.df2[id#16L,k#17L] Batched: true, DataFilters: [isnotnull(id#16L), (id#16L < 2), isnotnull(k#17L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint> +- *(6) Sort [id#18L ASC NULLS FIRST], false, 0 +- ReusedExchange [id#18L, k#21L], Exchange hashpartitioning(id#14L, 5), true, [id=#184] <== this reuse node points to a non-existing node ``` After this PR: ``` == Physical Plan == *(7) SortMergeJoin [id#14L], [id#18L], Inner :- *(3) Sort [id#14L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#14L, 5), true, [id=#231] : +- *(2) Project [id#14L, k#17L] : +- *(2) BroadcastHashJoin [k#15L], [k#17L], Inner, BuildRight : :- *(2) Project [id#14L, k#15L] : : +- *(2) Filter isnotnull(id#14L) : : +- *(2) ColumnarToRow : : +- FileScan parquet default.df1[id#14L,k#15L] Batched: true, DataFilters: [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#15L), dynamicpruningexpression(k#15L IN dynamicpruning#26)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : +- SubqueryBroadcast dynamicpruning#26, 0, [k#17L], [id=#103] : : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#102] : : +- *(1) Project [k#17L] : : +- *(1) Filter ((isnotnull(id#16L) AND (id#16L < 2)) AND isnotnull(k#17L)) : : +- *(1) ColumnarToRow : : +- FileScan parquet default.df2[id#16L,k#17L] Batched: true, DataFilters: [isnotnull(id#16L), (id#16L < 2), isnotnull(k#17L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint> : +- ReusedExchange [k#17L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#102] +- *(6) Sort [id#18L ASC NULLS FIRST], false, 0 +- ReusedExchange [id#18L, k#21L], Exchange hashpartitioning(id#14L, 5), true, [id=#231] ``` 2. Fixes an issue with separate consecutive `ReuseExchange` and `ReuseSubquery` rules that can result a `ReusedExchange` node pointing to an invalid exchange. This can happen due to the 2 separate rules when `ReuseSubquery` rule modifies an exchange that has already been referenced (reused) in `ReuseExchange` rule. Consider the following query: ``` WITH t AS ( SELECT df1.id, df2.k FROM df1 JOIN df2 ON df1.k = df2.k WHERE df2.id < 2 ), t2 AS ( SELECT * FROM t UNION SELECT * FROM t ) SELECT * FROM t2 AS a JOIN t2 AS b ON a.id = b.id ``` Before this PR the plan of the query was (note the `<== this reuse node points to a non-existing node` marker): ``` == Physical Plan == *(15) SortMergeJoin [id#46L], [id#58L], Inner :- *(7) Sort [id#46L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#46L, 5), true, [id=#979] : +- *(6) HashAggregate(keys=[id#46L, k#49L], functions=[]) : +- Exchange hashpartitioning(id#46L, k#49L, 5), true, [id=#975] : +- *(5) HashAggregate(keys=[id#46L, k#49L], functions=[]) : +- Union : :- *(2) Project [id#46L, k#49L] : : +- *(2) BroadcastHashJoin [k#47L], [k#49L], Inner, BuildRight : : :- *(2) Project [id#46L, k#47L] : : : +- *(2) Filter isnotnull(id#46L) : : : +- *(2) ColumnarToRow : : : +- FileScan parquet default.df1[id#46L,k#47L] Batched: true, DataFilters: [isnotnull(id#46L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#47L), dynamicpruningexpression(k#47L IN dynamicpruning#66)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : : +- SubqueryBroadcast dynamicpruning#66, 0, [k#49L], [id=#926] : : : +- ReusedExchange [k#49L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#656] : : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#656] : : +- *(1) Project [k#49L] : : +- *(1) Filter ((isnotnull(id#48L) AND (id#48L < 2)) AND isnotnull(k#49L)) : : +- *(1) ColumnarToRow : : +- FileScan parquet default.df2[id#48L,k#49L] Batched: true, DataFilters: [isnotnull(id#48L), (id#48L < 2), isnotnull(k#49L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint> : +- *(4) Project [id#46L, k#49L] : +- *(4) BroadcastHashJoin [k#47L], [k#49L], Inner, BuildRight : :- *(4) Project [id#46L, k#47L] : : +- *(4) Filter isnotnull(id#46L) : : +- *(4) ColumnarToRow : : +- FileScan parquet default.df1[id#46L,k#47L] Batched: true, DataFilters: [isnotnull(id#46L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#47L), dynamicpruningexpression(k#47L IN dynamicpruning#66)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : +- ReusedSubquery SubqueryBroadcast dynamicpruning#66, 0, [k#49L], [id=#926] : +- ReusedExchange [k#49L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#656] +- *(14) Sort [id#58L ASC NULLS FIRST], false, 0 +- ReusedExchange [id#58L, k#61L], Exchange hashpartitioning(id#46L, 5), true, [id=#761] <== this reuse node points to a non-existing node ``` After this PR: ``` == Physical Plan == *(15) SortMergeJoin [id#46L], [id#58L], Inner :- *(7) Sort [id#46L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#46L, 5), true, [id=#793] : +- *(6) HashAggregate(keys=[id#46L, k#49L], functions=[]) : +- Exchange hashpartitioning(id#46L, k#49L, 5), true, [id=#789] : +- *(5) HashAggregate(keys=[id#46L, k#49L], functions=[]) : +- Union : :- *(2) Project [id#46L, k#49L] : : +- *(2) BroadcastHashJoin [k#47L], [k#49L], Inner, BuildRight : : :- *(2) Project [id#46L, k#47L] : : : +- *(2) Filter isnotnull(id#46L) : : : +- *(2) ColumnarToRow : : : +- FileScan parquet default.df1[id#46L,k#47L] Batched: true, DataFilters: [isnotnull(id#46L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#47L), dynamicpruningexpression(k#47L IN dynamicpruning#66)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : : +- SubqueryBroadcast dynamicpruning#66, 0, [k#49L], [id=#485] : : : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#484] : : : +- *(1) Project [k#49L] : : : +- *(1) Filter ((isnotnull(id#48L) AND (id#48L < 2)) AND isnotnull(k#49L)) : : : +- *(1) ColumnarToRow : : : +- FileScan parquet default.df2[id#48L,k#49L] Batched: true, DataFilters: [isnotnull(id#48L), (id#48L < 2), isnotnull(k#49L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint> : : +- ReusedExchange [k#49L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#484] : +- *(4) Project [id#46L, k#49L] : +- *(4) BroadcastHashJoin [k#47L], [k#49L], Inner, BuildRight : :- *(4) Project [id#46L, k#47L] : : +- *(4) Filter isnotnull(id#46L) : : +- *(4) ColumnarToRow : : +- FileScan parquet default.df1[id#46L,k#47L] Batched: true, DataFilters: [isnotnull(id#46L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#47L), dynamicpruningexpression(k#47L IN dynamicpruning#66)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : +- ReusedSubquery SubqueryBroadcast dynamicpruning#66, 0, [k#49L], [id=#485] : +- ReusedExchange [k#49L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#484] +- *(14) Sort [id#58L ASC NULLS FIRST], false, 0 +- ReusedExchange [id#58L, k#61L], Exchange hashpartitioning(id#46L, 5), true, [id=#793] ``` (This example contains issue 1 as well.) 3. Improves the reuse of exchanges and subqueries by enabling reuse across the whole plan. This means that the new combined rule utilizes the reuse opportunities between parent and subqueries by traversing the whole plan. The traversal is started on the top level query only. 4. Due to the order of traversal this PR does while adding reuse nodes, the reuse nodes appear in parent queries if reuse is possible between different levels of queries (typical for DPP). This is not an issue from execution perspective, but this also means "forward references" in explain formatted output where parent queries come first. The changes I made to `ExplainUtils` are to handle these references properly. This PR fixes the above 3 issues by unifying the separate rules into a `ReuseExchangeAndSubquery` rule that does a 1 pass, whole-plan, bottom-up traversal. ### Why are the changes needed? Performance improvement. ### How was this patch tested? - New UTs in `ReuseExchangeAndSubquerySuite` to cover 1. and 2. - New UTs in `DynamicPartitionPruningSuite`, `SubquerySuite` and `ExchangeSuite` to cover 3. - New `ReuseMapSuite` to test `ReuseMap`. - Checked new golden files of `PlanStabilitySuite`s for invalid reuse references. - TPCDS benchmarks. Closes #28885 from peter-toth/SPARK-29375-SPARK-28940-whole-plan-reuse. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR moves to a released Iceberg version. ### Why are the changes needed? These changes are needed to avoid relying on snapshot builds. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Local tests.

This reverts commit 03c75ef.

This reverts commit 48714d2.

srowen added 2 commits May 31, 2014 11:47

Close resources and log more errors in Kafka consumer stream

5fd5ec3

In examples, add spark.master if not already set, as it is required

51ee3cb

srowen closed this Jun 3, 2014

srowen deleted the SPARK-1974 branch June 3, 2014 09:25

peter-toth mentioned this pull request Jul 17, 2020

[SPARK-29375][SPARK-28940][SPARK-32041][SQL] Whole plan exchange and subquery reuse #28885

Closed

flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021

Revert "Move to released Iceberg (apache#926) (apache#951)"

a47a3ec

This reverts commit 03c75ef.

flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021

Revert "Move to released Iceberg (apache#926)"

930dffa

This reverts commit 48714d2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1974. Most examples fail at startup because spark.master is not set #926

SPARK-1974. Most examples fail at startup because spark.master is not set #926

srowen commented May 30, 2014

AmplabJenkins commented May 30, 2014

AmplabJenkins commented May 30, 2014

vanzin commented May 30, 2014

srowen commented May 30, 2014

AmplabJenkins commented May 30, 2014

AmplabJenkins commented May 30, 2014

vanzin commented May 31, 2014

srowen commented May 31, 2014

AmplabJenkins commented May 31, 2014

AmplabJenkins commented May 31, 2014

srowen commented May 31, 2014

AmplabJenkins commented May 31, 2014

AmplabJenkins commented May 31, 2014

pwendell commented May 31, 2014

srowen commented May 31, 2014

vanzin commented Jun 2, 2014

syedhashmi commented Jun 2, 2014

vanzin commented Jun 2, 2014

srowen commented Jun 2, 2014

mateiz commented Jun 3, 2014

srowen commented Jun 3, 2014

SPARK-1974. Most examples fail at startup because spark.master is not set #926

SPARK-1974. Most examples fail at startup because spark.master is not set #926

Conversation

srowen commented May 30, 2014

AmplabJenkins commented May 30, 2014

AmplabJenkins commented May 30, 2014

vanzin commented May 30, 2014

srowen commented May 30, 2014

AmplabJenkins commented May 30, 2014

AmplabJenkins commented May 30, 2014

vanzin commented May 31, 2014

srowen commented May 31, 2014

AmplabJenkins commented May 31, 2014

AmplabJenkins commented May 31, 2014

srowen commented May 31, 2014

AmplabJenkins commented May 31, 2014

AmplabJenkins commented May 31, 2014

pwendell commented May 31, 2014

srowen commented May 31, 2014

vanzin commented Jun 2, 2014

syedhashmi commented Jun 2, 2014

vanzin commented Jun 2, 2014

srowen commented Jun 2, 2014

mateiz commented Jun 3, 2014

srowen commented Jun 3, 2014