Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3369] [CORE] [STREAMING] Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator #10413

Closed
wants to merge 2 commits into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Dec 21, 2015

Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.

CC @rxin @pwendell for API change; @tdas since it also touches streaming.

@SparkQA
Copy link

SparkQA commented Dec 21, 2015

Test build #48112 has finished for PR 10413 at commit ac83f56.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Dec 21, 2015

@srowen I've been thinking about all the RDD APIs last night, and one thought I have is that if the goal is to push more people towards DataFrame/Dataset, then maybe it wouldn't be crazy to not change any of the existing RDD APIs to ensure maximum compatibility.

@srowen
Copy link
Member Author

srowen commented Dec 21, 2015

I think it probably cuts the other way - if you want people to use dataframes then keeping the RDDs as-is is not that important. I don't see that RDDs are going away and fixing this API seems natural for 2.x.

EDIT: I didn't really finish the thought -- the issue here is that this is fixing a moderate problem for Java callers. By having to return an Iterable it requires staging the entire output of a flatMap or mapPartitions in memory. This one is worth fixing.

I certainly agree that breaking changes still need a decent reason, but 2.x is the right time to make those. This will be an interesting point of debate then when it comes to removing deprecated methods.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48131 has finished for PR 10413 at commit 33e4559.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48179 has finished for PR 10413 at commit afe2af0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #2248 has finished for PR 10413 at commit afe2af0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Dec 28, 2015

@rxin I wanted to sound you out on this one -- do you feel strongly against it or are the reasons above a little persuasive?

@SparkQA
Copy link

SparkQA commented Dec 28, 2015

Test build #48365 has finished for PR 10413 at commit 6240965.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Dec 28, 2015

Let me think about it a little bit more...

@SparkQA
Copy link

SparkQA commented Jan 3, 2016

Test build #48606 has finished for PR 10413 at commit 1dfe590.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48651 has finished for PR 10413 at commit d0323d8.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48655 has finished for PR 10413 at commit bf2480d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…e producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.
@SparkQA
Copy link

SparkQA commented Jan 8, 2016

Test build #49028 has finished for PR 10413 at commit d792f9a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2016

Test build #49030 has finished for PR 10413 at commit c3e0375.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member Author

srowen commented Jan 14, 2016

I'd like to raise the issue of whether to proceed with this change. @rxin what's your current thinking? I tend to favor it. Pinging, say, @pwendell or @vanzin or @andrewor14 for an opinion.

@srowen
Copy link
Member Author

srowen commented Jan 18, 2016

Hm, I'd like to proceed with this as I am in favor of the change. I don't want to do so over strong-er objections. I don't see any additional opinions here. @rxin what do you think -- are there modifications to this that could make things more compatible? documentation?

@holdenk
Copy link
Contributor

holdenk commented Jan 21, 2016

I think this change is pretty important, returning an Iterable means store the entire output in memory that really limits what you can do.

@srowen
Copy link
Member Author

srowen commented Jan 23, 2016

If there's another vote in favor -- and no strong objection surfacing -- I'd like to proceed soon to merge this.

@srowen
Copy link
Member Author

srowen commented Jan 26, 2016

Merged to master

srowen added a commit to srowen/spark that referenced this pull request Jan 26, 2016
…s inconsistent with Scala's Iterator->Iterator

Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.

CC rxin pwendell for API change; tdas since it also touches streaming.

Author: Sean Owen <sowen@cloudera.com>

Closes apache#10413 from srowen/SPARK-3369.
@srowen srowen closed this Jan 26, 2016
@srowen srowen deleted the SPARK-3369 branch January 26, 2016 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants