Add a Reader/Writer Interface for Streaming #25

marmbrus · 2016-01-07T19:14:09Z

In this PR I add a new interface for opening new streams (as Dataframes) and starting a new streaming query. These are modeled after the DataFrame reader/writer interface.

val df =
  sqlContext
    .streamFrom
    .format("text")
    .open("/michael/logs")

val filtered = df.filter($"value" contains "ERROR")

val runningQuery =
  filtered
    .streamTo
    .format("text")
    .start("/michael/errors")

runningQuery.stop()

Sources and Sinks are created by a StreamSourceProvider or StreamSinkProvider, which are similar to a RelationProvider in the Data Source API (and in fact a single class can be all of the above if desired).

I include a throwaway implementation of a text file source/sink for demonstration / testing.

TODO:

Improve tests
Improve comments

AmplabJenkins · 2016-01-07T19:17:45Z

Merged build finished. Test FAILed.

AmplabJenkins · 2016-01-07T19:17:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/21/
Test FAILed.

AmplabJenkins · 2016-01-07T20:41:21Z

Merged build finished. Test FAILed.

AmplabJenkins · 2016-01-07T20:41:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/22/
Test FAILed.

AmplabJenkins · 2016-01-07T21:05:07Z

Merged build finished. Test FAILed.

AmplabJenkins · 2016-01-07T21:05:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/23/
Test FAILed.

AmplabJenkins · 2016-01-08T01:18:18Z

Merged build finished. Test FAILed.

AmplabJenkins · 2016-01-08T01:18:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/24/
Test FAILed.

AmplabJenkins · 2016-01-08T02:17:10Z

Merged build finished. Test PASSed.

AmplabJenkins · 2016-01-08T02:17:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/25/
Test PASSed.

AmplabJenkins · 2016-01-08T03:39:06Z

Merged build finished. Test PASSed.

AmplabJenkins · 2016-01-08T03:39:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/26/
Test PASSed.

zsxwing · 2016-01-08T19:03:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

+   */
+  override def awaitNextBatch(): Unit = {
+    while (!batchRun) {
+      awaitBatchLock.synchronized { awaitBatchLock.wait(100) }


why use wait(100) instead of wait()?

Thats a really good question. When I had it as wait(), it was hanging non-deterministically. I think its okay to spin occasionally?

That sounds like a sign of corner cases we are missing.

From the docs interrupts and spurious wakeups are possible, and this method should always be used in a loop. Do my guess is that we some times wake up spuriously. There is then a race to check the done condition / finish the batch (which is why it would hang with very low probability (3/1000)). So, this actually does seem like the best solution. We don't spin a ton wastefully. In most takes we awake immediately. In very rare cases we sleep 100ms too long.

tdas · 2016-01-08T22:30:47Z

sql/core/src/main/scala/org/apache/spark/sql/StandingQuery.scala

+   * to gurantee that a new batch has been processed.
+   */
+  @DeveloperApi
+  def awaitNextBatch(): Unit


awaitBatchCompletion? awaitNextBatch doesnt signify whether to wait for next batch to start or end.

AmplabJenkins · 2016-01-08T22:48:42Z

Merged build finished. Test PASSed.

AmplabJenkins · 2016-01-08T22:48:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/29/
Test PASSed.

AmplabJenkins · 2016-01-08T23:09:55Z

Merged build finished. Test PASSed.

AmplabJenkins · 2016-01-08T23:09:56Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/30/
Test PASSed.

tdas · 2016-01-08T23:54:51Z

sql/core/src/main/scala/org/apache/spark/sql/DataStreamReader.scala

+   *
+   * @since 2.0.0
+   */
+  def format(source: String): DataStreamReader = {


Writing reader.format("kafka") is a quite weird, and will be weird for most non-fs streaming sources. Rather I propose having an alias called source, which works nice for both batch and streaming - source("text"), source("parquet"), source("kafka") all make sense.

I think this depends on what other methods are available on the reader/writer interfaces.

ah never mind -- i misunderstood it. your proposal makes sense

AmplabJenkins · 2016-01-10T21:01:15Z

Merged build finished. Test FAILed.

AmplabJenkins · 2016-01-10T21:01:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/35/
Test FAILed.

AmplabJenkins · 2016-01-10T21:54:15Z

Merged build finished. Test FAILed.

AmplabJenkins · 2016-01-10T21:54:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/36/
Test FAILed.

AmplabJenkins · 2016-01-10T22:08:02Z

Merged build finished. Test PASSed.

AmplabJenkins · 2016-01-10T22:08:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/spark-streaming-df-test/37/
Test PASSed.

Add a Reader/Writer Interface for Streaming

marmbrus force-pushed the streaming-readwrite branch from 9f1719a to 517ef4d Compare January 7, 2016 20:37

WIP

8ed211b

marmbrus force-pushed the streaming-readwrite branch from 517ef4d to 8ed211b Compare January 7, 2016 20:45

Some renaming

0676db4

fix docs and a locking bugs

3a87479

cleanup more bug fixes

f4f7205

zsxwing reviewed Jan 8, 2016
View reviewed changes

move batch run

99969b9

tdas reviewed Jan 8, 2016
View reviewed changes

tds comments

e06c0e3

tdas reviewed Jan 8, 2016
View reviewed changes

marmbrus force-pushed the streaming-readwrite branch from 5ad7ec6 to 9f13624 Compare January 10, 2016 21:47

comments

a52200b

marmbrus force-pushed the streaming-readwrite branch from 9f13624 to a52200b Compare January 10, 2016 21:54

marmbrus added a commit that referenced this pull request Jan 10, 2016

Merge pull request #25 from marmbrus/streaming-readwrite

c1139ec

Add a Reader/Writer Interface for Streaming

marmbrus merged commit c1139ec into streaming-df Jan 10, 2016

marmbrus deleted the streaming-readwrite branch March 8, 2016 00:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Reader/Writer Interface for Streaming #25

Add a Reader/Writer Interface for Streaming #25

marmbrus commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

zsxwing Jan 8, 2016

marmbrus Jan 8, 2016

tdas Jan 8, 2016

marmbrus Jan 8, 2016

tdas Jan 8, 2016

marmbrus Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

tdas Jan 8, 2016

marmbrus Jan 8, 2016

rxin Jan 8, 2016

rxin Jan 8, 2016

AmplabJenkins commented Jan 10, 2016

AmplabJenkins commented Jan 10, 2016

AmplabJenkins commented Jan 10, 2016

AmplabJenkins commented Jan 10, 2016

AmplabJenkins commented Jan 10, 2016

AmplabJenkins commented Jan 10, 2016

Add a Reader/Writer Interface for Streaming #25

Add a Reader/Writer Interface for Streaming #25

Conversation

marmbrus commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 7, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

AmplabJenkins commented Jan 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Jan 10, 2016

AmplabJenkins commented Jan 10, 2016

AmplabJenkins commented Jan 10, 2016

AmplabJenkins commented Jan 10, 2016

AmplabJenkins commented Jan 10, 2016

AmplabJenkins commented Jan 10, 2016