Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12934][SQL] Count-min sketch serialization #10893

Closed
wants to merge 5 commits into from

Conversation

liancheng
Copy link
Contributor

This PR adds serialization support for CountMinSketch.

A version number is added to version the serialized binary format.

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #49967 has finished for PR 10893 at commit e97d7f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • throw new CountMinSketchMergeException(\"Cannot merge estimator of class \" + other.getClass().getName());
    • public class CountMinSketchMergeException extends Exception

public static CountMinSketch readFrom(InputStream in) {
throw new UnsupportedOperationException("Not implemented yet");
public static CountMinSketch readFrom(InputStream in) throws IOException {
DataInputStream dis = new DataInputStream(in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put the implementation in CountMinSketchImpl, and simply delegate to a static method there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also somewhere in CountMinSketchImpl we should document the serialization format.

@@ -55,6 +57,25 @@
*/
abstract public class CountMinSketch {
/**
* Version number of the serialized binary format.
*/
public enum Version {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this out so that we can also use it in bloom filter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be count min sketch specific, because the two binary protocols can and should evolve separately.

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50023 has finished for PR 10893 at commit 9acfe33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50025 has finished for PR 10893 at commit 4636af1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50020 has finished for PR 10893 at commit 4abc4e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* - Row 1 (width * 64 bit)
* - ...
* - Row depth - 1 (width * 64 bit)
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move this comment near writeTo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I couldn't decide whether to put it before writeTo or readFrom, and then ended up here...

@cloud-fan
Copy link
Contributor

LGTM overall

DataInputStream dis = new DataInputStream(in);

// Ignores version number
dis.readInt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add some check here to throw an exception if the version number is not 1.

@rxin
Copy link
Contributor

rxin commented Jan 25, 2016

Going to merge this. Thanks. Please address my (one) feedback in your next pr.

@asfgit asfgit closed this in 6f0f1d9 Jan 25, 2016
@liancheng liancheng deleted the cms-serialization branch January 25, 2016 23:09
public CMSMergeException(String message) {
super(message);
public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
DataInputStream dis = new DataInputStream(in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be closed before returning from the method, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But DataInputStream.close() also closes the underlying input stream, which isn't our expected behavior, is it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants