Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12936][SQL] Initial bloom filter implementation #10883

Closed
wants to merge 7 commits into from

Conversation

cloud-fan
Copy link
Contributor

This PR adds an initial implementation of bloom filter in the newly added sketch module. The implementation is based on the BloomFilter class in guava.

Some difference from the design doc:

  • expose bitSize instead of sizeInBytes to user.
  • always need the expectedInsertions parameter when create bloom filter.

@cloud-fan
Copy link
Contributor Author

cc @rxin @liancheng

@SparkQA
Copy link

SparkQA commented Jan 23, 2016

Test build #49939 has finished for PR 10883 at commit bbf3822.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public abstract class BloomFilter
    • public class DefaultBloomFilter extends BloomFilter
    • static final class BitArray

@SparkQA
Copy link

SparkQA commented Jan 23, 2016

Test build #49940 has finished for PR 10883 at commit 2db0171.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public abstract class BloomFilter
    • public class DefaultBloomFilter extends BloomFilter
    • static final class BitArray

@SparkQA
Copy link

SparkQA commented Jan 23, 2016

Test build #49941 has finished for PR 10883 at commit 1617226.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public abstract class BloomFilter
    • public class DefaultBloomFilter extends BloomFilter
    • static final class BitArray

import java.io.UnsupportedEncodingException;
import java.util.Arrays;

public class DefaultBloomFilter extends BloomFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be consistent in naming with the count-min sketch one, i.e. rename this BloomFilterImpl.

return bits.bitSize();
}

private static long hashObjectToLong(Object item) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string part is same with count-min, but long part is different, is it worth to abstract it? cc @rxin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the long part different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because they use different strategy to produce n hash values for a long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add some inline comment explaining why they are different

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50002 has finished for PR 10883 at commit 920f292.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

while (i < numInsertion) {
filter.put(allItems(i))
i += 1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to use allItems.take(numInsertion).foreach(filter.put) for simplicity if this code path doesn't slow down test execution too much.

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50005 has finished for PR 10883 at commit 3633952.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50012 has finished for PR 10883 at commit 4fce26e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// After merge, `filter1` has `numItems` items which doesn't exceed `expectedNumItems`, so the
// `expectedFpp` should not be significantly higher than the default one: 3%
// Skip byte type as it has too little distinct values.
assert(typeName == "Byte" || 0.03 - filter1.expectedFpp() < 0.001)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the 0.001 here stand for epsilon? Should we wrap 0.03 - filter1.expectedFpp() with Math.abs()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to special case "Byte" here when we already have numItems (it's 200 for Byte type)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to make 0.01 a private static constant, e.g. EPSILON.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should swap 0.03 and filter1.expectedFpp() here.

@liancheng
Copy link
Contributor

One high level comment about testing code: I'm a little bit confused by those magic numbers there. Would be nice to name them.

* Creates a {@link BloomFilter} with given {@code expectedNumItems} and a default 3% {@code fpp}.
*/
public static BloomFilter create(long expectedNumItems) {
return create(expectedNumItems, 0.03);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have a public static constant DEFAULT_EXPECTED_FPP for this 0.03. I was pretty confused when finding the magic number 0.03 in the testing code.

@liancheng
Copy link
Contributor

Tip: I renamed CountMinSketchMergeException to IncompatibleMergeException and made it checked in #10893. You can use it in mergeInPlace.

testItemType[Long]("Long", 100000) { _.nextLong() }

testItemType[String]("String", 100000) { r => r.nextString(r.nextInt(512)) }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add another test case for incompatible merge (like this one).

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50031 has finished for PR 10883 at commit b850bfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2016

Test build #50034 has finished for PR 10883 at commit a9a6e83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • \"Cannot merge bloom filter of class \" + other.getClass().getName()

@liancheng
Copy link
Contributor

LGTM

@rxin
Copy link
Contributor

rxin commented Jan 26, 2016

As discussed offline, it might be better to have put(String) and put(Long) in addition to put(Object).

I'm going to merge this pull request. Please address the remaining comments in a new pull request.

@asfgit asfgit closed this in 109061f Jan 26, 2016
@cloud-fan cloud-fan deleted the bloom-filter branch January 26, 2016 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants