A suggestion of code improvement for BloomFilter. #7346

longlong354 · 2024-08-05T08:40:34Z

API(s)

com.google.common.hash.BloomFilter::optimalNumOfBits(long n, double p)
&
com.google.common.hash.BloomFilter::optimalNumOfHashFunctions(long n, long m)

How do you want it to be improved?

1. use static caculated value of log(2) and squared log(2) :

        private static final double LOG_TWO = Math.log(2);
	private static final double SQUARED_LOG_TWO = Math.pow(LOG_TWO,2);

2. calculate optimalNumOfBits by static values:

	static long optimalNumOfBits(long n, double p) {
	    if (p == 0) {
			p = Double.MIN_VALUE;
	    }
	    return (long) (-n * Math.log(p) / SQUARED_LOG_TWO);
	}

3. caculate optimalNumOfHashFunction by false positive rate(p) directly and using LOG_TWO :

     	static int optimalNumOfHashFunctions(double p){
		return  Math.max(1, (int) Math.round( - Math.log(p) / LOG_TWO));
	}

Why do we need it to be improved?

Slightly reduce the method's execution time.
Refine the calculation of the optimalNumOfHashFunctions method to be more concise and efficient. The current code may lead people to mistakenly believe that the result is related to the number of samples(n) and the number of bits（m）, when in fact it is only related to the false positive rate(p).

Example

as the given codes in  “How do you want it to be improved”

Current Behavior

as the source codes in “How do you want it to be improved”

Desired Behavior

as the given codes in
com.google.common.hash.BloomFilter::optimalNumOfBits(long n, double p)
&
com.google.common.hash.BloomFilter::optimalNumOfHashFunctions(long n, long m)

Concrete Use Cases

as "How do you want it to be improved"

Checklist

I agree to follow the code of conduct.
I have read and understood the contribution guidelines.
I have read and understood Guava's philosophy, and I strongly believe that this proposal aligns with it.

eamonnmcmanus · 2024-08-06T13:24:53Z

This appears to have been a deliberate change back in 2012. Maybe @kluever remembers the rationale.

longlong354 · 2024-08-07T05:16:26Z

Thanks for reply. My fault didn't express the main point clearly enough.
The main recommendation is using

optimalNumOfHashFunctions(double p)

instead of

optimalNumOfHashFunctions(long n, long m)

Ayush-Thakur-geek · 2024-08-26T20:21:54Z

Hey, @longlong354

So, you just want the usage of (double p) and not (long m, long n) as argument for bloomfilter.

eamonnmcmanus · 2024-08-27T13:34:20Z

The situation now is roughly that we start from $n$, the expected number of entries, and $p$, the desired false positive probability, and we derive $m$, the optimal number of bits, as

$$ m = \lfloor {-n\ \ln\ p } / { (\ln\ 2)^2} \rfloor $$

Then we further derive $k$, the optimal number of hash functions, as

$$ k = (m / n)\ \ln\ 2 \approx {-n\ \ln\ p } / { \ln\ 2} $$

rounded to an integer. You are proposing instead

$$ k = { -\ln\ p } / { \ln\ 2 } $$

removing the factor of $n$. That's a different number. Are you saying that it's more accurate? Could you explain why?

longlong354 · 2024-08-29T00:58:30Z

The derivation of the formula

k = (m / n) ln 2 ≈ -n ln p / ln 2

is inaccurate; it should be

k = (m / n) ln 2 ≈ - ln p / ln 2

Please kindly note that the bolded 'n' in the numerator cancels out with the 'n' in m( m = ⌊ − n ln ⁡ p / ( ln ⁡ 2 ) 2 ⌋ )"

ps:
refering to https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions
also can be verified in https://krisives.github.io/bloom-calculator/ that n does not affect k

longlong354 · 2024-08-29T00:59:16Z

Hey, @longlong354

So, you just want the usage of (double p) and not (long m, long n) as argument for bloomfilter.

yep~

Romain-E · 2024-10-11T18:56:41Z

Hi, I would like to take on this issue. I propose optimizing the optimalNumOfBits and optimalNumOfHashFunctions methods as follows:

Pre-calculate constant values for log(2) and log(2)^2 to avoid recalculating these in each method call, which will slightly improve performance.
Refine the calculation of optimalNumOfBits to ensure a false positive rate of 0 is handled by setting it to Double.MIN_VALUE, and use the pre-calculated values for efficiency.
Simplify optimalNumOfHashFunctions to calculate the number of hash functions directly from the false positive rate p, as this is the only factor affecting the result, rather than the number of elements n and bits m.
These changes will clarify the logic and improve performance.

Please let me know if I can proceed with these modifications!

eamonnmcmanus · 2024-10-12T16:48:59Z

Sorry, I lost track of this issue.

@longlong354 is right that the algebra in my earlier comment was incorrect, and it does seem as if rewriting the code as suggested would make sense.

I still don't know why the code was rewritten in 2012 to remove the pre-calculation of $\ln\ 2$ and $\ln^2\ 2$. I expect that the JIT compiler is able to inline these constants, but we could reasonably do it explicitly and save it the work.

I think we would accept a PR on the lines of the original comment. If @longlong354 wants to send that PR I think that would make the most sense, and otherwise @Romain-E. The only non-obvious thing I see is how to adjust the corresponding tests in BloomFilterTest. I think the second test method is no longer relevant but I'm not sure about the first.

Romain-E · 2024-10-12T17:05:52Z

I concur that the second test (testOptimalNumOfHashFunctionsRounding) no longer applies with the new approach and should probably be removed. The first test (testOptimalHashes) could be adjusted to work with the updated method by testing for different values of p rather than n and m. Alternatively, we could add a method that derives p from n and m and maintain the test in a slightly modified form.

Let me know if this direction works for you, and I’d be happy to proceed with the PR.

longlong354 · 2024-10-14T03:15:37Z

Thank u all for the discussion.
Pls proceed @Romain-E, thanks in advance.

longlong354 added the type=enhancement Make an existing feature better label Aug 5, 2024

eamonnmcmanus assigned kluever Aug 6, 2024

eamonnmcmanus added the P3 label Aug 6, 2024

eamonnmcmanus self-assigned this Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A suggestion of code improvement for BloomFilter. #7346

A suggestion of code improvement for BloomFilter. #7346

longlong354 commented Aug 5, 2024 •

edited

Loading

eamonnmcmanus commented Aug 6, 2024

longlong354 commented Aug 7, 2024

Ayush-Thakur-geek commented Aug 26, 2024

eamonnmcmanus commented Aug 27, 2024

longlong354 commented Aug 29, 2024 •

edited

Loading

longlong354 commented Aug 29, 2024

Romain-E commented Oct 11, 2024

eamonnmcmanus commented Oct 12, 2024

Romain-E commented Oct 12, 2024

longlong354 commented Oct 14, 2024

A suggestion of code improvement for BloomFilter. #7346

A suggestion of code improvement for BloomFilter. #7346

Comments

longlong354 commented Aug 5, 2024 • edited Loading

API(s)

How do you want it to be improved?

Why do we need it to be improved?

Example

Current Behavior

Desired Behavior

Concrete Use Cases

Checklist

eamonnmcmanus commented Aug 6, 2024

longlong354 commented Aug 7, 2024

Ayush-Thakur-geek commented Aug 26, 2024

eamonnmcmanus commented Aug 27, 2024

longlong354 commented Aug 29, 2024 • edited Loading

longlong354 commented Aug 29, 2024

Romain-E commented Oct 11, 2024

eamonnmcmanus commented Oct 12, 2024

Romain-E commented Oct 12, 2024

longlong354 commented Oct 14, 2024

longlong354 commented Aug 5, 2024 •

edited

Loading

longlong354 commented Aug 29, 2024 •

edited

Loading