Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Rabin-Karp #110

Merged
merged 3 commits into from
Jul 30, 2018
Merged

Conversation

Bruce-Feldman
Copy link
Contributor

Created Rabin hash function module.
Transitioned Rabin-Karp string matching function to Rabin hash function family.

@codecov-io
Copy link

codecov-io commented Jul 24, 2018

Codecov Report

Merging #110 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master   #110   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files         115    116    +1     
  Lines        2258   2260    +2     
  Branches      394    394           
=====================================
+ Hits         2258   2260    +2
Impacted Files Coverage Δ
src/utils/hash/rolling/Rabin_Fingerprint.js 100% <100%> (ø)
src/algorithms/string/rabin-karp/rabinKarp.js 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d090f76...fab2d46. Read the comment docs.

@Bruce-Feldman
Copy link
Contributor Author

I additionally think that Rabin-Karp should be changed to be a beginner topic.

@dubzzz
Copy link
Contributor

dubzzz commented Jul 25, 2018

Actually it might fix #102 as the hash function has been changed. I can re-run my test on this change to confirm it fixes the problem.

@dubzzz
Copy link
Contributor

dubzzz commented Jul 26, 2018

It seems that some values are still failing with this implementation. The characters outside the BMP-plan of Unicode make the algorithm fails:

rabinKarp("a\u{ffff}", "\u{ffff}"); // => OK - return: 1
rabinKarp("a\u{10000}", "\u{10000}"); // => FAIL - return: -1

// \u{10000} is LINEAR B SYLLABLE B008 A
// more at https://www.fileformat.info/info/unicode/char/10000/index.htm

See: https://runkit.com/dubzzz/rabin-2

@Bruce-Feldman
Copy link
Contributor Author

This is a really cool bug. For some reason, on my machine, I get the following interesting result:

> var text = "a\u{10000}";
undefined
> text.length
3
> Array.from(text).length
2

I am not so familiar with the guarantees of these library functions, is this expected behaviour? With your first example, I get the following behaviour (which I would expect):

> var text = "a\u{1000}"
undefined
> text.length
2
> Array.from(text).length
2

@dubzzz
Copy link
Contributor

dubzzz commented Jul 26, 2018

This is an expected behavior. In JavaScript, strings are encoded using UTF-16 which encodes code units on 16 bits. Actually characters outside the BMP range, which means whose code point is greater than 0xffff, require two code units to encode one code point. Legacy methods of JavaScript are counting code units while modern ones (most of the time) work at code point level (spread operator, Array.from...).

@dubzzz
Copy link
Contributor

dubzzz commented Jul 26, 2018

If your code uses both s.length and Array.from(s) you must definitely choose one.

s.length for code point would be [...s].length or Array.from(s).length

Array.from(s) for code units would be s. split('')

@dubzzz
Copy link
Contributor

dubzzz commented Jul 26, 2018

Just an additional note:

'\u{10000}'.charCodeAt(0) // 0xd800 
'\u{10000}'.codePointAt(0) // 0x10000

@Bruce-Feldman
Copy link
Contributor Author

Bruce-Feldman commented Jul 26, 2018

@dubzzz I updated the string processing to handle this edge case. Could you take a look at it and see if it seems sufficient in the general case? I think I understood your what you were saying about javascript representation of strings, but am not sure.

@dubzzz
Copy link
Contributor

dubzzz commented Jul 26, 2018

@Bruce-Feldman Just retried based on the last code you pushed. It seems that there is still an issue:

// your implementation
console.log(rabinKarp("a耀a","耀a")); // 1
console.log(rabinKarp("\u0000耀\u0000","耀\u0000")); // -1 ERROR

// indexOf
console.log("a耀a".indexOf("耀a")); // 1
console.log("\u0000耀\u0000".indexOf("耀\u0000")); // 1

You can easily re-run my test case either by using RunKit or locally by adding fast-check package.
Snippet: https://runkit.com/dubzzz/rabin-3 (tell me if you can't access it)

@Bruce-Feldman
Copy link
Contributor Author

Any recommendations to fix this problem in the general case?

@Bruce-Feldman
Copy link
Contributor Author

@dubzzz Looks like the issue was that I forgot that javascript performs bit operations on signed 32 bit integers. This should be addressed now by using non bit operators.

@dubzzz
Copy link
Contributor

dubzzz commented Jul 26, 2018

With your last implementation, all my property based tests are green ;)

On my side, I think we should integrate such property based testing tools into the repository to have more confidence in the algorithms and cover all possible and legits corner cases. In the case of rabinKarb it helped to discover multiple issues: in the current and the new implementation.

Concerning general ways to cover those edge cases on character encodings I do not have real recommendations :/

@Bruce-Feldman
Copy link
Contributor Author

@dubzzz Woohoo! I am really pleased with your tests - they seem to generate edge cases fairly reliably. Thanks for the help!

@trekhleb trekhleb changed the base branch from master to issue-102-rabin-karp-fix July 30, 2018 09:11
@trekhleb
Copy link
Owner

@Bruce-Feldman, @dubzzz thank you for PR and for new Rabing Karp testing edge cases!

@Bruce-Feldman I like the idea of moving hashing functionality out of RabinKarp. I even think that it should be moved not in utils but into separate section that will contain different hash implementations and maybe different crypto related algorithms. I'll merge your PR to separate branch for now and create this new section for hashing with related READMEs and links for further readings.

@dubzzz regarding the tool you've created that does property based testing - it is really nice I guess since it helped to find such edge cases for Rabing Karp as ^ !/\'#\'pp vs !/\'#\'pp'. But currently I'm not ready to answer your question about whether we'll have it in current repo or not. Let me investigate it further since I didn't have a chance to play with it so far.

@trekhleb trekhleb merged commit c4605ea into trekhleb:issue-102-rabin-karp-fix Jul 30, 2018
@dubzzz
Copy link
Contributor

dubzzz commented Jul 30, 2018

@trekhleb Thanks for the update. Please keep me updated if you need more details on the approach. Meanwhile I will try to find some more times to checks other algorithms

@trekhleb
Copy link
Owner

Thank you @dubzzz

Copy link
Contributor

@dubzzz dubzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the use of Math.random should be replaced by an appropriate property based test. It will be:

  • reproducible
  • able to simplify the failure (in case there is one)
  • and this is exactly designed for this purpose

harshes53 pushed a commit to harshes53/javascript-algorithms that referenced this pull request Dec 6, 2018
* Simplify Rabin-Karp functionality

* Created Rabin Fingerprinting module within util directory

* Updated Rabin-Karp search to use rolling hash module
Incorporate tests from @dubzzz
shoredata pushed a commit to shoredata/javascript-algorithms that referenced this pull request Mar 28, 2019
* Simplify Rabin-Karp functionality

* Created Rabin Fingerprinting module within util directory

* Updated Rabin-Karp search to use rolling hash module
Incorporate tests from @dubzzz
jixianu added a commit to jixianu/javascript-algorithms that referenced this pull request May 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants