-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The index: 32bit hashes instead of 64bit #27
Comments
Turning the hash type into a template parameter would indeed be the way to go. One question: What happens at the moment when there’s a hash collision? At which time is that resolved? |
As for true randstrobe hash collisions where the underlying strobemers are different, they will be noticed at the NAM stage. Specifically, when parsing out the leftmost and rightmost position of a strobe in the NAM. We do a sanity check that these strobes match the reference exactly. If they don't, it is either of the following:
In all the above cases, the procedure is to check for case 1 by reversing the NAM and check if it fits the reference, if it does, it was because of false symmetrical matches. This is taken care of in function |
If you mean hash collisions in the actual hash table implementation, I don't know how they are resolved under the hood. For the ransdtrobe hash collisions, here are some example stats for mapping 10M PE reads to rye, where
|
Thanks, right, there was something about slightly better sensitivity, I remember now. I did mean the randstrobe collisions. The mechanisms for resolving collisions in the hash table are what makes hash table implementations different. A hash table as we use it is just an array (vector). With linear probing, if the hash function h says to use slot h(key) and that one is already filled, slots h(key)+m, h(key)+2m etc. are tried (for some constant m) until an empty slot is found. This has some disadvantages (for example, things tend to cluster and then for each newly insertion that hits a cluster, even more probing is necessary). Some of this can be avoided with quadratic probing, where the probe sequence is h(key), h(key)+1²m, h(key)+2²m, h(key)+3²m etc. Robin hood hashing is a different method that can move things around on collisions. As I understood it, a cost is associated with each item describing how much probing is necessary to retrieve it. If a new entry needs to be inserted that collides with existing ones, the items are shuffled around, modifying their costs so that one "takes from the rich and gives to the poor" (hence Robin Hood). |
Note to developer:
Using 32-bit hashes may reduce both vector and hash table size. For human, it is expected to reduce peak memory from 32Gb (64bit) down to around 20-24Gb
At the final hash value computation h1/2+h2/2 when the seed has been decided, simply hash this 64 bit down to 32 bit with wyhash or XXH32.
This will result in more collisions on large genomes though as there are only 4.2billion unique slots. For human, we store about 450M unique values, but for larger genomes this may be a problem (can this be solved with 32/64 bit template?).
I see quite a reduction in both space and alignment time if we change to 32bit hash value (on a small dataset!).
The text was updated successfully, but these errors were encountered: