Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The index: 32bit hashes instead of 64bit #27

Open
ksahlin opened this issue Jun 1, 2022 · 4 comments
Open

The index: 32bit hashes instead of 64bit #27

ksahlin opened this issue Jun 1, 2022 · 4 comments
Labels
optimization Further information is requested

Comments

@ksahlin
Copy link
Owner

ksahlin commented Jun 1, 2022

Note to developer:

Using 32-bit hashes may reduce both vector and hash table size. For human, it is expected to reduce peak memory from 32Gb (64bit) down to around 20-24Gb

At the final hash value computation h1/2+h2/2 when the seed has been decided, simply hash this 64 bit down to 32 bit with wyhash or XXH32.

This will result in more collisions on large genomes though as there are only 4.2billion unique slots. For human, we store about 450M unique values, but for larger genomes this may be a problem (can this be solved with 32/64 bit template?).

I see quite a reduction in both space and alignment time if we change to 32bit hash value (on a small dataset!).

@ksahlin ksahlin added the optimization Further information is requested label Jun 1, 2022
@ksahlin ksahlin changed the title The index 32bit hashes instead of 64bit The index: 32bit hashes instead of 64bit Jun 1, 2022
@marcelm
Copy link
Collaborator

marcelm commented Dec 19, 2022

Turning the hash type into a template parameter would indeed be the way to go.

One question: What happens at the moment when there’s a hash collision? At which time is that resolved?

@ksahlin
Copy link
Owner Author

ksahlin commented Dec 19, 2022

As for true randstrobe hash collisions where the underlying strobemers are different, they will be noticed at the NAM stage. Specifically, when parsing out the leftmost and rightmost position of a strobe in the NAM. We do a sanity check that these strobes match the reference exactly. If they don't, it is either of the following:

  1. false symmetrical matches. The paper discusses this as a feature. It is basically that (s1, s2) will get the same hash value as (s2,s1). This is neccessary when masking seeds, and also turns out to be help a tad with sensitivity. See section "Modifications to strobemers" in paper.
  2. A 'true' hash collision from completely different strobes, i.e. (s1+s2)=(s3+s4).
  3. If not 1 or 2, it is some other bug..

In all the above cases, the procedure is to check for case 1 by reversing the NAM and check if it fits the reference, if it does, it was because of false symmetrical matches. This is taken care of in function reverse_nam_if_needed. If this is not the case, give up.

@ksahlin
Copy link
Owner Author

ksahlin commented Dec 19, 2022

If you mean hash collisions in the actual hash table implementation, I don't know how they are resolved under the hood.

For the ransdtrobe hash collisions, here are some example stats for mapping 10M PE reads to rye, where Did not fit strobe start site are logging these likely hash collisions.

Total mapping sites tried: 100327591
Did not fit strobe start site: 27528

@marcelm
Copy link
Collaborator

marcelm commented Dec 19, 2022

Thanks, right, there was something about slightly better sensitivity, I remember now. I did mean the randstrobe collisions.

The mechanisms for resolving collisions in the hash table are what makes hash table implementations different. A hash table as we use it is just an array (vector). With linear probing, if the hash function h says to use slot h(key) and that one is already filled, slots h(key)+m, h(key)+2m etc. are tried (for some constant m) until an empty slot is found. This has some disadvantages (for example, things tend to cluster and then for each newly insertion that hits a cluster, even more probing is necessary). Some of this can be avoided with quadratic probing, where the probe sequence is h(key), h(key)+1²m, h(key)+2²m, h(key)+3²m etc.

Robin hood hashing is a different method that can move things around on collisions. As I understood it, a cost is associated with each item describing how much probing is necessary to retrieve it. If a new entry needs to be inserted that collides with existing ones, the items are shuffled around, modifying their costs so that one "takes from the rich and gives to the poor" (hence Robin Hood).

@ksahlin ksahlin mentioned this issue Jan 11, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants