The index: 32bit hashes instead of 64bit #27

ksahlin · 2022-06-01T10:47:54Z

Note to developer:

Using 32-bit hashes may reduce both vector and hash table size. For human, it is expected to reduce peak memory from 32Gb (64bit) down to around 20-24Gb

At the final hash value computation h1/2+h2/2 when the seed has been decided, simply hash this 64 bit down to 32 bit with wyhash or XXH32.

This will result in more collisions on large genomes though as there are only 4.2billion unique slots. For human, we store about 450M unique values, but for larger genomes this may be a problem (can this be solved with 32/64 bit template?).

I see quite a reduction in both space and alignment time if we change to 32bit hash value (on a small dataset!).

marcelm · 2022-12-19T11:21:18Z

Turning the hash type into a template parameter would indeed be the way to go.

One question: What happens at the moment when there’s a hash collision? At which time is that resolved?

ksahlin · 2022-12-19T12:32:25Z

As for true randstrobe hash collisions where the underlying strobemers are different, they will be noticed at the NAM stage. Specifically, when parsing out the leftmost and rightmost position of a strobe in the NAM. We do a sanity check that these strobes match the reference exactly. If they don't, it is either of the following:

false symmetrical matches. The paper discusses this as a feature. It is basically that (s1, s2) will get the same hash value as (s2,s1). This is neccessary when masking seeds, and also turns out to be help a tad with sensitivity. See section "Modifications to strobemers" in paper.
A 'true' hash collision from completely different strobes, i.e. (s1+s2)=(s3+s4).
If not 1 or 2, it is some other bug..

In all the above cases, the procedure is to check for case 1 by reversing the NAM and check if it fits the reference, if it does, it was because of false symmetrical matches. This is taken care of in function reverse_nam_if_needed. If this is not the case, give up.

ksahlin · 2022-12-19T12:38:22Z

If you mean hash collisions in the actual hash table implementation, I don't know how they are resolved under the hood.

For the ransdtrobe hash collisions, here are some example stats for mapping 10M PE reads to rye, where Did not fit strobe start site are logging these likely hash collisions.

Total mapping sites tried: 100327591
Did not fit strobe start site: 27528

marcelm · 2022-12-19T13:22:50Z

Thanks, right, there was something about slightly better sensitivity, I remember now. I did mean the randstrobe collisions.

The mechanisms for resolving collisions in the hash table are what makes hash table implementations different. A hash table as we use it is just an array (vector). With linear probing, if the hash function h says to use slot h(key) and that one is already filled, slots h(key)+m, h(key)+2m etc. are tried (for some constant m) until an empty slot is found. This has some disadvantages (for example, things tend to cluster and then for each newly insertion that hits a cluster, even more probing is necessary). Some of this can be avoided with quadratic probing, where the probe sequence is h(key), h(key)+1²m, h(key)+2²m, h(key)+3²m etc.

Robin hood hashing is a different method that can move things around on collisions. As I understood it, a cost is associated with each item describing how much probing is necessary to retrieve it. If a new entry needs to be inserted that collides with existing ones, the items are shuffled around, modifying their costs so that one "takes from the rich and gives to the poor" (hence Robin Hood).

ksahlin added the optimization Further information is requested label Jun 1, 2022

ksahlin changed the title ~~The index 32bit hashes instead of 64bit~~ The index: 32bit hashes instead of 64bit Jun 1, 2022

ksahlin mentioned this issue Dec 10, 2022

Memory reduction #173

Closed

ksahlin mentioned this issue Dec 17, 2022

Reduced memory usage with a pre-generated index #179

Closed

ksahlin mentioned this issue Jan 11, 2023

Memory reduction II #191

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The index: 32bit hashes instead of 64bit #27

The index: 32bit hashes instead of 64bit #27

ksahlin commented Jun 1, 2022

marcelm commented Dec 19, 2022

ksahlin commented Dec 19, 2022 •

edited

Loading

ksahlin commented Dec 19, 2022

marcelm commented Dec 19, 2022

The index: 32bit hashes instead of 64bit #27

The index: 32bit hashes instead of 64bit #27

Comments

ksahlin commented Jun 1, 2022

marcelm commented Dec 19, 2022

ksahlin commented Dec 19, 2022 • edited Loading

ksahlin commented Dec 19, 2022

marcelm commented Dec 19, 2022

ksahlin commented Dec 19, 2022 •

edited

Loading