Avoid mod operator when fast alternative available #27299

benaadams · 2019-10-19T04:11:23Z

Use fastmod when 128bit multiply is available. (by @lemire https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/)

Api for review dotnet/corefx#41822 to enable 128bit multiply and make it more efficient.

Other half of PR from #27149 (first half was #27195)

|                       Method |   Toolchain |      Mean | Ratio |
|----------------------------- |------------ |----------:|------:|
|    TryGetValue_17_Int_Object |        base |  8.427 ns |  1.00 |
|    TryGetValue_17_Int_Object |        diff |  7.070 ns |  0.84 |
|    TryGetValue_3k_Int_Object |        base |  7.611 ns |  1.00 |
|    TryGetValue_3k_Int_Object |        diff |  6.635 ns |  0.87 |
|    TryGetValue_1M_Int_Object |        base |  7.743 ns |  1.00 |
|    TryGetValue_1M_Int_Object |        diff |  6.803 ns |  0.88 |
|                              |             |           |       |
| TryGetValue_17_String_Object |        base | 16.233 ns |  1.00 |
| TryGetValue_17_String_Object |        diff | 14.775 ns |  0.91 |
| TryGetValue_3k_String_Object |        base | 22.777 ns |  1.00 |
| TryGetValue_3k_String_Object |        diff | 21.387 ns |  0.94 |
| TryGetValue_1M_String_Object |        base | 74.274 ns |  1.00 |
| TryGetValue_1M_String_Object |        diff | 72.219 ns |  0.97 |
|                              |             |           |       |
| TryGetValue_17_Object_Object |        base | 17.341 ns |  1.00 |
| TryGetValue_17_Object_Object |        diff | 17.396 ns |  1.01 |
| TryGetValue_3k_Object_Object |        base | 24.407 ns |  1.00 |
| TryGetValue_3k_Object_Object |        diff | 22.979 ns |  0.94 |
| TryGetValue_1M_Object_Object |        base | 96.327 ns |  1.00 |
| TryGetValue_1M_Object_Object |        diff | 88.482 ns |  0.92 |

KNucleotide

|        Method | Toolchain |     Mean | Ratio |
|-------------- |---------- |---------:|------:|
| KNucleotide_1 |      base | 378.8 ms |  1.00 |
| KNucleotide_1 |      diff | 352.4 ms |  0.93 |

/cc @jkotas @AntonLapounov @GrabYourPitchforks @AndyAyersMS @danmosemsft

src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs

danmoseley · 2019-10-19T04:54:30Z

Seems you tried various @benadams, what range of results did you get
https://gist.github.com/benaadams/b3ba5e45c1f0a13b649296a9660ea960

(second hit when I searched for the hex number!)

benaadams · 2019-10-19T05:06:16Z

😄 They were mostly all terrible as they kept the same order of where the changes were happening (least significant bits); except Crc32 which mixed them up. Reversing the bits also worked well, but there is no fast way to do that on x86.

Summary is:

modulo is chopping off all the most significant bits; so it emphasises small changes; but large changes dissappear.

fastrange is chopping of the least significant bits; so it emphasises large changes; but small changes dissappear.

.NET simple hashes (like int where its 1:1) are small changes (sequential numbers); so to get similar behaviour to modulo from fastrange we need to reverse the bit order. That's not very fast (except on arm), however...

Crc32 is a mixer and scrambles the bits; as a bonus the reflected poly also reverses them (though that doesn't seem to be significant); and its fast (1 cycle, 3 cycle latency); so that also works.

benaadams · 2019-10-19T05:13:19Z

Note: good hashcodes e.g. Marvin (randomised string) and System.HashCode (xxHash32) don't need the Crc32 to work with fastrange; however branching based on hashcode type would take longer than just using Crc32. (also wouldn't detect the type for the infinite variety of user hashcodes)

danmoseley · 2019-10-19T05:15:41Z

I guess to cover the modulo path we will rely on the ARM test runs for now. https://github.com/dotnet/corefx/issues/36113

danmoseley · 2019-10-19T05:21:02Z

You ran all the dictionary perf tests I guess? I suppose Insert is essentially the same. @adamsitnik any reservations about our perf test coverage of dictionary?

benaadams · 2019-10-19T05:24:34Z

Yes, also added a new set dotnet/performance#938 as @AndyAyersMS pointed out strictly sequential keys weren't being tested; which is problematic since the bucketing was being changed and sequential integers was one of the issues for fastrange by itself.

benaadams · 2019-10-19T05:27:19Z

One caveat with this change from @GrabYourPitchforks is #27149 (comment)

We'd need to run the CRC32 logic by the crypto board to make sure that we're not improperly compressing the range of the hash function. The final step of CRC32 is a finite field polynomial division, which likely introduces a slight bias toward zero. It's probably safe but needs an extra set of eyes from somebody more steeped in this field.

jkotas · 2019-10-19T06:03:31Z

You ran all the dictionary perf tests I guess?

We also need to run perf tests with tiered JITing off to make sure that we are not regressing Bing and other folks who run in this config. We may need to do a work in crossgen/crossgen2 to handle this pattern well and avoid regressions in this config.

src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs

Need measurements mentioned by Jan

benaadams · 2019-10-19T16:52:12Z

Needs more investigation; gets worse when there are 1M (int) keys

|                    Method | Toolchain |      Mean | Ratio |
|-------------------------- |---------- |----------:|------:|
| TryGetValue_17_Int_Object |      base |  8.160 ns |  1.00 |
| TryGetValue_17_Int_Object |      diff |  5.763 ns |  0.71 |
|                           |           |           |       |
| TryGetValue_3k_Int_Object |      base |  8.213 ns |  1.00 |
| TryGetValue_3k_Int_Object |      diff |  7.236 ns |  0.88 |
|                           |           |           |       |
| TryGetValue_1M_Int_Object |      base |  8.323 ns |  1.00 |
| TryGetValue_1M_Int_Object |      diff | 25.507 ns |  3.07 |

Can I get a [no-merge] label?

AntonLapounov · 2019-10-20T01:23:13Z

TryGetValue_1M_Int_Object

@benaadams Are you basically comparing sequential memory access (base) to random memory access (diff)? Sequential memory access will always be faster due to hardware prefetching. Try looking up pseudorandom keys for both 'base' and 'diff'.

NinoFloris · 2019-10-20T02:09:41Z

@benaadams did you try fibonacci hashing too? I found it a pretty insightful blog https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/

Fastrange by itself really isn't great for the reason you described, it's throwing away lower bits. Fastrange + crc is more expensive than fibonacci, fibonacci has some issue in the last upper bit (so sort of mirrors modulos upper bits behavior better) and finally fibonacci should actually get better with larger tables.

Worth a shot?

(Implementation https://github.com/skarupke/flat_hash_map/blob/812aede752d789033a7c439e8fd4f8d81522a642/flat_hash_map.hpp#L1270-L1273)

AntonLapounov · 2019-10-20T03:13:11Z

Fastrange + crc is more expensive than fibonacci

This is not comparing apples to apples. If the size of the hash table is not a power of two, then Fibonacci alone won't work. And if the size is a power of two, Fastrange is not needed.

benaadams · 2019-10-23T23:42:11Z

Could change to fastmod which has the advantage of being the exact same behviour as modulo.

Disadvantage is it adds a ulong to Dictionary; and more problematic here is we can still only make use of it behind an intrinsic check as Math. doesn't expose a 64bit * 64bit => 128bit multiply https://github.com/dotnet/corefx/issues/41822

danmoseley · 2019-10-24T01:12:08Z

Cc @tannergooding for 128 bit multiply ask.

tannergooding · 2019-10-24T01:20:10Z

Cc @tannergooding for 128 bit multiply ask.

Ben already opened an issue for that here: https://github.com/dotnet/corefx/issues/41822. We should hopefully get it marked ready-for-review shortly :)

benaadams · 2019-10-24T01:27:06Z

Doesn't need the No Merge tag any more as not changing the behaviour (whether we want to merge it is a different thing of course 😉 )

benaadams · 2019-10-24T03:31:30Z

K, back on track

|                       Method | Toolchain |      Mean | Ratio |
|----------------------------- |---------- |----------:|------:|
|    TryGetValue_17_Int_Object |      base |  8.938 ns |  1.00 |
|    TryGetValue_17_Int_Object |      diff |  6.229 ns |  0.69 |
|                              |           |           |       |
|    TryGetValue_3k_Int_Object |      base |  7.657 ns |  1.00 |
|    TryGetValue_3k_Int_Object |      diff |  5.727 ns |  0.75 |
|                              |           |           |       |
|    TryGetValue_1M_Int_Object |      base |  7.721 ns |  1.00 |
|    TryGetValue_1M_Int_Object |      diff |  6.831 ns |  0.86 |
|                              |           |           |       |
| TryGetValue_17_String_Object |      base | 16.263 ns |  1.00 |
| TryGetValue_17_String_Object |      diff | 13.354 ns |  0.82 |
|                              |           |           |       |
| TryGetValue_3k_String_Object |      base | 22.804 ns |  1.00 |
| TryGetValue_3k_String_Object |      diff | 19.498 ns |  0.86 |
|                              |           |           |       |
| TryGetValue_1M_String_Object |      base | 79.410 ns |  1.00 |
| TryGetValue_1M_String_Object |      diff | 69.598 ns |  0.88 |
|                              |           |           |       |
| TryGetValue_17_Object_Object |      base | 17.572 ns |  1.00 |
| TryGetValue_17_Object_Object |      diff | 15.245 ns |  0.87 |
|                              |           |           |       |
| TryGetValue_3k_Object_Object |      base | 24.880 ns |  1.00 |
| TryGetValue_3k_Object_Object |      diff | 22.257 ns |  0.89 |
|                              |           |           |       |
| TryGetValue_1M_Object_Object |      base | 96.808 ns |  1.00 |
| TryGetValue_1M_Object_Object |      diff | 87.274 ns |  0.90 |

benaadams · 2019-10-24T03:39:23Z

|        Method | Toolchain |     Mean | Ratio |
|-------------- |---------- |---------:|------:|
| KNucleotide_1 |      base | 369.8 ms |  1.00 |
| KNucleotide_1 |      diff | 337.5 ms |  0.91 |
| KNucleotide_9 |      base | 91.43 ms |  1.00 |
| KNucleotide_9 |      diff | 88.44 ms |  0.97 |

benaadams · 2019-10-24T06:13:21Z

I am assuming dotnet/corefx#41822 will solve the issue with the intrinsic test in the R2R when Tiered Jitting is disabled as it will become a normal multiply?

src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs

src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs

AntonLapounov

Besides the issue in Resize, changes look good to me. This should benefit most applications, though may lead to increased memory consumption in applications creating lots of small Dictionary instances.

danmoseley · 2019-10-26T04:33:57Z

increased memory consumption in applications creating lots of small Dictionary instances.

We did remove a similar sized field last release (a ref for SyncRoot). I guess it cancels out...

src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs

src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs

Co-Authored-By: Jan Kotas <jkotas@microsoft.com>

jkotas

Thank you - nice work!

) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

GrabYourPitchforks · 2019-10-28T18:32:02Z

If there is concern about memory usage, one thing you could do is to move rarely-accessed data into a separate instance. For example:

public class Dictionary<TKey, TValue>
{
    private KeyCollection _keys; // REMOVE me
    private ValueCollection _values; // REMOVE me

    private RareMembers _rareMembers; // INTRODUCE me

    private sealed class RareMembers
    {
        internal KeyCollection Keys;
        internal ValueCollection Values;
    }
}

It saves 1 reference in any Dictionary<TKey, TValue> instance where the Keys or Values property isn't accessed. (I'd wager that describes most dictionary instances.) The trade-off is that if you do access either of those members there's a bit more indirection and a bit more memory usage.

- dotnet/coreclr#27215 - dotnet/coreclr#27299

benaadams mentioned this pull request Oct 19, 2019

Improve Dictionary perfomance #27149

Closed

benaadams commented Oct 19, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs Outdated Show resolved Hide resolved

danmoseley previously approved these changes Oct 19, 2019

View reviewed changes

danmoseley requested a review from jkotas October 19, 2019 05:21

jkotas reviewed Oct 19, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs Outdated Show resolved Hide resolved

benaadams force-pushed the dict-divisor branch 2 times, most recently from a16535d to 1f00260 Compare October 19, 2019 11:22

stephentoub added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Oct 19, 2019

benaadams force-pushed the dict-divisor branch from 1f00260 to ab30be9 Compare October 23, 2019 23:36

benaadams force-pushed the dict-divisor branch from ff3bbda to ee71bd6 Compare October 24, 2019 02:31

AntonLapounov reviewed Oct 26, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs Outdated Show resolved Hide resolved

nits

9c9c7f6

AntonLapounov reviewed Oct 26, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs Outdated Show resolved Hide resolved

AntonLapounov approved these changes Oct 26, 2019

View reviewed changes

Guard against corruption from OOM

7a008ef

jkotas reviewed Oct 26, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Collections/HashHelpers.cs Outdated Show resolved Hide resolved

Feedback

9db1979

benaadams commented Oct 26, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs Outdated Show resolved Hide resolved

nit

c6c9001

jkotas reviewed Oct 26, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs Outdated Show resolved Hide resolved

jkotas reviewed Oct 26, 2019

View reviewed changes

src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs Outdated Show resolved Hide resolved

Apply suggestions from code review

06aacfd

Co-Authored-By: Jan Kotas <jkotas@microsoft.com>

jkotas approved these changes Oct 26, 2019

View reviewed changes

jkotas merged commit e532bf6 into dotnet:master Oct 26, 2019

Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corefx that referenced this pull request Oct 26, 2019

Avoid mod operator when fast alternative available (dotnet/coreclr#27299

28d4b1f

) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corert that referenced this pull request Oct 26, 2019

Avoid mod operator when fast alternative available (dotnet/coreclr#27299

438a39a

) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/mono that referenced this pull request Oct 26, 2019

Avoid mod operator when fast alternative available (dotnet/coreclr#27299

8ae5326

) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

benaadams deleted the dict-divisor branch October 26, 2019 14:22

jkotas pushed a commit to dotnet/corert that referenced this pull request Oct 26, 2019

Avoid mod operator when fast alternative available (dotnet/coreclr#27299

c17eeaf

) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

marek-safar pushed a commit to mono/mono that referenced this pull request Oct 26, 2019

Avoid mod operator when fast alternative available (dotnet/coreclr#27299

2da4e54

) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

stephentoub pushed a commit to dotnet/corefx that referenced this pull request Oct 26, 2019

Avoid mod operator when fast alternative available (dotnet/coreclr#27299

36d9ba1

) Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

jkotas mentioned this pull request Nov 15, 2019

Add recently added entries for THIRD-PARTY-NOTICES dotnet/runtime#54

Merged

safern pushed a commit to dotnet/runtime that referenced this pull request Nov 15, 2019

Add recently added entries for THIRD-PARTY-NOTICES (#54)

6c0cdd8

- dotnet/coreclr#27215 - dotnet/coreclr#27299

danmoseley mentioned this pull request Jan 22, 2020

Replace modulo with fastmodulo in HashSet and Hashtable dotnet/runtime#1989

Closed

2 tasks

adamsitnik mentioned this pull request Sep 2, 2020

[ARM64] Performance regression: DictionarySequentialKeys.ContainsKey dotnet/runtime#41743

Closed

MihaZupan mentioned this pull request Feb 11, 2021

Stash LoadBalancingPolicy on ClusterConfig microsoft/reverse-proxy#733

Closed

EgorBo mentioned this pull request Feb 23, 2022

Use Fast-Mod algorithm for native EEHashTableBase dotnet/runtime#65778

Closed

adamsitnik mentioned this pull request Mar 7, 2022

Add some additional micro benchmark coverage for immutable collections. dotnet/performance#2268

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid mod operator when fast alternative available #27299

Avoid mod operator when fast alternative available #27299

benaadams commented Oct 19, 2019 •

edited

Loading

danmoseley commented Oct 19, 2019

benaadams commented Oct 19, 2019

benaadams commented Oct 19, 2019 •

edited

Loading

danmoseley commented Oct 19, 2019

danmoseley commented Oct 19, 2019

benaadams commented Oct 19, 2019

benaadams commented Oct 19, 2019

jkotas commented Oct 19, 2019 •

edited

Loading

benaadams commented Oct 19, 2019

AntonLapounov commented Oct 20, 2019 •

edited

Loading

NinoFloris commented Oct 20, 2019

AntonLapounov commented Oct 20, 2019

benaadams commented Oct 23, 2019

danmoseley commented Oct 24, 2019

tannergooding commented Oct 24, 2019

benaadams commented Oct 24, 2019

benaadams commented Oct 24, 2019

benaadams commented Oct 24, 2019

benaadams commented Oct 24, 2019

AntonLapounov left a comment

danmoseley commented Oct 26, 2019

jkotas left a comment

GrabYourPitchforks commented Oct 28, 2019

Avoid mod operator when fast alternative available #27299

Avoid mod operator when fast alternative available #27299

Conversation

benaadams commented Oct 19, 2019 • edited Loading

danmoseley commented Oct 19, 2019

benaadams commented Oct 19, 2019

benaadams commented Oct 19, 2019 • edited Loading

danmoseley commented Oct 19, 2019

danmoseley commented Oct 19, 2019

benaadams commented Oct 19, 2019

benaadams commented Oct 19, 2019

jkotas commented Oct 19, 2019 • edited Loading

benaadams commented Oct 19, 2019

AntonLapounov commented Oct 20, 2019 • edited Loading

NinoFloris commented Oct 20, 2019

AntonLapounov commented Oct 20, 2019

benaadams commented Oct 23, 2019

danmoseley commented Oct 24, 2019

tannergooding commented Oct 24, 2019

benaadams commented Oct 24, 2019

benaadams commented Oct 24, 2019

benaadams commented Oct 24, 2019

benaadams commented Oct 24, 2019

AntonLapounov left a comment

Choose a reason for hiding this comment

danmoseley commented Oct 26, 2019

jkotas left a comment

Choose a reason for hiding this comment

GrabYourPitchforks commented Oct 28, 2019

benaadams commented Oct 19, 2019 •

edited

Loading

benaadams commented Oct 19, 2019 •

edited

Loading

jkotas commented Oct 19, 2019 •

edited

Loading

AntonLapounov commented Oct 20, 2019 •

edited

Loading