Refactor hashtable use linear probing #15

danielealbano · 2020-06-08T22:07:38Z

This PR contains the required refactoring to switch to a 13 slots per bucket model where each slot is identified by an half of the hash and uses accelerated AVX2/AVX, if available, to use 2 SIMD instructions to search among the slots.

Each bucket is cache-line aligned hence the number of slots is limited, although from the tests and benchmarks even with an hashtable with 133821673 buckets the number of slots used, after having inserted 120439505 keys (load factor 0.9) is 9 therefore the upper limit of 13 can be considered reasonably high.

This refactoring includes a couple of feature flags to:

embed the key value per slot onto the bucket instead of having it external, it improves the performances by an average of 15% but a lot of memory is wasted as consequence of the pre-allocation
disable the write lock for the bucket, uses atomic operations to update the half hash in the bucket and re-organise the instruction to ensure to work properly without locks but it's 6 times slower and requires the code to be reviewed.

This code is based of another branch containing the required changes to use chaining but after an initial implementation and benchmark I have decided to switch back to linear probing but with a different slightly implementation to improve the overall performances.

From an initial bench on an AMD EPYC 7502P with 128GB of memory (the additional stats need to be reviewed, they aren't calculated properly)

hashtable_op_set_new/11748391/2937097/2/iterations:1/real_time/threads:128               32672169 ns    199629794 ns          128 keys_to_insert=2.9371M load_factor=0.25 load_factor_buckets=0.153504 total_buckets=11.7484M used_avg_bucket_slots=1.08569 used_buckets=2.70513M used_max_bucket_slots=5
hashtable_op_set_new/11748391/2937097/2/iterations:1/real_time/threads:256               21074638 ns     88349183 ns          256 keys_to_insert=2.9371M load_factor=0.25 load_factor_buckets=0.15359 total_buckets=11.7484M used_avg_bucket_slots=1.08509 used_buckets=2.70665M used_max_bucket_slots=6
hashtable_op_set_new/11748391/3876969/2/iterations:1/real_time/threads:128                4449545 ns    142267550 ns          128 keys_to_insert=3.87697M load_factor=0.33 load_factor_buckets=0.197453 total_buckets=11.7484M used_avg_bucket_slots=1.11412 used_buckets=3.47963M used_max_bucket_slots=6
hashtable_op_set_new/11748391/3876969/2/iterations:1/real_time/threads:256                1403225 ns     68445306 ns          256 keys_to_insert=3.87697M load_factor=0.33 load_factor_buckets=0.197459 total_buckets=11.7484M used_avg_bucket_slots=1.11409 used_buckets=3.47973M used_max_bucket_slots=7
hashtable_op_set_new/11748391/5874195/2/iterations:1/real_time/threads:128                5670142 ns    159535779 ns          128 keys_to_insert=5.87419M load_factor=0.5 load_factor_buckets=0.28346 total_buckets=11.7484M used_avg_bucket_slots=1.17583 used_buckets=4.9953M used_max_bucket_slots=7
hashtable_op_set_new/11748391/5874195/2/iterations:1/real_time/threads:256                1833017 ns     84183284 ns          256 keys_to_insert=5.87419M load_factor=0.5 load_factor_buckets=0.283449 total_buckets=11.7484M used_avg_bucket_slots=1.17588 used_buckets=4.9951M used_max_bucket_slots=7
hashtable_op_set_new/11748391/8811293/2/iterations:1/real_time/threads:128                8308756 ns    204654679 ns          128 keys_to_insert=8.81129M load_factor=0.75 load_factor_buckets=0.393357 total_buckets=11.7484M used_avg_bucket_slots=1.27094 used_buckets=6.93196M used_max_bucket_slots=8
hashtable_op_set_new/11748391/8811293/2/iterations:1/real_time/threads:256                2271562 ns     90279197 ns          256 keys_to_insert=8.81129M load_factor=0.75 load_factor_buckets=0.393446 total_buckets=11.7484M used_avg_bucket_slots=1.27065 used_buckets=6.93352M used_max_bucket_slots=8
hashtable_op_set_new/11748391/10573551/2/iterations:1/real_time/threads:128               7298341 ns    191494659 ns          128 keys_to_insert=10.5736M load_factor=0.9 load_factor_buckets=0.451147 total_buckets=11.7484M used_avg_bucket_slots=1.32973 used_buckets=7.95035M used_max_bucket_slots=8
hashtable_op_set_new/11748391/10573551/2/iterations:1/real_time/threads:256               1997645 ns     93085817 ns          256 keys_to_insert=10.5736M load_factor=0.9 load_factor_buckets=0.451078 total_buckets=11.7484M used_avg_bucket_slots=1.32993 used_buckets=7.94915M used_max_bucket_slots=8
hashtable_op_set_new/133821673/33455418/2/iterations:1/real_time/threads:128            128461522 ns   2099067226 ns          128 keys_to_insert=33.4554M load_factor=0.25 load_factor_buckets=0.153445 total_buckets=133.822M used_avg_bucket_slots=1.0856 used_buckets=30.8014M used_max_bucket_slots=6
hashtable_op_set_new/133821673/33455418/2/iterations:1/real_time/threads:256             23609258 ns    716225210 ns          256 keys_to_insert=33.4554M load_factor=0.25 load_factor_buckets=0.153448 total_buckets=133.822M used_avg_bucket_slots=1.08558 used_buckets=30.802M used_max_bucket_slots=6
hashtable_op_set_new/133821673/44161152/2/iterations:1/real_time/threads:128             58457250 ns   2057175148 ns          128 keys_to_insert=44.1612M load_factor=0.33 load_factor_buckets=0.197363 total_buckets=133.822M used_avg_bucket_slots=1.11393 used_buckets=39.6173M used_max_bucket_slots=6
hashtable_op_set_new/133821673/44161152/2/iterations:1/real_time/threads:256             17624390 ns    767131974 ns          256 keys_to_insert=44.1612M load_factor=0.33 load_factor_buckets=0.197361 total_buckets=133.822M used_avg_bucket_slots=1.11394 used_buckets=39.6168M used_max_bucket_slots=7
hashtable_op_set_new/133821673/33455418/2/iterations:1/real_time/threads:256              9742951 ns    627039687 ns          256 keys_to_insert=33.4554M load_factor=0.25 load_factor_buckets=1 total_buckets=133.822M used_avg_bucket_slots=0.166579 used_buckets=200.733M used_max_bucket_slots=6
hashtable_op_set_new/133821673/66910836/2/iterations:1/real_time/threads:128            103718502 ns   2900329998 ns          128 keys_to_insert=66.9108M load_factor=0.5 load_factor_buckets=0.283221 total_buckets=133.822M used_avg_bucket_slots=1.17572 used_buckets=56.8516M used_max_bucket_slots=7
hashtable_op_set_new/133821673/66910836/2/iterations:1/real_time/threads:256             25181532 ns    959190065 ns          256 keys_to_insert=66.9108M load_factor=0.5 load_factor_buckets=0.283225 total_buckets=133.822M used_avg_bucket_slots=1.17572 used_buckets=56.8525M used_max_bucket_slots=7

… 14 entries with AVX512, AVX2, SSE4 and a simple linear search support

…vx512,avx2,sse4,loop} functions, not fully tested

…TH to match the naming convention

…used during the search phase to speed up the lookup

…ble-snippets )

…instruction set variable to hint which instruction set should be selected

@chtz

…ided by @chtz ( https://stackoverflow.com/a/62123631/169278 )

…e real world they don't get to perform exactly X searches on exactly the same data exactly sequentially, this approach regenerates the data each time and is more realistic

…ir own files, need to properly handle AVX2 compilation flags targetted per src file

…plementation

… compiled

…fy the define containing the main bench

…ke_config.*, rename version.cmake module in cmake_config.cmake, improve message logging, add a custom target to automatically update the cmake_config.c file on every build to correctly update the build date/time

… via benches or testing)

…ake script

… variables to include the re-generated file at build time

…using a bitmask

…tialized, add a static variable to ensure the code is executed only once

…he compiler will always able to emit avx2/avx instructions

… instead of function pointers to pick the best implementation option at runtime

…oesn't change

…e cases that would case slowdowns

…te_new) function

…e convention for the methods and move the shared code to an external support c file

…eration

…e, etc)

…he keys for the update bench

…collect hashtable stats and update state code

…y/values onto the bucket and add support disabling the locks (switch to atomic operations)

danielealbano added 30 commits May 31, 2020 22:48

Initial implementation of the hash search algorithm in the array with…

bfc66a6

… 14 entries with AVX512, AVX2, SSE4 and a simple linear search support

Initial drop of the cachelines-based hashtable implementation

55d89e2

Style fix

03e881f

Initial benchmarking support for the hashtable_support_hash_search_{a…

20f7a9d

…vx512,avx2,sse4,loop} functions, not fully tested

Prefix cmake custom variables with the projectname

248be49

Drop more code related to the cacheline-based hashtable

6722f40

Rename HASHTABLE_INLINE_KEY_MAX_SIZE to HASHTABLE_KEY_INLINE_MAX_LENG…

a80afd5

…TH to match the naming convention

Add support to store a rpefix of the key into the KV structure to be …

d224862

…used during the search phase to speed up the lookup

Import portable snippet cpu library ( https://github.com/nemequ/porta…

520370b

…ble-snippets )

Drop gcc builtins to make the code more portable, drop the preferred …

3f4c92c

…instruction set variable to hint which instruction set should be selected

Drop the AVX512 and the SSE linear search algorithm implementation

0726c6b

Replace avx2 search algorithm implementation with the branchless prov…

30e13c2

…ided by @chtz ( https://stackoverflow.com/a/62123631/169278 )

Drop AVX512 and SEE benchs

b0a76cb

Drop unused header

494e0f1

Refactor the code to better test the two search implementation, in th…

1449d42

…e real world they don't get to perform exactly X searches on exactly the same data exactly sequentially, this approach regenerates the data each time and is more realistic

Split out the avx2/loop hash search algorithm implementation into the…

e300fcb

…ir own files, need to properly handle AVX2 compilation flags targetted per src file

Add the ad-hoc compilation flags for the avx2 hash serch algorithm im…

33bb3e9

…plementation

Improve benchmark structure

08341c0

Rely on the build system to decide if this the avx2 variant has to be…

b196a46

… compiled

Implement the AVX hash search algorithm variant

dfe2963

Add the AVX search algorithm to the auto-selection function

32103ac

Rework the build system

cd87975

Add the AVX version of the search algorithm to the benches and simpli…

86c8716

…fy the define containing the main bench

Expose the CMAKE_BUILD_TYPE variable

71a0928

Reorganise the cmake_config.h file and fix cpp support (when included…

aed75ea

… via benches or testing)

Split the buildstep invoked as custom dependency target in its own cm…

bf482bf

…ake script

Update the src cmakefile script to use the new cmake_config_c exposed…

6343ad6

… variables to include the re-generated file at build time

Refactor the hash search functions to be able to ignore some matches …

21f4898

…using a bitmask

Select the right hash search implementation when the hashtable is ini…

fb5ad77

…tialized, add a static variable to ensure the code is executed only once

danielealbano added 28 commits June 8, 2020 17:01

Fix how the avx2/avx search algorithm gets build, if it's on x86_64 t…

4cc7695

…he compiler will always able to emit avx2/avx instructions

Implement the search algorithms for the 8 slots version and use ifunc…

d2e10e1

… instead of function pointers to pick the best implementation option at runtime

Switch to use t1ha0, after a number of tests the load factor almost d…

5e6af66

…oesn't change

Drop pre-check on the write lock

6276d49

Lock always before checkinf the chain_first_ring is null to avoid edg…

2e4e6e8

…e cases that would case slowdowns

Fix the search or create key (hashtable_support_op_search_key_or_crea…

af18d87

…te_new) function

Improve testing

1623cd1

Fix hashtable structures

78ee645

Update the hashtable support hash search benchmark to use the new nam…

2a2ab70

…e convention for the methods and move the shared code to an external support c file

Improve how the load factor is calculated

55bf660

Drop the code related to the cachelines / loadfactor calculation

11d9a70

Use the set_thread_affinity function from bench-support.c

cd9e3fc

Fix headers

66d3c98

Drops keys pre-generation, better to generate random keys on every it…

2c4372e

…eration

Add the ability to collect hashtable stats (load factor, buckets usag…

ab5104b

…e, etc)

Check for set errors

c879f37

Fix index variable type and fix how the hashtable is prefilled with t…

561154f

…he keys for the update bench

Run the benches at least 10 times

5151349

Allow a maximum of 4 threads per core on the benches

59d0640

Rename the support function in the hashtable set bench and share the …

3cb9501

…collect hashtable stats and update state code

Enforce -O3 compilation flag when building for non-debug and fix style

5248f3a

Refactor the hashtable to use linear probing, add support to embed ke…

33c17c5

…y/values onto the bucket and add support disabling the locks (switch to atomic operations)

Fix mmap result check

dd05e1e

Drop useless fixtures

1fdbf4f

Update the tests to test the linear probing

0b2b1e4

Update the defaults (no locks, key/values embedded)

be352e0

Update the tests to support the new data structure

0f1ef1a

Re-enable locks, with atomic ops 6 times slower

a891287

danielealbano merged commit fa13abd into master Jun 8, 2020

danielealbano deleted the refactor_hashtable_use_linear_probing branch June 8, 2020 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor hashtable use linear probing #15

Refactor hashtable use linear probing #15

danielealbano commented Jun 8, 2020

Refactor hashtable use linear probing #15

Refactor hashtable use linear probing #15

Conversation

danielealbano commented Jun 8, 2020