Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add faster HashMap implementation #5271

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add faster HashMap implementation #5271

wants to merge 1 commit into from

Conversation

lth
Copy link
Contributor

@lth lth commented Apr 30, 2019

This adds a new hash table implementation that is generally more memory friendly, and faster than HashMap or std::unordered_map. This replaces the global lock table, as well as the tracked_keys data structure.

On a single threaded workload where GetForUpdate + Put(assume_tracked) is called in batches of 100k keys:
std::unordered_map: 265691.8 / s
HashMapRB: 298957.6 / s

12.5% improvement

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lth has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@adamretter
Copy link
Collaborator

@lth I also just saw this announcement of F14 a new Open Source HashMap implementation from Facebook, perhaps it is of interest - https://code.fb.com/developer-tools/f14/

@lth
Copy link
Contributor Author

lth commented May 1, 2019

@adamretter I spoke to @siying about this, since my first thought was just to use F14 as well, but it seems like it's hard to include into rocksdb without adding a lot of folly dependencies.

@siying
Copy link
Contributor

siying commented May 1, 2019

@adamretter we have discussed about depending on folly many times, but so far it's still too complicated. Several factors in my mind:

  1. platform supported. The community has ported RocksDB to platforms like Power, FreeBSD, Solaris, etc, while Folly has no long term support for them.
  2. easy of build. Right now RocksDB requires no hard dependency to build. If you have a Linux, FreeBSD, etc, you can just grab the code and do make or cmake and it is done (if you need specific compression library, you can optionally install them). If we rely on folly, either a user has to choose in build time whether to rely on it or not, or we treat folly as a hard dependency. Both ways make it harder for users to build and run RocksDB.
  3. RocksDB is GPLv2 and Apache dual-licence but folly is Apache. This will complicate users' consideration of adapting RocksDB or products built on RocksDB. Of course, we can work with our lawyers to try to re-license folly, so this is a relatively minor consideration.

So the decision so far is that we aren't going to depend on folly for now just because of this feature, and we may periodically revisit this decision.

@adamretter
Copy link
Collaborator

@siying Totally understand... and all are very good reasons! Thanks for the explanation :-)

@facebook-github-bot
Copy link
Contributor

@lth has updated the pull request. Re-import the pull request

@ltamasi ltamasi self-requested a review May 9, 2019 18:26
Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one cool hash table.

public:
using key_type = K;
using mapped_type = V;
using value_type = std::pair<K, V>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably match the standard associative containers here and use std::pair<const K, V>.

// the 'hole'.
//
template <typename K, typename V, class Hash = std::hash<K>>
class HashMapRB {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling it HashMapRobinHood? RB immediately made me think of red-black trees.

// Robinhood hashing is used, where metadata about the distance between the
// current slot and the desired slot is kept. On collisions during inserts, if
// the occupying item's distance is smaller than the inserted item's distance,
// then the inserted item takes over the slot, and the occupying item is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the rules for invalidating iterators is different than those for std::unordered_map. We should make sure none of the code we're switching over to the new implementation relies on std::unordered_map's behavior.

return ((1 << 7) | (offset << 3) | hashbits);
}

static constexpr uint8_t inc_dist(uint8_t x) { return x + (1 << 3); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add an assert here to make sure we don't overflow the 4-bit field (and similarly add an assertion for underflow in dec_dist below).

typedef iterator_impl<const HashMapRB, true> const_iterator;

// -- Iterator Operations
iterator begin() { return iterator(this, 0); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider adding cbegin/cend and empty as well to mimic the standard unordered_map.

destroy();

memcpy(this, &other, sizeof(*this));
other.values_ = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably clear the other fields as well to bring the moved-from object to a valid empty state (same with the move ctor below). Or even call init(1 << 4) on it; that might be even better.

// Rehash until we get a short distances. This could loop infinitely if we
// have a bad hash function.
while (true) {
pos = h & mask_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but this line seems superfluous considering pos is reinitialized to h & mask_ in the for loop below.

}

ROCKSDB_FORCE_INLINE iterator find(const K& key) {
const_iterator it = const_cast<typename std::add_const<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one way around the constness problems here would be to move the actual find logic to a private helper that would return only an index, and then have two thin find wrappers around it (one const method that returns a const_iterator, and one non-const method that returns an iterator).

assert(((pos + get_dist(info_[it.index_])) & mask_) == it.index_);

auto find_it = find(it->first);
assert(find_it != end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(find_it == it) ?

@maysamyabandeh maysamyabandeh removed their request for review January 3, 2020 21:33
@facebook-github-bot
Copy link
Contributor

@lth has updated the pull request. Re-import the pull request

@mrambacher
Copy link
Contributor

@siying @adamretter Is there a reason not to use Folly if it is available? Is there a reason not to introduce a compile-time flag that uses the Folly implementation if it is there and the RobinHood otherwise? Wouldn't this be similar to what is done with things like ROCKSDB_JEMALLOC and other flags?

I understand it would add another dimension to the overgrowing testing matrix and potentially complicate something like the Java distribution, but it seems like it might be nice to be able to take advantage of the Folly features where/when they are available.

@adamretter
Copy link
Collaborator

@mrambacher sounds reasonable to me, as long as Siying's concerns are met

@facebook-github-bot
Copy link
Contributor

Hi @lth!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants