[FEA] Murmur3 that matches spark hashing for partitioning #6863

rwlee · 2020-11-30T23:32:58Z

Existing Murmur3 hashing and recently implemented serial Murmur3 functionality (#6781) don't quite match Spark's murmur3 hash that is used for partitioning. In the process of testing it, I found a difference in the Spark implementation's handling of input tails that effects the hash of any input that is not 4 byte aligned.

While most of the existing kernel could be copied, string types and other unfixed width types supported in the future would replace https://github.com/rapidsai/cudf/blob/branch-0.17/cpp/include/cudf/detail/utilities/hash_functions.cuh#L517.

  for (int i = nblocks*4; i < len; i++) {
    uint32_t k1 = data[i]
    k1 *= c1;
    k1 = rotl32(k1, 15);
    k1 *= c2;
    h1 ^= k1;
    h1 = rotl32(h1, 13);
    h1 = h1 * 5 + 0xe6546b64;
  }

Can this be included as yet another hash function? If yes, should the hash ID specifically reference Spark?

The text was updated successfully, but these errors were encountered:

harrism · 2020-11-30T23:39:20Z

@rwlee I discussed this with @sameerz and we think you should open a PR (against 0.18) that includes the new hash function(s) and sufficient testing so that we can see exactly what the impact would be.

should the hash ID specifically reference Spark?

I don't know what the hash ID is, but it makes sense to explicitly reference Spark if that is the only place this is used.

rwlee · 2020-11-30T23:53:14Z

It would largely be parallel to the existing functionality. Changing string hashing would require copying the functor, unlike #6781 which reused the existing kernel and fixed a few salient cases.

I'll open a draft PR with a proposed implementation.

Resolves #6863 Expands existing murmur3 hashing functionality to match Spark's murmur3 hashing algorithm by modifying tail processing for unaligned bytes and processing booleans as 32bit integers rather than singular bytes. Authors: - Ryan Lee <ryanlee@nvidia.com> - rwlee <rwlee@users.noreply.github.com> Approvers: - Jake Hemstad - null - Robert (Bobby) Evans - GALI PREM SAGAR URL: #7024

rwlee added feature request New feature or request Needs Triage Need team to review and classify labels Nov 30, 2020

rwlee added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Nov 30, 2020

harrism removed the Needs Triage Need team to review and classify label Nov 30, 2020

harrism assigned rwlee Dec 7, 2020

This was referenced Dec 16, 2020

[REVIEW] Spark Murmur3 hash functionality #7024

Merged

[FEA] have murmur3 hash function that matches exactly with spark NVIDIA/spark-rapids#937

Closed

rapids-bot bot closed this as completed in #7024 Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Murmur3 that matches spark hashing for partitioning #6863

[FEA] Murmur3 that matches spark hashing for partitioning #6863

rwlee commented Nov 30, 2020

harrism commented Nov 30, 2020

rwlee commented Nov 30, 2020

[FEA] Murmur3 that matches spark hashing for partitioning #6863

[FEA] Murmur3 that matches spark hashing for partitioning #6863

Comments

rwlee commented Nov 30, 2020

harrism commented Nov 30, 2020

rwlee commented Nov 30, 2020