Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Murmur3 that matches spark hashing for partitioning #6863

Closed
rwlee opened this issue Nov 30, 2020 · 2 comments · Fixed by #7024
Closed

[FEA] Murmur3 that matches spark hashing for partitioning #6863

rwlee opened this issue Nov 30, 2020 · 2 comments · Fixed by #7024
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@rwlee
Copy link
Contributor

rwlee commented Nov 30, 2020

Existing Murmur3 hashing and recently implemented serial Murmur3 functionality (#6781) don't quite match Spark's murmur3 hash that is used for partitioning. In the process of testing it, I found a difference in the Spark implementation's handling of input tails that effects the hash of any input that is not 4 byte aligned.

While most of the existing kernel could be copied, string types and other unfixed width types supported in the future would replace https://github.com/rapidsai/cudf/blob/branch-0.17/cpp/include/cudf/detail/utilities/hash_functions.cuh#L517.

  for (int i = nblocks*4; i < len; i++) {
    uint32_t k1 = data[i]
    k1 *= c1;
    k1 = rotl32(k1, 15);
    k1 *= c2;
    h1 ^= k1;
    h1 = rotl32(h1, 13);
    h1 = h1 * 5 + 0xe6546b64;
  }

Can this be included as yet another hash function? If yes, should the hash ID specifically reference Spark?

@rwlee rwlee added feature request New feature or request Needs Triage Need team to review and classify labels Nov 30, 2020
@rwlee rwlee added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Nov 30, 2020
@harrism
Copy link
Member

harrism commented Nov 30, 2020

@rwlee I discussed this with @sameerz and we think you should open a PR (against 0.18) that includes the new hash function(s) and sufficient testing so that we can see exactly what the impact would be.

should the hash ID specifically reference Spark?

I don't know what the hash ID is, but it makes sense to explicitly reference Spark if that is the only place this is used.

@harrism harrism removed the Needs Triage Need team to review and classify label Nov 30, 2020
@rwlee
Copy link
Contributor Author

rwlee commented Nov 30, 2020

It would largely be parallel to the existing functionality. Changing string hashing would require copying the functor, unlike #6781 which reused the existing kernel and fixed a few salient cases.

I'll open a draft PR with a proposed implementation.

@rapids-bot rapids-bot bot closed this as completed in #7024 Jan 4, 2021
rapids-bot bot pushed a commit that referenced this issue Jan 4, 2021
Resolves #6863

Expands existing murmur3 hashing functionality to match Spark's murmur3 hashing algorithm by modifying tail processing for unaligned bytes and processing booleans as 32bit integers rather than singular bytes.

Authors:
  - Ryan Lee <ryanlee@nvidia.com>
  - rwlee <rwlee@users.noreply.github.com>

Approvers:
  - Jake Hemstad
  - null
  - Robert (Bobby) Evans
  - GALI PREM SAGAR

URL: #7024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants