[FEA] Murmur3 that matches spark hashing for partitioning #6863
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
Existing Murmur3 hashing and recently implemented serial Murmur3 functionality (#6781) don't quite match Spark's murmur3 hash that is used for partitioning. In the process of testing it, I found a difference in the Spark implementation's handling of input tails that effects the hash of any input that is not 4 byte aligned.
While most of the existing kernel could be copied, string types and other unfixed width types supported in the future would replace https://github.com/rapidsai/cudf/blob/branch-0.17/cpp/include/cudf/detail/utilities/hash_functions.cuh#L517.
Can this be included as yet another hash function? If yes, should the hash ID specifically reference Spark?
The text was updated successfully, but these errors were encountered: