Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ExistenceJoin Iterator using an auxiliary left semijoin #4796

Conversation

gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented Feb 16, 2022

This PR implements an iterator for ExistenceJoin

  1. This PR computes ExistenceJoin by executing left semijoin via cuDF. The lhs GatherMap is scattering true into a Boolean column with all lhs.numRows being initiallyfalse . The rhs data is not gathered.

  2. The PR also fixes regex matching against SparkPlan node strings. The previously used simple String mentions ExistenceJoin only in the CPU plan but does not print ExistenceJoin type as part of the Join exec string in the GPU plan.

Closes #589

Signed-off-by: Gera Shegalov gera@apache.org

@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov self-assigned this Feb 16, 2022
@gerashegalov gerashegalov added the task Work required that improves the product but is not user facing label Feb 16, 2022
@gerashegalov gerashegalov added this to the Feb 14 - Feb 25 milestone Feb 16, 2022
Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov mentioned this pull request Feb 16, 2022
2 tasks
@gerashegalov gerashegalov changed the title Implement existence join gatherer on top of left outer GpuHashJoin Implement existence join gatherer on top of left semijoin Feb 18, 2022
integration_tests/src/main/python/join_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/join_test.py Outdated Show resolved Hide resolved
// cuDF executes left semijoin, the gatherer is constructed with a new
// gather to gather every row from lhs
//
// we build a new rhs with a the "exists" Boolean column that has as many rows
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels off. with a the "exists" I think the a is a typo

// semijoin lhs-GatherMap labeling rows that have at least one match in the original
// rhs
//
val rhsExistsCB = withResource(Scalar.fromBool(false)) { falseScalar =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is way too deeply nested for me. Could we try to break it up some? The falseScalar is only used to create falseCV. It might also be nice to create a method for Table.scatter that takes the columnView and a single Scalar as input, and does all of the wrapping/unwrapping. to make this code that much more readable.

@gerashegalov gerashegalov changed the title [wip] Implement existence join gatherer on top of left semijoin Implement existence join gatherer on top of left semijoin Feb 25, 2022
@gerashegalov gerashegalov marked this pull request as ready for review February 25, 2022 17:16
@gerashegalov gerashegalov changed the title Implement existence join gatherer on top of left semijoin Implement ExistenceJoin Iterator using an auxiliary left semijoin Feb 25, 2022
@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov requested a review from jlowe March 2, 2022 06:31
@gerashegalov gerashegalov requested a review from jlowe March 4, 2022 06:40
@gerashegalov
Copy link
Collaborator Author

build

1 similar comment
@sameerz
Copy link
Collaborator

sameerz commented Mar 4, 2022

build

@gerashegalov gerashegalov merged commit 98a731f into NVIDIA:branch-22.04 Mar 9, 2022
@gerashegalov gerashegalov deleted the gerashegalov/issue589-gathermap-as-an-existence-column branch March 9, 2022 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support ExistenceJoin
4 participants