Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sequence support [databricks] #4376

Merged
merged 6 commits into from
Jan 6, 2022
Merged

Conversation

wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Dec 16, 2021

This is to close #3512. and this PR depends on rapidsai/cudf#9839 and #4376

For now, the PR only supports sequence on IntegerType.

Signed-off-by: Bobby Wang <wbo4958@gmail.com>
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking really good. I assume all of this issues I am pointing out are just because this is still a work in progress.

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Dec 17, 2021

Overall looking really good. I assume all of this issues I am pointing out are just because this is still a work in progress.

@revans2, Thx for the review. Yeah, this PR is still WIP, but it can work for IntegerType. Will refine this and add more types for support. But I'd not like to add TimeStamp and DateType for support in this PR, since the size calculation may be quite different which may cause this PR pretty big.

Seq[ColumnVector] = {
withResource(stop.sub(start)) { difference =>
withResource(Scalar.fromInt(1)) { scalarOne =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be GpuScalar(1, dataType) or similar, so we can support various types not just integer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thx

Comment on lines 509 to 523
withResource(difference.floorDiv(step)) { quotient =>
withResource(Scalar.fromInt(1)) { scalarOne =>
withResource(quotient.add(scalarOne)) { sizeWithNegative =>
withResource(Scalar.fromInt(0)) { scalarZero =>
withResource(sizeWithNegative.greaterOrEqualTo(scalarZero)) { pred =>
withResource(pred.ifElse(sizeWithNegative, scalarZero)) { tmpSize =>
// when start==stop, step==0, size will be 0. but we should change size to 1
withResource(difference.equalTo(scalarZero)) { diffHasZero =>
step match {
case stepScalar: Scalar =>
withResource(ColumnVector.fromScalar(stepScalar, rows)) { stepV =>
withResource(stepV.equalTo(scalarZero)) { stepHasZero =>
withResource(diffHasZero.and(stepHasZero)) { predWithZero =>
predWithZero.ifElse(scalarOne, tmpSize)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to get rid of such nested withResource? Otherwise, this looks so cluster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, refined this.

Comment on lines 535 to 537
withResource(numberScalar(dt, 1)) { one =>
withResource(quotient.add(one)) { sizeWithNegative =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When withResource nests this deeply, that's usually an indication that we're holding onto one or more GPU results longer than necessary, adding undesired and avoidable memory pressure. For example, we compute quotient here and only need it to compute sizeWithNegative, yet we hold onto the GPU memory for the quotient result until after the entire calculation completes. The memory can be freed earlier with something like this:

  withResource(numberScalar(dt, 1)) { one =>
    val sizeWithNegative = withResource(difference.floorDiv(step)) { quotient =>
      quotient.add(one)
    }
    withResource(sizeWithNegative) { sizeWithNegative =>
      ....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @jlowe. Changed this accordingly.

@sameerz sameerz added the feature request New feature or request label Dec 29, 2021
@ttnghia
Copy link
Collaborator

ttnghia commented Jan 4, 2022

FYI: cudf PR has been merged.

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jan 5, 2022

Thx for the information

@wbo4958 wbo4958 marked this pull request as ready for review January 5, 2022 05:40
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jan 5, 2022

build

@wbo4958 wbo4958 changed the title [DRAFT] Add sequence support Add sequence support [databricks] Jan 5, 2022
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jan 5, 2022

build

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jan 6, 2022

build

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jan 6, 2022

build

1 similar comment
@pxLi
Copy link
Collaborator

pxLi commented Jan 6, 2022

build

@wbo4958 wbo4958 merged commit 9283e84 into NVIDIA:branch-22.02 Jan 6, 2022
@wbo4958 wbo4958 deleted the sequence branch February 17, 2022 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support org.apache.spark.sql.catalyst.expressions.Sequence
6 participants