Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add support for LEAD/LAG window functions for fixed-width types [redux] #6277

Merged
merged 18 commits into from
Oct 7, 2020

Conversation

mythrocks
Copy link
Contributor

This is an alternative implementation for what #6261 attempts to solve, i.e. LEAD()/LAG() offset window functions for fixed-width data types.

Offset-based window functions include LEAD and LAG:

  1. LEAD(N): Returns the row from the input column, at the specified offset past the
    current row. If the offset crosses the grouping boundary or column boundary for
    a given row, a "default" value is returned if available. The "default" value is
    null, by default.
  2. LAG(N): returns the row from the input column at the specified offset preceding
    the current row. If the offset crosses the grouping boundary or column boundary for
    a given row, a "default" value is returned if available. The "default" value is
    null, by default.

E.g. Consider an input column with the following values and grouping:
[10, 11, 12, 13, 20, 21, 22, 23]
<------G1-----> <------G2------>

LEAD(input_col, 1) yields:
[11, 12, 13, null, 21, 22, 23, null]

LAG(input_col, 2) yields:
[null, null, 11, 12, null, null, 21, 22]

LEAD(input_col, 1, 99)(where 99 indicates the default) yields:
[11, 12, 13, 99, 21, 22, 23, 99]

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

Initial implementation for LEAD/LAG window functions,
for fixed-width types.

This version is implemented within grouped_rolling_window().
@codecov
Copy link

codecov bot commented Sep 20, 2020

Codecov Report

Merging #6277 into branch-0.16 will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.16    #6277   +/-   ##
============================================
  Coverage        83.00%   83.00%           
============================================
  Files               95       95           
  Lines            14901    14901           
============================================
  Hits             12368    12368           
  Misses            2533     2533           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84557ea...afe0ea7. Read the comment docs.

@mythrocks mythrocks changed the title [WIP] [lead/lag] Add support for LEAD/LAG window functions for fixed-width types [redux] [REVIEW] [lead/lag] Add support for LEAD/LAG window functions for fixed-width types [redux] Sep 21, 2020
@mythrocks mythrocks self-assigned this Sep 21, 2020
@mythrocks mythrocks added the 3 - Ready for Review Ready for review by team label Sep 21, 2020
@harrism harrism changed the title [REVIEW] [lead/lag] Add support for LEAD/LAG window functions for fixed-width types [redux] [REVIEW] Add support for LEAD/LAG window functions for fixed-width types [redux] Sep 23, 2020
@mythrocks
Copy link
Contributor Author

@jrhemstad, does this PR look like a more acceptable approach than #6261?

Also, @rgsl888prabhu, could I please bother you to have a look?

Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review

cpp/include/cudf/detail/utilities/device_operators.cuh Outdated Show resolved Hide resolved
cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
cpp/src/rolling/rolling.cu Show resolved Hide resolved
1. Fixed includes
2. Un-templatized DeviceLeadLag
3. Documentation fixes
cpp/tests/lead_lag/lead_lag_test.cpp Show resolved Hide resolved
cpp/tests/lead_lag/lead_lag_test.cpp Outdated Show resolved Hide resolved
@mythrocks
Copy link
Contributor Author

rerun tests

Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a comment to handle minor stuff, other than this looks good.

cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
1. Moved Lead/Lag specific error checks out of rolling_window()
2. Moved Lead/Lag specific short-circuit evaluation out of rolling_window()
3. Added tests for error-checking
@harrism harrism added the libcudf Affects libcudf (C++/CUDA) code. label Oct 6, 2020
@mythrocks
Copy link
Contributor Author

@jrhemstad, have the concerns raised here been addressed satisfactorily?

1. Remove redundant checks for LEAD/LAG in non-LEAD/LAG SFINAE path
2. Remove hardcode for stream id
@mythrocks
Copy link
Contributor Author

Thank you for the review, and the great catches, @jrhemstad.

@mythrocks mythrocks merged commit f293588 into rapidsai:branch-0.16 Oct 7, 2020
rapids-bot bot pushed a commit that referenced this pull request May 3, 2021
This commit extends the [LEAD()/LAG() offset window functions](https://learnsql.com/blog/lead-and-lag-functions-in-sql/) introduced in #6277 to work with more than just fixed-width data types. The functions are defined as follows:

1. `LEAD(N)`: Returns the row from the input column, at the specified offset past the
current row. If the offset crosses the grouping boundary or column boundary for
a given row, a "default" value is returned if available. The "default" value is
null, by default.
1. `LAG(N)`: returns the row from the input column at the specified offset preceding
the current row. If the offset crosses the grouping boundary or column boundary for
a given row, a "default" value is returned if available. The "default" value is
null, by default.

As an illustration, consider the following example input array input column, with two groups (`G1` and `G2`):
```c++
[ [1,1,1], [2], [3,3,3], [4,4,4,4],  [66,66], [], [88,88], [99] ]
  <--------------G1-------------->   <------------G2----------->
```

`LEAD(col, 1)` yields:
```c++
[ [2], [3,3,3], [4,4,4,4], ∅, [], [88,88], [99], ∅ ]
```

`LAG(input_col, 2)` yields:
```c++
[ ∅, ∅, [2], [3,3,3], ∅, ∅, [], [88,88] ]
```

If a `defaults` column is specified with contents:
```c++
[ [999], [999], [999], [999], [999], [999], [999], [999] ]
```
then, `LEAD(col, 1, defaults)` yields:
```c++
[ [2], [3,3,3], [4,4,4,4], [999], [], [88,88], [99], [999] ]
```
Note that in the cases where the offset (`1`) would cross the column/group boundary (i.e. indices `3` and `7`), the corresponding entry from the `defaults` column is returned instead of `∅` (i.e. `null`).

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #8062
@mythrocks mythrocks deleted the lead-lag-branch-0.16-redux branch November 23, 2021 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants