Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend range window queries to non-timestamp order-by columns #7675

Closed
wants to merge 10 commits into from

Conversation

mythrocks
Copy link
Contributor

This change introduces a grouped_range_rolling_window() function, to generalize the functionality available through grouped_time_range_rolling_window().

Prelude

Recall that grouped_time_range_rolling_window() applies an aggregation over a dynamic window of rows. The width of the window is determined based on a timestamp column, and the relative preceding/following offset specified in days. E.g. Consider this dataset:

[ // user,  sales_amt,  YYYYMMDD (date)
  { "user1",   10,      20200101    },
  { "user2",   20,      20200101    },
  { "user1",   20,      20200102    },
  { "user1",   10,      20200103    },
  { "user2",   30,      20200101    },
  { "user2",   80,      20200102    },
  { "user1",   50,      20200107    },
  { "user1",   60,      20200107    },
  { "user2",   40,      20200104    }
]

Grouping by user_id, and ordering by date yields the following sales_amt vector (with 2 groups, one for each distinct user_id):

Date :(202001-)  [ 01,  02,  03,  07,  07,    01,   01,   02,  04 ]
Input:           [ 10,  20,  10,  50,  60,    20,   30,   80,  40 ]
                   <-------user1-------->|<---------user2--------->

The SUM aggregation applied over a window of (1 DAY PRECEDING, 1 DAY FOLLOWING) with a minimum of 1 period yields the following result column:

Results:         [ 30,  40,  30,  110, 110,  130,  130,  130,  40 ]

What's new in this commit

grouped_range_rolling_window() extends the erstwhile range-window functionality in two ways:

  1. The order-by column (date) is not restricted to timestamps. This may also be a sorted integral column. The preceding/following offsets will then need to be specified as scalars of the same type as the order-by column.
    E.g.
    [ // user,  sales_amt,  seconds_since_snap
      { "user1",   10,         86451          },
      { "user2",   20,         86735          },
      { "user1",   20,         89162          },
      { "user1",   10,         92152          },
      ...]
    
  2. If the order-by column is indeed a timestamp, the preceding/following offsets may be duration scalars, whose type need not match the order-by column. The precision of the scalars cannot exceed that of the timestamp order-by column.
    E.g. If orderby_column.type() == TIMESTAMP_SECONDS, preceding.type() may be DURATION_SECONDS or DURATION_DAYS.

Analogous to window_bounds, a new range_window_bounds class has been introduced to specify bounds for range window functions:

  1. Supports scalar values for preceding/following
  2. Supports UNBOUNDED window widths
  3. Supports scaling the scalar bounds values to match the order-by column type.
    E.g. Scales DURATION_DAYS to DURATION_SECONDS, for use with TIMESTAMP_SECONDS order-by column.

range_window_bounds currently supports only integral and duration scalars.
Correspondingly, the orderby_column can only be integral and timestamp columns. Support for floats and fixed_point types may be added at a later date.

The existing grouped_time_range_rolling_window() function now delegates to the same backend as grouped_range_rolling_window().

1. Enabled integral types. Some tests.
2. Fixed unsigned.
3. Tonnes of tests for nulls in orderby column
4. Purged references to timestamps in range-window code.
5. Tests for time scaling.
6. Added tests for mean. Fixed sum test.
7. Added collect tests.
8. Tests for ASC/DESC timestamps, all aggs.
9. Docs.
@mythrocks mythrocks self-assigned this Mar 23, 2021
@mythrocks mythrocks requested review from a team as code owners March 23, 2021 04:00
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Mar 23, 2021
@mythrocks mythrocks marked this pull request as draft March 23, 2021 04:00
@mythrocks mythrocks added 2 - In Progress Currently a work in progress 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request labels Mar 23, 2021
@github-actions github-actions bot added the conda label Mar 23, 2021
@codecov
Copy link

codecov bot commented Mar 23, 2021

Codecov Report

Merging #7675 (ff5e271) into branch-0.19 (f1f1d0f) will increase coverage by 0.40%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7675      +/-   ##
===============================================
+ Coverage        82.12%   82.52%   +0.40%     
===============================================
  Files              101      101              
  Lines            17088    17458     +370     
===============================================
+ Hits             14033    14407     +374     
+ Misses            3055     3051       -4     
Impacted Files Coverage Δ
python/cudf/cudf/utils/gpu_utils.py 53.65% <0.00%> (-4.88%) ⬇️
python/cudf/cudf/core/column/lists.py 87.68% <0.00%> (-1.92%) ⬇️
python/cudf/cudf/core/abc.py 87.23% <0.00%> (-1.14%) ⬇️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 89.88% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/struct.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1f1d0f...ff5e271. Read the comment docs.

@jrhemstad
Copy link
Contributor

How does this relate to #7678?

Comment on lines 407 to 411
* E.g. For `orderby` column of type `TIMESTAMP_SECONDS`, the intervals may be
* `DURATION_SECONDS` or `DURATION_DAYS`. Higher resolution durations (e.g. `DURATION_NANOSECONDS`)
* cannot be used with lower resolution timestamps.
* 2. If the `orderby` column is an integral type (e.g. `INT32`), the `preceding`/`following`
* should be the exact same type (`INT32`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems inconsistent. Seems like the timestamp interval should also be the same type. There's no loss of information in converting a lower resolution timestamp value to a higher resolution, yes? E.g., if the orderby column is seconds, and the internal is [-1, +1] days, then that's no different than [-86400, +86400] seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't grasped your point, I think.

The order-by column is the timestamp at a certain resolution. The window bounds are durations (with matching or lower resolution). The bounds can't be timestamps because they are relative intervals. (E.g. 7 days preceding, 86400 seconds following.)

The old API took size_type and interpreted them in days (I.e. lowest resolution).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I meant the resolution should be the same, not the exact type. Otherwise you need a double dispatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise you need a double dispatch.

You're right; I'm doing the double-dispatch in this version of things. (By the way, is host-side dispatch expensive? My understanding is that the dispatch itself isn't more expensive than an if check. The real cost is in the casting.)

The reason this is permitted is to support how Spark/SQL specifies intervals at different resolutions. Would you recommend having the caller cast appropriately instead?

(The same will likely apply for decimals.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(By the way, is host-side dispatch expensive? My understanding is that the dispatch itself isn't more expensive than an if check. The real cost is in the casting.)

It's not really expensive in runtime, but can be quite expensive in compile time/binary size depending on the amount of code in the leaves of the double dispatched code paths.

The reason this is permitted is to support how Spark/SQL specifies intervals at different resolutions. Would you recommend having the caller cast appropriately instead?

Yes, the way I see it, it doesn't seem any different from how Pandas/Spark allow joining on columns of different types but libcudf doesn't support this. The caller is expected to perform any casting beforehand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... That'd certainly simplify certain other aspects of this feature.

Let me try pulling out the scaling part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, Jake. If the scaling feature is removed, I'd still need to check that the window-duration and the orderby-timstamps are of the same resolution.
I don't know how I'd do that without doing a double dispatch, short of long if-else.

Would you be averse to the double dispatch in this case?

Copy link
Contributor Author

@mythrocks mythrocks Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it took me a while to realize what should have been obvious to me: The scaling functionality is pre-existing. :/

The grouped_time_range...() function already permitted timestamp columns in any resolution, accepted the window widths in days, and scaled it to the timetamp's resolution. All that scale_to() does is to move the scaling logic into range_window_bounds.

Removing range_window_bounds::scale_to() would break existing functionality in grouped_time_range...(), and callers thereof (including Spark) who depend on the window widths being in DAYS.

For what it's worth, there isn't a way to avoid the double-dispatch anyway, even without scale_to(), because we'd still have to validate that the window range type (integral or duration) are compatible with the order-by column (integral or timestamp). This is currently checked as part of the scale_to call stack.

Comment on lines +83 to +84
* `scale_to()` scales the bounds scalar from its original granularity (e.g. DURATION_DAYS)
* to the orderby column's granularity (DURATION_SECONDS), before comparions are made.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be the user's responsibility. Otherwise it's very inconsistent both with itself (you can't do the same thing with non-time types?) and with other libcudf APIs that require types already be matched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Otherwise it's very inconsistent both with itself (you can't do the same thing with non-time types?)

My thought was that scaling would also apply to decimal types, when we add support for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't rescale decimal types in other APIs either. Decimal types with different scales are treated like different types.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API has my spidey sense of "making decisions" tingling. It feels too specialized, or at least that it's doing too many things. Furthermore, looking at rolling.hpp we're starting to have a proliferation of "rolling" APIs and I'm worried about the complexity and confusion of having so many.

Am I correct that all of this boils down to generating a per-element preceeding/following window range that could then be passed into the generic rolling_window API?

If so, it feels like we should exploit that fact. Perhaps one way to simplify this and other specialized "rolling" functionality is to split up the logic of generating the per-element bounds and actually computing the windowed aggregation.

Then we should also consider if there is redundancy in the existing APIs. E.g., grouped_rolling_window has overloads that take both size_type and window_bounds for the preceding/following window sizes. Isn't window_bounds more generic and therefore could replace the size_type overload?

@mythrocks
Copy link
Contributor Author

How does this relate to #7678?

#7678 is likely the JNI part of the same functionality.

@mythrocks mythrocks added the non-breaking Non-breaking change label Mar 23, 2021
@mythrocks
Copy link
Contributor Author

mythrocks commented Mar 23, 2021

we should also consider if there is redundancy in the existing APIs. E.g., grouped_rolling_window has overloads that take both size_type and window_bounds for the preceding/following window sizes. Isn't window_bounds more generic and therefore could replace the size_type overload?

I agree with your assessment. I hope to replace grouped_time_range_rolling_window() completely with grouped_range_rolling_window().

I envision the following changes in an upcoming release:

  1. Deprecate (and later, remove) both grouped_time_range_rolling_window() functions. These are an artifact of the solving smaller parts of the more generic problem. grouped_range_rolling_window() will subsume this feature.
  2. Deprecate (and later, remove) the size_type overloads of grouped_rolling_window(). These do not support UNBOUNDED ranges. We should use window_bounds for fixed width window functions.
  3. Deprecate (and later, remove) rolling_window() and grouped_rolling_window() overloads that do not take default_values. These exist currently because they were added before LEAD/LAG were supported.
  4. Possibly introduce window_bounds overloads for rolling_window(), if the Python/Pandas crew do not object, and deprecate the older functions. Without this, UNBOUNDED windows can't be supported with most rolling_window() functions.

This work hasn't been tackled yet because I'm still exploring what else we might need to support more non-trivial ranking functions. E.g. RANK()/DENSE_RANK() requires access to columns that the current interfaces do not have access to.

My assessment should be firmed up around the 0.20 time-frame. I'll give thought to doing this via composable functions.

@jrhemstad
Copy link
Contributor

Okay, glad to hear that you're already thinking about simplifying and consolidating these APIs. The sketch you laid out sounds reasonable.

@wbo4958
Copy link
Contributor

wbo4958 commented Mar 24, 2021

How does this relate to #7678?

hi, #7678 is the java/jni layer for the range-window

cpp/src/rolling/range_window_bounds.cpp Outdated Show resolved Hide resolved
cpp/tests/CMakeLists.txt Outdated Show resolved Hide resolved
1. Removed unnecessary std::move()
2. Undid mistaken change to tests/CMakeLists.txt
Also cleaned up references to other functions.
@mythrocks mythrocks marked this pull request as ready for review March 25, 2021 06:51
@mythrocks mythrocks requested a review from a team as a code owner March 25, 2021 06:51
Comment on lines +497 to +498
range_window_bounds&& preceding,
range_window_bounds&& following,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these r-value refs? What if someone wants to reuse the same range_window_bounds for multiple calls to grouped_range_rolling_window?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had to do with the ownership of the range scalar in range_window_bounds. I'll see if this can be removed if scale_to is removed.

@mythrocks mythrocks marked this pull request as draft March 25, 2021 18:57
@mythrocks
Copy link
Contributor Author

Thank you for your attention to this PR.

On Monday, I will back this out of 0.19, rather than rush it for Wednesday's release.

@mythrocks
Copy link
Contributor Author

Closing this. I'll address this in 0.20.

@mythrocks mythrocks closed this Mar 29, 2021
rapids-bot bot pushed a commit that referenced this pull request Apr 29, 2021
(Redux of #7675, which was for `branch-0.19`, and included support to scale duration scalars to the orderby column's resolution.)

This change introduces a `grouped_range_rolling_window()` function, to generalize the functionality available through `grouped_time_range_rolling_window()`.

### Prelude
Recall that `grouped_time_range_rolling_window()` applies an aggregation over a dynamic window of rows. The width of the window is determined based on a timestamp column, and the relative preceding/following offset specified in days. E.g. Consider this dataset:

```
[ // user,  sales_amt,  YYYYMMDD (date)
  { "user1",   10,      20200101    },
  { "user2",   20,      20200101    },
  { "user1",   20,      20200102    },
  { "user1",   10,      20200103    },
  { "user2",   30,      20200101    },
  { "user2",   80,      20200102    },
  { "user1",   50,      20200107    },
  { "user1",   60,      20200107    },
  { "user2",   40,      20200104    }
]
```
Grouping by `user_id`, and ordering by `date` yields the following `sales_amt` vector (with 2 groups, one for each distinct `user_id`):
```
Date :(202001-)  [ 01,  02,  03,  07,  07,    01,   01,   02,  04 ]
Input:           [ 10,  20,  10,  50,  60,    20,   30,   80,  40 ]
                   <-------user1-------->|<---------user2--------->
```
The `SUM` aggregation applied over a window of `(1 DAY PRECEDING, 1 DAY FOLLOWING)` with a minimum of 1 period yields the following result column:
```
Results:         [ 30,  40,  30,  110, 110,  130,  130,  130,  40 ]
```
### What's new in this commit
`grouped_range_rolling_window()` extends the erstwhile range-window functionality in two ways:
  1. The order-by column (`date`) is **not restricted to timestamps**. This may also be a sorted integral column. The preceding/following offsets will then need to be specified as scalars of the same type as the order-by column.
  E.g. 
     ```
     [ // user,  sales_amt,  seconds_since_snap
       { "user1",   10,         86451          },
       { "user2",   20,         86735          },
       { "user1",   20,         89162          },
       { "user1",   10,         92152          },
       ...]
     ```
  2. If the order-by column is indeed a timestamp, the **preceding/following offsets may be duration scalars**, whose precision must match that of the timestamp order-by column.
  E.g. If `orderby_column.type() == TIMESTAMP_SECONDS`, `preceding.type()` may only be `DURATION_SECONDS`.

Analogous to `window_bounds`, a new `range_window_bounds` class has been introduced to specify bounds for range window functions: 
1. Supports scalar values for preceding/following 
2. Supports `UNBOUNDED` window widths
   
`range_window_bounds` currently supports only integral and duration scalars.
Correspondingly, the `orderby_column` can only be integral and timestamp columns. Support for floats and fixed_point types may be added at a later date.

The existing `grouped_time_range_rolling_window()` function now delegates to the same backend as `grouped_range_rolling_window()`.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Mike Wendt (https://github.com/mike-wendt)
  - Robert Maynard (https://github.com/robertmaynard)
  - https://github.com/nvdbaranec
  - Christopher Harris (https://github.com/cwharris)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Jordan Jacobelli (https://github.com/Ethyling)

URL: #7866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress 4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants