-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend range window queries to non-timestamp order-by columns #7675
Conversation
1. Enabled integral types. Some tests. 2. Fixed unsigned. 3. Tonnes of tests for nulls in orderby column 4. Purged references to timestamps in range-window code. 5. Tests for time scaling. 6. Added tests for mean. Fixed sum test. 7. Added collect tests. 8. Tests for ASC/DESC timestamps, all aggs. 9. Docs.
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7675 +/- ##
===============================================
+ Coverage 82.12% 82.52% +0.40%
===============================================
Files 101 101
Lines 17088 17458 +370
===============================================
+ Hits 14033 14407 +374
+ Misses 3055 3051 -4
Continue to review full report at Codecov.
|
How does this relate to #7678? |
cpp/include/cudf/rolling.hpp
Outdated
* E.g. For `orderby` column of type `TIMESTAMP_SECONDS`, the intervals may be | ||
* `DURATION_SECONDS` or `DURATION_DAYS`. Higher resolution durations (e.g. `DURATION_NANOSECONDS`) | ||
* cannot be used with lower resolution timestamps. | ||
* 2. If the `orderby` column is an integral type (e.g. `INT32`), the `preceding`/`following` | ||
* should be the exact same type (`INT32`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems inconsistent. Seems like the timestamp interval should also be the same type. There's no loss of information in converting a lower resolution timestamp value to a higher resolution, yes? E.g., if the orderby column is seconds, and the internal is [-1, +1]
days, then that's no different than [-86400, +86400]
seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't grasped your point, I think.
The order-by column is the timestamp at a certain resolution. The window bounds are durations (with matching or lower resolution). The bounds can't be timestamps because they are relative intervals. (E.g. 7 days preceding, 86400 seconds following.)
The old API took size_type
and interpreted them in days (I.e. lowest resolution).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I meant the resolution should be the same, not the exact type. Otherwise you need a double dispatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise you need a double dispatch.
You're right; I'm doing the double-dispatch in this version of things. (By the way, is host-side dispatch expensive? My understanding is that the dispatch itself isn't more expensive than an if
check. The real cost is in the casting.)
The reason this is permitted is to support how Spark/SQL specifies intervals at different resolutions. Would you recommend having the caller cast appropriately instead?
(The same will likely apply for decimals.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(By the way, is host-side dispatch expensive? My understanding is that the dispatch itself isn't more expensive than an if check. The real cost is in the casting.)
It's not really expensive in runtime, but can be quite expensive in compile time/binary size depending on the amount of code in the leaves of the double dispatched code paths.
The reason this is permitted is to support how Spark/SQL specifies intervals at different resolutions. Would you recommend having the caller cast appropriately instead?
Yes, the way I see it, it doesn't seem any different from how Pandas/Spark allow joining on columns of different types but libcudf doesn't support this. The caller is expected to perform any casting beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... That'd certainly simplify certain other aspects of this feature.
Let me try pulling out the scaling part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, Jake. If the scaling feature is removed, I'd still need to check that the window-duration and the orderby-timstamps are of the same resolution.
I don't know how I'd do that without doing a double dispatch, short of long if-else.
Would you be averse to the double dispatch in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it took me a while to realize what should have been obvious to me: The scaling functionality is pre-existing. :/
The grouped_time_range...()
function already permitted timestamp columns in any resolution, accepted the window widths in days
, and scaled it to the timetamp's resolution. All that scale_to()
does is to move the scaling logic into range_window_bounds
.
Removing range_window_bounds::scale_to()
would break existing functionality in grouped_time_range...()
, and callers thereof (including Spark) who depend on the window widths being in DAYS
.
For what it's worth, there isn't a way to avoid the double-dispatch anyway, even without scale_to()
, because we'd still have to validate that the window range type (integral
or duration
) are compatible with the order-by column (integral
or timestamp
). This is currently checked as part of the scale_to
call stack.
* `scale_to()` scales the bounds scalar from its original granularity (e.g. DURATION_DAYS) | ||
* to the orderby column's granularity (DURATION_SECONDS), before comparions are made. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be the user's responsibility. Otherwise it's very inconsistent both with itself (you can't do the same thing with non-time types?) and with other libcudf APIs that require types already be matched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... Otherwise it's very inconsistent both with itself (you can't do the same thing with non-time types?)
My thought was that scaling would also apply to decimal types, when we add support for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't rescale decimal types in other APIs either. Decimal types with different scales are treated like different types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API has my spidey sense of "making decisions" tingling. It feels too specialized, or at least that it's doing too many things. Furthermore, looking at rolling.hpp
we're starting to have a proliferation of "rolling" APIs and I'm worried about the complexity and confusion of having so many.
Am I correct that all of this boils down to generating a per-element preceeding/following
window range that could then be passed into the generic rolling_window
API?
If so, it feels like we should exploit that fact. Perhaps one way to simplify this and other specialized "rolling" functionality is to split up the logic of generating the per-element bounds and actually computing the windowed aggregation.
Then we should also consider if there is redundancy in the existing APIs. E.g., grouped_rolling_window
has overloads that take both size_type
and window_bounds
for the preceding/following window sizes. Isn't window_bounds
more generic and therefore could replace the size_type
overload?
I agree with your assessment. I hope to replace I envision the following changes in an upcoming release:
This work hasn't been tackled yet because I'm still exploring what else we might need to support more non-trivial ranking functions. E.g. My assessment should be firmed up around the |
Okay, glad to hear that you're already thinking about simplifying and consolidating these APIs. The sketch you laid out sounds reasonable. |
1. Removed unnecessary std::move() 2. Undid mistaken change to tests/CMakeLists.txt
Also cleaned up references to other functions.
range_window_bounds&& preceding, | ||
range_window_bounds&& following, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these r-value refs? What if someone wants to reuse the same range_window_bounds
for multiple calls to grouped_range_rolling_window
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This had to do with the ownership of the range scalar in range_window_bounds
. I'll see if this can be removed if scale_to
is removed.
Thank you for your attention to this PR. On Monday, I will back this out of 0.19, rather than rush it for Wednesday's release. |
Closing this. I'll address this in |
(Redux of #7675, which was for `branch-0.19`, and included support to scale duration scalars to the orderby column's resolution.) This change introduces a `grouped_range_rolling_window()` function, to generalize the functionality available through `grouped_time_range_rolling_window()`. ### Prelude Recall that `grouped_time_range_rolling_window()` applies an aggregation over a dynamic window of rows. The width of the window is determined based on a timestamp column, and the relative preceding/following offset specified in days. E.g. Consider this dataset: ``` [ // user, sales_amt, YYYYMMDD (date) { "user1", 10, 20200101 }, { "user2", 20, 20200101 }, { "user1", 20, 20200102 }, { "user1", 10, 20200103 }, { "user2", 30, 20200101 }, { "user2", 80, 20200102 }, { "user1", 50, 20200107 }, { "user1", 60, 20200107 }, { "user2", 40, 20200104 } ] ``` Grouping by `user_id`, and ordering by `date` yields the following `sales_amt` vector (with 2 groups, one for each distinct `user_id`): ``` Date :(202001-) [ 01, 02, 03, 07, 07, 01, 01, 02, 04 ] Input: [ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <-------user1-------->|<---------user2---------> ``` The `SUM` aggregation applied over a window of `(1 DAY PRECEDING, 1 DAY FOLLOWING)` with a minimum of 1 period yields the following result column: ``` Results: [ 30, 40, 30, 110, 110, 130, 130, 130, 40 ] ``` ### What's new in this commit `grouped_range_rolling_window()` extends the erstwhile range-window functionality in two ways: 1. The order-by column (`date`) is **not restricted to timestamps**. This may also be a sorted integral column. The preceding/following offsets will then need to be specified as scalars of the same type as the order-by column. E.g. ``` [ // user, sales_amt, seconds_since_snap { "user1", 10, 86451 }, { "user2", 20, 86735 }, { "user1", 20, 89162 }, { "user1", 10, 92152 }, ...] ``` 2. If the order-by column is indeed a timestamp, the **preceding/following offsets may be duration scalars**, whose precision must match that of the timestamp order-by column. E.g. If `orderby_column.type() == TIMESTAMP_SECONDS`, `preceding.type()` may only be `DURATION_SECONDS`. Analogous to `window_bounds`, a new `range_window_bounds` class has been introduced to specify bounds for range window functions: 1. Supports scalar values for preceding/following 2. Supports `UNBOUNDED` window widths `range_window_bounds` currently supports only integral and duration scalars. Correspondingly, the `orderby_column` can only be integral and timestamp columns. Support for floats and fixed_point types may be added at a later date. The existing `grouped_time_range_rolling_window()` function now delegates to the same backend as `grouped_range_rolling_window()`. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Mike Wendt (https://github.com/mike-wendt) - Robert Maynard (https://github.com/robertmaynard) - https://github.com/nvdbaranec - Christopher Harris (https://github.com/cwharris) - Vukasin Milovanovic (https://github.com/vuule) - Jordan Jacobelli (https://github.com/Ethyling) URL: #7866
This change introduces a
grouped_range_rolling_window()
function, to generalize the functionality available throughgrouped_time_range_rolling_window()
.Prelude
Recall that
grouped_time_range_rolling_window()
applies an aggregation over a dynamic window of rows. The width of the window is determined based on a timestamp column, and the relative preceding/following offset specified in days. E.g. Consider this dataset:Grouping by
user_id
, and ordering bydate
yields the followingsales_amt
vector (with 2 groups, one for each distinctuser_id
):The
SUM
aggregation applied over a window of(1 DAY PRECEDING, 1 DAY FOLLOWING)
with a minimum of 1 period yields the following result column:What's new in this commit
grouped_range_rolling_window()
extends the erstwhile range-window functionality in two ways:date
) is not restricted to timestamps. This may also be a sorted integral column. The preceding/following offsets will then need to be specified as scalars of the same type as the order-by column.E.g.
E.g. If
orderby_column.type() == TIMESTAMP_SECONDS
,preceding.type()
may beDURATION_SECONDS
orDURATION_DAYS
.Analogous to
window_bounds
, a newrange_window_bounds
class has been introduced to specify bounds for range window functions:UNBOUNDED
window widthsE.g. Scales
DURATION_DAYS
toDURATION_SECONDS
, for use withTIMESTAMP_SECONDS
order-by column.range_window_bounds
currently supports only integral and duration scalars.Correspondingly, the
orderby_column
can only be integral and timestamp columns. Support for floats and fixed_point types may be added at a later date.The existing
grouped_time_range_rolling_window()
function now delegates to the same backend asgrouped_range_rolling_window()
.