Extend range window queries to non-timestamp order-by columns #7675

mythrocks · 2021-03-23T04:00:22Z

This change introduces a grouped_range_rolling_window() function, to generalize the functionality available through grouped_time_range_rolling_window().

Prelude

Recall that grouped_time_range_rolling_window() applies an aggregation over a dynamic window of rows. The width of the window is determined based on a timestamp column, and the relative preceding/following offset specified in days. E.g. Consider this dataset:

[ // user,  sales_amt,  YYYYMMDD (date)
  { "user1",   10,      20200101    },
  { "user2",   20,      20200101    },
  { "user1",   20,      20200102    },
  { "user1",   10,      20200103    },
  { "user2",   30,      20200101    },
  { "user2",   80,      20200102    },
  { "user1",   50,      20200107    },
  { "user1",   60,      20200107    },
  { "user2",   40,      20200104    }
]

Grouping by user_id, and ordering by date yields the following sales_amt vector (with 2 groups, one for each distinct user_id):

Date :(202001-)  [ 01,  02,  03,  07,  07,    01,   01,   02,  04 ]
Input:           [ 10,  20,  10,  50,  60,    20,   30,   80,  40 ]
                   <-------user1-------->|<---------user2--------->

The SUM aggregation applied over a window of (1 DAY PRECEDING, 1 DAY FOLLOWING) with a minimum of 1 period yields the following result column:

Results:         [ 30,  40,  30,  110, 110,  130,  130,  130,  40 ]

What's new in this commit

grouped_range_rolling_window() extends the erstwhile range-window functionality in two ways:

The order-by column (date) is not restricted to timestamps. This may also be a sorted integral column. The preceding/following offsets will then need to be specified as scalars of the same type as the order-by column.
E.g.

[ // user,  sales_amt,  seconds_since_snap
  { "user1",   10,         86451          },
  { "user2",   20,         86735          },
  { "user1",   20,         89162          },
  { "user1",   10,         92152          },
  ...]

If the order-by column is indeed a timestamp, the preceding/following offsets may be duration scalars, whose type need not match the order-by column. The precision of the scalars cannot exceed that of the timestamp order-by column.
E.g. If orderby_column.type() == TIMESTAMP_SECONDS, preceding.type() may be DURATION_SECONDS or DURATION_DAYS.

Analogous to window_bounds, a new range_window_bounds class has been introduced to specify bounds for range window functions:

Supports scalar values for preceding/following
Supports UNBOUNDED window widths
Supports scaling the scalar bounds values to match the order-by column type.
E.g. Scales DURATION_DAYS to DURATION_SECONDS, for use with TIMESTAMP_SECONDS order-by column.

range_window_bounds currently supports only integral and duration scalars.
Correspondingly, the orderby_column can only be integral and timestamp columns. Support for floats and fixed_point types may be added at a later date.

The existing grouped_time_range_rolling_window() function now delegates to the same backend as grouped_range_rolling_window().

1. Enabled integral types. Some tests. 2. Fixed unsigned. 3. Tonnes of tests for nulls in orderby column 4. Purged references to timestamps in range-window code. 5. Tests for time scaling. 6. Added tests for mean. Fixed sum test. 7. Added collect tests. 8. Tests for ASC/DESC timestamps, all aggs. 9. Docs.

codecov · 2021-03-23T08:17:15Z

Codecov Report

Merging #7675 (ff5e271) into branch-0.19 (f1f1d0f) will increase coverage by 0.40%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.19    #7675      +/-   ##
===============================================
+ Coverage        82.12%   82.52%   +0.40%     
===============================================
  Files              101      101              
  Lines            17088    17458     +370     
===============================================
+ Hits             14033    14407     +374     
+ Misses            3055     3051       -4

Impacted Files	Coverage Δ
python/cudf/cudf/utils/gpu_utils.py	`53.65% <0.00%> (-4.88%)`	⬇️
python/cudf/cudf/core/column/lists.py	`87.68% <0.00%> (-1.92%)`	⬇️
python/cudf/cudf/core/abc.py	`87.23% <0.00%> (-1.14%)`	⬇️
python/cudf/cudf/io/feather.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`89.88% <0.00%> (ø)`
python/cudf/cudf/comm/serialize.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/io.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/struct.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/fuzzer.py	`0.00% <0.00%> (ø)`
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1f1d0f...ff5e271. Read the comment docs.

jrhemstad · 2021-03-23T13:45:22Z

How does this relate to #7678?

cpp/include/cudf/rolling.hpp

jrhemstad · 2021-03-23T13:52:38Z

cpp/include/cudf/rolling.hpp

+ *      E.g. For `orderby` column of type `TIMESTAMP_SECONDS`, the intervals may be
+ * `DURATION_SECONDS` or `DURATION_DAYS`. Higher resolution durations (e.g. `DURATION_NANOSECONDS`)
+ * cannot be used with lower resolution timestamps.
+ *   2. If the `orderby` column is an integral type (e.g. `INT32`), the `preceding`/`following`
+ *      should be the exact same type (`INT32`).


This seems inconsistent. Seems like the timestamp interval should also be the same type. There's no loss of information in converting a lower resolution timestamp value to a higher resolution, yes? E.g., if the orderby column is seconds, and the internal is [-1, +1] days, then that's no different than [-86400, +86400] seconds.

I haven't grasped your point, I think.

The order-by column is the timestamp at a certain resolution. The window bounds are durations (with matching or lower resolution). The bounds can't be timestamps because they are relative intervals. (E.g. 7 days preceding, 86400 seconds following.)

The old API took size_type and interpreted them in days (I.e. lowest resolution).

Right, I meant the resolution should be the same, not the exact type. Otherwise you need a double dispatch.

Otherwise you need a double dispatch.

You're right; I'm doing the double-dispatch in this version of things. (By the way, is host-side dispatch expensive? My understanding is that the dispatch itself isn't more expensive than an if check. The real cost is in the casting.)

The reason this is permitted is to support how Spark/SQL specifies intervals at different resolutions. Would you recommend having the caller cast appropriately instead?

(The same will likely apply for decimals.)

(By the way, is host-side dispatch expensive? My understanding is that the dispatch itself isn't more expensive than an if check. The real cost is in the casting.)

It's not really expensive in runtime, but can be quite expensive in compile time/binary size depending on the amount of code in the leaves of the double dispatched code paths.

The reason this is permitted is to support how Spark/SQL specifies intervals at different resolutions. Would you recommend having the caller cast appropriately instead?

Yes, the way I see it, it doesn't seem any different from how Pandas/Spark allow joining on columns of different types but libcudf doesn't support this. The caller is expected to perform any casting beforehand.

Hmm... That'd certainly simplify certain other aspects of this feature.

Let me try pulling out the scaling part.

Hey, Jake. If the scaling feature is removed, I'd still need to check that the window-duration and the orderby-timstamps are of the same resolution.
I don't know how I'd do that without doing a double dispatch, short of long if-else.

Would you be averse to the double dispatch in this case?

Ok, it took me a while to realize what should have been obvious to me: The scaling functionality is pre-existing. :/

The grouped_time_range...() function already permitted timestamp columns in any resolution, accepted the window widths in days, and scaled it to the timetamp's resolution. All that scale_to() does is to move the scaling logic into range_window_bounds.

Removing range_window_bounds::scale_to() would break existing functionality in grouped_time_range...(), and callers thereof (including Spark) who depend on the window widths being in DAYS.

For what it's worth, there isn't a way to avoid the double-dispatch anyway, even without scale_to(), because we'd still have to validate that the window range type (integral or duration) are compatible with the order-by column (integral or timestamp). This is currently checked as part of the scale_to call stack.

cpp/include/cudf/rolling.hpp

jrhemstad · 2021-03-23T14:02:03Z

cpp/include/cudf/rolling/range_window_bounds.hpp

+   * `scale_to()` scales the bounds scalar from its original granularity (e.g. DURATION_DAYS)
+   * to the orderby column's granularity (DURATION_SECONDS), before comparions are made.


I think this should be the user's responsibility. Otherwise it's very inconsistent both with itself (you can't do the same thing with non-time types?) and with other libcudf APIs that require types already be matched.

... Otherwise it's very inconsistent both with itself (you can't do the same thing with non-time types?)

My thought was that scaling would also apply to decimal types, when we add support for it.

We don't rescale decimal types in other APIs either. Decimal types with different scales are treated like different types.

jrhemstad

This API has my spidey sense of "making decisions" tingling. It feels too specialized, or at least that it's doing too many things. Furthermore, looking at rolling.hpp we're starting to have a proliferation of "rolling" APIs and I'm worried about the complexity and confusion of having so many.

Am I correct that all of this boils down to generating a per-element preceeding/following window range that could then be passed into the generic rolling_window API?

If so, it feels like we should exploit that fact. Perhaps one way to simplify this and other specialized "rolling" functionality is to split up the logic of generating the per-element bounds and actually computing the windowed aggregation.

Then we should also consider if there is redundancy in the existing APIs. E.g., grouped_rolling_window has overloads that take both size_type and window_bounds for the preceding/following window sizes. Isn't window_bounds more generic and therefore could replace the size_type overload?

mythrocks · 2021-03-23T14:18:03Z

How does this relate to #7678?

#7678 is likely the JNI part of the same functionality.

mythrocks · 2021-03-23T17:26:52Z

we should also consider if there is redundancy in the existing APIs. E.g., grouped_rolling_window has overloads that take both size_type and window_bounds for the preceding/following window sizes. Isn't window_bounds more generic and therefore could replace the size_type overload?

I agree with your assessment. I hope to replace grouped_time_range_rolling_window() completely with grouped_range_rolling_window().

I envision the following changes in an upcoming release:

Deprecate (and later, remove) both grouped_time_range_rolling_window() functions. These are an artifact of the solving smaller parts of the more generic problem. grouped_range_rolling_window() will subsume this feature.
Deprecate (and later, remove) the size_type overloads of grouped_rolling_window(). These do not support UNBOUNDED ranges. We should use window_bounds for fixed width window functions.
Deprecate (and later, remove) rolling_window() and grouped_rolling_window() overloads that do not take default_values. These exist currently because they were added before LEAD/LAG were supported.
Possibly introduce window_bounds overloads for rolling_window(), if the Python/Pandas crew do not object, and deprecate the older functions. Without this, UNBOUNDED windows can't be supported with most rolling_window() functions.

This work hasn't been tackled yet because I'm still exploring what else we might need to support more non-trivial ranking functions. E.g. RANK()/DENSE_RANK() requires access to columns that the current interfaces do not have access to.

My assessment should be firmed up around the 0.20 time-frame. I'll give thought to doing this via composable functions.

jrhemstad · 2021-03-23T18:23:39Z

Okay, glad to hear that you're already thinking about simplifying and consolidating these APIs. The sketch you laid out sounds reasonable.

wbo4958 · 2021-03-24T11:52:03Z

How does this relate to #7678?

hi, #7678 is the java/jni layer for the range-window

cpp/src/rolling/range_window_bounds.cpp

cpp/src/rolling/range_window_bounds_detail.hpp

cpp/tests/CMakeLists.txt

cpp/tests/rolling/range_window_bounds_test.cpp

1. Removed unnecessary std::move() 2. Undid mistaken change to tests/CMakeLists.txt

Also cleaned up references to other functions.

jrhemstad · 2021-03-25T18:00:36Z

cpp/include/cudf/rolling.hpp

+  range_window_bounds&& preceding,
+  range_window_bounds&& following,


Why are these r-value refs? What if someone wants to reuse the same range_window_bounds for multiple calls to grouped_range_rolling_window?

This had to do with the ownership of the range scalar in range_window_bounds. I'll see if this can be removed if scale_to is removed.

mythrocks · 2021-03-29T01:53:37Z

Thank you for your attention to this PR.

On Monday, I will back this out of 0.19, rather than rush it for Wednesday's release.

mythrocks · 2021-03-29T16:56:49Z

Closing this. I'll address this in 0.20.

(Redux of #7675, which was for `branch-0.19`, and included support to scale duration scalars to the orderby column's resolution.) This change introduces a `grouped_range_rolling_window()` function, to generalize the functionality available through `grouped_time_range_rolling_window()`. ### Prelude Recall that `grouped_time_range_rolling_window()` applies an aggregation over a dynamic window of rows. The width of the window is determined based on a timestamp column, and the relative preceding/following offset specified in days. E.g. Consider this dataset: ``` [ // user, sales_amt, YYYYMMDD (date) { "user1", 10, 20200101 }, { "user2", 20, 20200101 }, { "user1", 20, 20200102 }, { "user1", 10, 20200103 }, { "user2", 30, 20200101 }, { "user2", 80, 20200102 }, { "user1", 50, 20200107 }, { "user1", 60, 20200107 }, { "user2", 40, 20200104 } ] ``` Grouping by `user_id`, and ordering by `date` yields the following `sales_amt` vector (with 2 groups, one for each distinct `user_id`): ``` Date :(202001-) [ 01, 02, 03, 07, 07, 01, 01, 02, 04 ] Input: [ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <-------user1-------->|<---------user2---------> ``` The `SUM` aggregation applied over a window of `(1 DAY PRECEDING, 1 DAY FOLLOWING)` with a minimum of 1 period yields the following result column: ``` Results: [ 30, 40, 30, 110, 110, 130, 130, 130, 40 ] ``` ### What's new in this commit `grouped_range_rolling_window()` extends the erstwhile range-window functionality in two ways: 1. The order-by column (`date`) is **not restricted to timestamps**. This may also be a sorted integral column. The preceding/following offsets will then need to be specified as scalars of the same type as the order-by column. E.g. ``` [ // user, sales_amt, seconds_since_snap { "user1", 10, 86451 }, { "user2", 20, 86735 }, { "user1", 20, 89162 }, { "user1", 10, 92152 }, ...] ``` 2. If the order-by column is indeed a timestamp, the **preceding/following offsets may be duration scalars**, whose precision must match that of the timestamp order-by column. E.g. If `orderby_column.type() == TIMESTAMP_SECONDS`, `preceding.type()` may only be `DURATION_SECONDS`. Analogous to `window_bounds`, a new `range_window_bounds` class has been introduced to specify bounds for range window functions: 1. Supports scalar values for preceding/following 2. Supports `UNBOUNDED` window widths `range_window_bounds` currently supports only integral and duration scalars. Correspondingly, the `orderby_column` can only be integral and timestamp columns. Support for floats and fixed_point types may be added at a later date. The existing `grouped_time_range_rolling_window()` function now delegates to the same backend as `grouped_range_rolling_window()`. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Mike Wendt (https://github.com/mike-wendt) - Robert Maynard (https://github.com/robertmaynard) - https://github.com/nvdbaranec - Christopher Harris (https://github.com/cwharris) - Vukasin Milovanovic (https://github.com/vuule) - Jordan Jacobelli (https://github.com/Ethyling) URL: #7866

mythrocks self-assigned this Mar 23, 2021

mythrocks requested review from a team as code owners March 23, 2021 04:00

mythrocks requested review from karthikeyann and davidwendt March 23, 2021 04:00

github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Mar 23, 2021

mythrocks marked this pull request as draft March 23, 2021 04:00

mythrocks added 2 - In Progress Currently a work in progress 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request labels Mar 23, 2021

mythrocks added 2 commits March 22, 2021 22:29

Renamed the orderby null tests.

65f2572

Added range_window_bounds.hpp to meta.yaml.

24aa1f3

github-actions bot added the conda label Mar 23, 2021

jrhemstad reviewed Mar 23, 2021

View reviewed changes

cpp/include/cudf/rolling.hpp Outdated Show resolved Hide resolved

jrhemstad reviewed Mar 23, 2021

View reviewed changes

cpp/include/cudf/rolling.hpp Outdated Show resolved Hide resolved

jrhemstad reviewed Mar 23, 2021

View reviewed changes

jrhemstad requested changes Mar 23, 2021

View reviewed changes

mythrocks added the non-breaking Non-breaking change label Mar 23, 2021

ttnghia reviewed Mar 24, 2021

View reviewed changes

cpp/src/rolling/range_window_bounds.cpp Outdated Show resolved Hide resolved

cpp/src/rolling/range_window_bounds_detail.hpp Show resolved Hide resolved

cpp/tests/CMakeLists.txt Outdated Show resolved Hide resolved

cpp/tests/rolling/range_window_bounds_test.cpp Show resolved Hide resolved

mythrocks added 2 commits March 24, 2021 16:12

Review comments:

255d7c4

1. Removed unnecessary std::move() 2. Undid mistaken change to tests/CMakeLists.txt

Better example for numeric range window function.

0bbc207

Also cleaned up references to other functions.

mythrocks added 4 commits March 24, 2021 21:45

Removed TODO, again.

194b525

Formatting

a776935

Added integral range tests.

dfc4518

Merge remote-tracking branch 'origin/branch-0.19' into range-queries

ab6ca8b

mythrocks marked this pull request as ready for review March 25, 2021 06:51

mythrocks requested a review from a team as a code owner March 25, 2021 06:51

Change to make_collect_list_aggregation().

ff5e271

raydouglass approved these changes Mar 25, 2021

View reviewed changes

mythrocks requested review from ttnghia and jrhemstad March 25, 2021 17:46

jrhemstad reviewed Mar 25, 2021

View reviewed changes

mythrocks marked this pull request as draft March 25, 2021 18:57

mythrocks closed this Mar 29, 2021

mythrocks mentioned this pull request Apr 5, 2021

Extend range window queries to non-timestamp order-by columns #7866

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend range window queries to non-timestamp order-by columns #7675

Extend range window queries to non-timestamp order-by columns #7675

mythrocks commented Mar 23, 2021

codecov bot commented Mar 23, 2021 •

edited

Loading

jrhemstad commented Mar 23, 2021

jrhemstad Mar 23, 2021

mythrocks Mar 23, 2021

jrhemstad Mar 23, 2021

mythrocks Mar 23, 2021

jrhemstad Mar 23, 2021

mythrocks Mar 23, 2021

mythrocks Mar 24, 2021

mythrocks Mar 24, 2021 •

edited

Loading

jrhemstad Mar 23, 2021

mythrocks Mar 23, 2021

jrhemstad Mar 25, 2021

jrhemstad left a comment

mythrocks commented Mar 23, 2021

mythrocks commented Mar 23, 2021 •

edited

Loading

jrhemstad commented Mar 23, 2021

wbo4958 commented Mar 24, 2021

jrhemstad Mar 25, 2021

mythrocks Mar 25, 2021

mythrocks commented Mar 29, 2021

mythrocks commented Mar 29, 2021

		* `scale_to()` scales the bounds scalar from its original granularity (e.g. DURATION_DAYS)
		* to the orderby column's granularity (DURATION_SECONDS), before comparions are made.

		range_window_bounds&& preceding,
		range_window_bounds&& following,

Extend range window queries to non-timestamp order-by columns #7675

Extend range window queries to non-timestamp order-by columns #7675

Conversation

mythrocks commented Mar 23, 2021

Prelude

What's new in this commit

codecov bot commented Mar 23, 2021 • edited Loading

Codecov Report

jrhemstad commented Mar 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mythrocks Mar 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhemstad left a comment

Choose a reason for hiding this comment

mythrocks commented Mar 23, 2021

mythrocks commented Mar 23, 2021 • edited Loading

jrhemstad commented Mar 23, 2021

wbo4958 commented Mar 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mythrocks commented Mar 29, 2021

mythrocks commented Mar 29, 2021

codecov bot commented Mar 23, 2021 •

edited

Loading

mythrocks Mar 24, 2021 •

edited

Loading

mythrocks commented Mar 23, 2021 •

edited

Loading