-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend range window queries to non-timestamp order-by columns #7866
Conversation
1. Enabled integral types. Some tests. 2. Fixed unsigned. 3. Tonnes of tests for nulls in orderby column 4. Purged references to timestamps in range-window code. 5. Tests for time scaling. 6. Added tests for mean. Fixed sum test. 7. Added collect tests. 8. Tests for ASC/DESC timestamps, all aggs. 9. Docs. 10.Renamed the orderby null tests. 11.Added range_window_bounds.hpp to meta.yaml. 12.Review comments (std::move, CMakeLists) 13.Better example for numeric range window function. 14.Removed TODO, again. 15.Formatting 16.Added integral range tests. 17.Removed scaling. Streamlined the dispatch logic.
Repeating comments from #7675 for clarity:
I agree with @jrhemstad's assessment above. I hope to replace I envision the following changes in an upcoming release:
This work hasn't been tackled yet because I'm still exploring what else we might need to support more non-trivial ranking functions. E.g. |
Codecov Report
@@ Coverage Diff @@
## branch-0.20 #7866 +/- ##
===============================================
- Coverage 82.88% 82.87% -0.02%
===============================================
Files 103 103
Lines 17668 17835 +167
===============================================
+ Hits 14645 14781 +136
- Misses 3023 3054 +31
Continue to review full report at Codecov.
|
57816eb
to
69a3c31
Compare
FYI, The java bindings PR #7909 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved changes for ops CODEOWNERS
files; will rely on team to verify other changes
1. Fixed @copydoc. 2. Added missing docs for range_window_bounds. 3. Removed invariant. Inlined into constructor.
@vuule, thank you for the detailed review. GitHub is refusing to let me reply inline, so I'm going to have to quote and reply here, manually. Please bear with me:
I heartily agree with this advice, in general contexts. I'm afraid the term
That's a good catch. I'll change this.
Hmm... This was so that I didn't have to explicitly state the template arguments for |
1. Fixed punctuation. 2. Renamed add_and_check_overflow(), etc. 3. Changed range_window_bounds default to unbounded. 4. Removed test helper. Not reqd with C++17.
This function (and the other one) aims to prevent overflow/underflow in bounds calculation. I caught this when I enabled range queries on unsigned
I appreciate the suggestion to use structured bindings, but I'm afraid they don't suffice in this context: I would have to use those as captures for the lambdas that follow, and that is unfortunately disallowed.
Thanks for pointing that out. I keep forgetting to correct that. Fixed now.
You will hate me for this: this was shorthand to convert from |
1. Use streams for calculating null-bounds.
I wonder why you cannot reply inline, I can reply to myself. |
1. Explicit conversion from timestamp type to duration type.
Please note that
The problem was that we'd need to make a copy of the scalar to store in
Since these functions are both private to implementation, and used only to initialize the |
I guess it's fine since you already have 3 approvals :) Please add comments to explain that the ownership is transferred in the constructor, and that the allocated object will be freed via the shared pointer (it's a bit scary to see a |
1. Cleaned up factory implementation, to make shared_ptr more apparent. 2. Documentation
@vuule, I have changed the (private) factory implementation to return |
That's better IMO :) Any specific reason why you used shared_ptr instead of unique_ptr? |
The initial version of this PR (#7675) used Adopting a scalar clone via |
1. Corrected docs for resolution of durations/timestamps. 2. Fixed nitpick in motor-racing dataset. 3. Switched bounds factory impl to use unique_ptr. 4. assert_non_negative() now uses if constexpr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing the comments!
@gpucibot merge |
Looks like we're waiting for a |
Addendum to #7866, to switch from using `std::tie()` to structured bindings, as prescribed for C++17. This required workarounds for compiler restrictions on using aliases from structured bindings as captures in lambdas. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Conor Hoekstra (https://github.com/codereport) - Vukasin Milovanovic (https://github.com/vuule) - Christopher Harris (https://github.com/cwharris) - Nghia Truong (https://github.com/ttnghia) URL: #8117
(Redux of #7675, which was for
branch-0.19
, and included support to scale duration scalars to the orderby column's resolution.)This change introduces a
grouped_range_rolling_window()
function, to generalize the functionality available throughgrouped_time_range_rolling_window()
.Prelude
Recall that
grouped_time_range_rolling_window()
applies an aggregation over a dynamic window of rows. The width of the window is determined based on a timestamp column, and the relative preceding/following offset specified in days. E.g. Consider this dataset:Grouping by
user_id
, and ordering bydate
yields the followingsales_amt
vector (with 2 groups, one for each distinctuser_id
):The
SUM
aggregation applied over a window of(1 DAY PRECEDING, 1 DAY FOLLOWING)
with a minimum of 1 period yields the following result column:What's new in this commit
grouped_range_rolling_window()
extends the erstwhile range-window functionality in two ways:date
) is not restricted to timestamps. This may also be a sorted integral column. The preceding/following offsets will then need to be specified as scalars of the same type as the order-by column.E.g.
E.g. If
orderby_column.type() == TIMESTAMP_SECONDS
,preceding.type()
may only beDURATION_SECONDS
.Analogous to
window_bounds
, a newrange_window_bounds
class has been introduced to specify bounds for range window functions:UNBOUNDED
window widthsrange_window_bounds
currently supports only integral and duration scalars.Correspondingly, the
orderby_column
can only be integral and timestamp columns. Support for floats and fixed_point types may be added at a later date.The existing
grouped_time_range_rolling_window()
function now delegates to the same backend asgrouped_range_rolling_window()
.