Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend range window queries to non-timestamp order-by columns #7866

Merged
merged 15 commits into from
Apr 29, 2021

Conversation

mythrocks
Copy link
Contributor

@mythrocks mythrocks commented Apr 5, 2021

(Redux of #7675, which was for branch-0.19, and included support to scale duration scalars to the orderby column's resolution.)

This change introduces a grouped_range_rolling_window() function, to generalize the functionality available through grouped_time_range_rolling_window().

Prelude

Recall that grouped_time_range_rolling_window() applies an aggregation over a dynamic window of rows. The width of the window is determined based on a timestamp column, and the relative preceding/following offset specified in days. E.g. Consider this dataset:

[ // user,  sales_amt,  YYYYMMDD (date)
  { "user1",   10,      20200101    },
  { "user2",   20,      20200101    },
  { "user1",   20,      20200102    },
  { "user1",   10,      20200103    },
  { "user2",   30,      20200101    },
  { "user2",   80,      20200102    },
  { "user1",   50,      20200107    },
  { "user1",   60,      20200107    },
  { "user2",   40,      20200104    }
]

Grouping by user_id, and ordering by date yields the following sales_amt vector (with 2 groups, one for each distinct user_id):

Date :(202001-)  [ 01,  02,  03,  07,  07,    01,   01,   02,  04 ]
Input:           [ 10,  20,  10,  50,  60,    20,   30,   80,  40 ]
                   <-------user1-------->|<---------user2--------->

The SUM aggregation applied over a window of (1 DAY PRECEDING, 1 DAY FOLLOWING) with a minimum of 1 period yields the following result column:

Results:         [ 30,  40,  30,  110, 110,  130,  130,  130,  40 ]

What's new in this commit

grouped_range_rolling_window() extends the erstwhile range-window functionality in two ways:

  1. The order-by column (date) is not restricted to timestamps. This may also be a sorted integral column. The preceding/following offsets will then need to be specified as scalars of the same type as the order-by column.
    E.g.
    [ // user,  sales_amt,  seconds_since_snap
      { "user1",   10,         86451          },
      { "user2",   20,         86735          },
      { "user1",   20,         89162          },
      { "user1",   10,         92152          },
      ...]
    
  2. If the order-by column is indeed a timestamp, the preceding/following offsets may be duration scalars, whose precision must match that of the timestamp order-by column.
    E.g. If orderby_column.type() == TIMESTAMP_SECONDS, preceding.type() may only be DURATION_SECONDS.

Analogous to window_bounds, a new range_window_bounds class has been introduced to specify bounds for range window functions:

  1. Supports scalar values for preceding/following
  2. Supports UNBOUNDED window widths

range_window_bounds currently supports only integral and duration scalars.
Correspondingly, the orderby_column can only be integral and timestamp columns. Support for floats and fixed_point types may be added at a later date.

The existing grouped_time_range_rolling_window() function now delegates to the same backend as grouped_range_rolling_window().

1. Enabled integral types. Some tests.
2. Fixed unsigned.
3. Tonnes of tests for nulls in orderby column
4. Purged references to timestamps in range-window code.
5. Tests for time scaling.
6. Added tests for mean. Fixed sum test.
7. Added collect tests.
8. Tests for ASC/DESC timestamps, all aggs.
9. Docs.
10.Renamed the orderby null tests.
11.Added range_window_bounds.hpp to meta.yaml.
12.Review comments (std::move, CMakeLists)
13.Better example for numeric range window function.
14.Removed TODO, again.
15.Formatting
16.Added integral range tests.
17.Removed scaling. Streamlined the dispatch logic.
@mythrocks mythrocks self-assigned this Apr 5, 2021
@mythrocks mythrocks requested review from a team as code owners April 5, 2021 23:14
@github-actions github-actions bot added CMake CMake build issue conda libcudf Affects libcudf (C++/CUDA) code. labels Apr 5, 2021
@mythrocks mythrocks added feature request New feature or request non-breaking Non-breaking change labels Apr 5, 2021
@mythrocks mythrocks marked this pull request as draft April 5, 2021 23:18
@mythrocks
Copy link
Contributor Author

Repeating comments from #7675 for clarity:

we should also consider if there is redundancy in the existing APIs. E.g., grouped_rolling_window has overloads that take both size_type and window_bounds for the preceding/following window sizes. Isn't window_bounds more generic and therefore could replace the size_type overload?

I agree with @jrhemstad's assessment above. I hope to replace grouped_time_range_rolling_window() completely with grouped_range_rolling_window().

I envision the following changes in an upcoming release:

  1. Deprecate (and later, remove) both grouped_time_range_rolling_window() functions. These are an artifact of the solving smaller parts of the more generic problem. grouped_range_rolling_window() will subsume this feature.
  2. Deprecate (and later, remove) the size_type overloads of grouped_rolling_window(). These do not support UNBOUNDED ranges. We should use window_bounds for fixed width window functions.
  3. Deprecate (and later, remove) rolling_window() and grouped_rolling_window() overloads that do not take default_values. These exist currently because they were added before LEAD/LAG were supported.
  4. Possibly introduce window_bounds overloads for rolling_window(), if the Python/Pandas crew do not object, and deprecate the older functions. Without this, UNBOUNDED windows can't be supported with most rolling_window() functions.

This work hasn't been tackled yet because I'm still exploring what else we might need to support more non-trivial ranking functions. E.g. RANK()/DENSE_RANK() requires access to columns that the current interfaces do not have access to.

@codecov
Copy link

codecov bot commented Apr 6, 2021

Codecov Report

Merging #7866 (d39aaf4) into branch-0.20 (51336df) will decrease coverage by 0.01%.
The diff coverage is 88.20%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.20    #7866      +/-   ##
===============================================
- Coverage        82.88%   82.87%   -0.02%     
===============================================
  Files              103      103              
  Lines            17668    17835     +167     
===============================================
+ Hits             14645    14781     +136     
- Misses            3023     3054      +31     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/io/orc.py 86.89% <ø> (ø)
python/cudf/cudf/utils/cudautils.py 57.75% <ø> (ø)
python/cudf/cudf/utils/dtypes.py 81.87% <ø> (-1.57%) ⬇️
python/cudf/cudf/utils/utils.py 90.00% <ø> (+0.49%) ⬆️
python/dask_cudf/dask_cudf/backends.py 89.51% <ø> (-0.08%) ⬇️
python/dask_cudf/dask_cudf/groupby.py 92.64% <ø> (+0.49%) ⬆️
python/dask_cudf/dask_cudf/io/orc.py 91.04% <ø> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <ø> (ø)
python/cudf/cudf/core/column/lists.py 86.98% <66.66%> (-0.43%) ⬇️
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 290c6ef...d39aaf4. Read the comment docs.

@wbo4958
Copy link
Contributor

wbo4958 commented Apr 8, 2021

FYI, The java bindings PR #7909

@mythrocks mythrocks marked this pull request as ready for review April 8, 2021 18:49
Copy link
Contributor

@mike-wendt mike-wendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved changes for ops CODEOWNERS files; will rely on team to verify other changes

cpp/include/cudf/rolling.hpp Show resolved Hide resolved
cpp/src/rolling/grouped_rolling.cu Outdated Show resolved Hide resolved
cpp/src/rolling/grouped_rolling.cu Outdated Show resolved Hide resolved
cpp/src/rolling/grouped_rolling.cu Show resolved Hide resolved
cpp/src/rolling/grouped_rolling.cu Show resolved Hide resolved
cpp/include/cudf/rolling.hpp Show resolved Hide resolved
cpp/src/rolling/grouped_rolling.cu Show resolved Hide resolved
cpp/src/rolling/range_window_bounds.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/rolling/range_window_bounds.hpp Outdated Show resolved Hide resolved
1. Fixed @copydoc.
2. Added missing docs for range_window_bounds.
3. Removed invariant. Inlined into constructor.
@mythrocks
Copy link
Contributor Author

@vuule, thank you for the detailed review. GitHub is refusing to let me reply inline, so I'm going to have to quote and reply here, manually. Please bear with me:

It's usually better to avoid negatives in the names... !is_unbounded adds a bit of overhead when reading the code.

I heartily agree with this advice, in general contexts. I'm afraid the term unbounded comes from SQL syntax for OVER clauses in window functions, indicating that the bounds go all the way to the beginning/end of the group/column.
The problem with un-negating unbounded is that it amounts to double-negation for those familiar with window functions. :] Could I please request permission to leave this, err, unchanged?

Should the default be unbounded, given that the default range is nullptr?

That's a good catch. I'll change this.

why is the helper needed?

Hmm... This was so that I didn't have to explicitly state the template arguments for window_exec_impl. It was for the same reason that std::back_inserter() is a type-deducing helper for std::back_insert_iterator<>. But since we're now on C++17, I might be able to do without the helper. I'll check.

1. Fixed punctuation.
2. Renamed add_and_check_overflow(), etc.
3. Changed range_window_bounds default to unbounded.
4. Removed test helper. Not reqd with C++17.
@mythrocks
Copy link
Contributor Author

Maybe add_safe?

This function (and the other one) aims to prevent overflow/underflow in bounds calculation. I caught this when I enabled range queries on unsigned order_by columns, and got incorrect results. The aim is to cap the min-value at numeric_limits<>::min() and max-value at numeric_limits<>::max().
I have changed these to add_safe() and subtract_safe().

No need to use std::tie now :)

I appreciate the suggestion to use structured bindings, but I'm afraid they don't suffice in this context: I would have to use those as captures for the lambdas that follow, and that is unfortunately disallowed.

../src/rolling/grouped_rolling.cu(359): error: structured binding cannot be captured
          detected during:
            instantiation of "std::unique_ptr<cudf::column, std::default_delete<cudf::column>> cudf::<unnamed>::grouped_r
ange_rolling_window_impl<OrderByT>(const cudf::column_view &, const cudf::column_view &, const cudf::order &, const rmm::
device_uvector<cudf::size_type> &, const rmm::device_uvector<cudf::size_type> &, cudf::range_window_bounds, cudf::range_w
indow_bounds, cudf::size_type, const std::unique_ptr<cudf::aggregation, std::default_delete<cudf::aggregation>> &, rmm::c
uda_stream_view, rmm::mr::device_memory_resource *) [with OrderByT=int8_t]"

Not using the stream here

Thanks for pointing that out. I keep forgetting to correct that. Fixed now.

What does the 5 mean here?

You will hate me for this: this was shorthand to convert from TIMESTAMP_DAYS to DURATION_DAYS, TIMESTAMP_SECONDS to DURATION_SECONDS etc., without using a switch-case. :/ I'll make this more explicit.

1. Use streams for calculating null-bounds.
@vuule
Copy link
Contributor

vuule commented Apr 28, 2021

I wonder why you cannot reply inline, I can reply to myself.
Feel free to keep is_unbounded, I was not aware of the domain terminology.

1. Explicit conversion from timestamp type to duration type.
@mythrocks
Copy link
Contributor Author

It's preferred to have factory methods return unique_ptr...
range_window_bounds should take a unique_ptr
Does this need to be a shared_ptr?

Please note that range_window_bounds has a private constructor, and is supposed to be constructed by one of two factory methods:

  1. get(scalar const&)
  2. unbounded(data_type)

The problem was that we'd need to make a copy of the scalar to store in range_window_bounds, and to make range_window_bounds itself copyable. The shared_ptr allows this cleanly.

It's preferred to have factory methods return unique_ptr...
range_window_bounds should take a unique_ptr

Since these functions are both private to implementation, and used only to initialize the shared_ptr, could I please keep the scalar*?

@vuule
Copy link
Contributor

vuule commented Apr 28, 2021

Since these functions are both private to implementation, and used only to initialize the shared_ptr, could I please keep the scalar*?

I guess it's fine since you already have 3 approvals :) Please add comments to explain that the ownership is transferred in the constructor, and that the allocated object will be freed via the shared pointer (it's a bit scary to see a new call, especially without any delete calls).

1. Cleaned up factory implementation, to make shared_ptr more apparent.
2. Documentation
@mythrocks
Copy link
Contributor Author

(it's a bit scary to see a new call, especially without any delete calls).

@vuule, I have changed the (private) factory implementation to return shared_ptr<> instead of naked pointers, and tried to clarify this in documentation. I'm hoping this is more palatable.

@vuule
Copy link
Contributor

vuule commented Apr 28, 2021

(it's a bit scary to see a new call, especially without any delete calls).

@vuule, I have changed the (private) factory implementation to return shared_ptr<> instead of naked pointers, and tried to clarify this in documentation. I'm hoping this is more palatable.

That's better IMO :) Any specific reason why you used shared_ptr instead of unique_ptr?

@mythrocks
Copy link
Contributor Author

That's better IMO :) Any specific reason why you used shared_ptr instead of unique_ptr?

The initial version of this PR (#7675) used unique_ptr. The hard part was how the caller would specify the scalar (which is polymorphic), and the lifetime/ownership of said scalar. Using unique_ptr disables copyability of range_window_bounds, and precludes the use of the same range_window_bounds object for multiple window function calls.
(#7675 included an abortive attempt to use range_window_bounds&& in the API, and too many std::move() calls.)

Adopting a scalar clone via shared_ptr turns out to be cleaner, overall. (I doff my hat to @nvdbaranec for the suggestion.)

1. Corrected docs for resolution of durations/timestamps.
2. Fixed nitpick in motor-racing dataset.
3. Switched bounds factory impl to use unique_ptr.
4. assert_non_negative() now uses if constexpr.
@mythrocks mythrocks requested a review from vuule April 28, 2021 21:12
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the comments!

@mythrocks
Copy link
Contributor Author

@gpucibot merge

@mythrocks mythrocks added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Apr 29, 2021
@mythrocks
Copy link
Contributor Author

Looks like we're waiting for a +1 from rapidsai/ops-codeowners.

@rapids-bot rapids-bot bot merged commit 5a012e5 into rapidsai:branch-0.20 Apr 29, 2021
rapids-bot bot pushed a commit that referenced this pull request Apr 30, 2021
Addendum to #7866, to switch from using `std::tie()` to structured bindings, as prescribed for C++17.

This required workarounds for compiler restrictions on using aliases from structured bindings as captures in lambdas.

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Conor Hoekstra (https://github.com/codereport)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Christopher Harris (https://github.com/cwharris)
  - Nghia Truong (https://github.com/ttnghia)

URL: #8117
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants