Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Support nullable timestamp columns in time range window functions #6557

Merged

Conversation

mythrocks
Copy link
Contributor

Closes #6524.

Time-range based rolling window functions currently do not consider the possibility that the timestamp column might have nulls in them, thus muffing up the time-window calculation.

Note that grouped_time_range_rolling_window() requires that the timestamp column is sorted (ASC or DESC), and possibly grouped by other keys. The NULL values in the timestamp column will be clustered at the beginning or end of the column (if ungrouped), or beginning/end of each group.
E.g. The following is an example of a table grouped by the first column, ordered by ASCENDING timestamp (within each group), with NULLs appearing first.

GBY key Timestamp key Aggregation Value
0 NULL 0
0 NULL 1
0 2020-01-02 2
0 2020-01-03 3
0 2020-01-04 4
1 NULL 0
1 NULL 1
1 NULL 2
1 2020-01-03 3
1 2020-01-04 4
1 2020-01-05 5

In this case, when doing a rolling-window COUNT on the third column with 1 DAY preceding and 1 DAY following, the rows with NULL timestamps should not participate with those that have a valid timestamp. They will, however, participate in a window of their own. The output for rolling-window COUNT should be:

2, 2, 2, 3, 2, 3, 3, 3, 2, 3, 2

Note that the third row of the table must ignore the previous two rows, since their timestamps are NULL. The rolling-window for the null rows themselves must consider all rows in that group's null-range. This behaviour is congruent with ranged window operations in SQL systems (such as SparkSQL).

The commits included herewith provide support for nullable timestamp columns in time-range window functions.

1. Initial commit. Fix for non-grouped time-ranges with
   null timestamps, in ASC order.
2. Fixed non-grouped time-ranges with null timestampw, in DESC.
3. Fixed grouped time-ranges with null timestamp, in ASC.
4. Fixed grouped time-ranges with null timestamp, in DESC.
5. Fixed grouped time-ranges for ASC (following)
6. Tests for different combinations of timestamp null grouping,
   timestamp ordering, and key grouping.
7. Refactor: Common code gathered to utility function.
8. Code formatting
@mythrocks mythrocks requested a review from a team as a code owner October 19, 2020 22:35
@mythrocks mythrocks self-assigned this Oct 19, 2020
@mythrocks mythrocks added the 2 - In Progress Currently a work in progress label Oct 19, 2020
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@mythrocks
Copy link
Contributor Author

This change stomps all over the work in #6512. I realize that I'd be mixing concerns here, but would anyone be opposed to my including that refactor in this PR?
The refactor really only does a couple of simple things:

  1. Move grouped_time_range*() and its helpers to a separate translation unit.
  2. Remove support for timestamp_D columns. A timestamp_D input will be cast to a timestamp_S column, thus eliminating the int32_t template instantiations.
    These changes should speed up the compile and reduce the object size.

@vuule
Copy link
Contributor

vuule commented Oct 19, 2020

This change stomps all over the work in #6512. I realize that I'd be mixing concerns here, but would anyone be opposed to my including that refactor in this PR?
The refactor really only does a couple of simple things:

  1. Move grouped_time_range*() and its helpers to a separate translation unit.
  2. Remove support for timestamp_D columns. A timestamp_D input will be cast to a timestamp_S column, thus eliminating the int32_t template instantiations.
    These changes should speed up the compile and reduce the object size.

Is #6512 ready for review?

@mythrocks
Copy link
Contributor Author

Is #6512 ready for review?

Hey, @vuule. It is, although I haven't benchmarked the runtime effect of casting timestamp_D -> timestamp_S.
It would be a little easier to redo those changes here (or after this PR) than to upmerge this PR after those changes have gone in. :]

@mythrocks mythrocks changed the title Fix time range window queries with null timestamps Support nullable timestamp columns in time range window functions Oct 19, 2020
@vuule
Copy link
Contributor

vuule commented Oct 19, 2020

Hey, @vuule. It is, although I haven't benchmarked the runtime effect of casting timestamp_D -> timestamp_S.
It would be a little easier to redo those changes here (or after this PR) than to upmerge this PR after those changes have gone in. :]

Now I understand the question:) I think the diff would not work well if the code has been moved between files, so these two PRs combined would be really hard to review. I would prefer to keep them separate, mostly for this reason.

@mythrocks
Copy link
Contributor Author

Thanks, @vuule. I'll make #6512 depend on #6557. I'll redo the #6512 change when this one goes in.

@mythrocks mythrocks added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 20, 2020
@mythrocks mythrocks changed the title Support nullable timestamp columns in time range window functions [REVIEW] Support nullable timestamp columns in time range window functions Oct 20, 2020
@codecov
Copy link

codecov bot commented Oct 20, 2020

Codecov Report

Merging #6557 into branch-0.17 will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.17    #6557   +/-   ##
============================================
  Coverage        82.54%   82.54%           
============================================
  Files               90       90           
  Lines            14928    14928           
============================================
  Hits             12322    12322           
  Misses            2606     2606           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbd2726...5d82e7c. Read the comment docs.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review

cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
1. Non-groupby case: Use null-count to calculate null bounds
2. Groupby case: Use partition_point()
@mythrocks
Copy link
Contributor Author

Hmm. There are test failures that don't seem related to this change, at least on the face of it. From the test logs:

=========================== short test summary info ============================
FAILED custreamz/tests/test_dataframes.py::test_binary_stream_operators[dask]
FAILED custreamz/tests/test_dataframes.py::test_pair_arithmetic[dask] - Value...
FAILED custreamz/tests/test_dataframes.py::test_getitem[dask] - KeyError: 'gt...
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>0-<lambda>0-<lambda>0]
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>0-<lambda>0-<lambda>1]
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>0-<lambda>2-<lambda>0]
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>0-<lambda>2-<lambda>1]
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>1-<lambda>0-<lambda>0]
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>1-<lambda>0-<lambda>1]
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>1-<lambda>2-<lambda>0]
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>1-<lambda>2-<lambda>1]
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>2-<lambda>0-<lambda>0]
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>2-<lambda>0-<lambda>1]
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>2-<lambda>2-<lambda>0]
FAILED custreamz/tests/test_dataframes.py::test_groupby_aggregate[dask-<lambda>2-<lambda>2-<lambda>1]
FAILED custreamz/tests/test_dataframes.py::test_setitem[dask] - AttributeErro...
FAILED custreamz/tests/test_dataframes.py::test_setitem_overwrites[dask] - Va...
FAILED custreamz/tests/test_dataframes.py::test_integration_from_stream[dask]
= 18 failed, 582 passed, 4 skipped, 240 xfailed, 1 xpassed, 2 warnings in 275.40s (0:04:35) =
+++ EXITCODE=1
++ gpuci_logger 'Test notebooks'

@vuule
Copy link
Contributor

vuule commented Oct 22, 2020

Hmm. There are test failures that don't seem related to this change, at least on the face of it.

Yeah, this is a known CI issue. Started happening two days ago.

@mythrocks
Copy link
Contributor Author

Thanks, @vuule. Phew.

As an aside, I think I've incorporated your suggestions in the current version of the code. Please let me know if this is more palatable.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions, requesting changes primarily for of test coverage.

cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
1. Short circuit eval for all-null/no-null cases.
2. Test cases for all-null/no-null.
3. Code formatting
@harrism
Copy link
Member

harrism commented Oct 26, 2020

rerun tests

@mythrocks
Copy link
Contributor Author

@vuule, thank you for your review and advice. Hey, @davidwendt. I was wondering if you might take a gander at this change. Thanks.

cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
1. Fixed angle-brackets for header inclusion.
2. Switched to doxygen-style function documentation
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Got an optional suggestion.

cpp/src/rolling/rolling.cu Outdated Show resolved Hide resolved
1. Streamline null_bound calculation, with fewer comparisons.
2. Fix botched null_bound calculation cleanup.
3. Remove errant println in tests.
@mythrocks mythrocks merged commit 350cd16 into rapidsai:branch-0.17 Oct 27, 2020
@mythrocks
Copy link
Contributor Author

Thanks for the reviews, all. I have just merged this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] grouped_time_range_rolling_window does not look at validity when determining bounds useing the timestamp
5 participants