Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Optimize row_number/rank for memory usage #1859

Closed
revans2 opened this issue Mar 3, 2021 · 0 comments · Fixed by #2953
Closed

[FEA] Optimize row_number/rank for memory usage #1859

revans2 opened this issue Mar 3, 2021 · 0 comments · Fixed by #2953
Assignees
Labels
feature request New feature or request

Comments

@revans2
Copy link
Collaborator

revans2 commented Mar 3, 2021

This is a part of #1789

In GpuWindowExec there are two sets of keys that are passed into it. The order by key(s) and the partition by keys(s). So if we wanted to run a command like

ROW_NUMBER() OVER (PARTITION BY A ORDER BY B) AS ROWNUM

Then the code would hash partition/shuffle on the column A and then sort within each partition on A, B, so that all of the data for each unique A group is next to each other. like in the following table

A B
1 0
1 5
6 -4
6 0
6 5

Currently in GpuWindowExec this is all brought into a single batch and sent to cudf for processing. #1856 may change/improve that some what, but the best it can do is split the data on grouping boundaries. This is not enough if the groups are too large to fit in a single batch or even in GPU memory, like in the case of #1642. For that we need to adjust how our algorithms work.

For rank(not supported yet), dense_range(not supported yet) and row_number(currently supported) we don't need all of the data for a window on the GPU at a single point in time. We just need the last row of the previous batch. With that we can add any offsets needed to the new batch to pick up where we left off.

For example a row_number query would only need the partition by keys and the final row number from the previous batch. After the new batch finishes processing, it would look for all rows that had the same partition by keys in them (are a part of the same group) and then add the row_number - 1 from the last row of the previous batch to it.

For dense_rank we also need to look at the ordering key to see if they match. If they do then we add the previous dense_rank - 1 to all of the rows that are a part of the same group. If they don't then we add the previous dense_rank to all of them.

For rank we would have to calculate row_number along with rank, but throw it away for all but the last row. Just like with dense_rank we would look at the ordering key from the previous batch to see if they match. If they do then we add the previous rank - 1 to all of the ranks in the same group. If they don't then we add the previous row_number - 1 to all of them in the same group.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 3, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Apr 6, 2021
@revans2 revans2 linked a pull request Jul 26, 2021 that will close this issue
@revans2 revans2 added this to the July 19 - July 30 milestone Jul 26, 2021
@revans2 revans2 self-assigned this Jul 26, 2021
@revans2 revans2 closed this as completed Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants