Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement a heuristic to split a project's input based on output and add to hash aggregate #8382

Closed
revans2 opened this issue May 24, 2023 · 1 comment
Assignees
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented May 24, 2023

Is your feature request related to a problem? Please describe.
This is related to some other GpuProjectExec work

#7866
#7258

This is similar, but instead of waiting to run out of memory and possibly produce an output that is way too large, we should split the input by rows if we can predict that the output size is going to be much larger than the input size.

This is specifically intended to be follow on work for #8141

The idea is that if we think that the output of a project is going to be so large that it goes above our target batch size, then we should split the input proactively to avoid that. We see this show up especially in the pre-process step for aggregates because to do them in a distributed way and to work around limitations in CUDF, especially with overflow detection, we need to add extra columns to the data being aggregated.

Describe the solution you'd like

I am kind of shooting from the hip here, so we should do some profiling and testing, but I think what we want to do is to estimate the output size-per row and a error bounds based off of an input batch and the project expressions.

For any fixed width types we know with 0 error that the output size is the width of the data plus some amortized validity buffer. For variable length types it is a bit more of a guess. If the output is just a reference to an input column (which is common) we can know with 0 error what the average per-row size is. It is not perfect if there is skew per row, but it is probably good enough. For other columns we could try and estimate it by looking at the expressions and having them help us with the estimate, but lets not do that now unless we see some real world use cases where it really is problematic. In those cases we would also guess the error as 100% and just pull in the estimate that Spark uses.

Now we can compute a range of possible sizes by summing up all of the estimates and also the estimates plus the error. With this we can now calculate a minimum and maximum number of rows that could be in an output batch to still fit in the desired output. This can be adjusted later, but for now I want to lean more towards making the batch a little bit bigger than needed if we just don't know instead of making it too small. This is to mitigate any performance impact to the common case (when the batch size grows, but not in a huge way)

So for now I would say that if the number of input rows is less than the maximum number of rows we think would work, then we do nothing. If it is larger than this, then we calculate how many batches we should split it into based off of the min number of rows estimate and try to split them evenly.

We want to put this into GpuProjectExec because I think it would be the simplest one to implement and would give us some test coverage, but we also want to put this into the preprocessing step of hash aggregate. The later is likely to require a lot more rework to allow this to be an iterator. We also need to think about retry so we don't lose that functionality. It probably means that the split will have to be separate from executing the project so that the retry can be in between them. It also is going to impact hash aggregate because we are going to need to refactor things to make the code use an iterator as the output of the preprocessing instead of a single batch.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels May 24, 2023
@mattahrens mattahrens changed the title [FEA] Implement a heruistic to split a project's input based on output and add to hash aggregate [FEA] Implement a heuristic to split a project's input based on output and add to hash aggregate May 31, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 31, 2023
@revans2 revans2 self-assigned this Jun 5, 2023
@revans2
Copy link
Collaborator Author

revans2 commented Jun 29, 2023

The majority of this was done as a part of #8618 and I think we can close this. I didn't put it into GpuProjectExec. It was working, but it was not perfect and there were significant performance issues if we split the input too small. As such made the split size much larger, and only did it for hash aggregate because there was risk to put it in everywhere. If we see issues where this would be good in project I can add it in there without much work.

@revans2 revans2 closed this as completed Jun 29, 2023
@sameerz sameerz removed the feature request New feature or request label Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

3 participants