Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Create benchmarks for multi-batch processing cases #11575

Open
revans2 opened this issue Oct 9, 2024 · 0 comments
Open

[FEA] Create benchmarks for multi-batch processing cases #11575

revans2 opened this issue Oct 9, 2024 · 0 comments
Labels
? - Needs Triage Need team to review and classify performance A performance related task/issue

Comments

@revans2
Copy link
Collaborator

revans2 commented Oct 9, 2024

Is your feature request related to a problem? Please describe.
We have seen issues with memory management when a task ends up processing multiple batches of input/output. The issue is described in #11343

But we don't have a good set of benchmarks to measure our progress in fixing this issue.

We need to come up with some formal benchmarks ideally using NDS data sets, or our stress test suite. They should try to cover cases when there are multiple batches of input and/or output to/from a single task. The cases we worry about are ones when a task leaves intermediate spillable data on the GPU, but we also want to verify that in cases where there is little to no intermediate data on the GPU that they also work well.

We should try and measure the impact of spilling and different I/O characteristics on the runtime of these jobs. I believe that eventually this will come down to deciding, in a few different situations, if it is better to try and buffer more remote data at the expense of increased local I/O + compute (spilling) and If we should try and do more GPU processing at the expense of increased local I/O + compute (spilling). So it might be nice to have a way to simulate different levels of local disk and remote disk throughput/IOPs. Not 100% sure how to do that, but if all we get are the queries that is a good start and we can try to simulate disk speeds later.

The proposed queries I want to see are:

  1. A join explodes and for most tasks there are multiple output batches going into a file write or a shuffle.
  2. A large window operation where the cardinality of the partition by keys make it so that most tasks end up with multiple batches worth of data. The window operation itself should ideally be a running window operation just to make that processing simpler.
  3. A large aggregation where there is a lot of input data, meaning multiple batches per task, but the output data is less than a single batch in size.
  4. A project where there are multiple batches being processed per task, but all of the batches are independent of each other.
  5. A parquet read + filter where the parquet read itself is likely to explode in size to multiple batches, and the filter does not filter much out.

In addition to the regular NDS runs that we already do to be sure there are no happy path performance regressions.

@revans2 revans2 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Oct 9, 2024
@sameerz sameerz added performance A performance related task/issue and removed feature request New feature or request labels Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

2 participants