Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use withRetry in GpuCoalesceBatches #7852

Merged
merged 11 commits into from
Mar 8, 2023

Conversation

abellina
Copy link
Collaborator

@abellina abellina commented Mar 6, 2023

Closes #7777
Closes #7855

This adds retry/retry+split semantics to the GpuCoalesceBatches iterators (two of them). It does not handle HostToGpuCoalesceIterator because the column builder code in cuDF needs some changes for it to be retriable. Note #7851.

This introduces a withRetryNoSplit that doesn't take any arguments, it's simply a retry block:

withRetryNoSplit { 
  // do some idempotent work
}

Additionally, this PR adjusts the parallelism for the UCX shuffle smoke test to 1, so we don't run the risk of OOMing.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina
Copy link
Collaborator Author

abellina commented Mar 6, 2023

build

1 similar comment
@abellina
Copy link
Collaborator Author

abellina commented Mar 6, 2023

build

@abellina
Copy link
Collaborator Author

abellina commented Mar 7, 2023

build

revans2
revans2 previously approved these changes Mar 7, 2023
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit and a question now.

@abellina
Copy link
Collaborator Author

abellina commented Mar 7, 2023

build

revans2
revans2 previously approved these changes Mar 7, 2023
@abellina
Copy link
Collaborator Author

abellina commented Mar 7, 2023

I am looking into the build failure, it is the smoke test for UCX.

@abellina
Copy link
Collaborator Author

abellina commented Mar 7, 2023

So what is happening in the smoke test is we fail to come up with UCX because we fail to allocate bounce buffers. The cause is likely that a previous test is still on the GPU, taking all the memory and causing this failure. I don't have a solution yet...

@abellina
Copy link
Collaborator Author

abellina commented Mar 7, 2023

build

@abellina
Copy link
Collaborator Author

abellina commented Mar 8, 2023

build

@abellina
Copy link
Collaborator Author

abellina commented Mar 8, 2023

build

@abellina
Copy link
Collaborator Author

abellina commented Mar 8, 2023

@revans2 this should be ready to go in. Run is clean, and all the directories show no leaks or double closes. nvidia-smi shows that we ended up in a V100 @ 32GB so I think the multiple app theory causing a UCX OOM holds.

@revans2 revans2 merged commit 7477bc4 into NVIDIA:branch-23.04 Mar 8, 2023
@abellina abellina deleted the retry/coalesce branch March 8, 2023 23:12
@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
3 participants