Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add host memory task metrics and explore a host memory allocation watch dog. #8880

Closed
6 tasks done
revans2 opened this issue Jul 31, 2023 · 0 comments · Fixed by #9509
Closed
6 tasks done

[FEA] Add host memory task metrics and explore a host memory allocation watch dog. #8880

revans2 opened this issue Jul 31, 2023 · 0 comments · Fixed by #9509
Assignees
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing

Comments

@revans2
Copy link
Collaborator

revans2 commented Jul 31, 2023

Is your feature request related to a problem? Please describe.
This is a follow on to #8879 The idea is that we probably are not going to get all of the blocking code 100% perfect in the first go at this. Especially if we have both GPU memory allocations and host memory allocations blocking tasks. We should look at adding in a watchdog of some kind that would be able to detect if nothing has happened to a task in a long time, and have us try to break the potential deadlock with an exception. This part is a bit experimental. It might not work out, which is why we are going to explore it. But as a part of this we should also add in metrics.

We will need to have some metrics on the time being taken for blocked processes and spilled processes. I think we want to have 5 new task level metrics and we will remove two that we currently have. Right now we have the amount of time spent spilling. But that is measured at the GPU spill entry point. Because we are adding in a new entry point where we could spill with a host allocation, we want to split it up. We will have one metric for the amount of time a task was blocked on host memory allocation; another for the amount of time spent transferring data from the GPU to the host; one for the amount of time the task spent spilling from host memory to disk; one for the amount of time spent reading spilled data back to host memory; and finally one for the amount of time spent reading spilled data back to GPU memory.

Tasks

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jul 31, 2023
@revans2 revans2 added task Work required that improves the product but is not user facing reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed feature request New feature or request labels Jul 31, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants