-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-exec module needs a rewrite and new protocol #3346
Comments
This work still needs to be done, but some of the requirements have changed.
Some other items to think about in the redesign:
I'm sure there are other things I'm missing at the moment. |
Wanted to document a side discussion had with @grondo, we also will want a way to configure which "exec" mechanism we want to use. There should be a:
the current mechanism in (Related to #3970 but not quite the same) |
There are several (more than several) open issues against the job-exec module. This module really needs a complete re-implementation, perhaps first with a thoughtful RFC describing the protocol so we avoid corner cases.
Outstanding issues in the current design include:
The following use case should also be considered in the design:
I think moving the launch of job shells off of rank 0 to the first rank of the job will help a lot with job throughput, as well as allowing the exec module a way to possible restart on rank 0. We'll have to think about how the job-exec system rediscovers running jobs at restart.
The text was updated successfully, but these errors were encountered: