Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-exec module needs a rewrite and new protocol #3346

Open
Tracked by #4433
grondo opened this issue Nov 17, 2020 · 2 comments
Open
Tracked by #4433

job-exec module needs a rewrite and new protocol #3346

grondo opened this issue Nov 17, 2020 · 2 comments
Labels
design don't expect this to ever be closed...

Comments

@grondo
Copy link
Contributor

grondo commented Nov 17, 2020

There are several (more than several) open issues against the job-exec module. This module really needs a complete re-implementation, perhaps first with a thoughtful RFC describing the protocol so we avoid corner cases.

Outstanding issues in the current design include:

The following use case should also be considered in the design:

I think moving the launch of job shells off of rank 0 to the first rank of the job will help a lot with job throughput, as well as allowing the exec module a way to possible restart on rank 0. We'll have to think about how the job-exec system rediscovers running jobs at restart.

@grondo
Copy link
Contributor Author

grondo commented Jan 24, 2022

This work still needs to be done, but some of the requirements have changed.

  • Now that the strategy to support broker restart includes libsdprocess, moving execution of the job shell directly to the involved broker ranks is required because the sdexec exec implementation can only work locally. (not only to allow restart of rank 0 nor only for scalability). It is an open question whether normal execution should also work this way
  • Some of the referenced issues above have already been solved, but we should not regress them (e.g. reporting of early launch errors, handling job shell output)

Some other items to think about in the redesign:

I'm sure there are other things I'm missing at the moment.

@chu11
Copy link
Member

chu11 commented Mar 25, 2022

Wanted to document a side discussion had with @grondo, we also will want a way to configure which "exec" mechanism we want to use. There should be a:

  • default mechanism (subprocess, systemd, etc.)
  • allow instance owner to use a non-default (e.g. testexec or something else)
  • but do not allow guests to chose non-default

the current mechanism in job-exec for selecting which exec method to use makes the above somewhat tricky to do, and thus has temporarily not been done

(Related to #3970 but not quite the same)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design don't expect this to ever be closed...
Projects
Status: No status
Development

No branches or pull requests

3 participants