System Instance L2 Resiliency

Flux System Instance Resiliency Planning

Initial Flux system instance resiliency will hinge on two main principles. (See also [reference to broker resiliency docs here?])

Subtree panics: Any time a broker becomes unresponsive or down RPCs will get an error and the entire subtree will be restarted. Recovering brokers will reconnect into the instance.
Recoverable jobs: After a subtree panic, recovering brokers will "rediscover" all running jobs so that no running jobs are lost.

Successful implementation of both of these principles shall be termed Level 2 or L2 Resilience.

Exit criteria: A restart of rank 0 broker with active jobs, both running and pending, results in a restart of all brokers participating in the system instance. After restart, running and pending jobs are recovered.

Task Breakdown

Recoverable jobs
- Replace fork/waitpid with a mechanism which allows "rediscovery" of broker subprocesses.
  - Investigate use of cgroups and/or systemd for this purpose
  - Investigate whether flag for libsubprocess or job-exec specific facility will be most efficacious
- Redesign/rewrite job-exec for recoverable jobs, and issues described in flux-core #3346
- Design and implement job shell "detached" mode
  - job shell will need to be able to operate temporarily in a mode where it has lost connection to the local broker.
  - reconnect will be driven by job rediscovery
- Preserve guest KVS namespaces across rank 0 restart
Broker resiliency model:
- Document resilency mode/protocol in broker docs #3804
- Brokers detect upstream peer reboot and take themselves down #3608
- Implement RPC "health check" to help diagnose stuck services: #2797
- broker: track RPC state and send error responses for lost peers #3800
- Administratively declare live broker peers dead to cause RPCs to fail fast and force a subtree restart. #3805

Planning

Aug 2021:

design or prototype for fork/exec/waitpid replacement scheme
design or prototype for rpc state tracking in broker/handle
refine work breakdown based on results

Oct 2021:

job shell offline mode implementation

Dec 2021:

Demonstrate restart of size=1

Feb 2022:

Demonstrate L2 resiliency on Fluke: restart rank 0 with running and pending jobs with no loss of jobs
Scale testing on larger systems as determined by need and system availability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System Instance L2 Resiliency

Flux System Instance Resiliency Planning

Task Breakdown

Planning

Clone this wiki locally