Skip to content

System Instance L2 Resiliency

Jim Garlick edited this page Jul 29, 2021 · 4 revisions

Flux System Instance Resiliency Planning

Initial Flux system instance resiliency will hinge on two main principles. (See also [reference to broker resiliency docs here?])

  1. Subtree panics: Any time a broker becomes unresponsive or down RPCs will get an error and the entire subtree will be restarted. Recovering brokers will reconnect into the instance.

  2. Recoverable jobs: After a subtree panic, recovering brokers will "rediscover" all running jobs so that no running jobs are lost.

Successful implementation of both of these principles shall be termed Level 2 or L2 Resilience.

Exit criteria: A restart of rank 0 broker with active jobs, both running and pending, results in a restart of all brokers participating in the system instance. After restart, running and pending jobs are recovered.

Task Breakdown

  • Recoverable jobs

    • Replace fork/waitpid with a mechanism which allows "rediscovery" of broker subprocesses.
      • Investigate use of cgroups and/or systemd for this purpose
      • Investigate whether flag for libsubprocess or job-exec specific facility will be most efficacious
    • Redesign/rewrite job-exec for recoverable jobs, and issues described in flux-core #3346
    • Design and implement job shell "detached" mode
      • job shell will need to be able to operate temporarily in a mode where it has lost connection to the local broker.
      • reconnect will be driven by job rediscovery
    • Preserve guest KVS namespaces across rank 0 restart
  • Broker resiliency model:

    • Document resilency mode/protocol in broker docs #3804
    • Brokers detect upstream peer reboot and take themselves down #3608
    • Implement RPC "health check" to help diagnose stuck services: #2797
    • broker: track RPC state and send error responses for lost peers #3800
    • Administratively declare live broker peers dead to cause RPCs to fail fast and force a subtree restart. #3805

Planning

Aug 2021:

  • design or prototype for fork/exec/waitpid replacement scheme
  • design or prototype for rpc state tracking in broker/handle
  • refine work breakdown based on results

Oct 2021:

  • job shell offline mode implementation

Dec 2021:

  • Demonstrate restart of size=1

Feb 2022:

  • Demonstrate L2 resiliency on Fluke: restart rank 0 with running and pending jobs with no loss of jobs
  • Scale testing on larger systems as determined by need and system availability.