Skip to content

system instance use cases for TOSS4 deliverable

Jim Garlick edited this page Jan 9, 2020 · 1 revision

Assumptions:

  • Compute nodes do fail unexpectedly
  • Service nodes do NOT fail unexpectedly

Use cases:

  1. Brokers start in any order via systemd

    • While configured TBON parent is down, block and keep trying
    • Broker rejects TBON connections until it has fully bootstrapped
    • Broker rejects local:// connections until it has fully bootstrapped
    • No brokers are ready until the leader (TBON root) broker is ready
    • Follower broker restores state (if any) from TBON parent during bootstrap
    • Leader broker restores state (if any) from disk
  2. Administrative drain/undrain (compute node)

    • Tell scheduler not to schedule new work on compute node
    • Admin may log a reason
    • Failed epilog can be configured to automatically drain a node
    • A drained node stays drained across reboot/instance restart
  3. Administrative shutdown (compute node)

    • Regular systemd unit shutdown
    • Broker notifies TBON parent of impending disconnect
    • Job-exec raises exception for any jobs allocated this node
    • Broker stops
  4. Unplanned outage (compute node)

    • TBON parent detects missing child via missed keepalives
    • Job-exec raises exception for any jobs allocated this node
    • (configurable) automatic administrative drain
  5. Administrative shutdown (full system)

    • Admin shutdown all compute nodes (see above)
    • Once interior TBON node has no children, save state to parent and exit
    • Leader saves state to disk and exits
  6. Administrative shutdown (service nodes only), e.g. don't kill running jobs

    • Admin shutdown as with full system above, except compute nodes just checkpoint state to parent and tell parent OK to shutdown
    • (for now) all service nodes should be shutdown together
    • Compute node immediately starts trying to reconnect
    • Any attempt to access instance services from compute node hangs until connection is restored
    • (idea) log a non-fatal exception for all running jobs?
  7. Administrative bringup (service nodes only)

    • Service nodes may start in any order (see above)
    • Compute nodes establish connection with TBON parent
  8. Rolling OS update

    • Use admin drain to reboot compute nodes after job completes
    • Use admin shutdown (service nodes only) to stop flux on service nodes without impacting workload
    • While service nodes down, do whatever phased reboots are appropriate