system instance use cases for TOSS4 deliverable

Assumptions:

Compute nodes do fail unexpectedly
Service nodes do NOT fail unexpectedly

Use cases:

Brokers start in any order via systemd
- While configured TBON parent is down, block and keep trying
- Broker rejects TBON connections until it has fully bootstrapped
- Broker rejects local:// connections until it has fully bootstrapped
- No brokers are ready until the leader (TBON root) broker is ready
- Follower broker restores state (if any) from TBON parent during bootstrap
- Leader broker restores state (if any) from disk
Administrative drain/undrain (compute node)
- Tell scheduler not to schedule new work on compute node
- Admin may log a reason
- Failed epilog can be configured to automatically drain a node
- A drained node stays drained across reboot/instance restart
Administrative shutdown (compute node)
- Regular systemd unit shutdown
- Broker notifies TBON parent of impending disconnect
- Job-exec raises exception for any jobs allocated this node
- Broker stops
Unplanned outage (compute node)
- TBON parent detects missing child via missed keepalives
- Job-exec raises exception for any jobs allocated this node
- (configurable) automatic administrative drain
Administrative shutdown (full system)
- Admin shutdown all compute nodes (see above)
- Once interior TBON node has no children, save state to parent and exit
- Leader saves state to disk and exits
Administrative shutdown (service nodes only), e.g. don't kill running jobs
- Admin shutdown as with full system above, except compute nodes just checkpoint state to parent and tell parent OK to shutdown
- (for now) all service nodes should be shutdown together
- Compute node immediately starts trying to reconnect
- Any attempt to access instance services from compute node hangs until connection is restored
- (idea) log a non-fatal exception for all running jobs?
Administrative bringup (service nodes only)
- Service nodes may start in any order (see above)
- Compute nodes establish connection with TBON parent
Rolling OS update
- Use admin drain to reboot compute nodes after job completes
- Use admin shutdown (service nodes only) to stop flux on service nodes without impacting workload
- While service nodes down, do whatever phased reboots are appropriate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

system instance use cases for TOSS4 deliverable

Clone this wiki locally