Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contact: update modified time for better crash detection #5511

Open
oliver-sanders opened this issue May 3, 2023 · 1 comment
Open

contact: update modified time for better crash detection #5511

oliver-sanders opened this issue May 3, 2023 · 1 comment
Labels
question Flag this as a question for the next Cylc project meeting.
Milestone

Comments

@oliver-sanders
Copy link
Member

Consider using a main loop plugin to update the modification time of the contact file at a scheduled interval for easier crash detection.

Problem

  • When workflows crash, we are sometimes unable to remove the old contact file.
  • The contact file acts as a marker for other Cylc commands (e.g. scan) that a workflow is running.
  • In order to determine whether a workflow is running or crashed you need to ping the workflow.
  • Pinging a workflow means a network request which is slow.

Proposed Solution

  • Avoid the need to ping workflows by using the contact file modification time as an indicator of activity.
  • os.utime can be used to set the modification time.
  • No need for cylc scan to ping workflows UNTIL it detects crashes.
  • This would only be local not remote.

See also #4858, #5405 (review)

@oliver-sanders oliver-sanders added the question Flag this as a question for the next Cylc project meeting. label May 3, 2023
@oliver-sanders oliver-sanders added this to the some-day milestone May 3, 2023
@hjoliver
Copy link
Member

hjoliver commented May 3, 2023

Good idea, IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Flag this as a question for the next Cylc project meeting.
Projects
None yet
Development

No branches or pull requests

2 participants