Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contact: detect crashed workflows #4858

Open
oliver-sanders opened this issue May 5, 2022 · 2 comments
Open

contact: detect crashed workflows #4858

oliver-sanders opened this issue May 5, 2022 · 2 comments
Milestone

Comments

@oliver-sanders
Copy link
Member

Idea of @dpmatthews

At the moment if a client connection (ZMQ/TCP) fails, then we try to SSH to the scheduler server where the workflow was running and perform a process listing. If the process is found not to be running we delete the contact file, this permits the workflow to be rerun whereas before users would have had to hunt these files down manually.

Instead of just deleting these contact files we could provide a command to list crashed workflows e.g:

  • cylc scan --state=crashed.
  • cylc play $(cylc scan --state=crashed).

The UIS could use this information and alert users to crashes. Sysadmins could potentially scan for crashed workflows.

Needs a little thought e.g. if we don't remove the contact file then any client connections (e.g. cylc message commands from orphaned jobs) will continue to attempt to connect to the workflow which could cause additional load, perhaps we would want to mv contact contact.crashed or something like that.

Probably a fairly straightforward feature to implement.

Pull requests welcome!

@oliver-sanders oliver-sanders added this to the cylc-8.x milestone May 5, 2022
@hjoliver
Copy link
Member

hjoliver commented May 6, 2022

Good idea.

Instead of either removing the contact file (to prevent connection attempts and allow restart) - which gets rid of the crash evidence; or leaving it as-is - which will result in useless connection attempts; maybe there's a middle ground: add a line to the contact file to indicate that the server is down? (After which deliberate removal of the file would be required).

@oliver-sanders
Copy link
Member Author

See also cylc/cylc-uiserver#257

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants