You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment if a client connection (ZMQ/TCP) fails, then we try to SSH to the scheduler server where the workflow was running and perform a process listing. If the process is found not to be running we delete the contact file, this permits the workflow to be rerun whereas before users would have had to hunt these files down manually.
Instead of just deleting these contact files we could provide a command to list crashed workflows e.g:
cylc scan --state=crashed.
cylc play $(cylc scan --state=crashed).
The UIS could use this information and alert users to crashes. Sysadmins could potentially scan for crashed workflows.
Needs a little thought e.g. if we don't remove the contact file then any client connections (e.g. cylc message commands from orphaned jobs) will continue to attempt to connect to the workflow which could cause additional load, perhaps we would want to mv contact contact.crashed or something like that.
Probably a fairly straightforward feature to implement.
Pull requests welcome!
The text was updated successfully, but these errors were encountered:
Instead of either removing the contact file (to prevent connection attempts and allow restart) - which gets rid of the crash evidence; or leaving it as-is - which will result in useless connection attempts; maybe there's a middle ground: add a line to the contact file to indicate that the server is down? (After which deliberate removal of the file would be required).
At the moment if a client connection (ZMQ/TCP) fails, then we try to SSH to the scheduler server where the workflow was running and perform a process listing. If the process is found not to be running we delete the contact file, this permits the workflow to be rerun whereas before users would have had to hunt these files down manually.
Instead of just deleting these contact files we could provide a command to list crashed workflows e.g:
cylc scan --state=crashed
.cylc play $(cylc scan --state=crashed)
.The UIS could use this information and alert users to crashes. Sysadmins could potentially scan for crashed workflows.
Needs a little thought e.g. if we don't remove the contact file then any client connections (e.g.
cylc message
commands from orphaned jobs) will continue to attempt to connect to the workflow which could cause additional load, perhaps we would want tomv contact contact.crashed
or something like that.Probably a fairly straightforward feature to implement.
Pull requests welcome!
The text was updated successfully, but these errors were encountered: