Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow events: fix an issue where "timeout" events would not fire #5959

Merged
merged 3 commits into from
Mar 25, 2024

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Feb 7, 2024

The result of attempting to document workflow events here: cylc/cylc-doc#685

Write some integration tests, discovered that a couple events weren't being fired in some situations.

[edit] It turns out that the way the scheduler terminates the subproc pool on shutdown means that, whilst this PR fixes the "event not being fired" problem, the event handler still won't actually run. This part of the issue has been bumped to #5997

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@oliver-sanders oliver-sanders added the bug Something is wrong :( label Feb 7, 2024
@oliver-sanders oliver-sanders added this to the cylc-8.2.5 milestone Feb 7, 2024
@oliver-sanders oliver-sanders self-assigned this Feb 7, 2024
cylc/flow/scheduler.py Outdated Show resolved Hide resolved
abort_conf = f"abort on {event}"
if self._get_events_conf(abort_conf):
# "cylc play" needs to exit with error status here.
raise SchedulerError(f'"{abort_conf}" is set')
if self._get_events_conf(f"{event} handlers") is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the source of the bug.

In my case I had mail events = <event> set, but not <event> handlers, this was preventing mail events from running.

I think the workflow events manager already has logic for determining when events should be run so this check was superfluous.

Comment on lines +66 to +64
'simulation': {
'default run length': 'PT0S',
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self, subtract this when merging into master.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I let you merge in @oliver-sanders ?

* Fix a small bug where "workflow timeout" and "inactivity timeout"
  events might not fire depending on the global config.
* Add integration tests to lock down the behaviour of workflow events.
schd._main_loop = killer

# start the scheduler and wait for it to hit the exception
with pytest.raises(Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flake8-bugbear points out this is "evil" (their words) - I suggest making sure we know it's the correct exception

Suggested change
with pytest.raises(Exception):
with pytest.raises(Exception, match=r':\('):

Copy link
Member Author

@oliver-sanders oliver-sanders Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how that match narrows it down?

I didn't want to text the exception that comes out as it's testing more internal implementation than needed (so more fragile test), but could try for a more specific exception if it's important.

Because the same workflow is also run in other tests (sanz-exception) the presence of any exception should be good enough here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not too bad considering the full context of the test, but it shouldn't be fragile to narrow it down to the particular exception that's supposed to be raised?

Copy link
Member Author

@oliver-sanders oliver-sanders Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, Cylc catches exceptions and re-raises them as something else, this logic is beyond the scope of the test. The test is trying to ensure that if an unexpected error occurs, the handlers are fired. It doesn't matter to the test what exception is raised, all that matters is that the scheduler bails, as such, checking it raises Exception is sufficient to ensure it's failing which is all we're interested in here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we at least turn off the red squiggly line?

Suggested change
with pytest.raises(Exception):
with pytest.raises(Exception): # noqa: B017

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do, but we're not really in the habit of making changes outside of working practices to support people's text editor setups (pyright flags many [mostly invalid] warnings, but we let 'em be).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is the repo configured flake8; when flake8 is run per-file it ignores the exclude setting in tox.ini)

Copy link
Member Author

@oliver-sanders oliver-sanders Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We do not enforce flake8 in the tests, there are many other errors besides this one)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(9 times out of 10 it would be a no-brainer to follow this rule, so there are benefits to flake8-ing the tests)

@MetRonnie
Copy link
Member

[scheduler]
    [[events]]
        abort on workflow timeout = True
        workflow timeout = PT1S
        abort handlers = echo "ALPHA"
        workflow timeout handlers = echo "TANGO"

The workflow timeout handler is not running when abort on workflow timeout is set

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Feb 23, 2024

So, it looks like it actually is firing off the event (and the test picked up on this).

BUT, on workflow abort, we terminate all processes in the subprocpool which kills the handler command, likely before it has had the chance to be issued:

async def _shutdown(self, reason: BaseException) -> None:
"""Shutdown the workflow."""
self._log_shutdown_reason(reason)
if hasattr(self, 'proc_pool'):
try:
self.proc_pool.close()
if self.proc_pool.is_not_done():
# e.g. KeyboardInterrupt
self.proc_pool.terminate()
self.proc_pool.process()
except Exception as exc:
LOG.exception(exc)

We could implement a mini-main loop, just to run the subprocpool until it empties naturally, but it could contain all sorts of stuff, like job submission commands which we probably don't want to run.

Dammit. All I wanted to do was to clarify one line in the docs!

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Feb 23, 2024

We could implement a mini-main loop, just to run the subprocpool until it empties naturally

I've just tried this, but it gets messy.

patch
diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 432ef11ff..687ebbb64 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -1415,6 +1415,7 @@ class Scheduler:
 
         Run workflow events in simulation and dummy mode ONLY if enabled.
         """
+        LOG.warning(f'EVENT {event} - {reason}')
         conf = self.config
         with suppress(KeyError):
             if (
@@ -1925,7 +1926,8 @@ class Scheduler:
                 if self.proc_pool.is_not_done():
                     # e.g. KeyboardInterrupt
                     self.proc_pool.terminate()
-                self.proc_pool.process()
+                while self.proc_pool.is_not_done():
+                    self.proc_pool.process()
             except Exception as exc:
                 LOG.exception(exc)
 
diff --git a/cylc/flow/subprocpool.py b/cylc/flow/subprocpool.py
index c380bdaa3..a539aa8c5 100644
--- a/cylc/flow/subprocpool.py
+++ b/cylc/flow/subprocpool.py
@@ -332,16 +332,38 @@ class SubProcPool:
     def terminate(self):
         """Drain queue, and kill and process remaining child processes."""
         self.close()
+        keep = []
         # Drain queue
         while self.queuings:
-            ctx = self.queuings.popleft()[0]
-            ctx.err = self.ERR_WORKFLOW_STOPPING
-            ctx.ret_code = self.RET_CODE_WORKFLOW_STOPPING
-            self._run_command_exit(ctx)
+            item = self.queuings.popleft()
+            ctx = item[0]
+            # if ctx.cmd_key.startswith(WORKFLOW_EVENT_HANDLER)
+            cmd_key = ctx.cmd_key
+            if isinstance(cmd_key, tuple):
+                cmd_key = cmd_key[0]
+            if cmd_key.startswith('workflow-event-handler'):
+                keep.append(item)
+            else:
+                ctx.err = self.ERR_WORKFLOW_STOPPING
+                ctx.ret_code = self.RET_CODE_WORKFLOW_STOPPING
+                self._run_command_exit(ctx)
+
+        # Re-add event handlers back into the queue
+        for item in keep:
+            self.queuings.append(item)
+
         # Kill remaining processes
         for value in self.runnings:
             proc = value[0]
-            if proc:
+            ctx = value[1]
+            cmd_key = ctx.cmd_key
+            if isinstance(cmd_key, tuple):
+                cmd_key = cmd_key[0]
+            if not proc:
+                # no process to kill
+                continue
+            if not cmd_key.startswith('workflow-event-handler'):
+                # don't kill event handlers
                 _killpg(proc, SIGKILL)
         # Wait for child processes
         self.process()

Any job submission commands get killed, the fallout from this causes the task(s) to be put in the submit-failed state which in turn causes them to be marked as incomplete.

I'm not sure this has ever worked!

Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened new issue for that #5997

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one question though.

Comment on lines -653 to +654
await self.main_loop()
while True: # MAIN LOOP
await self._main_loop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On changing to this, it might make sense (clarity reasons) to move the sleep code from the end of the _main_loop method and into the while block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. I think something (besides the sleep) is using tinit which might be why I left it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. In that case, fine to leave as-is.

@oliver-sanders
Copy link
Member Author

Merging with three approvals.

@oliver-sanders oliver-sanders merged commit ce25a81 into cylc:8.2.x Mar 25, 2024
25 checks passed
@oliver-sanders oliver-sanders deleted the test-workflow-events branch March 25, 2024 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants