Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix workflow shutdown on no platforms error #5395

Merged

Conversation

wxtim
Copy link
Member

@wxtim wxtim commented Mar 3, 2023

closes #5392

Note Put up as a draft to allow @oliver-sanders to sanity check.

NoPlatformsError comes from get_platform and will only occur if get platform is given an argument (i.e. it isn't just getting the localhost platform). There are 6 places where this occurs in the code:

  • scripts/cat_log - I think this one is safe, or at least won't cause unexpected scheduler shutdowns.
  • scripts/check_versions - Raising this error would be this script working as expected.
  • task_jobs_mgr:1170 - I think this is where the bug reported cam from.
  • task_pool.TaskPool.load_db_task_pool_for_restart - Called on restart - should not cause a problem because the database does not store platform groups, only platforms.
  • task_events_mgr.TaskEventsManager._process_job_log_retrieval - ctx.platfrom used here contains the actual platform the job was submitted to, and not the platform group, so is safe.
  • subprocpool.SubProcPool._run_command_exit - Similarly uses a context object which will have a selected platform rather than a raw platform group.

@oliver-sanders any idea how to test this in an automated way? Can I create an SSH config for the test shell?

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Some tests written, and no existing tests failed. I'm not completely convinced I've covered all the possible ways this change could go wrong, so reviewers should be creativite trying to break this code.
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

My Manual test Recipe

# .ssh/config
Host bar
    HostName=realhost

Host foo
    HostName=realhost
# global.cylc

    [[foo]]
        hosts = foo
        install target = localhost
    [[bar]]
        hosts = bar
        install target = localhost

[platform groups]
    [[foobar]]
        platforms = foo, bar
# flow.cylc
[scheduling]
    initial cycle point = 2257
    [[graph]]
        R1 = foo

[runtime]
    [[foo]]
        script = echo "Hello World!"
        platform = foobar

Add a breakpoint at task_job_mgr.TaskJobMgr at line 1165 (above `get_platform), then run:

cylc vip myworkflow --no-detach --host localhost

... wait for your breakpoint...

then

self.bad_hosts = {'foo', 'bar'}

Press continue --- and your workflow should shutdown with a NoPlatformsError, on my branch the workflow stalls, rather than shutting down.

@wxtim wxtim marked this pull request as draft March 3, 2023 12:07
@wxtim wxtim changed the base branch from master to 8.1.x March 3, 2023 12:08
@MetRonnie MetRonnie self-requested a review March 3, 2023 12:24
@MetRonnie
Copy link
Member

MetRonnie commented Mar 3, 2023

NoPlatformsError comes from get_platform and will only occur if get platform is given an argument (i.e. it isn't just getting the localhost platform).

I'm not yet convinced of this.


But while investigating I've noticed a slight flaw where this variable is basically unused:

platform_group = None

The only reference to it:

if platform_group:
platform_data['group'] = platform_group

I'm sure this is unrelated to the bug, but worth sorting out while we're at it.


Also, shouldn't there be a break at L201? Or a break combined with removing the reversed() call at L194?

for platform_name_re in reversed(list(platform_groups)):
# Platform is member of a group.
if re.fullmatch(platform_name_re, platform_name):
platform_name = get_platform_from_group(
platform_groups[platform_name_re], group_name=platform_name,
bad_hosts=bad_hosts
)

@wxtim
Copy link
Member Author

wxtim commented Mar 6, 2023

I'm not yet convinced of this.

My reasoning for this in detail:

  • raise NoPlatformsError occurs once at platforms.py:290 in function get_platform_from_group.
  • get_platform_from_group is called once, in platform_from_name.
  • platform_from_name occurs
    • 4 times in get_platform
    • Once in get_platform_from_group, which is then used by get_platform - this sounds circular, except that in get_platform_from_group we ask for a platform_from_name for a platform within a group, so we know that get_platform shouldn't send us back to get_platform_from_group. If you try to have
[platforms]
    [[foo]]
        ...

[platform groups]
    [[bar]]
        platforms = foo
    [[baz]]
        platform = bar, foo

you'll just end up looping more, but end up using foo anyway. Circular references between platform groups don't work either.

  • x
    • Once in get_install_target_to_platforms_map. This is being fed a list of platform names by get_install_target_to_platforms_map, which gets them from workflow_files.remote_clean, which is called by init_clean. That call gets platform names for tasks from the database - I checked - platform groups aren't recorded in the database, which is probably OK. In an ideal world I guess we might want to restart on the same group but not care about the platform.

@wxtim
Copy link
Member Author

wxtim commented Mar 7, 2023

But while investigating I've noticed a slight flaw where this variable is basically unused:

Yes, I've had a check, and I think you're right. I'll remove it.

@wxtim
Copy link
Member Author

wxtim commented Mar 7, 2023

Also, shouldn't there be a break at L201?

Yes, I think there should be.

@wxtim
Copy link
Member Author

wxtim commented Mar 7, 2023

ME

What's the difference between NoPlatformsError and PlatformLookupError

Answer: No Platforms Error is when the platform group definition exists, but the hosts aren't reachable for any platform in group. PlatformLookupError indicates a problem with the global.cylc or flow.cylc - i.e. user has asked for an undefined platform.

@wxtim wxtim force-pushed the fix_workflow_shutdown_on_NoPlatformsError branch from b9340b8 to ce1f91b Compare March 7, 2023 09:22
@wxtim wxtim marked this pull request as ready for review March 7, 2023 09:22
@wxtim wxtim self-assigned this Mar 7, 2023
@wxtim wxtim added the bug Something is wrong :( label Mar 7, 2023
@wxtim wxtim added this to the cylc-8.1.x milestone Mar 7, 2023
@oliver-sanders
Copy link
Member

@oliver-sanders any idea how to test this in an automated way? Can I create an SSH config for the test shell?

def test_whatever(monkeypatch):
    def mock_c():
        raise NoPlatformError(...)
    monkeypatch.setattr(
         'a.b.c',
         mock_c,
    )
    c()

Mangling SSH settings on the fly nasty, I wouldn't functional test this.

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_platforms interface can raise three possible errors:

  • PlatformLookupError
  • NoPlatformsError
  • NoHostsError

For safety I think every call of get_platform which could be called by the scheduler (cat-log etc, don't matter) should catch all three.

Note that we can't combine them at the moment because NoHostsError may require special handling.

cylc/flow/platforms.py Outdated Show resolved Hide resolved
cylc/flow/task_job_mgr.py Outdated Show resolved Hide resolved
@oliver-sanders
Copy link
Member

I've taken a look into more radical change, however, this is going to be a large refactor so not suitable for this release.

@wxtim wxtim marked this pull request as draft March 8, 2023 10:39
@wxtim wxtim force-pushed the fix_workflow_shutdown_on_NoPlatformsError branch 2 times, most recently from b9b8600 to c8615ee Compare March 14, 2023 10:29
@@ -487,7 +488,14 @@ def _run_callback(callback, args_=None):
# that you can use that platform's ssh command.
platform = None
if isinstance(ctx.cmd_key, TaskJobLogsRetrieveContext):
platform = get_platform(ctx.cmd_key.platform_name)
try:
Copy link
Member Author

@wxtim wxtim Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've kept this one reasonably simple. I've struggled to recreate this error by changing externals and in the end was only able to manually test it by adding

               raise PLATFORM_LOOKUP_ERRORS[2]('foobar')

This isn't really very satisfactory, but I don't think that this likely to cause a real world bug, and that this case is just covering ourselves.

I think that we want to say that if for whatever reason we cannot retrieve remote logs we want to tell the user and move on with the workflow.

cylc/flow/task_events_mgr.py Outdated Show resolved Hide resolved
@wxtim wxtim force-pushed the fix_workflow_shutdown_on_NoPlatformsError branch 2 times, most recently from f00de14 to 2d3b053 Compare March 16, 2023 14:59
cylc/flow/task_pool.py Outdated Show resolved Hide resolved
@wxtim wxtim force-pushed the fix_workflow_shutdown_on_NoPlatformsError branch 3 times, most recently from c137d08 to ca6a48a Compare March 17, 2023 07:53
@wxtim wxtim force-pushed the fix_workflow_shutdown_on_NoPlatformsError branch from 2700b75 to f4323d6 Compare March 28, 2023 14:00
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, error handling makes sense, just need to test.

@wxtim
Copy link
Member Author

wxtim commented Apr 11, 2023

Code LGTM, error handling makes sense, just need to test.

Is this "You (Tim) need to write more tests" or "I (Oliver) want more time to try and break it?

cylc/flow/platforms.py Outdated Show resolved Hide resolved
for row_idx, row in enumerate(self.connect().execute(stmt)):
callback(row_idx, list(row))
platform_error = callback(row_idx, list(row))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are dependent on this implementation detail of the callback, I think it would make sense to refactor away the callback to avoid the potential for breakage. Don't let it hold up this PR, I'll have a quick go to see if it can be done without wasting much time

tests/functional/platforms/10-platform-gone-on-restart.t Outdated Show resolved Hide resolved
@wxtim wxtim requested a review from MetRonnie April 11, 2023 15:23
tests/integration/test_task_events_mgr.py Outdated Show resolved Hide resolved
tests/integration/test_task_job_mgr.py Outdated Show resolved Hide resolved
Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have followed your manual test recipe

@hjoliver
Copy link
Member

Flake8 reporting a few of these in the CI fast tests:

C419 Unnecessary list comprehension passed to any() prevents short-circuiting - rewrite as a generator.

fix test

Update cylc/flow/platforms.py

Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>

Update tests/functional/platforms/10-platform-gone-on-restart.t

made docstrings more legible.
@wxtim wxtim force-pushed the fix_workflow_shutdown_on_NoPlatformsError branch from 7668df4 to 8fe1f30 Compare April 17, 2023 08:02
Don't use any([i for i in iterable])
use
any(i for i in iterable).
It's more efficient because we don't have to expand the entire thing.
@wxtim
Copy link
Member Author

wxtim commented Apr 17, 2023

Flake8 reporting a few of these in the CI fast tests:

C419 Unnecessary list comprehension passed to any() prevents short-circuiting - rewrite as a generator.

Yes - this needs a quick rebase/cherry pick of the fix already merged to target.

@MetRonnie
Copy link
Member

Actually all that was needed was closing + reopening. This causes GH Actions to create a new merge commit that it uses at checkout

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick pass-through this looks good to me. I'll leave @oliver-sanders for the merge though, since he has reviewed it pretty extensively.

cylc/flow/platforms.py Outdated Show resolved Hide resolved
Co-authored-by: Hilary James Oliver <hilary.j.oliver@gmail.com>
Comment on lines +14 to +44
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

from cylc.flow.task_events_mgr import TaskJobLogsRetrieveContext
from cylc.flow.scheduler import Scheduler

from typing import Any as Fixture


async def test_process_job_logs_retrieval_warns_no_platform(
one_conf: Fixture, flow: Fixture, scheduler: Fixture, run: Fixture,
db_select: Fixture, caplog: Fixture
):
"""Job log retrieval handles `NoHostsError`"""

ctx = TaskJobLogsRetrieveContext(
ctx_type='raa',
platform_name='skarloey',
max_size=256,
key='skarloey'
)
reg: str = flow(one_conf)
schd: 'Scheduler' = scheduler(reg, paused_start=True)
# Run
async with run(schd):
schd.task_events_mgr._process_job_logs_retrieval(
schd, ctx, 'foo'
)
warning = caplog.records[-1]
assert warning.levelname == 'WARNING'
assert 'Unable to retrieve' in warning.msg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Get the log from the run fixture (reduces the scoping of the logging to just the workflow run).
  • Use log_filter to filter by level/message (does not require the warning to be the last log message).
  • Remove unused db_select.
  • s/reg/id_.
  • s/NoHostsError/NoPlatformsErrror/
Suggested change
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
from cylc.flow.task_events_mgr import TaskJobLogsRetrieveContext
from cylc.flow.scheduler import Scheduler
from typing import Any as Fixture
async def test_process_job_logs_retrieval_warns_no_platform(
one_conf: Fixture, flow: Fixture, scheduler: Fixture, run: Fixture,
db_select: Fixture, caplog: Fixture
):
"""Job log retrieval handles `NoHostsError`"""
ctx = TaskJobLogsRetrieveContext(
ctx_type='raa',
platform_name='skarloey',
max_size=256,
key='skarloey'
)
reg: str = flow(one_conf)
schd: 'Scheduler' = scheduler(reg, paused_start=True)
# Run
async with run(schd):
schd.task_events_mgr._process_job_logs_retrieval(
schd, ctx, 'foo'
)
warning = caplog.records[-1]
assert warning.levelname == 'WARNING'
assert 'Unable to retrieve' in warning.msg
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
import logging
from typing import Any as Fixture
from cylc.flow.task_events_mgr import TaskJobLogsRetrieveContext
from cylc.flow.scheduler import Scheduler
async def test_process_job_logs_retrieval_warns_no_platform(
one_conf: Fixture, flow: Fixture, scheduler: Fixture, run: Fixture,
log_filter,
):
"""Job log retrieval handles `NoPlatformsError`"""
ctx = TaskJobLogsRetrieveContext(
ctx_type='raa',
platform_name='skarloey',
max_size=256,
key='skarloey'
)
id_: str = flow(one_conf)
schd: 'Scheduler' = scheduler(id_, paused_start=True)
async with run(schd) as log:
# cause a NoPlatformError to be raised in job log retrieval
schd.task_events_mgr._process_job_logs_retrieval(
schd, ctx, 'foo'
)
assert log_filter(
log,
level=logging.WARNING,
contains='Unable to retrieve',
)

@@ -104,3 +107,24 @@ async def test_run_job_cmd_no_hosts_error(
log,
contains='No available hosts for no-host-platform',
)


async def test__run_job_cmd_logs_platform_lookup_fail(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same treatment can be applied here.

@oliver-sanders oliver-sanders merged commit fddd828 into cylc:8.1.x Apr 19, 2023
@wxtim wxtim deleted the fix_workflow_shutdown_on_NoPlatformsError branch April 19, 2023 11:01
@oliver-sanders oliver-sanders modified the milestones: cylc-8.1.x, cylc-8.1.3 Apr 20, 2023
@oliver-sanders oliver-sanders mentioned this pull request Apr 20, 2023
wxtim added a commit to wxtim/cylc that referenced this pull request Apr 21, 2023
…too_soon

* upstream/8.1.x:
  Fix.platform is regex remote tidy fail (cylc#5445)
  parsec: better error messages for section/setting mixups
  cat-log: change interface for scheduler logs to include the dirname
  cat-log: support other workflow files
  Fix workflow shutdown on no platforms error (cylc#5395)
  actions: add build test
@MetRonnie MetRonnie mentioned this pull request May 2, 2023
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NoPlatformsError
4 participants