Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply prerequisite changes to spawned tasks, on reload & restart #5334

Merged
merged 10 commits into from
Apr 26, 2023

Conversation

hjoliver
Copy link
Member

@hjoliver hjoliver commented Jan 30, 2023

On restart we assume that all prerequisites of a spawned task are recorded in the DB. This will not be the case on restarting after adding new dependencies to that task.

Example:

[scheduling]
    [[graph]]
        R1 = """
            a => z
            b => z
            # c => z
        """
[runtime]
    [[a]]
         script = "sleep 20"
    [[b, c, z]]

To reproduce the bug:

  • install and play
  • do cylc stop --now when a is still running and z has been spawned by b:succeeded
  • uncomment # c => z to add a new prerequisite (c) task for z
  • reinstall and play: scheduler will crash

The fix: if a prerequisite does not exist in the DB, assume that it is unsatisfied.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg and conda-environment.yml.
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@hjoliver hjoliver added the bug Something is wrong :( label Jan 30, 2023
@hjoliver hjoliver self-assigned this Jan 30, 2023
@hjoliver hjoliver force-pushed the fix-prereq-bug branch 2 times, most recently from a45e438 to 5a8f332 Compare January 30, 2023 22:24
@hjoliver
Copy link
Member Author

hjoliver commented Jan 30, 2023

(Rebased to pick up #5333)

@wxtim wxtim requested review from oliver-sanders and removed request for oliver-sanders January 31, 2023 11:53
@oliver-sanders oliver-sanders added this to the cylc-8.1.2 milestone Jan 31, 2023
@hjoliver hjoliver changed the base branch from master to 8.1.x February 8, 2023 02:57
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this pull request Feb 8, 2023
@oliver-sanders
Copy link
Member

This fix looks good to me, I've added an integration test for this (restarts are possible, they just require a little fiddling):

hjoliver#30

I haven't looked into removing prereqs or reloads which I think we should also be testing? Or are there other tests for this?

@hjoliver
Copy link
Member Author

hjoliver commented Feb 8, 2023

Thanks for the integration test @oliver-sanders

@hjoliver hjoliver marked this pull request as ready for review February 8, 2023 21:57
@hjoliver
Copy link
Member Author

hjoliver commented Feb 8, 2023

I haven't looked into removing prereqs ...

This works already, for restart. Checking for tests ... nope. I've added a tweaked copy of your new integration test to cover this.

@hjoliver
Copy link
Member Author

hjoliver commented Feb 8, 2023

I expect the equivalent reload will be fine. Already-spawned task proxies do have prerequisites copied on reload, and it was just the DB check on restart that caused the problem... NOPE, prerequisites are not copied over intelligently 😬

@hjoliver
Copy link
Member Author

hjoliver commented Feb 9, 2023

(Reload fixed, still working on the integration test for that ...)

@hjoliver hjoliver force-pushed the fix-prereq-bug branch 2 times, most recently from 637cee2 to 4273862 Compare February 10, 2023 01:28
@hjoliver hjoliver changed the title Fix restart after adding new deps to spawned task. Apply prerequisite changes to spawned tasks, on reload & restart Feb 10, 2023
@hjoliver
Copy link
Member Author

After considerable pfaffing about, this should be good to go now.

@hjoliver
Copy link
Member Author

@dwsutherland - could you review at this end? It touches the datastore a bit.

@oliver-sanders
Copy link
Member

oliver-sanders commented Feb 16, 2023

I tried testing with this example:

[scheduler]                    
    allow implicit tasks = True    
                               
[scheduling]                   
    initial cycle point = 1    
    cycling mode = integer    
    [[graph]]               
        P1 = """            
            a => b          
            a => c          
        """

Then added this dependency:

            a[-P1] => a    

And reloaded.

Cylc didn't crash, but the workflow became stuck (no stall event detected):

Screenshot from 2023-02-16 15-42-43

Note 34/a had succeeded and left the pool before the reload happened. I think the missing DB prereq entry results in the dep being marked as unsatisfied incorrectly.

Rather than defaulting to False should we run it through the DB or re-spawn the dependency to run it through the DB somehow?

@oliver-sanders
Copy link
Member

The issue I mentioned above isn't great, but that said, it is a lot better than crashing. If it's difficult we could bump the DB stuff to a later PR as the pressure on release is quite high due to the GUI bug.

@oliver-sanders oliver-sanders modified the milestones: cylc-8.1.2, cylc-8.1.3 Feb 20, 2023
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this pull request Mar 1, 2023
* Accompanies cylc#5334
  which fixes traceback for missing prereqs on reload/restart
  by marking them as unsatisfied.
@oliver-sanders
Copy link
Member

Gotta fix here: hjoliver#31

However, we still need to work out how to square this with the data store which still defaults to False after my change.

@hjoliver
Copy link
Member Author

(Thanks, just back to this now, I'll take a look at the problem you found, and your fix ...)

@hjoliver
Copy link
Member Author

hjoliver commented Mar 10, 2023

OK what's going on is, a is initially parent-less so it gets spawned to the runahead limit. Then after the reload it has a parent (it's own previous instance) and we update the already-spawned instances to give them the new prerequisite. The new prerequisite should be satisfied for the first instance, but it isn't - because under SoD, prerequisite status only gets updated on demand, at the instant the corresponding upstream output gets completed.

@hjoliver
Copy link
Member Author

Agreed with your fix @oliver-sanders ; merged it to this branch.

One small issue, I don't think we can provide the prerequsite satisfied message accurately i.e. we can't tell the difference between succeeded naturally and whatever the alternative is. However, this is minor.

Yeah, minor. I think "satisfied naturally" is the only one used at the moment in Cylc 8 anyhow. Although that will change with https://cylc.github.io/cylc-admin/proposal-cylc-set.html

@hjoliver
Copy link
Member Author

From the side-PR:

I'm not sure how to square this with the data store without duplicating the database calls.

Presumably I just need to call delta_task_prerequisite(task) in the datastore manager.

This PR marks all satisfied prerequisites as satisfied naturally. I think the information is lost by this point, is there a more suitable value e.g satisfied from DB?

(See above) This isn't used for anything at the moment. For now I'll just make something up to distinguish the two options.

TODO

  • update datastore
  • new test for this case
  • "satisfied by DB" (or similar)

@oliver-sanders
Copy link
Member

We had someone hit this. Strangely when they reverted the diff and ran vr again, the workflow started up with an empty task pool???

Not sure why that happened, doesn't sound right. The good news is it was recoverable with a bit of triggering. This did remind me of the importance of these two issues:

Both of which hit us whilst trying to recover the workflow.

@oliver-sanders
Copy link
Member

oliver-sanders commented Apr 24, 2023

Gotta solution to the last three bullet points here - hjoliver#32

  • Sets the prereq message to "satisfied from database".
  • Updates the data store when prereqs change on reload (the restart case was already handled).
  • Shims this into the existing tests.

One conflict to shift, then I think this is good to go.

@hjoliver
Copy link
Member Author

Gotta solution to the last three bullet points here - hjoliver#32

Great, thanks.

@hjoliver
Copy link
Member Author

@oliver-sanders - applied your tweaks and rebased.

_remote_background functional tests failing unrelated infrastructure reasons.

@@ -68,6 +68,15 @@ Rose options (`-O`, `-S` & `-D`) with `cylc view`.
[#5363](https://github.com/cylc/cylc-flow/pull/5363) Improvements and bugfixes
for `cylc lint`.

-------------------------------------------------------------------------------
## __cylc-8.1.2 (<span actions:bind='release-date'>Upcoming</span>)__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 8.1.3.

'graph_1, graph_2, '
'expected_1, expected_2, expected_3, expected_4',
[
param( # Restart after adding a prerequisite to task z
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO these should have an id= kwarg -

Imagine trying to run

pytest 'tests/i/test_task_pool.py::test_restart_prereqs[a => z\n               b => z-a => z\n               b => z\n               c => z-expected_10-expected_20-expected_30-expected_40]'

b => z
c => z''',
[
('1', 'a', 'running'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that everything is at CP = 1, could we just test output[1:] = expect and not include the '1'?

It's marginal, but I think it cuts down on the amount written. Don't care that much.

Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving - It looks good - I've made a couple of stylistic points, but

  • Reproduced the bug, failed to reproduce it on branch.
  • Read Code

@wxtim wxtim merged commit c8f9c23 into cylc:8.1.x Apr 26, 2023
@hjoliver hjoliver deleted the fix-prereq-bug branch April 26, 2023 11:17
wxtim added a commit to wxtim/cylc that referenced this pull request May 4, 2023
…_too_soon

* upstream/master: (70 commits)
  Tweak scan CLI help. (cylc#5405)
  Add support for python 3.11 (cylc#5497)
  tests/u: add parameter ids
  build(deps): bump pypa/gh-action-pypi-publish from 1.8.5 to 1.8.6
  prerequisite: remove unused interface
  Fix change log dup section.
  Fix flaky test
  Update CONTRIBUTING.md
  Tidy platforms
  modify CONTRIBUTING.md
  Bump dev version (cylc#5501)
  Prepare release 8.1.3
  Ignore off-sequence parents, for datastore. (cylc#5495)
  Avoid duplicate prerequisites from multiple recurrences. (cylc#5466)
  Update cylc/flow/prerequisite.py
  prerequsite: remove target_point_strings attribute
  tests/f: fix events/11
  swarm: convert cylc-dev Docker image to run on ubuntu:latest
  fix changelog
  Apply prerequisite changes to spawned tasks, on reload & restart (cylc#5334)
  ...
wxtim added a commit to wxtim/cylc that referenced this pull request May 4, 2023
…too_soon

* upstream/8.1.x: (46 commits)
  Bump dev version (cylc#5518)
  Prepare release 8.1.4
  cat-log: fix issues with file listing
  Update changelog
  Fix `eval_host()` localhost bug & plug testing gap
  Bump dev version (cylc#5501)
  Prepare release 8.1.3
  Ignore off-sequence parents, for datastore. (cylc#5495)
  Avoid duplicate prerequisites from multiple recurrences. (cylc#5466)
  tests/f: fix events/11
  swarm: convert cylc-dev Docker image to run on ubuntu:latest
  fix changelog
  Apply prerequisite changes to spawned tasks, on reload & restart (cylc#5334)
  cat-log: don't list directories which don't exist (cylc#5488)
  task_job_mgr: move compute out of open call
  manylinux tests: update to ubuntu-20.04
  Update changelog
  Address review
  Fix.platform is regex remote tidy fail (cylc#5445)
  parsec: better error messages for section/setting mixups
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants