-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Traverser: add valid return in mod_plan for partial cancel when allocation exists #1292
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed it makes the error messages I was seeing go away!
@jameshcorbett can I set MWP or do we need a test case to check for this in Fluxion? It would be good to know if the partial cancel behaves as expected when encountering this issue. |
Yeah it would be good to have a test case somehow. I put this PR in flux-sched v0.38.0 via a patch so I don't think there's as much hurry to merge it. I'm also working on a test case in flux-coral2 environment but it's not working quite right yet and the tests are slow so it's taking a while. I can provide a graph and jobspec that will hit the issue, but I don't know about the pattern of partial-cancel RPCs. |
After more thought, I think we need to add a testsuite check for this issue. |
Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away. I submitted a bunch of identical rabbit jobs back-to-back and the first couple go through and then one gets stuck in SCHED, which is what I expect to happen when fluxion thinks all of the |
Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh What are the scheduler and queue policies set to in the coral2 tests? Edit: I just noticed this PR: flux-framework/flux-coral2#208, but it would be good to have a test in sched, too. |
Dismissing the first review to make sure this doesn't get merged accidentally. I'll re-request once we understand the behavior better.
Whatever the defaults are I think, there's no configuration done.
Thanks for the pointer, I should be able to look into this tomorrow! |
Problem: issue flux-framework#1284 identified a scenario where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. Add a goto statement to return 0 under this circumstance.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1292 +/- ##
========================================
- Coverage 75.2% 75.2% -0.1%
========================================
Files 110 110
Lines 15230 15231 +1
========================================
- Hits 11467 11457 -10
- Misses 3763 3774 +11
|
Issue #1284 identified a problem where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the
mod_plan
function when an allocation is found andmod_data.mod_type == job_modify_t::PARTIAL_CANCEL
. This PR adds agoto
statement to return0
under this circumstance.