-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix negative ALIVE actors metric #45718
Conversation
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this easy to test and validate the fix works?
src/ray/core_worker/core_worker.h
Outdated
} | ||
ray::stats::STATS_actors.Record(-(running + in_get + in_wait), | ||
ray::stats::STATS_actors.Record(alive, | ||
{{"State", "ALIVE"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this actually be "IDLE" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should but it's a backward incompatible change I think.
Not a huge priority for the customer but just checking for any update on this one |
@jjyao any update on this one? I see it was blocked by some pre-merge tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call it out somewhere in a doc or change log such that this definition now changes?
@@ -59,7 +59,8 @@ DEFINE_stats( | |||
DEFINE_stats(actors, | |||
"Current number of actors currently in a particular state.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long run, I wonder if it makes more sense to separate this metric cleanly rather than relying on the executor as a label. It seems tricky for users to know what the implications of source is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel if we document it correctly it should be fine? Although currently source
is not documented at all.
) | ||
], | ||
), | ||
Panel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just replace the Scheduler Actor State graph completely? Or is it still useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think it's still useful because it has states that's not available in the other graph.
Why are these changes needed?
Currently GCS emits the total number of ALIVE actors and each core worker emits
-1
if its actor is ALIVE but not idle. The sum is total number of idle actors (we call it state ALIVE while it really means IDLE). Since GCS and core worker emits metrics at different time, the sum might be negative.This PR fixes this by letting gcs and core worker emits different actor states:
Metrics dashboard is updated accordingly.
With this change, state ALIVE actually means alive instead of the old (wrong) idle semantic and we have a new state IDLE to indicate whether the actor is idle or not.
Related issue number
Closes #46232
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.