Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix negative ALIVE actors metric #45718

Merged
merged 11 commits into from
Jun 26, 2024
Merged

Conversation

jjyao
Copy link
Contributor

@jjyao jjyao commented Jun 4, 2024

Why are these changes needed?

Currently GCS emits the total number of ALIVE actors and each core worker emits -1 if its actor is ALIVE but not idle. The sum is total number of idle actors (we call it state ALIVE while it really means IDLE). Since GCS and core worker emits metrics at different time, the sum might be negative.

image

This PR fixes this by letting gcs and core worker emits different actor states:

  • GCS will emit DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, RESTARTING and DEAD
  • If the actor is ALIVE, core worker will emit IDLE, RUNNING_TASK, RUNNING_IN_RAY_GET, RUNNING_IN_RAY_WAIT. Those can be seen as sub-states of ALIVE.

Metrics dashboard is updated accordingly.

With this change, state ALIVE actually means alive instead of the old (wrong) idle semantic and we have a new state IDLE to indicate whether the actor is idle or not.

Related issue number

Closes #46232

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao requested a review from a team as a code owner June 4, 2024 18:26
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Jun 4, 2024
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this easy to test and validate the fix works?

}
ray::stats::STATS_actors.Record(-(running + in_get + in_wait),
ray::stats::STATS_actors.Record(alive,
{{"State", "ALIVE"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this actually be "IDLE" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should but it's a backward incompatible change I think.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@alexr-anyscale
Copy link

Not a huge priority for the customer but just checking for any update on this one

@alexr-anyscale
Copy link

@jjyao any update on this one? I see it was blocked by some pre-merge tests.

@jjyao jjyao changed the title [Core] Fix negative ACTIVE actors metric [Core] Fix negative ALIVE actors metric Jun 26, 2024
@jjyao jjyao assigned alanwguo and unassigned rkooo567 Jun 26, 2024
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call it out somewhere in a doc or change log such that this definition now changes?

@@ -59,7 +59,8 @@ DEFINE_stats(
DEFINE_stats(actors,
"Current number of actors currently in a particular state.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run, I wonder if it makes more sense to separate this metric cleanly rather than relying on the executor as a label. It seems tricky for users to know what the implications of source is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel if we document it correctly it should be fine? Although currently source is not documented at all.

)
],
),
Panel(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just replace the Scheduler Actor State graph completely? Or is it still useful?

Copy link
Contributor Author

@jjyao jjyao Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think it's still useful because it has states that's not available in the other graph.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao merged commit 80594fa into ray-project:master Jun 26, 2024
6 checks passed
@jjyao jjyao deleted the jjyao/metri branch June 26, 2024 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] NodeId and NodeIp tags don't match the expected instance
5 participants