Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure healthy cluster after robustness failpoint #15596

Closed
serathius opened this issue Mar 30, 2023 · 2 comments · Fixed by #15604
Closed

Ensure healthy cluster after robustness failpoint #15596

serathius opened this issue Mar 30, 2023 · 2 comments · Fixed by #15604
Assignees

Comments

@serathius
Copy link
Member

What would you like to be added?

Recent change that wait for all watch events caused nightly robustness tests runs to wait infinitely https://github.com/etcd-io/etcd/actions/runs/4562802681

Bug in raft #15595 causes etcd members to crash. With only 1 out of three members cluster is not able to proceed.

In this case instead of waiting infinity we should just detect that cluster is unhealthy and abort collecting watch.
Bonus points for implementing timeout for watching on events

Why is this needed?

Robustness tests should not timeout

@jmhbnz
Copy link
Member

jmhbnz commented Mar 31, 2023

Hey @serathius - Had a quick look at this to try and understand how these tests work, a lot to unpick! Am I on the right track that this function is responsible for the infinite wait? https://github.com/etcd-io/etcd/blob/main/tests/robustness/watch.go#L34

If I'm on the right track feel free to assign this one to me and I will have a go at raising the fix 😅

@serathius
Copy link
Member Author

serathius commented Mar 31, 2023

Yes, this is the correct function. It is infinite due to recent change in #15575.

However the function by itself is not the problem. I think we should have an external mechanism that checks cluster is totally down and cancels the context passed to collectClusterWatchEvents and watchMember.

My first guess would be that triggerFailpoints function should validate that cluster is healthy between and after injecting failpoint. If it's not it should propagate the signal up as an error. With error the runScenario function can cancel the context passed to collectClusterWatchEvents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

2 participants