Ensure healthy cluster after robustness failpoint #15596

serathius · 2023-03-30T20:42:17Z

What would you like to be added?

Recent change that wait for all watch events caused nightly robustness tests runs to wait infinitely https://github.com/etcd-io/etcd/actions/runs/4562802681

Bug in raft #15595 causes etcd members to crash. With only 1 out of three members cluster is not able to proceed.

In this case instead of waiting infinity we should just detect that cluster is unhealthy and abort collecting watch.
Bonus points for implementing timeout for watching on events

Why is this needed?

Robustness tests should not timeout

jmhbnz · 2023-03-31T02:34:36Z

Hey @serathius - Had a quick look at this to try and understand how these tests work, a lot to unpick! Am I on the right track that this function is responsible for the infinite wait? https://github.com/etcd-io/etcd/blob/main/tests/robustness/watch.go#L34

If I'm on the right track feel free to assign this one to me and I will have a go at raising the fix 😅

serathius · 2023-03-31T08:39:29Z

Yes, this is the correct function. It is infinite due to recent change in #15575.

However the function by itself is not the problem. I think we should have an external mechanism that checks cluster is totally down and cancels the context passed to collectClusterWatchEvents and watchMember.

My first guess would be that triggerFailpoints function should validate that cluster is healthy between and after injecting failpoint. If it's not it should propagate the signal up as an error. With error the runScenario function can cancel the context passed to collectClusterWatchEvents.

serathius added the type/feature label Mar 30, 2023

serathius assigned jmhbnz Mar 31, 2023

jmhbnz mentioned this issue Mar 31, 2023

tests: Ensure healthy cluster before and after robustness failpoint #15604

Merged

serathius closed this as completed in #15604 Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure healthy cluster after robustness failpoint #15596

Ensure healthy cluster after robustness failpoint #15596

serathius commented Mar 30, 2023

jmhbnz commented Mar 31, 2023

serathius commented Mar 31, 2023 •

edited

Loading

Ensure healthy cluster after robustness failpoint #15596

Ensure healthy cluster after robustness failpoint #15596

Comments

serathius commented Mar 30, 2023

What would you like to be added?

Why is this needed?

jmhbnz commented Mar 31, 2023

serathius commented Mar 31, 2023 • edited Loading

serathius commented Mar 31, 2023 •

edited

Loading