Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make robustness qps requirements less fragile to CI performance #17775

Closed
serathius opened this issue Apr 11, 2024 · 3 comments · Fixed by #17825
Closed

Make robustness qps requirements less fragile to CI performance #17775

serathius opened this issue Apr 11, 2024 · 3 comments · Fixed by #17825

Comments

@serathius
Copy link
Member

serathius commented Apr 11, 2024

What would you like to be added?

One of the guiding principles of robustness tests was strictness, if we are expecting something to happen, we should validate that it really happens. If we require high qps for producing a bug we should validate that we achieve high qps. Running on lowered powered machines like arm doesn't mean that we can just drop the requirement as it undermines the quality of robustness tests. We need to go back to the original motivation of high qps and look for other ways to improve reliability of the robustness tests.

The qps requirements for robustness were driven by #13766, a data inconsistency bug requiring high qps to even have a chance of reproduction . Reaching such qps required pretty advanced tricks to linearize the requests in finite time, however it was finally was delivered by #14682 (comment).

Maintaining such high qps throughput the whole test is pretty hard, that's why even if we need 1000 qps for #13766, we allow it to fluctuate between 200-1000 from test to test. This is because the main part of the test is failure injection. Failure that can cause the etcd to become unavailable. This implicitly creates a dependency on etcd ability to recover from failure. As showed in #17455 this can vary from version to version, from configuration to configuration. It etcd takes too long to recover, the average qps suffers and the test fail.

So, how can me make the test feasible on lower powered machines while we cannot reduce the qps requirements? Simle, by asking whether we really need qps to be high throughput the whole test, or maybe a part of it. If we think about reproducing #13766 we only should care about qps just before we sent SIGKILL, not as much when etcd is down, nor after it recovers. It would be good to check if there are some requests after etcd is killed, but no high qps is required then.

Proposal:

  • Track start and end time, member affected and type of failpoint injected. Add this information to into client report. In the future we could even add it to visualization.
  • Calculate QPS only for the time and requests finished before the failpoint injection.
  • Add validation for number of requests that happened during and after failpoint injection. Expect that minimal number needs to be adjusted based on cluster configuration (3 vs 1 node) as some failpoints can cause total full downtime for 1 node cluster.

cc @MadhavJivrajani @siyuanfoundation @ahrtr @jmhbnz

Why is this needed?

Remove the dependency of robustness tests on etcd MTTR (mean time to recovery) , which as result should allow us:

  • Reduce the flakiness on low powered machines
  • Allow us to increase min qps targets
@serathius
Copy link
Member Author

cc @jamshidi799
Maybe that will interest you.

@jamshidi799
Copy link

Sure, I will pick this. Thank you

@serathius
Copy link
Member Author

Noticed a recent increase it robustness test failures due to qps. I would want to make sure this is prioritized.

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

2 participants