Make robustness qps requirements less fragile to CI performance #17775

serathius · 2024-04-11T13:13:36Z

What would you like to be added?

One of the guiding principles of robustness tests was strictness, if we are expecting something to happen, we should validate that it really happens. If we require high qps for producing a bug we should validate that we achieve high qps. Running on lowered powered machines like arm doesn't mean that we can just drop the requirement as it undermines the quality of robustness tests. We need to go back to the original motivation of high qps and look for other ways to improve reliability of the robustness tests.

The qps requirements for robustness were driven by #13766, a data inconsistency bug requiring high qps to even have a chance of reproduction . Reaching such qps required pretty advanced tricks to linearize the requests in finite time, however it was finally was delivered by #14682 (comment).

Maintaining such high qps throughput the whole test is pretty hard, that's why even if we need 1000 qps for #13766, we allow it to fluctuate between 200-1000 from test to test. This is because the main part of the test is failure injection. Failure that can cause the etcd to become unavailable. This implicitly creates a dependency on etcd ability to recover from failure. As showed in #17455 this can vary from version to version, from configuration to configuration. It etcd takes too long to recover, the average qps suffers and the test fail.

So, how can me make the test feasible on lower powered machines while we cannot reduce the qps requirements? Simle, by asking whether we really need qps to be high throughput the whole test, or maybe a part of it. If we think about reproducing #13766 we only should care about qps just before we sent SIGKILL, not as much when etcd is down, nor after it recovers. It would be good to check if there are some requests after etcd is killed, but no high qps is required then.

Proposal:

Track start and end time, member affected and type of failpoint injected. Add this information to into client report. In the future we could even add it to visualization.
Calculate QPS only for the time and requests finished before the failpoint injection.
Add validation for number of requests that happened during and after failpoint injection. Expect that minimal number needs to be adjusted based on cluster configuration (3 vs 1 node) as some failpoints can cause total full downtime for 1 node cluster.

cc @MadhavJivrajani @siyuanfoundation @ahrtr @jmhbnz

Why is this needed?

Remove the dependency of robustness tests on etcd MTTR (mean time to recovery) , which as result should allow us:

Reduce the flakiness on low powered machines
Allow us to increase min qps targets

serathius · 2024-04-12T13:47:09Z

cc @jamshidi799
Maybe that will interest you.

jamshidi799 · 2024-04-12T19:57:28Z

Sure, I will pick this. Thank you

serathius · 2024-04-18T16:20:30Z

Noticed a recent increase it robustness test failures due to qps. I would want to make sure this is prioritized.

/assign

serathius added the type/feature label Apr 11, 2024

serathius mentioned this issue Apr 11, 2024

[Robustness] Etcd v3.4 required more than expected leader elections #17455

Open

4 tasks

serathius added the area/robustness-testing label Apr 12, 2024

k8s-ci-robot assigned serathius Apr 18, 2024

serathius mentioned this issue Apr 19, 2024

Don't require minimal for failpoint injection period #17825

Merged

serathius closed this as completed in #17825 Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make robustness qps requirements less fragile to CI performance #17775

Make robustness qps requirements less fragile to CI performance #17775

serathius commented Apr 11, 2024 •

edited

Loading

serathius commented Apr 12, 2024

jamshidi799 commented Apr 12, 2024

serathius commented Apr 18, 2024

Make robustness qps requirements less fragile to CI performance #17775

Make robustness qps requirements less fragile to CI performance #17775

Comments

serathius commented Apr 11, 2024 • edited Loading

What would you like to be added?

Why is this needed?

serathius commented Apr 12, 2024

jamshidi799 commented Apr 12, 2024

serathius commented Apr 18, 2024

serathius commented Apr 11, 2024 •

edited

Loading