Scalability load test extended to exercise Deployments, DaemonSets, StatefulSets, Jobs, PersistentVolumes, Secrets, ConfigMaps, NetworkPolicies #704

mm4tt · 2019-07-29T17:00:42Z

mm4tt · 2019-07-29T17:00:51Z

/assign

mm4tt · 2019-07-31T10:08:16Z

I've run a 5K node test yesterday, using the extended load scenario with enabled: Secrets, ConfigMaps, StatefulSets and PVs.
The test passed, but the new prometheus-based api-call latency measurement failed for a few tuples:

W0731 02:58:04.461] I0731 02:58:04.459614   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:persistentvolumes Subresource: Verb:POST Scope:cluster Latency:perc50: 266.666666ms, perc90: 1.375s, perc99: 2.443099999s Count:553}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459626   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:leases Subresource: Verb:GET Scope:namespace Latency:perc50: 32.018464ms, perc90: 143.243359ms, perc99: 1.663881526s Count:17274382}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459643   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:services Subresource: Verb:DELETE Scope:namespace Latency:perc50: 1.215999999s, perc90: 1.472s, perc99: 1.4972s Count:8251}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459655   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:namespaces Subresource: Verb:GET Scope:cluster Latency:perc50: 26.875234ms, perc90: 48.375422ms, perc99: 1.14s Count:5228}; threshold: 1s
W0731 02:58:04.462] I0731 02:58:04.459668   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:configmaps Subresource: Verb:POST Scope:namespace Latency:perc50: 28.571428ms, perc90: 88.1ms, perc99: 1.02472s Count:18627}; threshold: 1s

I checked Prometheus graphs for that run, and it looks like the Prometheus based api-call was broken by single spikes (happening around log rotate) in all cases:
** POST PersistentVolumes**

** DELETE Services **

This is a known problem in Prometheus api-call latency (actually it's the reason it's currently disabled). @oxddr and @krzysied are working on this on hopefully we'll have some solution soon.

So to summarize, the extended load looks very promising. Once we have a solution to the spike problems in Prometheus api-call-latency we should be good (or really close to) to enable it in CI/CD.

mm4tt · 2019-07-31T11:26:42Z

After discussing with team we agreed that we should be good to enable Secrets and ConfigMaps in CI/CD tests.

On the other hand it might be tricky, as currently we have a separate experimental config for extended load.
I think it might be easier to first implement everything there, then move it out of experimental directory and make it the default load config, and then gradually enable new objects via overrides.

oxddr · 2019-07-31T12:00:51Z

I checked Prometheus graphs for that run, and it looks like the Prometheus based api-call was broken by single spikes (happening around log rotate) in all cases:

(...)

This is a known problem in Prometheus api-call latency (actually it's the reason it's currently disabled). janluk and Krzysztof Siedlecki are working on this on hopefully we'll have some solution soon.

For the record, the fact that Prometheus-based API call latency SLO was violated can be actually valid. But I understand SLO violations caused by logrotate are orthogonal to changes you did.

Prometheus-based measurement is close to SLO definition and thus is more strict and prone to violations caused by spikes.

This is most likely no-op until we turn on some Network Policy Provider that will start enfocring these network policies. It should be pretty straightforward to turn on Calico both in GKE and in GCE. This should be done separately to isolate any potential performance impact of tuning it just on. Ref. kubernetes#704

kubernetes/perf-tests#704

This is no-op, following the experiment rollout precedure described at https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/docs/experiments.md Ref. kubernetes/perf-tests#704

This also merges experimental load with the real load test. The "knob" has been enabled in presubmits in kubernetes/test-infra#14166. Ref. kubernetes#704

Ref. kubernetes#704

This is no-op, following the experiment rollout precedure described at https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/docs/experiments.md Ref. kubernetes/perf-tests#704

Ref. kubernetes#704

Ref. kubernetes/perf-tests#704

Will be keeping an eye on the next runs and rollback / disable for some jobs if needed. Ref. kubernetes/perf-tests#704

Ref. kubernetes#704

Similarly to the other new resources, will be keeping an eye on the next runs and rollback / disable for some jobs if needed. kubernetes/perf-tests#704

The only tricky part here is deleting PVs that are created via StatefulSets. Theses PVs are not automatically deleted when StatefulSets are deleted. Becasue of that I extended the ClusterLoader Phase api to allow deleting objects that weren't created directly via CL2. The way it works is that once we detect a new object if certain option is set we issue a List request to find nunmber of replicas. Ref. kubernetes#704

This is most likely no-op until we turn on some Network Policy Provider that will start enfocring these network policies. It should be pretty straightforward to turn on Calico both in GKE and in GCE. This should be done separately to isolate any potential performance impact of tuning it just on. Ref. kubernetes#704

fejta-bot · 2019-12-25T08:38:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

mm4tt · 2019-12-27T09:56:21Z

/remove-lifecycle stale

We hope to get back to this in Q1 2020.
In general, this is done except Network Policies. NetworkPolicies are also implemented but we need to resolve some issues in Calico before enabling them.

fejta-bot · 2020-03-26T10:35:20Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

oxddr · 2020-03-26T11:01:58Z

/remove-lifecycle stale /lifecycle frozen

k8s-ci-robot assigned mm4tt Jul 29, 2019

This was referenced Jul 29, 2019

Create configmaps and secrets in extended load test #703

Merged

ClusterLoader2 cleanups and improvements #712

Merged

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Jul 31, 2019

Support StatefulSets in runtimeobjects.go

dd8f171

Ref. kubernetes#704

mm4tt mentioned this issue Jul 31, 2019

Support StatefulSets in runtimeobjects.go #715

Merged

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Jul 31, 2019

Extended load - StatefulSets and PVs.

231e460

Ref. kubernetes#704

mm4tt mentioned this issue Jul 31, 2019

Extended load - StatefulSets and PVs. #716

Closed

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Jul 31, 2019

Extended load - StatefulSets and PVs.

4ef02b5

Ref. kubernetes#704

mm4tt mentioned this issue Jul 31, 2019

Extended Load: NetworkPolicies #719

Closed

mm4tt added a commit to mm4tt/test-infra that referenced this issue Jul 31, 2019

Enable Calico Network Policy Provider in GCE perf-tests

2691d83

kubernetes/perf-tests#704

mm4tt mentioned this issue Jul 31, 2019

Enable Calico Network Policy Provider in GCE perf-tests kubernetes/test-infra#13709

Closed

mm4tt mentioned this issue Sep 2, 2019

Enable configmaps and secrets in presubmits kubernetes/test-infra#14166

Merged

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 2, 2019

Enable secrets and configmaps in load test

9441f7d

This also merges experimental load with the real load test. The "knob" has been enabled in presubmits in kubernetes/test-infra#14166. Ref. kubernetes#704

mm4tt mentioned this issue Sep 2, 2019

Enable secrets and configmaps in load test #767

Merged

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 3, 2019

Enable secrets and configmaps in load test

e8d1aab

This also merges experimental load with the real load test. The "knob" has been enabled in presubmits in kubernetes/test-infra#14166. Ref. kubernetes#704

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 3, 2019

Extended load - StatefulSets and PVs.

b56c389

Ref. kubernetes#704

mm4tt mentioned this issue Sep 3, 2019

Enable statefulsets and pvs in presubmits kubernetes/test-infra#14174

Merged

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 4, 2019

Extended load - StatefulSets and PVs.

0de2ffe

Ref. kubernetes#704

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 4, 2019

Extended load - StatefulSets and PVs.

159b650

Ref. kubernetes#704

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 4, 2019

Create StatefulSets in load test

13d72da

Ref. kubernetes#704

mm4tt mentioned this issue Sep 4, 2019

Create StatefulSets in load test #776

Merged

mm4tt added a commit to mm4tt/test-infra that referenced this issue Sep 12, 2019

No-op override for enabling jobs in scalabiblity presubmits

921778b

Ref. kubernetes/perf-tests#704

mm4tt mentioned this issue Sep 12, 2019

No-op override for enabling jobs in scalabiblity presubmits kubernetes/test-infra#14300

Merged

mm4tt added a commit to mm4tt/test-infra that referenced this issue Sep 16, 2019

Enable DaemonSets in scalability periodic jobs

a19ed19

Will be keeping an eye on the next runs and rollback / disable for some jobs if needed. Ref. kubernetes/perf-tests#704

This was referenced Sep 16, 2019

Enable DaemonSets in scalability periodic jobs kubernetes/test-infra#14337

Merged

Load test: Test run-to-completion workflow for Job objects #799

Closed

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 16, 2019

Extend load test to cover Jobs

a57807c

Ref. kubernetes#704

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 16, 2019

Extend load test to cover Jobs

7aa08ad

Ref. kubernetes#704

mm4tt mentioned this issue Sep 16, 2019

Extend load test to cover Jobs #801

Merged

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 17, 2019

Extend load test to cover Jobs

57deac7

Ref. kubernetes#704

mm4tt added a commit to mm4tt/test-infra that referenced this issue Sep 17, 2019

Enable Jobs in scalabiblity periodic jobs

195e9fa

Similarly to the other new resources, will be keeping an eye on the next runs and rollback / disable for some jobs if needed. kubernetes/perf-tests#704

mm4tt mentioned this issue Sep 17, 2019

Enable Jobs in scalability periodic jobs kubernetes/test-infra#14357

Merged

mm4tt mentioned this issue Sep 17, 2019

Extend load test to cover PersistentVolumes #802

Merged

This was referenced Sep 18, 2019

[Failing Test] [sig scalability] gce-master-scale-performance - load test fails and started taking ~3h longer kubernetes/kubernetes#82818

Closed

Make PV tests work on kubemark clusters #803

Open

wojtek-t mentioned this issue Sep 26, 2019

Bump daemonset parallelism in load test #815

Merged

jkaniuk mentioned this issue Oct 3, 2019

Extended Load: NetworkPolicies #828

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2019

mm4tt mentioned this issue Feb 7, 2020

Cleanup the "extended load test" overrides #1036

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 26, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability load test extended to exercise Deployments, DaemonSets, StatefulSets, Jobs, PersistentVolumes, Secrets, ConfigMaps, NetworkPolicies #704

Scalability load test extended to exercise Deployments, DaemonSets, StatefulSets, Jobs, PersistentVolumes, Secrets, ConfigMaps, NetworkPolicies #704

mm4tt commented Jul 29, 2019 •

edited

Loading

mm4tt commented Jul 29, 2019

mm4tt commented Jul 31, 2019

mm4tt commented Jul 31, 2019 •

edited

Loading

oxddr commented Jul 31, 2019 •

edited

Loading

fejta-bot commented Dec 25, 2019

mm4tt commented Dec 27, 2019

fejta-bot commented Mar 26, 2020

oxddr commented Mar 26, 2020 via email

Scalability load test extended to exercise Deployments, DaemonSets, StatefulSets, Jobs, PersistentVolumes, Secrets, ConfigMaps, NetworkPolicies #704

Scalability load test extended to exercise Deployments, DaemonSets, StatefulSets, Jobs, PersistentVolumes, Secrets, ConfigMaps, NetworkPolicies #704

Comments

mm4tt commented Jul 29, 2019 • edited Loading

mm4tt commented Jul 29, 2019

mm4tt commented Jul 31, 2019

mm4tt commented Jul 31, 2019 • edited Loading

oxddr commented Jul 31, 2019 • edited Loading

fejta-bot commented Dec 25, 2019

mm4tt commented Dec 27, 2019

fejta-bot commented Mar 26, 2020

oxddr commented Mar 26, 2020 via email

mm4tt commented Jul 29, 2019 •

edited

Loading

mm4tt commented Jul 31, 2019 •

edited

Loading

oxddr commented Jul 31, 2019 •

edited

Loading