Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability load test extended to exercise Deployments, DaemonSets, StatefulSets, Jobs, PersistentVolumes, Secrets, ConfigMaps, NetworkPolicies #704

Open
12 of 16 tasks
mm4tt opened this issue Jul 29, 2019 · 8 comments
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@mm4tt
Copy link
Contributor

mm4tt commented Jul 29, 2019

Implemented

Enabled in CI/CD

  • Deployments
  • DaemonSets
  • StatefulSets
  • Jobs
  • Persistent Volumes
  • Secrets
  • ConfigMaps
  • NetworkPolicies
@mm4tt
Copy link
Contributor Author

mm4tt commented Jul 29, 2019

/assign

@mm4tt
Copy link
Contributor Author

mm4tt commented Jul 31, 2019

I've run a 5K node test yesterday, using the extended load scenario with enabled: Secrets, ConfigMaps, StatefulSets and PVs.
The test passed, but the new prometheus-based api-call latency measurement failed for a few tuples:

W0731 02:58:04.461] I0731 02:58:04.459614   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:persistentvolumes Subresource: Verb:POST Scope:cluster Latency:perc50: 266.666666ms, perc90: 1.375s, perc99: 2.443099999s Count:553}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459626   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:leases Subresource: Verb:GET Scope:namespace Latency:perc50: 32.018464ms, perc90: 143.243359ms, perc99: 1.663881526s Count:17274382}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459643   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:services Subresource: Verb:DELETE Scope:namespace Latency:perc50: 1.215999999s, perc90: 1.472s, perc99: 1.4972s Count:8251}; threshold: 1s
W0731 02:58:04.461] I0731 02:58:04.459655   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:namespaces Subresource: Verb:GET Scope:cluster Latency:perc50: 26.875234ms, perc90: 48.375422ms, perc99: 1.14s Count:5228}; threshold: 1s
W0731 02:58:04.462] I0731 02:58:04.459668   10763 api_responsiveness_prometheus.go:90] APIResponsiveness: WARNING Top latency metric: {Resource:configmaps Subresource: Verb:POST Scope:namespace Latency:perc50: 28.571428ms, perc90: 88.1ms, perc99: 1.02472s Count:18627}; threshold: 1s

I checked Prometheus graphs for that run, and it looks like the Prometheus based api-call was broken by single spikes (happening around log rotate) in all cases:
** POST PersistentVolumes**
ZAam2FaceSR

** DELETE Services **
eUYWOTEnC5S

This is a known problem in Prometheus api-call latency (actually it's the reason it's currently disabled). @oxddr and @krzysied are working on this on hopefully we'll have some solution soon.

So to summarize, the extended load looks very promising. Once we have a solution to the spike problems in Prometheus api-call-latency we should be good (or really close to) to enable it in CI/CD.

@mm4tt
Copy link
Contributor Author

mm4tt commented Jul 31, 2019

After discussing with team we agreed that we should be good to enable Secrets and ConfigMaps in CI/CD tests.

On the other hand it might be tricky, as currently we have a separate experimental config for extended load.
I think it might be easier to first implement everything there, then move it out of experimental directory and make it the default load config, and then gradually enable new objects via overrides.

@oxddr
Copy link
Contributor

oxddr commented Jul 31, 2019

I checked Prometheus graphs for that run, and it looks like the Prometheus based api-call was broken by single spikes (happening around log rotate) in all cases:

(...)

This is a known problem in Prometheus api-call latency (actually it's the reason it's currently disabled). janluk and Krzysztof Siedlecki are working on this on hopefully we'll have some solution soon.

For the record, the fact that Prometheus-based API call latency SLO was violated can be actually valid. But I understand SLO violations caused by logrotate are orthogonal to changes you did.

Prometheus-based measurement is close to SLO definition and thus is more strict and prone to violations caused by spikes.

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Jul 31, 2019
This is most likely no-op until we turn on some Network Policy Provider that will start enfocring these network policies.

It should be pretty straightforward to turn on Calico both in GKE and in GCE.
This should be done separately to isolate any potential performance
impact of tuning it just on.

Ref. kubernetes#704
mm4tt added a commit to mm4tt/test-infra that referenced this issue Sep 2, 2019
mm4tt added a commit to mm4tt/test-infra that referenced this issue Sep 2, 2019
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 2, 2019
This also merges experimental load with the real load test.

The "knob" has been enabled in presubmits in kubernetes/test-infra#14166.

Ref. kubernetes#704
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 3, 2019
This also merges experimental load with the real load test.

The "knob" has been enabled in presubmits in kubernetes/test-infra#14166.

Ref. kubernetes#704
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 3, 2019
mm4tt added a commit to mm4tt/test-infra that referenced this issue Sep 3, 2019
mm4tt added a commit to mm4tt/test-infra that referenced this issue Sep 3, 2019
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 4, 2019
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 4, 2019
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 4, 2019
mm4tt added a commit to mm4tt/test-infra that referenced this issue Sep 12, 2019
mm4tt added a commit to mm4tt/test-infra that referenced this issue Sep 16, 2019
Will be keeping an eye on the next runs and rollback / disable for some
jobs if needed.

Ref. kubernetes/perf-tests#704
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 16, 2019
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 16, 2019
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 17, 2019
mm4tt added a commit to mm4tt/test-infra that referenced this issue Sep 17, 2019
Similarly to the other new resources, will be keeping an eye on the next runs and rollback / disable for some jobs if needed.

kubernetes/perf-tests#704
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 17, 2019
The only tricky part here is deleting PVs that are created via
StatefulSets. Theses PVs are not automatically deleted when StatefulSets
are deleted. Becasue of that I extended the ClusterLoader Phase api to
allow deleting objects that weren't created directly via CL2. The way
it works is that once we detect a new object if certain option is set we
issue a List request to find nunmber of replicas.

Ref. kubernetes#704
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 17, 2019
The only tricky part here is deleting PVs that are created via
StatefulSets. Theses PVs are not automatically deleted when StatefulSets
are deleted. Becasue of that I extended the ClusterLoader Phase api to
allow deleting objects that weren't created directly via CL2. The way
it works is that once we detect a new object if certain option is set we
issue a List request to find nunmber of replicas.

Ref. kubernetes#704
mm4tt added a commit to mm4tt/perf-tests that referenced this issue Sep 18, 2019
The only tricky part here is deleting PVs that are created via
StatefulSets. Theses PVs are not automatically deleted when StatefulSets
are deleted. Becasue of that I extended the ClusterLoader Phase api to
allow deleting objects that weren't created directly via CL2. The way
it works is that once we detect a new object if certain option is set we
issue a List request to find nunmber of replicas.

Ref. kubernetes#704
jkaniuk pushed a commit to jkaniuk/perf-tests that referenced this issue Oct 3, 2019
This is most likely no-op until we turn on some Network Policy Provider that will start enfocring these network policies.

It should be pretty straightforward to turn on Calico both in GKE and in GCE.
This should be done separately to isolate any potential performance
impact of tuning it just on.

Ref. kubernetes#704
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2019
@mm4tt
Copy link
Contributor Author

mm4tt commented Dec 27, 2019

/remove-lifecycle stale

We hope to get back to this in Q1 2020.
In general, this is done except Network Policies. NetworkPolicies are also implemented but we need to resolve some issues in Calico before enabling them.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 26, 2020
@oxddr
Copy link
Contributor

oxddr commented Mar 26, 2020 via email

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

4 participants