-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync -release-blocking dashboards with kubernetes/release/lib/releaselib.sh #3028
Sync -release-blocking dashboards with kubernetes/release/lib/releaselib.sh #3028
Conversation
f0f1d16
to
d6ac51e
Compare
testgrid/config/config.yaml
Outdated
@@ -1990,8 +1990,40 @@ dashboards: | |||
- name: periodic-kubeadm-gce-1.6 | |||
test_group_name: periodic-kubernetes-e2e-kubeadm-gce-1-6 | |||
|
|||
# These are the release *blocking* tests. These provide a valid binary signal | |||
# to gate releases (alpha, beta, official). | |||
# This list is used by kubernetes/release/lib/releaselib.sh to provide no/nogo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit pessimistic. We do sometimes get a go
signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you mean pessimistic? Sometimes we get a 'go' signal from what? The current go/no go is using the list in releaselib.sh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was being too cute. There is a typo that says the two signals are no
or nogo
.
testgrid/config/config.yaml
Outdated
# Matches the list on https://github.com/kubernetes/test-infra/issues/2029 | ||
- name: release-1.6-blocking | ||
- name: release-1.6-blocking-hard-soft |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be just -soft
?
d6ac51e
to
9e8b119
Compare
This sounds like a step in the right direction to me. One concern I have is that after In the past, I've accomplished that by temporarily commenting out a line in |
@enisoc That is addressed in kubernetes/release#341. |
ada9e22
to
b37f1c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kubernetes/sig-release-pr-reviews
testgrid/config/config.yaml
Outdated
# This list is used by kubernetes/release/lib/releaselib.sh to provide no/nogo | ||
# signal for releases. | ||
# The first test/suite in the list is the primary suite for release purposes | ||
- name: release-1.6-blocking-hard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe leave the name of the "blocking-hard" groups as-is, so we don't break bookmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Not stuck on the names. Suggestions welcome. I can maybe remove the -soft and call the -hard something else. Maybe -actual ?
testgrid/config/config.yaml
Outdated
# Matches the list on https://github.com/kubernetes/test-infra/issues/2029 | ||
- name: release-1.6-blocking | ||
- name: release-1.6-blocking-soft |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between the "soft" blocking tests and the rest of the tests that are not blocking in any way? why do we need to differentiate them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we don't? The "release blocking" set came about as a way for @kubernetes/kubernetes-release-managers to keep an eye on "release blocking" tests, however, that list did not take into account the existing actual set of "release blocking" tests defined in releaselib.sh.
Ideally we'd have ONE set of release-blocking tests (per release) that were 1) not flaky 2) can produce a true go/nogo gate for a release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should just run with that and redefine blocking to be exactly what it should be. The other tests that folks want to eyeball for whatever reason can be eyeballed elsewhere. @kubernetes/kubernetes-release-managers can I get an ack on that so that this doesn't get reverted as soon as it goes through? Let's decide now.
b37f1c5
to
d7bdd7b
Compare
The split idea doesn't seem to have much traction. Let's see how everyone likes just getting the -release-blocking dashboards in sync with what have actually been the release blocking tests for a year and a half now. |
5c1ae1b
to
419b534
Compare
@kubernetes/kubernetes-release-managers I know this is something you're currently working around. It would be good to get this in sooner than later. Please have a look and lgtm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchen1107 could you also take a look at the list?
testgrid/config/config.yaml
Outdated
- name: release-1.7-blocking | ||
dashboard_tab: | ||
- name: build-1.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the build test? Actually i think the build test should be the primary test. cc @krzyzacy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the replies, @caesarxuchao! That does seem a reasonable primary test suite. The current list of suites hasn't been updated in some months and many new suites have been created, so a full audit is due. If noone objects, we'll make the build job primary.
test_group_name: ci-kubernetes-federation-build-1.7 | ||
- name: gce-release-1-7 | ||
test_group_name: ci-kubernetes-e2e-gce-release-1-7 | ||
- name: gci-gce-release-1-7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchen1107 do we need all the gci-, gke-, gci-gke- version of all the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure? This is up to all of you @kubernetes/kubernetes-release-managers. Remember the goal here when coming up with a set of HARD gating tests, is that you can get a passing set with some regularity as the point is to block the release. Don't add anything here that's just "been failing forever". And the suites may differ between branches (master vs. release-*).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we need all of them. But in long term, I wish we can remove gke related or treat them differently in gke release cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of those? Can you be more specific? There are LOTS of suites that fit the gci variant. Are all of those passing enough to provide a good signal at this point? Should whatever that list ends up being be the same for 1.7 and master (and 1.6 for that matter)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean all gci-,gke-,gci-gke- tests that were in the 1.7-blocking should remain there.
testgrid/config/config.yaml
Outdated
- name: gke-release-1-7 | ||
test_group_name: ci-kubernetes-e2e-gke-release-1-7 | ||
- name: gci-gke-release-1-7 | ||
test_group_name: ci-kubernetes-e2e-gci-gke-release-1-7 | ||
- name: aws-release-1-7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep one aws suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, for master and all "blocking" sets?
testgrid/config/config.yaml
Outdated
test_group_name: ci-kubernetes-e2e-gke-serial-release-1-7 | ||
- name: gci-gke-serial-release-1-7 | ||
test_group_name: ci-kubernetes-e2e-gci-gke-serial-release-1-7 | ||
- name: gce-federation-release-1-7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep a federation test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, for master and all "blocking" sets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only keep federation for master and 1.7+. For 1.6 and lower, federation test is permanently broken and WNF.
testgrid/config/config.yaml
Outdated
test_group_name: ci-kubernetes-e2e-gci-gke-serial-release-1-7 | ||
- name: gce-federation-release-1-7 | ||
test_group_name: ci-kubernetes-e2e-gce-federation-release-1-7 | ||
- name: gce-alpha-features-release-1-7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep the alpha features, thought it's failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what you probably don't want then in a blocking set? You can always check other tabs/dashboards for other "suites of interest", but something that's just always red doesn't belong here.
testgrid/config/config.yaml
Outdated
test_group_name: ci-kubernetes-e2e-gce-scalability-release-1-7 | ||
- name: gci-gce-scalability-release-1-7 | ||
test_group_name: ci-kubernetes-e2e-gci-gce-scalability-release-1-7 | ||
- name: gci-gce-ingress-release-1-7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchen1107 do we need the ingress test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
testgrid/config/config.yaml
Outdated
test_group_name: ci-kubernetes-e2e-gke-reboot-release-1-7 | ||
- name: gci-gke-reboot-release-1-7 | ||
test_group_name: ci-kubernetes-e2e-gci-gke-reboot-release-1-7 | ||
- name: soak-gce-1-7-deploy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the soak test? They are all failing.
testgrid/config/config.yaml
Outdated
test_group_name: ci-kubernetes-soak-gci-gce-1-7-test | ||
- name: gce-gpu-1-7 | ||
test_group_name: ci-kubernetes-e2e-gce-gpu-1-7 | ||
- name: gce-kubeadm-1.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the 1.6 parts, as long as the noted tests are moved to 1.6-all so they don't disappear.
testgrid/config/config.yaml
Outdated
# Matches the list on https://github.com/kubernetes/test-infra/issues/2029 | ||
# These are the release *blocking* tests. These provide a valid binary signal | ||
# to gate releases (alpha, beta, official). | ||
# This list is used by kubernetes/release/lib/releaselib.sh to provide no/nogo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be go/nogo
.
test_group_name: ci-kubernetes-verify-release-1.6 | ||
- name: test-go-1.6 | ||
test_group_name: ci-kubernetes-test-go-release-1.6 | ||
- name: federation-build-1.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this to release-1.6-all? It's not there yet and I don't want it to fall out entirely.
test_group_name: ci-kubernetes-e2e-gke-release-1-6 | ||
- name: gci-gke-1.6 | ||
test_group_name: ci-kubernetes-e2e-gci-gke-release-1-6 | ||
- name: kubeadm-gce-1.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this to release-1.6-all? It's not there yet and I don't want it to fall out entirely.
e16f00c
to
12fb43a
Compare
Thanks @caesarxuchao and @enisoc. All comments addressed. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for 1.6 parts.
# to gate releases (alpha, beta, official). | ||
# This list is used by kubernetes/release/lib/releaselib.sh to provide go/nogo | ||
# signal for releases. | ||
# The first test/suite in the list is the primary suite for release purposes | ||
- name: release-1.7-blocking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add three more:
- kubelet-serial-gce-e2e-1-7
- gke-gci-gpu-1-7
- gce-gci-auth-1-7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These looks like brand new variants, not just something to move into -blocking. That seems to be outside the scope of this PR. Post-this, anyone can add those new suites and put them into blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not difficult, let's just add them in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added these. As these are brand new test suites and not just moving things around, I have little confidence I did it correctly. Please check carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, none of these tests exists:
https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/kubelet-serial-gce-e2e-1-7
https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-auth-1-7
https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-gpu-1-7
Let's leave them in the TODO. @dchen1107 who should create these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-mcmahon let's add these three into the panel. @dchen1107 says there are PRs creating these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't add them until the PR goes in that includes them. Let's just get this in now. We can iterate.
@dchen1107 @caesarxuchao How about we get this in and iterate as needed? I think this is close enough and it sounds like things are pretty fluid still anyway. |
I am slightly weirded out that we're using a config meant for testgrid as the canonical source for release tools. What am I missing? |
@david-mcmahon i'm using the 1.7-blocking panel to monitor the branch health every day so i hope we can get it right in one iteration. Please ping me if any comment is unclear. |
78fce18
to
71ecf88
Compare
@spiffxp it does seem a little odd but on the other hand a) it is what teams use to monitor and b) we cannot add/remove jobs without adding/removing them from testgrid. So it is probably the best place for this to live for now. |
@spiffxp @fejta This PR answers an immediate need based on where the data is currently stored. Could these lists be stored elsewhere? Sure. If you have thoughts/suggestions on how you'd prefer to do this, open a new issue and let's move it. For now this at least syncs up very important lists that provide a key measurement of release health. @kubernetes/kubernetes-release-managers if this looks good, please lgtm so we can move this forward. |
@david-mcmahon it looks like the net change for 1.7-blocking is removed the gpu test? Can we add it back? |
I think it is a good start how it is written now. We have all the unit tests in place -- we can't stop running jobs without also deleting them from testgrid. We can't add new jobs without adding them to testgrid somewhere. If we do add a list somewhere we need to make a unit test to ensure it matches the jobs in testgrid since these are what people monitor on a daily basis. |
@david-mcmahon could you add back the gpu test to 1.7 blocking? Then lgtm. Thanks for bearing with me. |
71ecf88
to
4538584
Compare
@krzyzacy We're probably colliding here. Please check these changes. |
testgrid/config/config.yaml
Outdated
@@ -2053,168 +2055,73 @@ dashboards: | |||
test_group_name: periodic-kubernetes-bazel-build-1-6 | |||
- name: periodic-kubeadm-gce-1.6 | |||
test_group_name: periodic-kubernetes-e2e-kubeadm-gce-1-6 | |||
- name: -kubeadm-gce-1.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe ci-kubeadm-gce-1.6
, or it's actually postsubmit-kubeadm-gce-1.6
@david-mcmahon agree we can iterate, we can merge this first and I'll make follow-up PRs for other 1.7 blocking jobs. |
4538584
to
b96e2ba
Compare
/lgtm |
This is the first step in centralizing test suite definitions.
This splits the current release blocking lists into hard (binary signal) and soft (human ack required) and provides a place for releaselib.sh to consume the list of suites rather than hardcoding them leading to them inevitably drifting out of sync.The "-hard" lists in this PR are a best guess based on a union of the "-soft" lists and what's currently in releaselib.sh.I do not expect this is the final list. Release managers, please chime in here. This is to help you with a simple 'gate' to provide release signal. Ideally we'd have everything in one list vs hard/soft. We'll get there, but in the meantime you'll use both lists to get a reasonable 'hard' signal and a human ack by checking the soft list in conjunction.This simply syncs up the "blocking" suites with what is currently in releaselib.sh.
Fixes kubernetes/release#340
cc @kubernetes/kubernetes-release-managers @dclaar @javier-b-perez