Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync -release-blocking dashboards with kubernetes/release/lib/releaselib.sh #3028

Merged
merged 1 commit into from
Jun 19, 2017

Conversation

david-mcmahon
Copy link
Contributor

@david-mcmahon david-mcmahon commented Jun 12, 2017

This is the first step in centralizing test suite definitions.

This splits the current release blocking lists into hard (binary signal) and soft (human ack required) and provides a place for releaselib.sh to consume the list of suites rather than hardcoding them leading to them inevitably drifting out of sync.

The "-hard" lists in this PR are a best guess based on a union of the "-soft" lists and what's currently in releaselib.sh.

I do not expect this is the final list. Release managers, please chime in here. This is to help you with a simple 'gate' to provide release signal. Ideally we'd have everything in one list vs hard/soft. We'll get there, but in the meantime you'll use both lists to get a reasonable 'hard' signal and a human ack by checking the soft list in conjunction.

This simply syncs up the "blocking" suites with what is currently in releaselib.sh.

Fixes kubernetes/release#340

cc @kubernetes/kubernetes-release-managers @dclaar @javier-b-perez

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 12, 2017
@@ -1990,8 +1990,40 @@ dashboards:
- name: periodic-kubeadm-gce-1.6
test_group_name: periodic-kubernetes-e2e-kubeadm-gce-1-6

# These are the release *blocking* tests. These provide a valid binary signal
# to gate releases (alpha, beta, official).
# This list is used by kubernetes/release/lib/releaselib.sh to provide no/nogo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit pessimistic. We do sometimes get a go signal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you mean pessimistic? Sometimes we get a 'go' signal from what? The current go/no go is using the list in releaselib.sh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was being too cute. There is a typo that says the two signals are no or nogo.

# Matches the list on https://github.com/kubernetes/test-infra/issues/2029
- name: release-1.6-blocking
- name: release-1.6-blocking-hard-soft
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be just -soft?

@enisoc
Copy link
Member

enisoc commented Jun 12, 2017

This sounds like a step in the right direction to me. One concern I have is that after anago fetches the list from somewhere centralized, we probably need a flag to bypass certain broken hard-blocker tests on a case-by-case basis. We do this often for alphas, less often for betas, and... as for official releases I will plead the fifth.

In the past, I've accomplished that by temporarily commenting out a line in releaselib.sh, but that won't work anymore if it downloads the list.

@david-mcmahon
Copy link
Contributor Author

@enisoc That is addressed in kubernetes/release#341.

@david-mcmahon david-mcmahon force-pushed the release-blocking branch 2 times, most recently from ada9e22 to b37f1c5 Compare June 14, 2017 01:17
Copy link
Member

@ixdy ixdy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kubernetes/sig-release-pr-reviews

# This list is used by kubernetes/release/lib/releaselib.sh to provide no/nogo
# signal for releases.
# The first test/suite in the list is the primary suite for release purposes
- name: release-1.6-blocking-hard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe leave the name of the "blocking-hard" groups as-is, so we don't break bookmarks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Not stuck on the names. Suggestions welcome. I can maybe remove the -soft and call the -hard something else. Maybe -actual ?

# Matches the list on https://github.com/kubernetes/test-infra/issues/2029
- name: release-1.6-blocking
- name: release-1.6-blocking-soft
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between the "soft" blocking tests and the rest of the tests that are not blocking in any way? why do we need to differentiate them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we don't? The "release blocking" set came about as a way for @kubernetes/kubernetes-release-managers to keep an eye on "release blocking" tests, however, that list did not take into account the existing actual set of "release blocking" tests defined in releaselib.sh.

Ideally we'd have ONE set of release-blocking tests (per release) that were 1) not flaky 2) can produce a true go/nogo gate for a release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should just run with that and redefine blocking to be exactly what it should be. The other tests that folks want to eyeball for whatever reason can be eyeballed elsewhere. @kubernetes/kubernetes-release-managers can I get an ack on that so that this doesn't get reverted as soon as it goes through? Let's decide now.

@david-mcmahon david-mcmahon changed the title Split up "release-blocking" sets into hard and soft. Sync -release-blocking dashboards with kubernetes/release/lib/releaselib.sh Jun 14, 2017
@david-mcmahon
Copy link
Contributor Author

The split idea doesn't seem to have much traction. Let's see how everyone likes just getting the -release-blocking dashboards in sync with what have actually been the release blocking tests for a year and a half now.

@david-mcmahon david-mcmahon force-pushed the release-blocking branch 4 times, most recently from 5c1ae1b to 419b534 Compare June 14, 2017 23:00
@david-mcmahon
Copy link
Contributor Author

@kubernetes/kubernetes-release-managers I know this is something you're currently working around. It would be good to get this in sooner than later. Please have a look and lgtm.

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchen1107 could you also take a look at the list?

- name: release-1.7-blocking
dashboard_tab:
- name: build-1.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the build test? Actually i think the build test should be the primary test. cc @krzyzacy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the replies, @caesarxuchao! That does seem a reasonable primary test suite. The current list of suites hasn't been updated in some months and many new suites have been created, so a full audit is due. If noone objects, we'll make the build job primary.

test_group_name: ci-kubernetes-federation-build-1.7
- name: gce-release-1-7
test_group_name: ci-kubernetes-e2e-gce-release-1-7
- name: gci-gce-release-1-7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchen1107 do we need all the gci-, gke-, gci-gke- version of all the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure? This is up to all of you @kubernetes/kubernetes-release-managers. Remember the goal here when coming up with a set of HARD gating tests, is that you can get a passing set with some regularity as the point is to block the release. Don't add anything here that's just "been failing forever". And the suites may differ between branches (master vs. release-*).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we need all of them. But in long term, I wish we can remove gke related or treat them differently in gke release cycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of those? Can you be more specific? There are LOTS of suites that fit the gci variant. Are all of those passing enough to provide a good signal at this point? Should whatever that list ends up being be the same for 1.7 and master (and 1.6 for that matter)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean all gci-,gke-,gci-gke- tests that were in the 1.7-blocking should remain there.

- name: gke-release-1-7
test_group_name: ci-kubernetes-e2e-gke-release-1-7
- name: gci-gke-release-1-7
test_group_name: ci-kubernetes-e2e-gci-gke-release-1-7
- name: aws-release-1-7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep one aws suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, for master and all "blocking" sets?

test_group_name: ci-kubernetes-e2e-gke-serial-release-1-7
- name: gci-gke-serial-release-1-7
test_group_name: ci-kubernetes-e2e-gci-gke-serial-release-1-7
- name: gce-federation-release-1-7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep a federation test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, for master and all "blocking" sets?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only keep federation for master and 1.7+. For 1.6 and lower, federation test is permanently broken and WNF.

test_group_name: ci-kubernetes-e2e-gci-gke-serial-release-1-7
- name: gce-federation-release-1-7
test_group_name: ci-kubernetes-e2e-gce-federation-release-1-7
- name: gce-alpha-features-release-1-7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the alpha features, thought it's failing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what you probably don't want then in a blocking set? You can always check other tabs/dashboards for other "suites of interest", but something that's just always red doesn't belong here.

test_group_name: ci-kubernetes-e2e-gce-scalability-release-1-7
- name: gci-gce-scalability-release-1-7
test_group_name: ci-kubernetes-e2e-gci-gce-scalability-release-1-7
- name: gci-gce-ingress-release-1-7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchen1107 do we need the ingress test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

test_group_name: ci-kubernetes-e2e-gke-reboot-release-1-7
- name: gci-gke-reboot-release-1-7
test_group_name: ci-kubernetes-e2e-gci-gke-reboot-release-1-7
- name: soak-gce-1-7-deploy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the soak test? They are all failing.

test_group_name: ci-kubernetes-soak-gci-gce-1-7-test
- name: gce-gpu-1-7
test_group_name: ci-kubernetes-e2e-gce-gpu-1-7
- name: gce-kubeadm-1.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the 1.6 parts, as long as the noted tests are moved to 1.6-all so they don't disappear.

# Matches the list on https://github.com/kubernetes/test-infra/issues/2029
# These are the release *blocking* tests. These provide a valid binary signal
# to gate releases (alpha, beta, official).
# This list is used by kubernetes/release/lib/releaselib.sh to provide no/nogo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be go/nogo.

test_group_name: ci-kubernetes-verify-release-1.6
- name: test-go-1.6
test_group_name: ci-kubernetes-test-go-release-1.6
- name: federation-build-1.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to release-1.6-all? It's not there yet and I don't want it to fall out entirely.

test_group_name: ci-kubernetes-e2e-gke-release-1-6
- name: gci-gke-1.6
test_group_name: ci-kubernetes-e2e-gci-gke-release-1-6
- name: kubeadm-gce-1.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to release-1.6-all? It's not there yet and I don't want it to fall out entirely.

@david-mcmahon david-mcmahon force-pushed the release-blocking branch 2 times, most recently from e16f00c to 12fb43a Compare June 15, 2017 20:01
@david-mcmahon
Copy link
Contributor Author

Thanks @caesarxuchao and @enisoc. All comments addressed. PTAL.

Copy link
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for 1.6 parts.

# to gate releases (alpha, beta, official).
# This list is used by kubernetes/release/lib/releaselib.sh to provide go/nogo
# signal for releases.
# The first test/suite in the list is the primary suite for release purposes
- name: release-1.7-blocking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add three more:

  1. kubelet-serial-gce-e2e-1-7
  2. gke-gci-gpu-1-7
  3. gce-gci-auth-1-7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These looks like brand new variants, not just something to move into -blocking. That seems to be outside the scope of this PR. Post-this, anyone can add those new suites and put them into blocking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not difficult, let's just add them in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these. As these are brand new test suites and not just moving things around, I have little confidence I did it correctly. Please check carefully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-mcmahon let's add these three into the panel. @dchen1107 says there are PRs creating these tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't add them until the PR goes in that includes them. Let's just get this in now. We can iterate.

@david-mcmahon
Copy link
Contributor Author

@dchen1107 @caesarxuchao How about we get this in and iterate as needed? I think this is close enough and it sounds like things are pretty fluid still anyway.

@spiffxp
Copy link
Member

spiffxp commented Jun 15, 2017

I am slightly weirded out that we're using a config meant for testgrid as the canonical source for release tools. What am I missing?

@caesarxuchao
Copy link
Member

@david-mcmahon i'm using the 1.7-blocking panel to monitor the branch health every day so i hope we can get it right in one iteration. Please ping me if any comment is unclear.

@david-mcmahon david-mcmahon force-pushed the release-blocking branch 2 times, most recently from 78fce18 to 71ecf88 Compare June 15, 2017 23:51
@fejta
Copy link
Contributor

fejta commented Jun 16, 2017

@spiffxp it does seem a little odd but on the other hand a) it is what teams use to monitor and b) we cannot add/remove jobs without adding/removing them from testgrid. So it is probably the best place for this to live for now.

@david-mcmahon
Copy link
Contributor Author

@spiffxp @fejta This PR answers an immediate need based on where the data is currently stored. Could these lists be stored elsewhere? Sure. If you have thoughts/suggestions on how you'd prefer to do this, open a new issue and let's move it.

For now this at least syncs up very important lists that provide a key measurement of release health. @kubernetes/kubernetes-release-managers if this looks good, please lgtm so we can move this forward.

@caesarxuchao
Copy link
Member

@david-mcmahon it looks like the net change for 1.7-blocking is removed the gpu test? Can we add it back?

@fejta
Copy link
Contributor

fejta commented Jun 17, 2017

I think it is a good start how it is written now. We have all the unit tests in place -- we can't stop running jobs without also deleting them from testgrid. We can't add new jobs without adding them to testgrid somewhere.

If we do add a list somewhere we need to make a unit test to ensure it matches the jobs in testgrid since these are what people monitor on a daily basis.

@caesarxuchao
Copy link
Member

@david-mcmahon could you add back the gpu test to 1.7 blocking? Then lgtm. Thanks for bearing with me.

@david-mcmahon
Copy link
Contributor Author

@krzyzacy We're probably colliding here. Please check these changes.

@@ -2053,168 +2055,73 @@ dashboards:
test_group_name: periodic-kubernetes-bazel-build-1-6
- name: periodic-kubeadm-gce-1.6
test_group_name: periodic-kubernetes-e2e-kubeadm-gce-1-6
- name: -kubeadm-gce-1.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe ci-kubeadm-gce-1.6, or it's actually postsubmit-kubeadm-gce-1.6

@krzyzacy
Copy link
Member

@david-mcmahon agree we can iterate, we can merge this first and I'll make follow-up PRs for other 1.7 blocking jobs.

@david-mcmahon david-mcmahon merged commit 3fb31fb into kubernetes:master Jun 19, 2017
@krzyzacy
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants