Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not allow EC2 instance ID NotFound to succeed tagging #674

Merged
merged 1 commit into from
Oct 14, 2023

Conversation

ndbaker1
Copy link
Contributor

@ndbaker1 ndbaker1 commented Oct 13, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Removes the graceful handling of InvalidInstanceID.NotFound error when attempting to tag an ec2 instance that has not fully come up. This has caused an issue where we've seen the tagging controller misleadingly exit successfully, not actually tagging the instance, and does not re-queue the item to (ideally) execute again once the instance becomes visible.

example log feed:

tags.go:326] Couldn't find resource when trying to tag it hence skipping it, InvalidInstanceID.NotFound: The instance ID 'i-***' does not exist status code: 400, request id: ***
tagging_controller.go:299] Successfully tagged i-*** with map[aws:eks:cluster-name:***]. Labeling the nodes with tagging controller labels now.
tagging_controller.go:305] Successfully labeled node ip-***.compute.internal with map[k8s.io/cloud-provider-aws:***].

This behavior does satisfy the untag action, since removing the tag from a non-existing instance is a no-op, so no changes need to be made there.

Its worth mentioning the initial PR to gracefully handle this (#448) aimed to fix all cases discussed in issue #444 where the untracked InvalidInstanceID.NotFound errors were valid failure modes in the context of instance termination.

Which issue(s) this PR fixes: N/A

Special notes for your reviewer: N/A

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 13, 2023
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 13, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: ndbaker1 / name: Nick Baker (805b07f)

@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 13, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @ndbaker1!

It looks like this is your first PR to kubernetes/cloud-provider-aws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/cloud-provider-aws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 13, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @ndbaker1. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 13, 2023
if isAWSErrorInstanceNotFound(err) {
klog.Infof("Couldn't find resource when trying to tag it hence skipping it, %v", err)
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we return this error instead of silencing it, the workItem will be re-queued and presumably we'll successfully tag the instance after the API becomes consistent?

How long did we observe that to take in this scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that's the intention 👍

from the logs we have for this event, CCM executes the tagging work item almost exactly when the instance launches. given that when experimenting through cli

ID=$(aws ec2 run-instances --image-id ami-07d07d65c47e5aa90 --instance-type t2.micro --query Instances[0].InstanceId --output text)
aws ec2 create-tags --resources $ID --tags Key=test,Value=value
aws ec2 describe-tags --filters Name=resource-id,Values=$ID

you pretty much can't encounter the issue, i think any amount of retry in place would fix the issue

@cartermckinnon
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 14, 2023
@cartermckinnon
Copy link
Contributor

cartermckinnon commented Oct 14, 2023

Change looks fine to me. IIUC, silencing this error in #448 was just an optimization; if there is an errant Node in the API, we'll try to tag it n times, but there's no correctness issue per se?

This will show up in our error metrics, but I think that's appropriate.

@cartermckinnon
Copy link
Contributor

/retest

@hakman
Copy link
Member

hakman commented Oct 14, 2023

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 14, 2023
Copy link
Member

@hakman hakman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good change for better reliability. Thanks @ndbaker1!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hakman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 14, 2023
@k8s-ci-robot k8s-ci-robot merged commit 0ef307f into kubernetes:master Oct 14, 2023
11 checks passed
@ndbaker1 ndbaker1 deleted the ec2-tagging branch October 14, 2023 05:46
k8s-ci-robot added a commit that referenced this pull request Oct 25, 2023
…pstream-release-1.28

[release-1.28] Automated cherry pick of #674: do not allow ec2 instance ID not found in tagging path
k8s-ci-robot added a commit that referenced this pull request Oct 25, 2023
…pstream-release-1.25

[release-1.25] Automated cherry pick of #674: do not allow ec2 instance ID not found in tagging path
k8s-ci-robot added a commit that referenced this pull request Oct 25, 2023
…pstream-release-1.24

[release-1.24] Automated cherry pick of #674: do not allow ec2 instance ID not found in tagging path
k8s-ci-robot added a commit that referenced this pull request Oct 25, 2023
…pstream-release-1.27

[release-1.27] Automated cherry pick of #674: do not allow ec2 instance ID not found in tagging path
k8s-ci-robot added a commit that referenced this pull request Oct 25, 2023
…pstream-release-1.26

[release-1.26] Automated cherry pick of #674: do not allow ec2 instance ID not found in tagging path
k8s-ci-robot added a commit that referenced this pull request Nov 1, 2023
…pstream-release-1.23

[release-1.23] Automated cherry pick of #674: do not allow ec2 instance ID not found in tagging path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants