Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry creating kubeclients in FluentD when error #855

Merged
merged 4 commits into from
Sep 3, 2020

Conversation

samjsong
Copy link
Contributor

@samjsong samjsong commented Aug 26, 2020

Edit: New proposal based on Dominik's comments is to retry for ~4 minutes (with exponential backoff) in the configure step. This should hopefully give enough leeway for transient issue, and throw an exception otherwise, causing the Fluentd worker process to die and restart (after having logged the exception). This will make it way easier to understand the issue than the previous Got exception undefined method.

Based on the issues that @frankreno mentioned some customers were facing with

Got exception undefined method _____ for #<String:0x00007ff19eaf4e60> ______

I found that this is because Ruby does a weird thing where if a method returns an exception, Ruby treats that exception as a String type. In this case, inside connect_kubernetes we call create_client which can return an exception. If for some reason we fail to create this client, we store the exception instead of the client - which means that when we try to call client.getEntity, we get the above undefined method for String error.

By creating the client in the configure step, this means that when the fluentd worker exits and restarts, we never re-initialize the client. By moving connect_kubernetes to start, we will at least try to re-initialize the client.

Unfortunately I spent some time trying to figure out why the initial error happens but haven't been able to figure that out, but hopefully this change at least makes it so that fluentd can fix itself if a transient issue ends.

@samjsong samjsong requested review from frankreno, a team, perk-sumo and pmalek-sumo and removed request for a team August 26, 2020 15:51
@sumo-drosiek
Copy link
Contributor

Maybe we could put querying into begin rescue block and reconnect in case of connection is invalid

@samjsong
Copy link
Contributor Author

I think we could also add new begin rescue blocks where we call kubeclient methods to re-init the clients in that same case

Copy link
Contributor

@frankreno frankreno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - as discussed let's get this lined up for a 1.2.1 release.

@frankreno frankreno added this to the v1.2 milestone Sep 1, 2020
@samjsong samjsong changed the title initialize kube api clients at start, not config initialize kubeclients at start; check and reinit when there is an error Sep 2, 2020
@samjsong
Copy link
Contributor Author

samjsong commented Sep 2, 2020

I've made a few changes that I hope will help further

  • return nil in create_clients so that @clients[api_version] will return nil rather than a String of the exception thrown
  • in the events plugin and enriched metadata plugin service monitor, add a nil check on relevant clients. If nil, attempt to recreate the client before proceeding. There's no guarantee this attempt will succeed (especially if there's a legitimate issue), but as these are infinite loops we'll at least try again soon.
  • in fetch_resource, fix the check for client not created. Again, attempt to recreate if nil.

I thought about doing some sort of retry mechanism on client creation, but I wasn't sure how long we'd want to retry - infinitely? If the team thinks adding some sort of retry into create_clients itself, I'd be happy to do that as well

@samjsong
Copy link
Contributor Author

samjsong commented Sep 2, 2020

Hmm, Travis failed even though build passed locally, I’ll check tomorrow

@perk
Copy link
Contributor

perk commented Sep 3, 2020

Travis vs kubeclient error has been fixed upstream by #870 - please rebase :)

@samjsong samjsong force-pushed the ssong-init-client branch 2 times, most recently from fbfc20c to e0c3bf4 Compare September 3, 2020 06:51
Copy link
Contributor

@sumo-drosiek sumo-drosiek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@samjsong samjsong changed the title initialize kubeclients at start; check and reinit when there is an error retry creating kubeclients when error Sep 3, 2020
Copy link
Contributor

@perk-sumo perk-sumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@perk-sumo perk-sumo merged commit 6648b63 into master Sep 3, 2020
@perk-sumo perk-sumo deleted the ssong-init-client branch September 3, 2020 11:21
@perk-sumo perk-sumo changed the title retry creating kubeclients when error Retry creating kubeclients in FluentD when error Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants