-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure taskErr channel is buffered #85
Ensure taskErr channel is buffered #85
Conversation
The Problem: Specifically under ClusterProvider.CreateCluster, there are 4 tasks that need to complete. The first two tasks run in parallel. The next 2 run serially, but are not gated against success of the first two. This means that if more than one task fails, the cluster creation process can get stuck attempting to write to the unbuffered taskErr channel. The operational fix: Buffer the channel with 5 slots for errors, ensuring there are no surprises. eksctl is not some high-performance app, so having a buffer is useful. The systematic fix: When writing to the channel, provide a default escape clause that allows a failed write to be noticed, and warned of. This will prevent silent hangings.
Hmm. Perhaps |
Can't return []error because the polling for stack-completion happens in a goroutine. I think having another go-routine to read errors off the channel as they happen is the other viable option. Let me know if that's what you want. I can make the changes. |
I don't follow. Mind you, the whole error handling is a little weird - why do task functions take a |
Oh I see what you're saying. So then... you want to wire out the output array of runCreateTask by either returning from it (in case of error). I can do that too. About the If you'd like I'm happy to send all errors out through the channel for consistency. |
I believe this code can be simplified a lot. @archisgore so this will fix silence that we are seeing e.g. in #75? If so, I'd be happy to merge this with an outlook that we will simplify this code later on. |
Yes, this change will fix that too. And I agree, I'll clean up the code. I've got a lot of PRs coming out. I'm using this as a production scripting tool, so I'll be around for any technical debt payback for sure. |
@rade what do you think? |
Already began work for a revamp: #87 |
I am not all happy with a) the hard-coded '5', since that requires knowledge of what happens under the covers, and b) the non-blocking sends since these will produce log messages for what is essentially a bug in the code. The quickest way to address these, which doesn't require a revamping/cleaning-up of the whole error handling, would be to read from |
Instead of buffering channel, the channel is instead read from a separate go-routine for errors. In order to achieve this: 1. CreateCluster is launched in a go-routine. 2. The main go routine now blocks on reading errors, unless, 3. The channel is closed, at which point it'll get a nil error back and break out of the error-listening loop.
You got it. :-) No more buffering, and separate go-routine to read the error channel. Ready for review. Let me know what you think. |
Checking for blocked channels is no longer necessary, since the reader is in a different go-routine.
pkg/eks/api.go
Outdated
@@ -137,10 +137,12 @@ func (c *ClusterProvider) runCreateTask(tasks map[string]func(chan error) error, | |||
logger.Debug("task %q started", tn) | |||
errs := make(chan error) | |||
if err := task(errs); err != nil { | |||
logger.Debug("task %q failed to start due to error %#v", tn, err) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
cmd/eksctl/create.go
Outdated
// read any errors (it only gets non-nil errors) | ||
for err := range taskErr { | ||
if err == nil { | ||
break | ||
} |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
All changes you asked for implemented. I only need your preference on what to do with multiple errors scenario, and I'll take care of it. |
We could use https://github.com/hashicorp/go-multierror, perhaps. Also, as I mentioned in my review, this code does need some comment as to what is going on, and why. |
Yes absolutely. I'll add comments and everything. I just needed to know what you wanted, so that I can document it correctly. I'll use go-multierror as you requested. |
suggested ;) I haven't used it myself before, so there may be gotchas. Btw, please run some before/after manual tests for the 'multiple errors' case and post the output here - i.e. it should block in 'before' (i.e. on master) and print out multiple errors in 'after' (i.e. on this branch). |
btw, ...
Given that the channel is entirely unbuffered, won't the cluster creation get stuck even when just a single task fails? |
I'm going to let someone else fix this. Sorry about all the churn. I don't have the time on my schedule to see this to completion. |
@archisgore looks like |
(cherry-picked from #85)
(cherry-picked from #85)
Refactor `GetDevice` method.
The Problem:
Specifically under ClusterProvider.CreateCluster, there are 4
tasks that need to complete.
The first two tasks run in parallel. The next 2 run serially,
but are not gated against success of the first two.
This means that if more than one task fails, the cluster creation
process can get stuck attempting to write to the unbuffered
taskErr channel.
The operational fix:
Buffer the channel with 5 slots for errors, ensuring there
are no surprises.
eksctl is not some high-performance app, so having a buffer
is useful.
The systematic fix:
When writing to the channel, provide a default escape clause
that allows a failed write to be noticed, and warned of. This
will prevent silent hangings.