Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure taskErr channel is buffered #85

Closed
wants to merge 5 commits into from
Closed

Ensure taskErr channel is buffered #85

wants to merge 5 commits into from

Conversation

archisgore
Copy link
Contributor

The Problem:

Specifically under ClusterProvider.CreateCluster, there are 4
tasks that need to complete.

The first two tasks run in parallel. The next 2 run serially,
but are not gated against success of the first two.

This means that if more than one task fails, the cluster creation
process can get stuck attempting to write to the unbuffered
taskErr channel.

The operational fix:

Buffer the channel with 5 slots for errors, ensuring there
are no surprises.

eksctl is not some high-performance app, so having a buffer
is useful.

The systematic fix:

When writing to the channel, provide a default escape clause
that allows a failed write to be noticed, and warned of. This
will prevent silent hangings.

The Problem:

Specifically under ClusterProvider.CreateCluster, there are 4
tasks that need to complete.

The first two tasks run in parallel. The next 2 run serially,
but are not gated against success of the first two.

This means that if more than one task fails, the cluster creation
process can get stuck attempting to write to the unbuffered
taskErr channel.

The operational fix:

Buffer the channel with 5 slots for errors, ensuring there
are no surprises.

eksctl is not some high-performance app, so having a buffer
is useful.

The systematic fix:

When writing to the channel, provide a default escape clause
that allows a failed write to be noticed, and warned of. This
will prevent silent hangings.
@rade
Copy link
Contributor

rade commented Jun 27, 2018

Hmm. Perhaps CreateCluster should return []error. That would make for a cleaner API and avoid the channel sizing problems, though it does mean errors cannot be reported incrementally. The code currently doesn't do the latter; in fact another way to address the problem would be to implement that functionality - reading errors from errChan from a separate go-routine and reporting them would avoid the blocking.

@archisgore
Copy link
Contributor Author

Can't return []error because the polling for stack-completion happens in a goroutine.

I think having another go-routine to read errors off the channel as they happen is the other viable option. Let me know if that's what you want. I can make the changes.

@rade
Copy link
Contributor

rade commented Jun 27, 2018

Can't return []error because the polling for stack-completion happens in a goroutine.

I don't follow. runCreateTask could return an error slice. Yes, it would use a channel internally, but it knows the bound, so can ensure it is not blocking.

Mind you, the whole error handling is a little weird - why do task functions take a chan error and return error?

@archisgore
Copy link
Contributor Author

Oh I see what you're saying. So then... you want to wire out the output array of runCreateTask by either returning from it (in case of error). I can do that too.

About the chan error and returning error, I am very new to this codebase. Just started playing with eksctl 8 hours ago. I'm assuming mostly it's to capture both sync and async errors. I haven't seen the commit history yet, but I would bet the first implementation was purely sync, and then the channel was used to fork off go-routines.

If you'd like I'm happy to send all errors out through the channel for consistency.

@errordeveloper
Copy link
Contributor

I believe this code can be simplified a lot.

@archisgore so this will fix silence that we are seeing e.g. in #75? If so, I'd be happy to merge this with an outlook that we will simplify this code later on.

@archisgore
Copy link
Contributor Author

Yes, this change will fix that too. And I agree, I'll clean up the code. I've got a lot of PRs coming out. I'm using this as a production scripting tool, so I'll be around for any technical debt payback for sure.

@errordeveloper
Copy link
Contributor

@rade what do you think?

@archisgore archisgore mentioned this pull request Jun 27, 2018
@archisgore
Copy link
Contributor Author

Already began work for a revamp: #87

@rade
Copy link
Contributor

rade commented Jun 28, 2018

@rade what do you think?

I am not all happy with a) the hard-coded '5', since that requires knowledge of what happens under the covers, and b) the non-blocking sends since these will produce log messages for what is essentially a bug in the code.

The quickest way to address these, which doesn't require a revamping/cleaning-up of the whole error handling, would be to read from errChan in a separate go-routine, thus avoiding the blocking.

Instead of buffering channel, the channel
is instead read from a separate go-routine for errors.

In order to achieve this:
1. CreateCluster is launched in a go-routine.

2. The main go routine now blocks on reading errors, unless,

3. The channel is closed, at which point it'll get a nil error back
   and break out of the error-listening loop.
@archisgore
Copy link
Contributor Author

archisgore commented Jun 28, 2018

You got it. :-) No more buffering, and separate go-routine to read the error channel. Ready for review. Let me know what you think.

Checking for blocked channels is no longer necessary, since
the reader is in a different go-routine.
pkg/eks/api.go Outdated
@@ -137,10 +137,12 @@ func (c *ClusterProvider) runCreateTask(tasks map[string]func(chan error) error,
logger.Debug("task %q started", tn)
errs := make(chan error)
if err := task(errs); err != nil {
logger.Debug("task %q failed to start due to error %#v", tn, err)

This comment was marked as abuse.

This comment was marked as abuse.

// read any errors (it only gets non-nil errors)
for err := range taskErr {
if err == nil {
break
}

This comment was marked as abuse.

This comment was marked as abuse.

@archisgore
Copy link
Contributor Author

All changes you asked for implemented. I only need your preference on what to do with multiple errors scenario, and I'll take care of it.

@rade
Copy link
Contributor

rade commented Jun 28, 2018

I only need your preference on what to do with multiple errors scenario, and I'll take care of it.

We could use https://github.com/hashicorp/go-multierror, perhaps.

Also, as I mentioned in my review, this code does need some comment as to what is going on, and why.

@archisgore
Copy link
Contributor Author

Yes absolutely. I'll add comments and everything. I just needed to know what you wanted, so that I can document it correctly. I'll use go-multierror as you requested.

@rade
Copy link
Contributor

rade commented Jun 28, 2018

I'll use go-multierror as you requested.

suggested ;) I haven't used it myself before, so there may be gotchas.

Btw, please run some before/after manual tests for the 'multiple errors' case and post the output here - i.e. it should block in 'before' (i.e. on master) and print out multiple errors in 'after' (i.e. on this branch).

@rade
Copy link
Contributor

rade commented Jul 1, 2018

btw, ...

This means that if more than one task fails, the cluster creation process can get stuck attempting to write to the unbuffered taskErr channel.

Given that the channel is entirely unbuffered, won't the cluster creation get stuck even when just a single task fails?

@archisgore
Copy link
Contributor Author

I'm going to let someone else fix this. Sorry about all the churn. I don't have the time on my schedule to see this to completion.

@archisgore archisgore closed this Jul 1, 2018
@errordeveloper
Copy link
Contributor

@archisgore looks like go ctl.CreateCluster(taskErr) is the indeed the simplest fix for this. I am probably misunderstanding something about channels, but what I do understand very well is that this code is unnecessarily complicated anyway, which is my fault! I'll apply this simple fix along with a few other changes, then cut a release and consider simplifying the code after that (we should be able to use a single nested CloudFormation stack, and thereby reduce the need for most of the Go codes we have around CloudFormation and EKS API). Thanks for your contributions so far, despite getting anything merged, you helped a lot!

errordeveloper pushed a commit that referenced this pull request Jul 4, 2018
errordeveloper pushed a commit that referenced this pull request Jul 4, 2018
jstrachan pushed a commit that referenced this pull request Jul 25, 2018
torredil pushed a commit to torredil/eksctl that referenced this pull request May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants