Ensure taskErr channel is buffered #85

archisgore · 2018-06-27T02:06:17Z

The Problem:

Specifically under ClusterProvider.CreateCluster, there are 4
tasks that need to complete.

The first two tasks run in parallel. The next 2 run serially,
but are not gated against success of the first two.

This means that if more than one task fails, the cluster creation
process can get stuck attempting to write to the unbuffered
taskErr channel.

The operational fix:

Buffer the channel with 5 slots for errors, ensuring there
are no surprises.

eksctl is not some high-performance app, so having a buffer
is useful.

The systematic fix:

When writing to the channel, provide a default escape clause
that allows a failed write to be noticed, and warned of. This
will prevent silent hangings.

The Problem: Specifically under ClusterProvider.CreateCluster, there are 4 tasks that need to complete. The first two tasks run in parallel. The next 2 run serially, but are not gated against success of the first two. This means that if more than one task fails, the cluster creation process can get stuck attempting to write to the unbuffered taskErr channel. The operational fix: Buffer the channel with 5 slots for errors, ensuring there are no surprises. eksctl is not some high-performance app, so having a buffer is useful. The systematic fix: When writing to the channel, provide a default escape clause that allows a failed write to be noticed, and warned of. This will prevent silent hangings.

rade · 2018-06-27T04:01:53Z

Hmm. Perhaps CreateCluster should return []error. That would make for a cleaner API and avoid the channel sizing problems, though it does mean errors cannot be reported incrementally. The code currently doesn't do the latter; in fact another way to address the problem would be to implement that functionality - reading errors from errChan from a separate go-routine and reporting them would avoid the blocking.

archisgore · 2018-06-27T04:05:37Z

Can't return []error because the polling for stack-completion happens in a goroutine.

I think having another go-routine to read errors off the channel as they happen is the other viable option. Let me know if that's what you want. I can make the changes.

rade · 2018-06-27T04:17:25Z

Can't return []error because the polling for stack-completion happens in a goroutine.

I don't follow. runCreateTask could return an error slice. Yes, it would use a channel internally, but it knows the bound, so can ensure it is not blocking.

Mind you, the whole error handling is a little weird - why do task functions take a chan error and return error?

archisgore · 2018-06-27T04:22:03Z

Oh I see what you're saying. So then... you want to wire out the output array of runCreateTask by either returning from it (in case of error). I can do that too.

About the chan error and returning error, I am very new to this codebase. Just started playing with eksctl 8 hours ago. I'm assuming mostly it's to capture both sync and async errors. I haven't seen the commit history yet, but I would bet the first implementation was purely sync, and then the channel was used to fork off go-routines.

If you'd like I'm happy to send all errors out through the channel for consistency.

errordeveloper · 2018-06-27T11:02:41Z

I believe this code can be simplified a lot.

@archisgore so this will fix silence that we are seeing e.g. in #75? If so, I'd be happy to merge this with an outlook that we will simplify this code later on.

archisgore · 2018-06-27T14:29:50Z

Yes, this change will fix that too. And I agree, I'll clean up the code. I've got a lot of PRs coming out. I'm using this as a production scripting tool, so I'll be around for any technical debt payback for sure.

errordeveloper · 2018-06-27T15:27:50Z

@rade what do you think?

archisgore · 2018-06-27T16:54:53Z

Already began work for a revamp: #87

rade · 2018-06-28T08:50:15Z

@rade what do you think?

I am not all happy with a) the hard-coded '5', since that requires knowledge of what happens under the covers, and b) the non-blocking sends since these will produce log messages for what is essentially a bug in the code.

The quickest way to address these, which doesn't require a revamping/cleaning-up of the whole error handling, would be to read from errChan in a separate go-routine, thus avoiding the blocking.

Instead of buffering channel, the channel is instead read from a separate go-routine for errors. In order to achieve this: 1. CreateCluster is launched in a go-routine. 2. The main go routine now blocks on reading errors, unless, 3. The channel is closed, at which point it'll get a nil error back and break out of the error-listening loop.

archisgore · 2018-06-28T12:08:36Z

You got it. :-) No more buffering, and separate go-routine to read the error channel. Ready for review. Let me know what you think.

Checking for blocked channels is no longer necessary, since the reader is in a different go-routine.

pkg/eks/api.go

@@ -137,10 +137,12 @@ func (c *ClusterProvider) runCreateTask(tasks map[string]func(chan error) error,
 			logger.Debug("task %q started", tn)
 			errs := make(chan error)
 			if err := task(errs); err != nil {
+				logger.Debug("task %q failed to start due to error %#v", tn, err)


cmd/eksctl/create.go

 		// read any errors (it only gets non-nil errors)
 		for err := range taskErr {
+			if err == nil {
+				break
+			}


archisgore · 2018-06-28T16:19:05Z

All changes you asked for implemented. I only need your preference on what to do with multiple errors scenario, and I'll take care of it.

rade · 2018-06-28T16:55:57Z

I only need your preference on what to do with multiple errors scenario, and I'll take care of it.

We could use https://github.com/hashicorp/go-multierror, perhaps.

Also, as I mentioned in my review, this code does need some comment as to what is going on, and why.

archisgore · 2018-06-28T16:58:47Z

Yes absolutely. I'll add comments and everything. I just needed to know what you wanted, so that I can document it correctly. I'll use go-multierror as you requested.

rade · 2018-06-28T17:06:42Z

I'll use go-multierror as you requested.

suggested ;) I haven't used it myself before, so there may be gotchas.

Btw, please run some before/after manual tests for the 'multiple errors' case and post the output here - i.e. it should block in 'before' (i.e. on master) and print out multiple errors in 'after' (i.e. on this branch).

rade · 2018-07-01T12:35:51Z

btw, ...

This means that if more than one task fails, the cluster creation process can get stuck attempting to write to the unbuffered taskErr channel.

Given that the channel is entirely unbuffered, won't the cluster creation get stuck even when just a single task fails?

archisgore · 2018-07-01T13:15:23Z

I'm going to let someone else fix this. Sorry about all the churn. I don't have the time on my schedule to see this to completion.

errordeveloper · 2018-07-04T13:30:12Z

@archisgore looks like go ctl.CreateCluster(taskErr) is the indeed the simplest fix for this. I am probably misunderstanding something about channels, but what I do understand very well is that this code is unnecessarily complicated anyway, which is my fault! I'll apply this simple fix along with a few other changes, then cut a release and consider simplifying the code after that (we should be able to use a single nested CloudFormation stack, and thereby reduce the need for most of the Go codes we have around CloudFormation and EKS API). Thanks for your contributions so far, despite getting anything merged, you helped a lot!

(cherry-picked from #85)

Refactor `GetDevice` method.

archisgore mentioned this pull request Jun 27, 2018

Make cluster creation and deletion idempotent, repeatable and resumable from partial state #86

Closed

archisgore mentioned this pull request Jun 27, 2018

bad output keys can result in eks.GetOutput blocking #21

Closed

Merge branch 'master' into archis/buffered-taskerr-chan

fef3efd

archisgore mentioned this pull request Jun 27, 2018

Error-handling revamp #87

Closed

No more checking for blocked channels when sending error.

7a0be51

Checking for blocked channels is no longer necessary, since the reader is in a different go-routine.

rade reviewed Jun 28, 2018

View reviewed changes

cmd/eksctl/create.go Outdated

// read any errors (it only gets non-nil errors)

for err := range taskErr {

if err == nil {

break

}

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

Implemented changes based on feedback

279c6ed

archisgore closed this Jul 1, 2018

errordeveloper pushed a commit that referenced this pull request Jul 4, 2018

Unblocks channel to make it readable

4c88ad8

(cherry-picked from #85)

errordeveloper pushed a commit that referenced this pull request Jul 4, 2018

Unblocks channel to make it readable

2ad3b63

(cherry-picked from #85)

jstrachan pushed a commit that referenced this pull request Jul 25, 2018

Unblocks channel to make it readable

1e4c2bb

(cherry-picked from #85)

torredil pushed a commit to torredil/eksctl that referenced this pull request May 20, 2022

Merge pull request eksctl-io#85 from leakingtapan/refactor-devmgr

cbaf6d6

Refactor `GetDevice` method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure taskErr channel is buffered #85

Ensure taskErr channel is buffered #85

archisgore commented Jun 27, 2018

rade commented Jun 27, 2018

archisgore commented Jun 27, 2018

rade commented Jun 27, 2018

archisgore commented Jun 27, 2018

errordeveloper commented Jun 27, 2018

archisgore commented Jun 27, 2018

errordeveloper commented Jun 27, 2018

archisgore commented Jun 27, 2018

rade commented Jun 28, 2018 •

edited

Loading

archisgore commented Jun 28, 2018 •

edited

Loading

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

archisgore commented Jun 28, 2018

rade commented Jun 28, 2018

archisgore commented Jun 28, 2018

rade commented Jun 28, 2018

rade commented Jul 1, 2018

archisgore commented Jul 1, 2018

errordeveloper commented Jul 4, 2018

Ensure taskErr channel is buffered #85

Ensure taskErr channel is buffered #85

Conversation

archisgore commented Jun 27, 2018

rade commented Jun 27, 2018

archisgore commented Jun 27, 2018

rade commented Jun 27, 2018

archisgore commented Jun 27, 2018

errordeveloper commented Jun 27, 2018

archisgore commented Jun 27, 2018

errordeveloper commented Jun 27, 2018

archisgore commented Jun 27, 2018

rade commented Jun 28, 2018 • edited Loading

archisgore commented Jun 28, 2018 • edited Loading

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

archisgore commented Jun 28, 2018

rade commented Jun 28, 2018

archisgore commented Jun 28, 2018

rade commented Jun 28, 2018

rade commented Jul 1, 2018

archisgore commented Jul 1, 2018

errordeveloper commented Jul 4, 2018

rade commented Jun 28, 2018 •

edited

Loading

archisgore commented Jun 28, 2018 •

edited

Loading