Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[full ci] Improve exec error message for a CVM shutdown mid operation #6593

Merged
merged 14 commits into from
Nov 3, 2017

Conversation

matthewavery
Copy link
Contributor

This adds some error types and propagates them back from the portlayer to the personality. TaskInspect now returns a ConflictError if the supplied ID does not exist and we see that the state of the cvm is powered off. This error case may be too broad, but in TaskInspect we are not sure if the ID supplied was meant to be an exec ID or a session task ID. This means the solution is best guess, if the id is not found and we are off it is assumed that the operation has been interupted. I thought about just doing the state check again on the personality side of things, however the unknown task ID issue is already triggered after making it past the original state check in CreateExecTask, so I deemed that the task inspect check might need to be closer to the SoT(feel free to correct me on this as well).

Additionally, the guest reload config function has been wrapped in a retry that retries on transient errors. We may want to narrow down the potential retry case as I used the waiter.go intermittent error decider function in first pass since it was designed to retry tasks against vsphere that are supposed to be inherently transient. Based on discussion in this PR we may change that.

I have also added some logging, with some trace.Operation this is pending potential removal per @cgtexmex 's review. I added it originally since it made it easier to track what all was getting called when I was learning the Exec path of operations. It was also needed since I needed to track concurrent calls against the same container.

note: please keep in mind this was my first exposure to the Exec portion of the code base, if I have done something silly please let me know. I spent a decent while poking around and trying understand end to end exactly what was going on.

Fixes #6370

@matthewavery
Copy link
Contributor Author

matthewavery commented Oct 19, 2017

TIL that the docker client calls multiple personality endpoints. So there will still be a concurrent modification error when we report the shutdown on the race condition edge. so now the error message looks like this

Conflict error from portlayer: [PUT /containers/{handle}][409] commitConflict  &{Code:0 Message:Cannot complete operation due to concurrent modification by another operation.}^M
Error response from daemon: Conflict error from portlayer: Cannot complete the operation, container 1f23ba97c6741466d04c2d3e95b3636711f9cf19c12e32737fa3cd7fd47ab411 has been powered off during execution

Makefile Outdated
@@ -235,13 +235,11 @@ misspell: $(MISSPELL)

govet:
@echo checking go vet...
@$(GO) tool vet -all -lostcancel -tests $$(find . -mindepth 1 -maxdepth 1 -type d -not -name vendor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we are removing vet and gas checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not :( I had that removed since mine misbehaves and it snuck in... this happens to me occasionally as I get add happy with git.

Copy link
Contributor

@caglar10ur caglar10ur Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then keep it in your local directory. Why did you do a git add and pushed it as a part of this PR?

@@ -257,7 +258,8 @@ func (c *Container) TaskWaitToStart(cid, cname, eid string) error {

// ContainerExecCreate sets up an exec in a running container.
func (c *Container) ContainerExecCreate(name string, config *types.ExecConfig) (string, error) {
defer trace.End(trace.Begin(name))
op := trace.NewOperation(context.TODO(), "")
defer trace.End(trace.Begin(fmt.Sprintf("opid=(%s) name=(%s)", op.ID(), name)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a Stringer to the op so %s ends up with ID()? (if it is missing)

Copy link
Contributor Author

@matthewavery matthewavery Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like a op.Sprintf? :) yes I would love to do that! I think I had a PR that never got merged awhile back that had one.

@@ -271,15 +273,19 @@ func (c *Container) ContainerExecCreate(name string, config *types.ExecConfig) (
if err != nil {
return "", InternalServerError(err.Error())
}

// This does not appear to be working... we may need to do this check further down the stack. it is going to be very race filled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a fixme here. Also not sure what this comment is trying to tell :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that was just an investigative comment :) I thought it was removed sorry :P I was thinking that it was an early check on state that was far from the source of truth. But I figured it is more of a soft check anyways in case the user tried to exec an already off container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to remove this completely. It is a good check to have and we can always handle things down stream as well.

handle, err := c.Handle(id, name)
if err != nil {
log.Error(err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

op.Errorf?

}

err := retry.Do(retryFunc, isIntermittentFailure)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space

Copy link
Contributor

@caglar10ur caglar10ur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

op.Sprintf -> fmt.Sprintf one is the blocker as I don't think it compiles with it

@@ -222,6 +222,7 @@ func (c *Container) TaskInspect(cid, cname, eid string) (*models.TaskInspectResp
if err != nil {
return nil, err
}

return resp.Payload, nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] extra line :P

@@ -257,7 +258,8 @@ func (c *Container) TaskWaitToStart(cid, cname, eid string) error {

// ContainerExecCreate sets up an exec in a running container.
func (c *Container) ContainerExecCreate(name string, config *types.ExecConfig) (string, error) {
defer trace.End(trace.Begin(name))
op := trace.NewOperation(context.TODO(), "")
defer trace.End(trace.Begin(op.Sprintf("name=(%s)", name)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

op.Sprintf -> fmt.Sprintf

Copy link
Contributor Author

@matthewavery matthewavery Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the op.Sprintf since i thought that was the stringer that you wanted :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I just saw the below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm 👍 I can make that change, I see what you meant by stringer now, I still like the Sprintf though unless there is an egregious reason not to have it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't think having a Sprintf on op makes sense. fmt already provides that and that's the standart way of doing that.

Others can chime in if they think this way or that way :)

@@ -381,7 +407,8 @@ func (c *Container) ContainerExecResize(eid string, height, width int) error {
// ContainerExecStart starts a previously set up exec instance. The
// std streams are set up.
func (c *Container) ContainerExecStart(ctx context.Context, eid string, stdin io.ReadCloser, stdout io.Writer, stderr io.Writer) error {
defer trace.End(trace.Begin(eid))
op := trace.NewOperation(ctx, "")
defer trace.End(trace.Begin(fmt.Sprintf("opID=(%s) eid=(%s)", op.ID(), eid)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add a Stringer to op and use op here instead of op.ID()

func (o *Operation) String() string {
	return o.id
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. So that it has the same behavior as error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also changing this to using the op.Sprintf since it will tack the header on by default. I wanted to promote the use of op.Sprintf for providing strings to the Trace.Begin and Trace.End calls since I still love those and it offers a much cleaner way of adding the Operation header to that logging.

@matthewavery matthewavery changed the title Improve exec error message for a CVM shutdown mid operation [full ci] Improve exec error message for a CVM shutdown mid operation Oct 27, 2017
@@ -257,7 +258,8 @@ func (c *Container) TaskWaitToStart(cid, cname, eid string) error {

// ContainerExecCreate sets up an exec in a running container.
func (c *Container) ContainerExecCreate(name string, config *types.ExecConfig) (string, error) {
defer trace.End(trace.Begin(name))
op := trace.NewOperation(context.TODO(), "")
defer trace.End(trace.Begin(op.Sprintf("name=(%s)", name)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't think having a Sprintf on op makes sense. fmt already provides that and that's the standart way of doing that.

Others can chime in if they think this way or that way :)

@cgtexmex
Copy link
Contributor

From my perspective I wouldn't add pkg/trace/Operation functionality in this PR. #2739 will result in changes to the package and I'll be adding operations in "strategic" places....so to minimize rework / wasted effort I'd vote for you to use fmt for now...

@matthewavery
Copy link
Contributor Author

You got it @cgtexmex

@cgtexmex
Copy link
Contributor

You have added the creation of an Operation at the beginning of most (if not all) the exec funcs -- please use the op.Debugf(log message) consistently. Specifically thinking about the type switches as part of the error handling -- use op.Debugf and not log.Debugf

@cgtexmex
Copy link
Contributor

cgtexmex commented Oct 30, 2017

LGTM

Approved with PullApprove Approved with PullApprove

Copy link
Contributor

@mhagen-vmware mhagen-vmware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a few minor suggestions.

@@ -16,7 +16,6 @@
Documentation Test 1-38 - Docker Exec
Resource ../../resources/Util.robot
Suite Setup Install VIC Appliance To Test Server
Suite Teardown Cleanup VIC Appliance On Test Server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put that back in

@@ -58,7 +57,7 @@ Exec Echo -t
${rc} ${output}= Run And Return Rc And Output docker %{VCH-PARAMS} pull ${busybox}
Should Be Equal As Integers ${rc} 0
Should Not Contain ${output} Error
${rc} ${id}= Run And Return Rc And Output docker %{VCH-PARAMS} run -d ${busybox} /bin/top -d 600
${rc} ${id}= Run And Return Rc And Output docker %{VCH-PARAMS} run -td ${busybox} /bin/top -d 600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't add -t as this isn't run with a TTY

${rc} ${output}= Run And Return Rc And Output docker %{VCH-PARAMS} pull ${busybox}
Should Be Equal As Integers ${rc} 0
Should Not Contain ${output} Error
${rc} ${id}= Run And Return Rc And Output docker %{VCH-PARAMS} run -d ${busybox} /bin/top -d 600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think you need -d 600 here.. no reason to create a race here, just let it run

@matthewavery matthewavery force-pushed the concurrent-issues branch 2 times, most recently from 9b0c8a6 to 653f412 Compare November 3, 2017 16:59
This is specifically adding a retry from
our retry package which will retry the
config reload when we a concurrent access
fault occurs. We also refresh the container
config during the retry in the event that
we need the new ChangeVersion. This is just
the initial test, at first glance it does
not look like it is enough to resolve the
issue. I will also be adding a test which
should reliably reproduce the error
scenario.
@matthewavery matthewavery merged commit cd24fcd into vmware:master Nov 3, 2017
AngieCris pushed a commit to AngieCris/vic that referenced this pull request Nov 20, 2017
This will add a retry for intermittent failures for a guest config reload. As well as propagate a sane and friendly error back to the user if their exec operation was interrupted by the container shutting down. Some more work will need to be done for fixing the concurrent modification error that is returned via stdout when we attach the streams.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker exec following a docker restart of a node container results in unknown task ID error
5 participants