[full ci] Improve exec error message for a CVM shutdown mid operation #6593

matthewavery · 2017-10-19T16:29:45Z

This adds some error types and propagates them back from the portlayer to the personality. TaskInspect now returns a ConflictError if the supplied ID does not exist and we see that the state of the cvm is powered off. This error case may be too broad, but in TaskInspect we are not sure if the ID supplied was meant to be an exec ID or a session task ID. This means the solution is best guess, if the id is not found and we are off it is assumed that the operation has been interupted. I thought about just doing the state check again on the personality side of things, however the unknown task ID issue is already triggered after making it past the original state check in CreateExecTask, so I deemed that the task inspect check might need to be closer to the SoT(feel free to correct me on this as well).

Additionally, the guest reload config function has been wrapped in a retry that retries on transient errors. We may want to narrow down the potential retry case as I used the waiter.go intermittent error decider function in first pass since it was designed to retry tasks against vsphere that are supposed to be inherently transient. Based on discussion in this PR we may change that.

I have also added some logging, with some trace.Operation this is pending potential removal per @cgtexmex 's review. I added it originally since it made it easier to track what all was getting called when I was learning the Exec path of operations. It was also needed since I needed to track concurrent calls against the same container.

note: please keep in mind this was my first exposure to the Exec portion of the code base, if I have done something silly please let me know. I spent a decent while poking around and trying understand end to end exactly what was going on.

Fixes #6370

matthewavery · 2017-10-19T19:41:38Z

TIL that the docker client calls multiple personality endpoints. So there will still be a concurrent modification error when we report the shutdown on the race condition edge. so now the error message looks like this

Conflict error from portlayer: [PUT /containers/{handle}][409] commitConflict  &{Code:0 Message:Cannot complete operation due to concurrent modification by another operation.}^M
Error response from daemon: Conflict error from portlayer: Cannot complete the operation, container 1f23ba97c6741466d04c2d3e95b3636711f9cf19c12e32737fa3cd7fd47ab411 has been powered off during execution

caglar10ur · 2017-10-23T18:33:50Z

Makefile

@@ -235,13 +235,11 @@ misspell: $(MISSPELL)

 govet:
 	@echo checking go vet...
-	@$(GO) tool vet -all -lostcancel -tests $$(find . -mindepth 1 -maxdepth 1 -type d -not -name vendor)


Why we are removing vet and gas checks?

We are not :( I had that removed since mine misbehaves and it snuck in... this happens to me occasionally as I get add happy with git.

OK, then keep it in your local directory. Why did you do a git add and pushed it as a part of this PR?

caglar10ur · 2017-10-23T18:36:31Z

lib/apiservers/engine/backends/container.go

@@ -257,7 +258,8 @@ func (c *Container) TaskWaitToStart(cid, cname, eid string) error {

 // ContainerExecCreate sets up an exec in a running container.
 func (c *Container) ContainerExecCreate(name string, config *types.ExecConfig) (string, error) {
-	defer trace.End(trace.Begin(name))
+	op := trace.NewOperation(context.TODO(), "")
+	defer trace.End(trace.Begin(fmt.Sprintf("opid=(%s) name=(%s)", op.ID(), name)))


Should we add a Stringer to the op so %s ends up with ID()? (if it is missing)

like a op.Sprintf? :) yes I would love to do that! I think I had a PR that never got merged awhile back that had one.

caglar10ur · 2017-10-23T18:37:21Z

lib/apiservers/engine/backends/container.go

@@ -271,15 +273,19 @@ func (c *Container) ContainerExecCreate(name string, config *types.ExecConfig) (
 	if err != nil {
 		return "", InternalServerError(err.Error())
 	}
+
+	// This does not appear to be working... we may need to do this check further down the stack. it is going to be very race filled.


maybe a fixme here. Also not sure what this comment is trying to tell :P

ah that was just an investigative comment :) I thought it was removed sorry :P I was thinking that it was an early check on state that was far from the source of truth. But I figured it is more of a soft check anyways in case the user tried to exec an already off container.

I am going to remove this completely. It is a good check to have and we can always handle things down stream as well.

caglar10ur · 2017-10-23T18:37:37Z

lib/apiservers/engine/backends/container.go

 	handle, err := c.Handle(id, name)
 	if err != nil {
+		log.Error(err.Error())


caglar10ur · 2017-10-23T20:28:04Z

lib/portlayer/exec/commit.go

+	}
+
+	err := retry.Do(retryFunc, isIntermittentFailure)
+


extra space

caglar10ur

op.Sprintf -> fmt.Sprintf one is the blocker as I don't think it compiles with it

caglar10ur · 2017-10-26T17:26:57Z

lib/apiservers/engine/backends/container.go

@@ -222,6 +222,7 @@ func (c *Container) TaskInspect(cid, cname, eid string) (*models.TaskInspectResp
 	if err != nil {
 		return nil, err
 	}
+
 	return resp.Payload, nil



[minor] extra line :P

caglar10ur · 2017-10-26T17:29:32Z

lib/apiservers/engine/backends/container.go

@@ -257,7 +258,8 @@ func (c *Container) TaskWaitToStart(cid, cname, eid string) error {

 // ContainerExecCreate sets up an exec in a running container.
 func (c *Container) ContainerExecCreate(name string, config *types.ExecConfig) (string, error) {
-	defer trace.End(trace.Begin(name))
+	op := trace.NewOperation(context.TODO(), "")
+	defer trace.End(trace.Begin(op.Sprintf("name=(%s)", name)))


op.Sprintf -> fmt.Sprintf

I added the op.Sprintf since i thought that was the stringer that you wanted :)

ah I just saw the below

hmmm 👍 I can make that change, I see what you meant by stringer now, I still like the Sprintf though unless there is an egregious reason not to have it.

Personally I don't think having a Sprintf on op makes sense. fmt already provides that and that's the standart way of doing that.

Others can chime in if they think this way or that way :)

caglar10ur · 2017-10-26T17:31:09Z

lib/apiservers/engine/backends/container.go

@@ -381,7 +407,8 @@ func (c *Container) ContainerExecResize(eid string, height, width int) error {
 // ContainerExecStart starts a previously set up exec instance. The
 // std streams are set up.
 func (c *Container) ContainerExecStart(ctx context.Context, eid string, stdin io.ReadCloser, stdout io.Writer, stderr io.Writer) error {
-	defer trace.End(trace.Begin(eid))
+	op := trace.NewOperation(ctx, "")
+	defer trace.End(trace.Begin(fmt.Sprintf("opID=(%s) eid=(%s)", op.ID(), eid)))


you can add a Stringer to op and use op here instead of op.ID()

func (o *Operation) String() string { return o.id }

Ah I see. So that it has the same behavior as error

I am also changing this to using the op.Sprintf since it will tack the header on by default. I wanted to promote the use of op.Sprintf for providing strings to the Trace.Begin and Trace.End calls since I still love those and it offers a much cleaner way of adding the Operation header to that logging.

caglar10ur · 2017-10-27T06:01:08Z

lib/apiservers/engine/backends/container.go

@@ -257,7 +258,8 @@ func (c *Container) TaskWaitToStart(cid, cname, eid string) error {

 // ContainerExecCreate sets up an exec in a running container.
 func (c *Container) ContainerExecCreate(name string, config *types.ExecConfig) (string, error) {
-	defer trace.End(trace.Begin(name))
+	op := trace.NewOperation(context.TODO(), "")
+	defer trace.End(trace.Begin(op.Sprintf("name=(%s)", name)))


Personally I don't think having a Sprintf on op makes sense. fmt already provides that and that's the standart way of doing that.

Others can chime in if they think this way or that way :)

cgtexmex · 2017-10-27T16:04:42Z

From my perspective I wouldn't add pkg/trace/Operation functionality in this PR. #2739 will result in changes to the package and I'll be adding operations in "strategic" places....so to minimize rework / wasted effort I'd vote for you to use fmt for now...

matthewavery · 2017-10-27T17:30:33Z

You got it @cgtexmex

cgtexmex · 2017-10-30T17:03:53Z

You have added the creation of an Operation at the beginning of most (if not all) the exec funcs -- please use the op.Debugf(log message) consistently. Specifically thinking about the type switches as part of the error handling -- use op.Debugf and not log.Debugf

cgtexmex · 2017-10-30T18:49:51Z

LGTM

mhagen-vmware

looks good, just a few minor suggestions.

mhagen-vmware · 2017-11-02T19:32:37Z

tests/test-cases/Group1-Docker-Commands/1-38-Docker-Exec.robot

@@ -16,7 +16,6 @@
 Documentation  Test 1-38 - Docker Exec
 Resource  ../../resources/Util.robot
 Suite Setup  Install VIC Appliance To Test Server
-Suite Teardown  Cleanup VIC Appliance On Test Server


put that back in

mhagen-vmware · 2017-11-02T19:33:01Z

tests/test-cases/Group1-Docker-Commands/1-38-Docker-Exec.robot

@@ -58,7 +57,7 @@ Exec Echo -t
    ${rc}  ${output}=  Run And Return Rc And Output  docker %{VCH-PARAMS} pull ${busybox}
    Should Be Equal As Integers  ${rc}  0
    Should Not Contain  ${output}  Error
-    ${rc}  ${id}=  Run And Return Rc And Output  docker %{VCH-PARAMS} run -d ${busybox} /bin/top -d 600
+    ${rc}  ${id}=  Run And Return Rc And Output  docker %{VCH-PARAMS} run -td ${busybox} /bin/top -d 600


don't add -t as this isn't run with a TTY

mhagen-vmware · 2017-11-02T19:34:40Z

tests/test-cases/Group1-Docker-Commands/1-38-Docker-Exec.robot

+     ${rc}  ${output}=  Run And Return Rc And Output  docker %{VCH-PARAMS} pull ${busybox}
+     Should Be Equal As Integers  ${rc}  0
+     Should Not Contain  ${output}  Error
+     ${rc}  ${id}=  Run And Return Rc And Output  docker %{VCH-PARAMS} run -d ${busybox} /bin/top -d 600


don't think you need -d 600 here.. no reason to create a race here, just let it run

This is specifically adding a retry from our retry package which will retry the config reload when we a concurrent access fault occurs. We also refresh the container config during the retry in the event that we need the new ChangeVersion. This is just the initial test, at first glance it does not look like it is enough to resolve the issue. I will also be adding a test which should reliably reproduce the error scenario.

This will add a retry for intermittent failures for a guest config reload. As well as propagate a sane and friendly error back to the user if their exec operation was interrupted by the container shutting down. Some more work will need to be done for fixing the concurrent modification error that is returned via stdout when we attach the streams.

vmwclabot added the cla-not-required label Oct 19, 2017

matthewavery requested review from caglar10ur and hickeng October 19, 2017 19:34

caglar10ur suggested changes Oct 23, 2017

View reviewed changes

caglar10ur suggested changes Oct 26, 2017

View reviewed changes

matthewavery changed the title ~~Improve exec error message for a CVM shutdown mid operation~~ [full ci] Improve exec error message for a CVM shutdown mid operation Oct 27, 2017

caglar10ur approved these changes Oct 27, 2017

View reviewed changes

matthewavery force-pushed the concurrent-issues branch from 21f9437 to eaba45f Compare October 30, 2017 18:37

mhagen-vmware suggested changes Nov 2, 2017

View reviewed changes

mhagen-vmware approved these changes Nov 2, 2017

View reviewed changes

matthewavery force-pushed the concurrent-issues branch 2 times, most recently from 9b0c8a6 to 653f412 Compare November 3, 2017 16:59

matthewavery added 13 commits November 3, 2017 10:09

Add investigative comments and additional work for retry logic.

62850e4

Add new error type and propagate it back

f41c79f

Remove extraneous comments and some unnneeded logging

dbedda6

Address some PR comments and add some logging

78f78ba

Move to using ops sprintf

5d704c0

Add stringer to trace operation package

9b6d853

Fix small mistake

6d9f9d7

Perform Housecleaning for PR

66fc3d5

Fix missed log messaging to use the operation package

1d76f5e

Add integration test for poweroff exec scenario

6b9aded

Remove unneeded options in test

af229d7

Add suite teardown back in

a11d0ec

Readd the suite tear down after rebase conflict

5413c06

matthewavery force-pushed the concurrent-issues branch from 653f412 to 5413c06 Compare November 3, 2017 17:09

matthewavery merged commit cd24fcd into vmware:master Nov 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[full ci] Improve exec error message for a CVM shutdown mid operation #6593

[full ci] Improve exec error message for a CVM shutdown mid operation #6593

matthewavery commented Oct 19, 2017

matthewavery commented Oct 19, 2017 •

edited

Loading

caglar10ur Oct 23, 2017

matthewavery Oct 24, 2017

caglar10ur Oct 24, 2017 •

edited

Loading

caglar10ur Oct 23, 2017

matthewavery Oct 24, 2017 •

edited

Loading

caglar10ur Oct 23, 2017

matthewavery Oct 24, 2017

matthewavery Oct 24, 2017

caglar10ur Oct 23, 2017

caglar10ur Oct 23, 2017

caglar10ur left a comment

caglar10ur Oct 26, 2017

caglar10ur Oct 26, 2017

matthewavery Oct 26, 2017 •

edited

Loading

matthewavery Oct 26, 2017

matthewavery Oct 26, 2017

caglar10ur Oct 27, 2017

caglar10ur Oct 26, 2017

matthewavery Oct 26, 2017

matthewavery Oct 27, 2017

caglar10ur Oct 27, 2017

cgtexmex commented Oct 27, 2017

matthewavery commented Oct 27, 2017

cgtexmex commented Oct 30, 2017

cgtexmex commented Oct 30, 2017 •

edited by frapposelli

Loading

mhagen-vmware left a comment

mhagen-vmware Nov 2, 2017

mhagen-vmware Nov 2, 2017

mhagen-vmware Nov 2, 2017

[full ci] Improve exec error message for a CVM shutdown mid operation #6593

[full ci] Improve exec error message for a CVM shutdown mid operation #6593

Conversation

matthewavery commented Oct 19, 2017

matthewavery commented Oct 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caglar10ur Oct 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewavery Oct 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caglar10ur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewavery Oct 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgtexmex commented Oct 27, 2017

matthewavery commented Oct 27, 2017

cgtexmex commented Oct 30, 2017

cgtexmex commented Oct 30, 2017 • edited by frapposelli Loading

mhagen-vmware left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewavery commented Oct 19, 2017 •

edited

Loading

caglar10ur Oct 24, 2017 •

edited

Loading

matthewavery Oct 24, 2017 •

edited

Loading

matthewavery Oct 26, 2017 •

edited

Loading

cgtexmex commented Oct 30, 2017 •

edited by frapposelli

Loading