Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1-38-Docker-Exec failure in Exec During PowerOff test #6744

Closed
anchal-agrawal opened this issue Nov 10, 2017 · 11 comments
Closed

1-38-Docker-Exec failure in Exec During PowerOff test #6744

anchal-agrawal opened this issue Nov 10, 2017 · 11 comments
Assignees
Labels
kind/defect Behavior that is inconsistent with what's intended priority/p0 team/foundation

Comments

@anchal-agrawal
Copy link
Contributor

Seen in build https://ci.vcna.io/vmware/vic/14606 in the 1-38-Docker-Exec suite:

------------------------------------------------------------------------------
Exec During PowerOff                                                  | FAIL |
' Error response from daemon: Server error from portlayer: Cannot find task d313fb6d4b5b3021b8692ce78c44395d183e7e365bad4725889a9e96fc7ac74a
Error response from daemon: Conflict error from portlayer: [PUT /containers/{handle}][409] commitConflict  &{Code:0 Message:Cannot complete operation due to concurrent modification by another operation.}
Error response from daemon: Server error from portlayer: The operation is not allowed in the current state.
Error response from daemon: Server error from portlayer: Cannot find task 441203c3924ede3a6da11582b16b873f3bb80d508bc8af27ffcb46d6fbc0d348
Error response from daemon: Server error from portlayer: Cannot find task 20146b554a2e17500ecfe5cf0d697274030bc08eb5e5843b71a671aa47c8953b
Error response from daemon: Conflict error from portlayer: [PUT /containers/{handle}][409] commitConflict  &{Code:0 Message:Cannot complete operation due to concurrent modification by another operation.}
Error response from daemon: Server error from portlayer: Cannot find task 6834083aabf4f65385d4d6a719a7c25f3c84a7bca459973bf0cb6e31a9657961
Error response from daemon: Server error from portlayer: The operation is not allowed in the current state.
Error response from daemon: Conflict error from portlayer: [PUT /containers/{handle}][409] commitConflict  &{Code:0 Message:Cannot complete operation due to concurrent modification by another operation.}
' does not contain 'Cannot complete the operation, container 2982075f4b15a8509b418f26e64ce204e7d5cba2da752e7f3883b5d1a694b043 has been powered off during execution'
------------------------------------------------------------------------------

Logs:
Test-Cases.Group1-Docker-Commands.1-38-Docker-Exec-VCH-14606-7800-container-logs.zip

@anchal-agrawal anchal-agrawal added kind/defect Behavior that is inconsistent with what's intended priority/p0 team/foundation labels Nov 10, 2017
@anchal-agrawal
Copy link
Contributor Author

Assigned to team/foundation for further triage.

@cgtexmex
Copy link
Contributor

@matthewavery I'm looking to you to run with this...

@matthewavery
Copy link
Contributor

@cgtexmex Alright I will triage and investigate today.

@matthewavery
Copy link
Contributor

Note: these are some preliminary notes from my initial triage of this failure.

#Analysis of Exec during poweroff failure

Container of Interest: d2c46697a1b36066acec660a3676cd69be9534bfa592f17713ea1846edb7ae78

##Portlayer Invesitagion:

Looks like in the portlayer we are hitting the sub case below our poweroff check... because of this we are still not looking in etasks when perhaps we should be... either the runtime was nil or the vm changed state right after we checked(shouldn't really happen tmk). because of this we do not actually switch to checking if the etasks list has our id since we never change task to point to etasks.

@matthewavery
Copy link
Contributor

related to #6370

@matthewavery
Copy link
Contributor

matthewavery commented Nov 13, 2017

To resolve this we will need to add more structured error return to the exec paths. here are some things that are needed up invesitgation:

  • Many of the personality Exec path calls need to be abstracted to the container proxy
  • A structured power failure error needs to be added to swagger calls which should then be translated to a sane error message returned by all of the mid flight exec paths.
  • More structured error messages must be propagated back from the portlayer using the new power failure errors added in the above bullet point.
  • more power state logic will likely need to be added to the portlayer exec path.

@matthewavery
Copy link
Contributor

matthewavery commented Nov 14, 2017

per a long analyasis with @hickeng :

THINGS TO DO:

1. create a `TaskInspect` to container proxy and update the call to use the same handle as everything else in Container Exec Start. TaskInspection should go in the retry with everything else and the entire operation should be retried instead of just part. We need to seek a new handle and perform the config additions again. 

2. make sure everything in start is moved to be inside the retry. If we get a concurrent mod error then we must try everything from scratch with a new handle. 

3. We need to change TaskInspect from the portlayer perspective to behave like this: 

func Inspect(op *trace.Operation, h interface{}, id string) (*executor.SessionConfig, error) {
    defer trace.End(trace.Begin(id))

    handle, ok := h.(*exec.Handle)
    if !ok {
        return nil, fmt.Errorf("Type assertion failed for %#+v", handle)
    }

    stasks := handle.ExecConfig.Sessions
    etasks := handle.ExecConfig.Execs

    op.Debugf("target task ID: %s", id)
    op.Debugf("session tasks during inspect: %s", stasks)
    op.Debugf("exec tasks during inspect: %s", etasks)

    if _, ok := stasks[id]; ok {
        return stasks[id], nil
    }

    if _, ok := etasks[id]; ok {
        return etasks[id], nil
    }

    // this error should actually be typed. So that we can determine a 404 situation in the handlers
    return nil, fmt.Errorf("Cannot find task %s", id)
}

4. IMPORTANT: if TaskInspect gives us a 404 we must properly interpret this in the personality. ExecTaskInspect --> 404 --> likely the container was turned off. That should be the order of interpretation. 

5. We need to handle concurrent modification errors properly in the exec path... We cannot just throw the entire packaged error into a `500` error string. 

6. more to come... from notes on my desk... 

@mlh78750
Copy link
Contributor

mlh78750 commented Dec 6, 2017

What happened with this issue? I see the PR but I'm not sure why this went back to ToDo. Is this still expected in 22?

@matthewavery
Copy link
Contributor

@mlh78750 it was determined that this was not included in 1.3. since I have been working on other high priority issues and that PR is not fully tested(with a lot of code overhaul) it has been tabled for the moment, in order to get 1.3 squared away.

@cgtexmex cgtexmex removed this from the Sprint 23 Foundation milestone Dec 6, 2017
@mdubya66
Copy link
Contributor

mdubya66 commented Dec 6, 2017

This problem has existed since 1.1 and is not a regression. The PR that is out against it is too risky at this point. Removing it from the release. Try the exec again and you will find the container is off.

@anchal-agrawal anchal-agrawal added this to the Sprint 27 Foundation milestone Feb 28, 2018
@matthewavery
Copy link
Contributor

matthewavery commented Feb 28, 2018

#7410 has been created to follow some of the issues that we are still seeing.

This ticket will be closed by PR #6787 as it has fixed many many issues with poweroff exec and concurrent execs.

This issue ended up encompassing many many more issues that it appeared. As each unique error involved several different problems being fixed on the exec path, and some of the errors came from different paths but yielded the same error. If you are interested in seeing what some of these were please check out the above PR(warning! it is very large). As such this ticket will be close and #7410 has been made to track what is left after #6787 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/defect Behavior that is inconsistent with what's intended priority/p0 team/foundation
Projects
None yet
Development

No branches or pull requests

5 participants