-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker exec following a docker restart of a node container results in unknown task ID error #6370
Comments
marking this high for now, as it is completely non-functional. I need to get into the container in order to debug why it is failing and I cannot do that because of the portlayer errors. |
@mhagen-vmware Looking at it, but does the container output have anything of use in determining why the application didn't work? The prefix is timestamp encoding -
|
The containeVM is shutting down due to main process exit just as you run the exec request. We hit this line in the portlayer - https://github.com/vmware/vic/blob/master/lib/portlayer/exec/commit.go#L175 - and I remember wondering what to do there: if err != nil {
// NOTE: not sure how to handle this error - the change is already applied, it's just not picked up in the container
} This fix should check for the cause of failure to deliver reload and:
@cgtexmex the latter requirement above is another instance of the conflict case you're looking at, but this one is legitimately handled on the portlayer side. |
@hickeng Should we be retrying a concurrent access error here in commit or down stream from here? The personality has retried concurrent access errors in the past(I believe). for the second scenario I am wondering which error we should retry? If we do collide with these errors we would expect to see something like a ConcurrentAccess style fault. But if we actively collide with it are we going to get some sort of InvalidState transition fault? I might need some insight on this as I have no idea what to look for. We need a way to reproduce a collision and then check the log for the fault type generated by that collisions. Thoughts? |
I was looking into the InvalidState fault. Since it is described as being thrown when "the operation cannot be performed in the current state of the virtual machine. For example, because the virtual machine's configuration is not available.` And for the RelocateVM_Task the InvalidState fault is "Thrown if the operation cannot be performed because of the host or virtual machine's current state. For example, if the host is in maintenance mode, or if the virtual machine's configuration information is not available.". (reference) Based on those definitions it looks like we should be retrying on this fault as well. But my additional question is should we be retrying this mid commit for the invalid state, but not for the Concurrent access? |
InvalidState should be retried directly in that same manner that TaskInProgress is for states that we identify as fundamentally transient - this should be covering invalid states such as:
My current understanding is that VM_MIGRATION will actually cause an update to changeVersion when complete which means we'll get a ConcurrentModificationError (or whatever it presents as). Regardless we can try adding this state to the immediate retry above and then fall out of the retry when it turns into the concurrent modification error. ConcurrentModificaiton should not be retried directly and should be returned to the personality as is currently the case. |
dropping this here as it is important https://code.vmware.com/web/dp/doc/preview?id=1503#/doc/vim.ResourcePool.html%23importVApp @hickeng took a look into this and the only fault type I potentially see for this direct call that might be intermittent would be the Looks like the CreateVM_Task has the same RuntimeFault(I am sure this is common) stipulation. Along with the InvalidState possibility. So for this ticket I will also look into possible faults that could be thrown as a RuntimeFault that we might be able to get around with a retry for now. |
Here is an initial list of faults that I think we might also want to retry against as potential intermittent failures. @hickeng @caglar10ur @cgtexmex @dougm @lcastellano
I also noted that we have the potential case of
Note: This list was generated by looking through CreateVM_Task and ImportVApp_Task. There are likely more but this was a first look through based on some provided docs. It might be worthwhile to compile a list of all the tasks we initiate thus far against VC and then update this list as we go. Then we would have a great way to crosscheck the type of faults we would expect to see and we could make judgement calls against that list. just a thought. :) |
IMHO there is no reason to retry HostCommunication error as don't think host will come back to life magically. By the way this happened literally 4 min. ago :)
|
Good to know :) this was the kind of thing I was looking forward. From what I read it looks ok like host communication errors take manual intervention? Or take a longer time to resolve? I will remove this from my changes. |
putting into 1.3 and marking high pri, per @pdaigle |
Reopening, further logic and error message propagation needs to be added for the
|
per a deep anlysis with @hickeng :
|
@cgtexmex has been dealing with |
@hickeng @mhagen-vmware @matthewavery I have no idea what to write in the release notes about this issue. Can you please provide a concise writeup of the user-visible symptoms? Thanks! |
container-logs.zip
The text was updated successfully, but these errors were encountered: