Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker exec following a docker restart of a node container results in unknown task ID error #6370

Closed
mhagen-vmware opened this issue Sep 15, 2017 · 17 comments · Fixed by #6593
Assignees
Labels
kind/defect Behavior that is inconsistent with what's intended priority/p0 source/customer Reported by a customer, directly or via an intermediary team/foundation

Comments

@mhagen-vmware
Copy link
Contributor

anon@anon-Latitude-E7450:~/vic$ docker -H 192.168.254.178:2376 --tls restart node1
node1
anon@anon-Latitude-E7450:~/vic$ docker -H 192.168.254.178:2376 --tls exec -it node1 /bin/bash
Server error from portlayer: Unknown task ID: e76d137a9d9d136cffd1353f6c902d571a14b2abdaeb8220868fe0408d607739
Error response from daemon: Server error from portlayer: [GET /tasks][500] inspectInternalServerError  &{Code:0 Message:unknown task ID: e76d137a9d9d136cffd1353f6c902d571a14b2abdaeb8220868fe0408d607739}

container-logs.zip

@mhagen-vmware mhagen-vmware added kind/defect Behavior that is inconsistent with what's intended priority/p0 team/container labels Sep 15, 2017
@mhagen-vmware
Copy link
Contributor Author

mhagen-vmware commented Sep 15, 2017

marking this high for now, as it is completely non-functional. I need to get into the container in order to debug why it is failing and I cannot do that because of the portlayer errors.

@hickeng
Copy link
Member

hickeng commented Sep 16, 2017

@mhagen-vmware Looking at it, but does the container output have anything of use in determining why the application didn't work? The prefix is timestamp encoding - docker logs should have happily returned the formatted version of this:

n/d25TKh5BRgAg==npm info it worked if it ends with ok
HLB45TKh5BSQAQ==npm info using npm@5.3.0
PsV45TKh5BSwAQ==npm info using node@v8.5.0
uE9aFTOh5BTgAQ==npm info linkStuff !invalid#1
kb76FTOh5BQAAw==npm info lifecycle undefined~install: undefined
0VewFjOh5BRAAw==npm info lifecycle undefined~postinstall: undefined
MVJgFzOh5BQwAw==npm info lifecycle undefined~prepublish: undefined
ueIFGDOh5BQAAw==npm info lifecycle undefined~prepare: undefined
Ty+HGDOh5BRwBQ==npm WARN saveError ENOENT: no such file or directory, open '/usr/src/app/package.json'
x3dRGTOh5BRgAw==npm info lifecycle undefined~preshrinkwrap: undefined
LjnoGTOh5BQwAw==npm info lifecycle undefined~shrinkwrap: undefined
t189GzOh5BQQBQ==npm notice created a lockfile as package-lock.json. You should commit this file.
O1hYHDOh5BRwAw==npm info lifecycle undefined~postshrinkwrap: undefined
VZzcHDOh5BRABQ==npm WARN enoent ENOENT: no such file or directory, open '/usr/src/app/package.json'
7cvEHTOh5BTAAQ==npm WARN app No description
Sd/aHTOh5BQgAg==npm WARN app No repository field.
x9yZHjOh5BTAAQ==npm WARN app No README data
y4KbHjOh5BTwAQ==npm WARN app No license field.
X5ebHjOh5BQQAA==
WTdgHzOh5BRQAQ==up to date in 0.222s
s9PPHzOh5BTQAA==npm info ok 
0SWMLjOh5BRgAg==npm info it worked if it ends with ok
AsjTLjOh5BSQAQ==npm info using npm@5.3.0
uuXTLjOh5BSwAQ==npm info using node@v8.5.0
IteBMTOh5BSAAg==npm ERR! path /usr/src/app/package.json
UPmCMTOh5BRQAQ==npm ERR! code ENOENT
Qm5DMjOh5BQgAQ==npm ERR! errno -2
CM+6MjOh5BRgAQ==npm ERR! syscall open
KyF3MzOh5BRABQ==npm ERR! enoent ENOENT: no such file or directory, open '/usr/src/app/package.json'
dFcvNDOh5BRgBA==npm ERR! enoent This is related to npm not being able to find a file.
3GAwNDOh5BQQAQ==npm ERR! enoent 
pbs3NTOh5BQQAA==
6yQ6NTOh5BRQAw==npm ERR! A complete log of this run can be found in:
hSpBNTOh5BQQBA==npm ERR!     /root/.npm/_logs/2017-09-15T19_56_56_751Z-debug.log
Fr8YZkyh5BRgAg==npm info it worked if it ends with ok
hTIbZkyh5BSQAQ==npm info using npm@5.3.0
qjsbZkyh5BSwAQ==npm info using node@v8.5.0
+ydLp0yh5BTgAQ==npm info linkStuff !invalid#1
IiuFp0yh5BQAAw==npm info lifecycle undefined~install: undefined
VchpqEyh5BRAAw==npm info lifecycle undefined~postinstall: undefined
K2C2qUyh5BQwAw==npm info lifecycle undefined~prepublish: undefined
4CG4qUyh5BQAAw==npm info lifecycle undefined~prepare: undefined
fTu4qUyh5BRwBQ==npm WARN saveError ENOENT: no such file or directory, open '/usr/src/app/package.json'
eHP8qUyh5BRgAw==npm info lifecycle undefined~preshrinkwrap: undefined
ZgkWr0yh5BQwAw==npm info lifecycle undefined~shrinkwrap: undefined
ZfI7r0yh5BRwAw==npm info lifecycle undefined~postshrinkwrap: undefined
4WCzskyh5BRABQ==npm WARN enoent ENOENT: no such file or directory, open '/usr/src/app/package.json'
9eqfuEyh5BRQAQ==up to date in 0.368s
m6QCvEyh5BTAAQ==npm WARN app No description
StkjvEyh5BQgAg==npm WARN app No repository field.
7K6xvUyh5BTAAQ==npm WARN app No README data
jFr2vUyh5BTwAQ==npm WARN app No license field.
Hp5rwEyh5BQQAA==
TNvgw0yh5BTQAA==npm info ok 
ml8r8kyh5BRgAg==npm info it worked if it ends with ok
kiNU8kyh5BSQAQ==npm info using npm@5.3.0
uOJw8kyh5BSwAQ==npm info using node@v8.5.0
aWROU02h5BSAAg==npm ERR! path /usr/src/app/package.json
NDeHVU2h5BRQAQ==npm ERR! code ENOENT
qwCoWk2h5BQgAQ==npm ERR! errno -2
Y1bSWk2h5BRgAQ==npm ERR! syscall open
1o8NW02h5BRABQ==npm ERR! enoent ENOENT: no such file or directory, open '/usr/src/app/package.json'
B4H5Y02h5BRgBA==npm ERR! enoent This is related to npm not being able to find a file.
UZ/tak2h5BQQAQ==npm ERR! enoent 
767WtE2h5BQQAA==
vvrutE2h5BRQAw==npm ERR! A complete log of this run can be found in:
aNs+tU2h5BQQBA==npm ERR!     /root/.npm/_logs/2017-09-15T19_58_48_855Z-debug.log

@hickeng
Copy link
Member

hickeng commented Sep 16, 2017

The containeVM is shutting down due to main process exit just as you run the exec request.
The exec is configured in the cVM guestinfo, however it's never launched as the "reload" guest program is never launched.

We hit this line in the portlayer - https://github.com/vmware/vic/blob/master/lib/portlayer/exec/commit.go#L175 - and I remember wondering what to do there:

if err != nil {
	// NOTE: not sure how to handle this error - the change is already applied, it's just not picked up in the container
}

This fix should check for the cause of failure to deliver reload and:

  • in the case of power off return an error such as "container exited during operation"
  • if it's a transient state error (such as a collision with Reconfigure, vMotion, et al) then retry - this is what the retry.Do code is intended for.

@cgtexmex the latter requirement above is another instance of the conflict case you're looking at, but this one is legitimately handled on the portlayer side.

@hickeng hickeng added the source/customer Reported by a customer, directly or via an intermediary label Sep 18, 2017
@matthewavery
Copy link
Contributor

matthewavery commented Sep 22, 2017

@hickeng Should we be retrying a concurrent access error here in commit or down stream from here? The personality has retried concurrent access errors in the past(I believe). for the second scenario I am wondering which error we should retry? If we do collide with these errors we would expect to see something like a ConcurrentAccess style fault. But if we actively collide with it are we going to get some sort of InvalidState transition fault? I might need some insight on this as I have no idea what to look for. We need a way to reproduce a collision and then check the log for the fault type generated by that collisions. Thoughts?

@matthewavery
Copy link
Contributor

matthewavery commented Sep 22, 2017

I was looking into the InvalidState fault. Since it is described as being thrown when "the operation cannot be performed in the current state of the virtual machine. For example, because the virtual machine's configuration is not available.`

And for the RelocateVM_Task the InvalidState fault is "Thrown if the operation cannot be performed because of the host or virtual machine's current state. For example, if the host is in maintenance mode, or if the virtual machine's configuration information is not available.". (reference)

Based on those definitions it looks like we should be retrying on this fault as well. But my additional question is should we be retrying this mid commit for the invalid state, but not for the Concurrent access?

@matthewavery matthewavery self-assigned this Sep 25, 2017
@cgtexmex cgtexmex added this to the Sprint 17 Foundation milestone Sep 25, 2017
@hickeng
Copy link
Member

hickeng commented Sep 25, 2017

InvalidState should be retried directly in that same manner that TaskInProgress is for states that we identify as fundamentally transient - this should be covering invalid states such as:

  • Guest operation -> Reconfigure
  • Reconfigure -> Reconfigure

My current understanding is that VM_MIGRATION will actually cause an update to changeVersion when complete which means we'll get a ConcurrentModificationError (or whatever it presents as). Regardless we can try adding this state to the immediate retry above and then fall out of the retry when it turns into the concurrent modification error. ConcurrentModificaiton should not be retried directly and should be returned to the personality as is currently the case.
This overlaps with #5647 which is also about handling transient InvalidState (unless my initial guess is incorrect).

@matthewavery
Copy link
Contributor

thanks @hickeng I was also looking into #5647 when looking at this since they definitely seemed to have related work.

@matthewavery
Copy link
Contributor

matthewavery commented Sep 28, 2017

dropping this here as it is important

https://code.vmware.com/web/dp/doc/preview?id=1503#/doc/vim.ResourcePool.html%23importVApp

@hickeng took a look into this and the only fault type I potentially see for this direct call that might be intermittent would be the RuntimeFault(which is a rather broad fault type for us to retry on). I will dig into that fault some more, I also noticed this call can trigger a createVM_Task.

Looks like the CreateVM_Task has the same RuntimeFault(I am sure this is common) stipulation. Along with the InvalidState possibility. So for this ticket I will also look into possible faults that could be thrown as a RuntimeFault that we might be able to get around with a retry for now.

@matthewavery
Copy link
Contributor

Here is an initial list of faults that I think we might also want to retry against as potential intermittent failures. @hickeng @caglar10ur @cgtexmex @dougm @lcastellano

  • HostCommunication: This one was iffy for me, but it is possible that the host communication issue could clean itself up during the retry.
  • FailToLockFaultToleranceVMs: Thrown when trying to state lock a Fault Tolerance VM, and the other VM in the same Fault Tolerance pair is already locked. This sounds intermittent and I am not certain we would see this. I would need someone else to comment on this fault that knows more.
  • VAppOperationInProgress: This sounds like the vapp version of a TaskInProgress and considering the standard deployment of VIC is currently to a vApp I think this should also be retried.

I also noted that we have the potential case of

  • UnexpectedFault : this is described as thrown when a newer version of the server reports a error that a cannot be converted to a fault that a client that is using an older version of the API would expect. which can come in the form of a RuntimeFault. Not sure if we should attempt to retry this at all. But I am looking for some opinions on this.

Note: This list was generated by looking through CreateVM_Task and ImportVApp_Task. There are likely more but this was a first look through based on some provided docs. It might be worthwhile to compile a list of all the tasks we initiate thus far against VC and then update this list as we go. Then we would have a great way to crosscheck the type of faults we would expect to see and we could make judgement calls against that list. just a thought. :)

@caglar10ur
Copy link
Contributor

IMHO there is no reason to retry HostCommunication error as don't think host will come back to life magically. By the way this happened literally 4 min. ago :)

[vagrant@devbox:/opt/go/src/github.com/vmware/vic(5621)] vic-delete
Oct  4 2017 17:25:19.134Z INFO  ### Removing VCH ####
Oct  4 2017 17:25:20.338Z INFO  Validating target
Oct  4 2017 17:25:22.073Z INFO
Oct  4 2017 17:25:22.073Z INFO  VCH ID: VirtualMachine:vm-58326
Oct  4 2017 17:25:22.197Z INFO  Removing VMs
Oct  4 2017 17:25:23.396Z ERROR op=1376.6: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:23.666Z ERROR op=1376.10: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:23.719Z ERROR op=1376.13: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:23.719Z ERROR op=1376.11: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:23.804Z ERROR op=1376.14: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:23.880Z INFO  Unregistered VM to cleanup after failed destroy: "VirtualMachine:vm-60149"
Oct  4 2017 17:25:23.915Z INFO  Unregistered VM to cleanup after failed destroy: "VirtualMachine:vm-60150"
Oct  4 2017 17:25:23.943Z ERROR op=1376.16: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:23.997Z INFO  Unregistered VM to cleanup after failed destroy: "VirtualMachine:vm-60146"
Oct  4 2017 17:25:24.064Z ERROR op=1376.18: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.071Z INFO  Unregistered VM to cleanup after failed destroy: "VirtualMachine:vm-60148"
Oct  4 2017 17:25:24.071Z INFO  Unregistered VM to cleanup after failed destroy: "VirtualMachine:vm-60147"
Oct  4 2017 17:25:24.122Z ERROR op=1376.19: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.137Z ERROR op=1376.17: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.175Z INFO  Unregistered VM to cleanup after failed destroy: "VirtualMachine:vm-60144"
Oct  4 2017 17:25:24.201Z ERROR op=1376.21: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.214Z ERROR op=1376.20: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.268Z ERROR op=1376.23: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.301Z ERROR op=1376.22: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.483Z ERROR op=1376.25: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.483Z ERROR op=1376.26: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.483Z ERROR op=1376.24: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.530Z INFO  Unregistered VM to cleanup after failed destroy: "VirtualMachine:vm-60139"
Oct  4 2017 17:25:24.538Z INFO  Unregistered VM to cleanup after failed destroy: "VirtualMachine:vm-60143"
Oct  4 2017 17:25:24.550Z ERROR op=1376.28: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.586Z ERROR op=1376.27: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.612Z INFO  Unregistered VM to cleanup after failed destroy: "VirtualMachine:vm-60142"
Oct  4 2017 17:25:24.619Z ERROR op=1376.29: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.641Z ERROR op=1376.30: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.709Z INFO  Unregistered VM to cleanup after failed destroy: "VirtualMachine:vm-60134"
Oct  4 2017 17:25:24.755Z ERROR op=1376.31: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.867Z ERROR op=1376.34: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.879Z ERROR op=1376.32: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.881Z ERROR op=1376.33: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:24.933Z ERROR op=1376.35: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:25.006Z ERROR op=1376.36: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:25.058Z ERROR op=1376.37: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:25.059Z ERROR op=1376.41: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:25.067Z ERROR op=1376.38: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:25.115Z ERROR op=1376.39: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:25.125Z ERROR op=1376.40: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:25.339Z ERROR op=1376.43: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:25.546Z ERROR op=1376.44: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Oct  4 2017 17:25:25.833Z INFO  Unregistered VM to cleanup after failed destroy: "VirtualMachine:vm-60091"
Oct  4 2017 17:25:33.176Z INFO  Specify --force to force delete
....

@matthewavery
Copy link
Contributor

Good to know :) this was the kind of thing I was looking forward. From what I read it looks ok like host communication errors take manual intervention? Or take a longer time to resolve? I will remove this from my changes.

@hickeng
Copy link
Member

hickeng commented Oct 5, 2017

I think we may have suffered some scope creep from the original issue. If this issue is going to be fixed via #5647 then can we move that to in-progress, move this back to todo and then we confirm that this is fixed once #5647 is closed.

@mdubya66
Copy link
Contributor

mdubya66 commented Oct 6, 2017

putting into 1.3 and marking high pri, per @pdaigle

@matthewavery
Copy link
Contributor

matthewavery commented Nov 13, 2017

Reopening, further logic and error message propagation needs to be added for the wait, bind, and potentially other exec paths as well. and some refactoring is badly needed for structured error message propagation. see below list of items needed(grabbed from #6744) :

  • Many of the personality Exec path calls need to be abstracted to the container proxy
  • A structured power failure error needs to be added to swagger calls which should then be translated to a sane error message returned by all of the mid flight exec paths.
  • More structured error messages must be propagated back from the portlayer using the new power failure errors added in the above bullet point.
  • more power state logic will likely need to be added to the portlayer exec path.

@matthewavery matthewavery reopened this Nov 13, 2017
@matthewavery
Copy link
Contributor

per a deep anlysis with @hickeng :


THINGS TO DO:

1. Move TASKINSPECT to container proxy and update the call to use the same handle as everything else in Container Exec Start. TaskInspection should go in the retry with everything else. and the handle should be retrieved above that. 

2. make sure everything in start is moved to be inside the retry. If we get a concurrent mod error then we must try everything from scratch with a new handle. 

3. We need to change TaskInspect from the portlayer perspective to behave like this: 

func Inspect(op *trace.Operation, h interface{}, id string) (*executor.SessionConfig, error) {
    defer trace.End(trace.Begin(id))

    handle, ok := h.(*exec.Handle)
    if !ok {
        return nil, fmt.Errorf("Type assertion failed for %#+v", handle)
    }

    stasks := handle.ExecConfig.Sessions
    etasks := handle.ExecConfig.Execs

    op.Debugf("target task ID: %s", id)
    op.Debugf("session tasks during inspect: %s", stasks)
    op.Debugf("exec tasks during inspect: %s", etasks)

    if _, ok := stasks[id]; ok {
        return stasks[id], nil
    }

    if _, ok := etasks[id]; ok {
        return etasks[id], nil
    }

    return nil, fmt.Errorf("Cannot find task %s", id)
}

4. IMPORTANT: if TaskInspect gives us a 404 we must properly interpret this in the personality. ExecTaskInspect --> 404 --> likely the container was turned off. That should be the order of interpretation. 

5. We need to handle concurrent modification errors properly in the exec path... We cannot just throw the entire packaged error into a `500` error string. 

6. more to come... from notes on my desk... 

@cgtexmex cgtexmex removed this from the Sprint 20 Foundation milestone Dec 6, 2017
@hickeng
Copy link
Member

hickeng commented Jan 4, 2018

@cgtexmex has been dealing with HostNotConnected issues recently with the image store reload work. I think that a retry is viable for most operations so long as there are other hosts online - my question is why VC continues to route operations to a host that's not connected after the first time.
We could do with some details, either from VC code or experimentation, as to whether a retry could be effective or whether we will just keep getting tasks dispatched to the offline host.
I've sent an email to Derek to see if we can get some clarity on designed vs actual VC behaviour in this scenario.

@mdubya66 mdubya66 added the impact/doc/note Requires creation of or changes to an official release note label Jan 8, 2018
@stuclem
Copy link
Contributor

stuclem commented Jan 9, 2018

@hickeng @mhagen-vmware @matthewavery I have no idea what to write in the release notes about this issue. Can you please provide a concise writeup of the user-visible symptoms? Thanks!

@matthewavery matthewavery self-assigned this Jan 31, 2018
@stuclem stuclem removed the impact/doc/note Requires creation of or changes to an official release note label Feb 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/defect Behavior that is inconsistent with what's intended priority/p0 source/customer Reported by a customer, directly or via an intermediary team/foundation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants