Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nightly 5-4-High-Availability failure: ESX host is temporarily disconnected #7005

Closed
AngieCris opened this issue Dec 28, 2017 · 4 comments
Closed
Labels
component/test Tests not covered by a more specific component label priority/p0 source/scenario Found via a scenario failure team/foundation

Comments

@AngieCris
Copy link
Contributor

image

From portlayer log:

Dec 27 2017 22:37:52.768Z ERROR op=290.26: unexpected fault on task retry: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}
Dec 27 2017 22:37:52.792Z DEBUG op=290.26: Unhandled fault while attempting to destroy vm fd5e455f572cbd2c0f1a07198b94ab1d96d713431a9f2858efccf15c7a02f357: &types.HostNotConnected{HostCommunication:types.HostCommunication{RuntimeFault:types.RuntimeFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}}

5-4-High-Availability.zip

@AngieCris AngieCris added component/test Tests not covered by a more specific component label source/scenario Found via a scenario failure priority/p0 team/foundation labels Dec 28, 2017
@hickeng
Copy link
Member

hickeng commented Jan 3, 2018

Possible candidate for adding a task retry - deleting a container should be viable regardless of whether a single host is down.

@hickeng
Copy link
Member

hickeng commented Jan 4, 2018

#6370 is discussing which task errors should be retried at a low level in the tasks package (e.g. TaskInProgress, HostNotConnected, etc) and which should propagate up to the higher level logic for re-dispatch (ConcurrentModificationError)

@hickeng
Copy link
Member

hickeng commented Feb 12, 2018

Talking to dbeard I'm told that we could well be seeing a delay in updating the host list for routing operations, however I don't see why we should settle for this given the HA remediation has already occurred.
@mhagen-vmware derek has suggested we grab a support bundle when we see this again and open an issue. I've opened bug2057397 with the details we have and the vpxd log fragment gathered as part of the VCH log bundle. Adding this to the log collection epic as it's another case where we'd like to be able to trigger specific log collection on a given symptom.

@hickeng
Copy link
Member

hickeng commented Feb 14, 2018

dup of #6667. Updated that with the visible symptom observed here.

@hickeng hickeng closed this as completed Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/test Tests not covered by a more specific component label priority/p0 source/scenario Found via a scenario failure team/foundation
Projects
None yet
Development

No branches or pull requests

2 participants