Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while powering on containers - "Server error from portlayer: could not find IP range" #6851

Closed
andrewtchin opened this issue Nov 27, 2017 · 58 comments
Labels
kind/defect Behavior that is inconsistent with what's intended priority/p0 source/system-test Reported by the system testing team team/container

Comments

@andrewtchin
Copy link
Contributor

@lgayatri commented on Tue Nov 21 2017

@andrewtchin @cgtexmex @mhagen-vmware

image

image

Service https://10.197.37.143:2376/v1.24/containers/c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49/start returned error 500 for POST. id 284135304 message Server error from portlayer: could not find IP range; Reason: {"message":"Server error from portlayer: could not find IP range"}
docker-personality.log says:

Nov 22 2017 05:09:54.751Z INFO  volumes found:
Nov 22 2017 05:09:55.326Z WARN  error getting config from source: config not modified
Nov 22 2017 05:14:47.145Z ERROR Operation failed with errors.apiError{error:(*errors.errorString)(0xc420a15240), statusCode:500}
Nov 22 2017 05:14:47.145Z ERROR Handler for POST /v1.24/containers/c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49/start returned error: Server error from portlayer: could not find IP range
Nov 22 2017 05:14:48.231Z ERROR Operation failed with errors.apiError{error:(*errors.errorString)(0xc4202c9ad0), statusCode:500}
Nov 22 2017 05:14:48.231Z ERROR Handler for POST /v1.24/containers/c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49/start returned error: Server error from portlayer: could not find IP range
Nov 22 2017 05:14:48.379Z ERROR Operation failed with errors.apiError{error:(*errors.errorString)(0xc420573f30), statusCode:500}
Nov 22 2017 05:14:48.379Z ERROR Handler for POST /v1.24/containers/c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49/start returned error: Server error from portlayer: could not find IP range
Nov 22 2017 05:14:48.661Z ERROR Operation failed with errors.apiError{error:(*errors.errorString)(0xc42048b360), statusCode:500}
Nov 22 2017 05:14:48.661Z ERROR Handler for POST /v1.24/containers/c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49/start returned error: Server error from portlayer: could not find IP range

Logs attached.


@lgayatri commented on Tue Nov 21 2017

container-logs (2).zip


@lgayatri commented on Wed Nov 22 2017

Hitting this issue on other VCHs too.

image


@mdubya66 commented on Wed Nov 22 2017

@sergiosagu could this be related to the ip range work we did?


@sergiosagu commented on Thu Nov 23 2017

@mdubya66 - hmm... I don't think so, but we should investigate it.

@lgayatri - could you attach the application template that you used for the deployment? or perhaps to create one if you ran the container directly.


@lgayatri commented on Thu Nov 23 2017

@sergiosagu , I have live setup. Please take a look

https://10.197.37.132:8282/#/home/containers - administrator@vsphere.local/Admin!23 and root/Admin!23


@sergiosagu commented on Fri Nov 24 2017

@mdubya66 - It's looks like it could be something with the VCH host itself.

I've found these:

Probably someone from the engine team should take a look.


@andrewtchin commented on Mon Nov 27 2017

@lgayatri Were these VCHs restarted? What are the steps to reproduce this?

@lgayatri
Copy link
Contributor

@andrewtchin , VCHs were not restarted.
Steps:

  1. Create 50 VCHs with about 50 containers in each
  2. The containers got stopped after their process exited and hence got powered off
  3. From the Admiral UI, I powered on few containers and hit this issue.

@mdubya66
Copy link
Contributor

is this a regression? @mhagen-vmware

@mhagen-vmware
Copy link
Contributor

looks like it, this is the same test that she has been running, it is now not passing.

@lgayatri
Copy link
Contributor

@mdubya66 , Regression - Yes

@mdubya66
Copy link
Contributor

@sflxn this is a potential stopper. can we root cause?

@sflxn
Copy link
Contributor

sflxn commented Nov 29, 2017

@lgayatri I pinged you on slack. Can you provide me with the scripts you used to reproduce this?

@lgayatri
Copy link
Contributor

@sflxn , sent in email. BTW, we are different time zones.

@mdubya66
Copy link
Contributor

@mlh78750 potential stopper, adding to the list until we get root cause

@sflxn
Copy link
Contributor

sflxn commented Dec 1, 2017

I've broken down the portlayer logs the best I can. The logs doesn't provide the the operation that starts a reconfigure so it's hard to determine what is what, but through digging, I've been able to surmise most of the operations that occurred for container c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49 in the logs.

What I suspect is happening is that we are not properly accounting for the removal of container IPs properly in our internal accounting. The error that is shown in the original post is an error that originates from VIC's internal accounting and not vSphere.

Not sure yet whether this is a regression or not. Also not sure why the accounting is off. Marked up portlayer logs attached.
port-layer.log

@sflxn
Copy link
Contributor

sflxn commented Dec 1, 2017

If there is a regression, it is likely in PR #6525. Still reviewing changes.

@sflxn
Copy link
Contributor

sflxn commented Dec 5, 2017

I've reviewed the PR above, and I do not believe the regression is there. I have studied the IP releasal paths in the portlayer and persona and cannot find a scenario that would cause it to fail removing the IP in the portlayer's internal accounting.

I've looked at PRs since 1.2 in the portlayer's networking folder as well as portlayer's rest handler and in the persona and have not come across any change that would cause this.

On manual start (by @lgayatri), the tether logs shows the container VM zero'd IP and static is false (dhcp). This is expected.

I've discussed this with @hickeng and @hmahmood and haven't come up with any other ideas as to what is causing this.

Only path left that I can consider is the following lines in the portlayer logs:

Dropping duplicate power on operation?
--------------------------------------

Nov 21 2017 07:42:39.685Z INFO  op=292.4740: Dropping duplicate power on operation for c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49
Nov 21 2017 07:42:51.571Z ERROR op=292.4763: unexpected fault on task retry: &types.ConcurrentAccess{VimFault:types.VimFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}

@sflxn
Copy link
Contributor

sflxn commented Dec 5, 2017

@lgayatri As I mentioned on the skype call, can you run these two tests?

  1. vic-machine create with --debug=2. Lower the sleep time to 45mins or so.
  2. vic-machine create with --debug=2. Change the script's command from docker run to docker create instead. Then go and power on the containers.

@sflxn
Copy link
Contributor

sflxn commented Dec 5, 2017

Some updated info on this issue:

@lgayatri found this issue while testing on VC 6.5.

  1. Created the VCHs with vic-machine CLI
  2. Added VCHs to Admiral
  3. Ran docker commands from command line against VCH's API server
  4. Once containers stopped, she came back and tried to batch start many of the stopped containers from Admiral

From the logs:

Nearly all (if not all) the containers she tried to start returned conflict error from the portlayer multiple times (from retries). Error:

ERROR Will stop trying again. Operation failed with errors.apiError{error:(*errors.errorString)(0xc4205bb1a0), statusCode:409}

Only one failed with the "could not find IP range" towards the end of the log.

@lgayatri
Copy link
Contributor

lgayatri commented Dec 5, 2017

@sflxn , first step is in progress:
vic-machine create with --debug=2. Lower the sleep time to 45mins or so.

@lgayatri
Copy link
Contributor

lgayatri commented Dec 5, 2017

@sflxn please feel free to access my testbed. There are 2500+ containers with sleep 45m executed and now in STOPPED state.

@sflxn
Copy link
Contributor

sflxn commented Dec 5, 2017

@lgayatri I went on your server and tried to start 100 containers and they all started. I then tried with 200+ containers, and it worked. Perhaps debug has altered some timing. I started the containers in the container view of the project_lg. They were spread over many VCHs. I assume you did the same thing.

I noticed that there were a lot of containers that should have been in the stopped state but were still running 10-11hrs later. Did you see this?

@sflxn
Copy link
Contributor

sflxn commented Dec 5, 2017

I was just able to reproduce it with the debug VCHs that @lgayatri deployed. The debug logs are attached. Container e505021d6222730b7fa2591f549aa454190aefdebdec3523cd0e2eb5e0791815.

container-logs-debug.zip

observations

  1. I noticed that Admiral is requesting the container to start 4 times without any exponential backoff. Each request comes milliseconds after the last one failed.
  2. I am able to restart the container from the command line using the docker CLI.

@mlh78750
Copy link
Contributor

mlh78750 commented Dec 6, 2017

@sflxn We should open an issue with Admiral on the backoff. Also, do you have a good idea of the root cause at this point?

@lgayatri
Copy link
Contributor

lgayatri commented Dec 6, 2017

@sflxn All containers should be in stopped state as they had sleep 45m . I still see 37 containers in RUNNING status
image

This is Admiral UI refresh issue. Opened - vmware-archive/admiral#231

@lgayatri
Copy link
Contributor

lgayatri commented Dec 6, 2017

@sflxn do you want me to run the second test ?
....
vic-machine create with --debug=2. Change the script's command from docker run to docker create instead. Then go and power on the containers.
....

@sflxn
Copy link
Contributor

sflxn commented Dec 6, 2017

@mlh78750 I believe it's trying to reserve an IP that is already in used. The debug logs are not actually helpful.

Scopes BindContainer for e505021d6222730b7fa2591f549aa454190aefdebdec3523cd0e2eb5e0791815
--------------------

Dec  5 2017 21:36:15.902Z DEBUG [BEGIN]  [vic/lib/apiservers/portlayer/restapi/handlers.(*ScopesHandlersImpl).ScopesBindContainer:273] handle(b3427a92bdb994a4aa90f924214b5904)
Dec  5 2017 21:36:15.902Z DEBUG [BEGIN]  [vic/lib/portlayer/network.(*Context).BindContainer:658]
Dec  5 2017 21:36:15.902Z DEBUG [BEGIN]  [vic/lib/portlayer/network.(*Context).container:848]
Dec  5 2017 21:36:15.902Z DEBUG [ END ]  [vic/lib/portlayer/network.(*Context).container:848] [15.363µs] 
Dec  5 2017 21:36:15.902Z DEBUG getting aliases for focused_davinci from loving_wright
Dec  5 2017 21:36:15.902Z DEBUG getting aliases for focused_davinci from festive_yonath
Dec  5 2017 21:36:15.902Z DEBUG getting aliases for focused_davinci from musing_newton
Dec  5 2017 21:36:15.902Z DEBUG getting aliases for focused_davinci from sharp_meninsky
Dec  5 2017 21:36:15.902Z DEBUG getting aliases for focused_davinci from frosty_torvalds
Dec  5 2017 21:36:15.903Z DEBUG getting aliases for focused_davinci from sharp_golick
Dec  5 2017 21:36:15.903Z DEBUG getting aliases for focused_davinci from happy_fermi
Dec  5 2017 21:36:15.903Z DEBUG getting aliases for focused_davinci from competent_lumiere
Dec  5 2017 21:36:15.903Z DEBUG adding alias bridge:3d64843977c3 -> focused_davinci
Dec  5 2017 21:36:15.903Z DEBUG adding alias bridge:focused_davinci -> focused_davinci
Dec  5 2017 21:36:15.903Z DEBUG [ END ]  [vic/lib/portlayer/network.(*Context).BindContainer:658] [177.919µs] 
Dec  5 2017 21:36:15.903Z DEBUG [ END ]  [vic/lib/apiservers/portlayer/restapi/handlers.(*ScopesHandlersImpl).ScopesBindContainer:273] [201.507µs] handle(b3427a92bdb994a4aa90f924214b5904)

I'm adding more info to the logs so that we can actually see which IP it is trying to reserve. Also, the error message is ambiguous. The IP could already be in used or the requested IP is outside the pool's range. My update should make that clearer. I need to verify that this hypothesis is correct, that it's requesting an IP that's already in use.

How the IP is already in use is the next investigation. I don't see this path while perusing the code. There is something in the logs that Lakshmi originally posted. A lot of the containers she started on that VCH failed after multiple retries. They failed with a concurrent access error. The code that exist in the persona should unbind the container in the event of failure. I'm trying to find a case where it doesn't really unbind.

@lgayatri
Copy link
Contributor

@sflxn , test bed is ready, please feel free to use it. The VCHs have been created with debug level 2.

@sflxn
Copy link
Contributor

sflxn commented Dec 20, 2017

I might be wrong about the summary and fix from earlier.

From digging in further, I wonder if the network event handler is not at fault. Here is a new summarized portlayer log. I've included the handle change version from the logs.

  1. Container A restart is using the old handle
  2. Container A's network event handler gets triggered and creates a new handle and the IP is released in this changeset.
  3. Container B starts and reserves the IP
  4. Container A restart finally tries to reserve the IP, but has old information (ne.Assigned.IP) since it's using the old change version

Questions

  1. Why does the network event handler create a new handle? Should it do that?
  2. What mechanism does the portlayer have to sync up concurrent access to a handle?
  3. Should we disallow access to this container till the unbind finishes? Are we not detecting this scenario as a concurrent access problem? I think this is ultimately the solution. Detect this as concurrent access, return it for step 1 above (container A restart). When the persona detects this error, it will retry. Eventually, there will be no concurrent access error and the Persona will get the latest handle.

Portlayer-Summarized.log

@lgayatri
Copy link
Contributor

lgayatri commented Dec 20, 2017

@sflxn , here is the repro:

image
container-logs-10.197.37.131.zip
SSH has been enabled for 10.197.37.131, I sent you the credentials.

@mdubya66
Copy link
Contributor

I was able to restart the container 172.16.0.17, which was one of the two that failed to start. 172.16.0.21 would not restart from either the command line or admiral. It kept failing with the IP address in use message. I then searched for 172.16.0.21 in admiral and found 54 containers with that IP address. If there are 50 VCH's that seems odd. I used the docker command to inspect each container and docker inspect did not see a single stopped container with the IP address. So I then started clicking on each container in admiral and one by one the container would be removed from the search 172.6.0.21 search window in admiral. After all 54 were no longer in the search window I found the affectionate_torvalds container and clicked start. It then ran, where it would not run before on both admiral or from the docker command line

Could admiral holding onto the IP address have some effect?
screen shot 2017-12-20 at 12 11 41 pm

@mdubya66
Copy link
Contributor

Changes to the system overnight showed a new container with the same error. I searched on it's IP and there are 338 containers sharing the same IP.

screen shot 2017-12-20 at 12 19 26 pm

@sflxn
Copy link
Contributor

sflxn commented Dec 20, 2017

I briefly analyzed the logs from @lgayatri's latest run with debug logging. The error is similar to the one I found yesterday but a slight difference. The root cause is that when a container unbinds from the network, it needs to release the IP, which it does. Concurrently, a request comes in to restart the container and tries to grab the same IP because it's using an old handle (where the IP wasn't released yet). It needs to get the new handle (where the IP was released) so that it can request a new IP.

I have a WIP PR that goes toward resolving this. I just need to investigate why your logs shows it repeatedly trying to restart and bind using the old handle. This shouldn't happen. A retry in the persona should get the new handle... eventually. This may still be true. Just need to finish the analysis.

#6989

UPDATE: I've verified that eventually, the container I was looking at gets the IP. I just now tried to start the container and it started. However, there is still an assigned IP address when we tried to bind. This is a bug. It never fully released the IP address earlier. Hours later, I expected a restart would get the updated handle and see the IP was released. It wasn't. I'll need to find this bug and add it to the PR I linked above. This bug needs to be fixed because even with returning ConcurrentResourceError from the network layer, we can still hit this issue if the system is thrashing and the container never fully zeroes out the assigned IP.

Dec 20 2017 06:32:18.692Z DEBUG op=300.51: BindContainer found container ff6fcce8c9b6d20c6ea1f21c2c24377ac5d73899b53fb835716e9c07d5c3127a with dynamic endpoint and unassigned IP
Dec 20 2017 06:32:18.692Z INFO  Reserving IP range [172.16.0.4, 172.16.0.4]

portlayer-summarized.log

hickeng pushed a commit to hickeng/vic that referenced this issue Dec 21, 2017
Increasing the event page size on the portlayer
to help with missing events.
Helps with vmware#6384, vmware#6851
@sflxn
Copy link
Contributor

sflxn commented Dec 21, 2017

Still tracking down why unbind failed, which eventually causes this issue.

300.637, Unbind failed due to concurrent modification - 79752927304b297507870e2cedd21b0b110b33f6a4900031b26503dea56dfe60
----------
Dec 20 2017 07:32:07.191Z ERROR op=300.637: unexpected fault on task retry: &types.ConcurrentAccess{VimFault:types.VimFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}

Once the above error occurs, all requests to bind this container has the IP in question assigned because it was never able to properly unbind (per the error above). On a system that is not loaded, you can eventually restart the container. I was able to do it earlier today, but when I looked at the logs, it still had the IP assigned going into BindContainer. This was many, many hours later. If the VC was loaded, I may not have been able to restart the container because it insists on getting that specific IP since it was never able to cleanly unbind.

I do see a potential problem with Admiral. Lakshmi's system is very loaded. Admiral never waits for a successful start. It waits a maxium of 2mins and then sends VIC another request. In this system, one start request took nearly 6 minutes to complete, and it completed successfully, but Admiral didn't wait around. It had sent in 2 more requests in during this time, each waiting a maximum of 2mins. After the one successful start, Admiral continuously resends the same start for the same container. I think this is a pretty serious bug.
persona-summarized.log

@lgayatri
Copy link
Contributor

lgayatri commented Dec 21, 2017

My setup is moderately loaded, this is what we support.
@karthik-narayan

@mdubya66
Copy link
Contributor

Potential workaround but not pretty.

(mwilliamson) ib-aus-office-124:~/Downloads $ docker -H 10.197.37.144:2376 --tls start b18afb2d2d79
Error response from daemon: Server error from portlayer: Cannot reserve IP range 172.16.0.10 - 172.16.0.10.  Already in use
Error: failed to start containers: b18afb2d2d79

Using docker network inspect I found the container that was in conflict.  which was b8f692ac54f and stopped it. Then restarted both.

(mwilliamson) ib-aus-office-124:~/Downloads $ docker -H 10.197.37.144:2376 --tls stop b8f692ac54f
b8f692ac54f
(mwilliamson) ib-aus-office-124:~/Downloads $ docker -H 10.197.37.144:2376 --tls start b18afb2d2d79
b18afb2d2d79
(mwilliamson) ib-aus-office-124:~/Downloads $ docker -H 10.197.37.144:2376 --tls start b8f692ac54f
b8f692ac54f
(mwilliamson) ib-aus-office-124:~/Downloads $

@mdubya66 mdubya66 added the impact/doc/note Requires creation of or changes to an official release note label Dec 29, 2017
@hickeng
Copy link
Member

hickeng commented Jan 3, 2018

I think this issue after a failed unbind has been around since we added the VCH restart support in #2579. We should not be re-requesting the same address if the cVM is not currently powered on.
Any power state check should be made against the RuntimeState in the Handle, not against the VM itself - that keeps the prereq logic protected by the changeVersion.
The bridge network is not considered a dynamic scope so we will try to reuse the specific IP in the Assigned field

@stuclem
Copy link
Contributor

stuclem commented Jan 9, 2018

@lgayatri @sflxn @hickeng I have no idea how to condense this issue into a couple of lines for the release notes. Does this even need to be release noted?

@lgayatri
Copy link
Contributor

@stuclem, IMHO lets not document this for now. BTW, this issue will be seen in 1.3.1 as well.
image

@stuclem
Copy link
Contributor

stuclem commented Jan 30, 2018

Thanks @lgayatri. Removing kind/note.

@stuclem stuclem removed the impact/doc/note Requires creation of or changes to an official release note label Jan 30, 2018
@hickeng
Copy link
Member

hickeng commented Jan 31, 2018

@sflxn can you summarize how far you got with this?

As a follow up note to my prior comments - the bridge is not “dynamic” however we use that set to true in the config to differentiate between bridge address assigned by user vs bridge address set via network component.

@mdubya66
Copy link
Contributor

@lgayatri can you confirm we are still running into this issue?

@mdubya66 mdubya66 added the source/system-test Reported by the system testing team label Mar 12, 2018
@sgairo sgairo added this to the Sprint 28 Container Ops milestone Mar 14, 2018
@lgayatri
Copy link
Contributor

Not seen this issue on 1.4 OVA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/defect Behavior that is inconsistent with what's intended priority/p0 source/system-test Reported by the system testing team team/container
Projects
None yet
Development

No branches or pull requests