Error while powering on containers - "Server error from portlayer: could not find IP range" #6851

andrewtchin · 2017-11-27T22:38:10Z

@lgayatri commented on Tue Nov 21 2017

Service https://10.197.37.143:2376/v1.24/containers/c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49/start returned error 500 for POST. id 284135304 message Server error from portlayer: could not find IP range; Reason: {"message":"Server error from portlayer: could not find IP range"}
docker-personality.log says:

Nov 22 2017 05:09:54.751Z INFO  volumes found:
Nov 22 2017 05:09:55.326Z WARN  error getting config from source: config not modified
Nov 22 2017 05:14:47.145Z ERROR Operation failed with errors.apiError{error:(*errors.errorString)(0xc420a15240), statusCode:500}
Nov 22 2017 05:14:47.145Z ERROR Handler for POST /v1.24/containers/c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49/start returned error: Server error from portlayer: could not find IP range
Nov 22 2017 05:14:48.231Z ERROR Operation failed with errors.apiError{error:(*errors.errorString)(0xc4202c9ad0), statusCode:500}
Nov 22 2017 05:14:48.231Z ERROR Handler for POST /v1.24/containers/c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49/start returned error: Server error from portlayer: could not find IP range
Nov 22 2017 05:14:48.379Z ERROR Operation failed with errors.apiError{error:(*errors.errorString)(0xc420573f30), statusCode:500}
Nov 22 2017 05:14:48.379Z ERROR Handler for POST /v1.24/containers/c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49/start returned error: Server error from portlayer: could not find IP range
Nov 22 2017 05:14:48.661Z ERROR Operation failed with errors.apiError{error:(*errors.errorString)(0xc42048b360), statusCode:500}
Nov 22 2017 05:14:48.661Z ERROR Handler for POST /v1.24/containers/c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49/start returned error: Server error from portlayer: could not find IP range

Logs attached.

@lgayatri commented on Tue Nov 21 2017

container-logs (2).zip

@lgayatri commented on Wed Nov 22 2017

Hitting this issue on other VCHs too.

@mdubya66 commented on Wed Nov 22 2017

@sergiosagu could this be related to the ip range work we did?

@sergiosagu commented on Thu Nov 23 2017

@mdubya66 - hmm... I don't think so, but we should investigate it.

@lgayatri - could you attach the application template that you used for the deployment? or perhaps to create one if you ran the container directly.

@lgayatri commented on Thu Nov 23 2017

@sergiosagu , I have live setup. Please take a look

https://10.197.37.132:8282/#/home/containers - administrator@vsphere.local/Admin!23 and root/Admin!23

@sergiosagu commented on Fri Nov 24 2017

@mdubya66 - It's looks like it could be something with the VCH host itself.

I've found these:

Probably someone from the engine team should take a look.

@andrewtchin commented on Mon Nov 27 2017

@lgayatri Were these VCHs restarted? What are the steps to reproduce this?

The text was updated successfully, but these errors were encountered:

lgayatri · 2017-11-28T05:10:31Z

@andrewtchin , VCHs were not restarted.
Steps:

Create 50 VCHs with about 50 containers in each
The containers got stopped after their process exited and hence got powered off
From the Admiral UI, I powered on few containers and hit this issue.

mdubya66 · 2017-11-28T14:19:32Z

is this a regression? @mhagen-vmware

mhagen-vmware · 2017-11-28T15:09:17Z

looks like it, this is the same test that she has been running, it is now not passing.

lgayatri · 2017-11-29T10:34:24Z

@mdubya66 , Regression - Yes

mdubya66 · 2017-11-29T17:10:48Z

@sflxn this is a potential stopper. can we root cause?

sflxn · 2017-11-29T19:26:18Z

@lgayatri I pinged you on slack. Can you provide me with the scripts you used to reproduce this?

lgayatri · 2017-11-30T04:36:58Z

@sflxn , sent in email. BTW, we are different time zones.

mdubya66 · 2017-11-30T18:33:56Z

@mlh78750 potential stopper, adding to the list until we get root cause

sflxn · 2017-12-01T00:30:37Z

I've broken down the portlayer logs the best I can. The logs doesn't provide the the operation that starts a reconfigure so it's hard to determine what is what, but through digging, I've been able to surmise most of the operations that occurred for container c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49 in the logs.

What I suspect is happening is that we are not properly accounting for the removal of container IPs properly in our internal accounting. The error that is shown in the original post is an error that originates from VIC's internal accounting and not vSphere.

Not sure yet whether this is a regression or not. Also not sure why the accounting is off. Marked up portlayer logs attached.
port-layer.log

sflxn · 2017-12-01T00:53:03Z

If there is a regression, it is likely in PR #6525. Still reviewing changes.

sflxn · 2017-12-05T02:50:34Z

I've reviewed the PR above, and I do not believe the regression is there. I have studied the IP releasal paths in the portlayer and persona and cannot find a scenario that would cause it to fail removing the IP in the portlayer's internal accounting.

I've looked at PRs since 1.2 in the portlayer's networking folder as well as portlayer's rest handler and in the persona and have not come across any change that would cause this.

On manual start (by @lgayatri), the tether logs shows the container VM zero'd IP and static is false (dhcp). This is expected.

I've discussed this with @hickeng and @hmahmood and haven't come up with any other ideas as to what is causing this.

Only path left that I can consider is the following lines in the portlayer logs:

Dropping duplicate power on operation?
--------------------------------------

Nov 21 2017 07:42:39.685Z INFO  op=292.4740: Dropping duplicate power on operation for c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49
Nov 21 2017 07:42:51.571Z ERROR op=292.4763: unexpected fault on task retry: &types.ConcurrentAccess{VimFault:types.VimFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}

sflxn · 2017-12-05T05:41:37Z

@lgayatri As I mentioned on the skype call, can you run these two tests?

vic-machine create with --debug=2. Lower the sleep time to 45mins or so.
vic-machine create with --debug=2. Change the script's command from docker run to docker create instead. Then go and power on the containers.

sflxn · 2017-12-05T05:54:19Z

Some updated info on this issue:

@lgayatri found this issue while testing on VC 6.5.

Created the VCHs with vic-machine CLI
Added VCHs to Admiral
Ran docker commands from command line against VCH's API server
Once containers stopped, she came back and tried to batch start many of the stopped containers from Admiral

From the logs:

Nearly all (if not all) the containers she tried to start returned conflict error from the portlayer multiple times (from retries). Error:

ERROR Will stop trying again. Operation failed with errors.apiError{error:(*errors.errorString)(0xc4205bb1a0), statusCode:409}

Only one failed with the "could not find IP range" towards the end of the log.

lgayatri · 2017-12-05T07:54:08Z

@sflxn , first step is in progress:
vic-machine create with --debug=2. Lower the sleep time to 45mins or so.

lgayatri · 2017-12-05T12:54:15Z

@sflxn please feel free to access my testbed. There are 2500+ containers with sleep 45m executed and now in STOPPED state.

sflxn · 2017-12-05T19:13:20Z

@lgayatri I went on your server and tried to start 100 containers and they all started. I then tried with 200+ containers, and it worked. Perhaps debug has altered some timing. I started the containers in the container view of the project_lg. They were spread over many VCHs. I assume you did the same thing.

I noticed that there were a lot of containers that should have been in the stopped state but were still running 10-11hrs later. Did you see this?

sflxn · 2017-12-05T23:46:49Z

I was just able to reproduce it with the debug VCHs that @lgayatri deployed. The debug logs are attached. Container e505021d6222730b7fa2591f549aa454190aefdebdec3523cd0e2eb5e0791815.

container-logs-debug.zip

observations

I noticed that Admiral is requesting the container to start 4 times without any exponential backoff. Each request comes milliseconds after the last one failed.
I am able to restart the container from the command line using the docker CLI.

mlh78750 · 2017-12-06T04:29:02Z

@sflxn We should open an issue with Admiral on the backoff. Also, do you have a good idea of the root cause at this point?

lgayatri · 2017-12-06T05:09:47Z

@sflxn All containers should be in stopped state as they had sleep 45m . I still see 37 containers in RUNNING status

This is Admiral UI refresh issue. Opened - vmware-archive/admiral#231

lgayatri · 2017-12-06T06:10:22Z

@sflxn do you want me to run the second test ?
....
vic-machine create with --debug=2. Change the script's command from docker run to docker create instead. Then go and power on the containers.
....

sflxn · 2017-12-06T07:37:14Z

@mlh78750 I believe it's trying to reserve an IP that is already in used. The debug logs are not actually helpful.

Scopes BindContainer for e505021d6222730b7fa2591f549aa454190aefdebdec3523cd0e2eb5e0791815
--------------------

Dec  5 2017 21:36:15.902Z DEBUG [BEGIN]  [vic/lib/apiservers/portlayer/restapi/handlers.(*ScopesHandlersImpl).ScopesBindContainer:273] handle(b3427a92bdb994a4aa90f924214b5904)
Dec  5 2017 21:36:15.902Z DEBUG [BEGIN]  [vic/lib/portlayer/network.(*Context).BindContainer:658]
Dec  5 2017 21:36:15.902Z DEBUG [BEGIN]  [vic/lib/portlayer/network.(*Context).container:848]
Dec  5 2017 21:36:15.902Z DEBUG [ END ]  [vic/lib/portlayer/network.(*Context).container:848] [15.363µs] 
Dec  5 2017 21:36:15.902Z DEBUG getting aliases for focused_davinci from loving_wright
Dec  5 2017 21:36:15.902Z DEBUG getting aliases for focused_davinci from festive_yonath
Dec  5 2017 21:36:15.902Z DEBUG getting aliases for focused_davinci from musing_newton
Dec  5 2017 21:36:15.902Z DEBUG getting aliases for focused_davinci from sharp_meninsky
Dec  5 2017 21:36:15.902Z DEBUG getting aliases for focused_davinci from frosty_torvalds
Dec  5 2017 21:36:15.903Z DEBUG getting aliases for focused_davinci from sharp_golick
Dec  5 2017 21:36:15.903Z DEBUG getting aliases for focused_davinci from happy_fermi
Dec  5 2017 21:36:15.903Z DEBUG getting aliases for focused_davinci from competent_lumiere
Dec  5 2017 21:36:15.903Z DEBUG adding alias bridge:3d64843977c3 -> focused_davinci
Dec  5 2017 21:36:15.903Z DEBUG adding alias bridge:focused_davinci -> focused_davinci
Dec  5 2017 21:36:15.903Z DEBUG [ END ]  [vic/lib/portlayer/network.(*Context).BindContainer:658] [177.919µs] 
Dec  5 2017 21:36:15.903Z DEBUG [ END ]  [vic/lib/apiservers/portlayer/restapi/handlers.(*ScopesHandlersImpl).ScopesBindContainer:273] [201.507µs] handle(b3427a92bdb994a4aa90f924214b5904)

I'm adding more info to the logs so that we can actually see which IP it is trying to reserve. Also, the error message is ambiguous. The IP could already be in used or the requested IP is outside the pool's range. My update should make that clearer. I need to verify that this hypothesis is correct, that it's requesting an IP that's already in use.

How the IP is already in use is the next investigation. I don't see this path while perusing the code. There is something in the logs that Lakshmi originally posted. A lot of the containers she started on that VCH failed after multiple retries. They failed with a concurrent access error. The code that exist in the persona should unbind the container in the event of failure. I'm trying to find a case where it doesn't really unbind.

lgayatri · 2017-12-20T06:45:12Z

@sflxn , test bed is ready, please feel free to use it. The VCHs have been created with debug level 2.

sflxn · 2017-12-20T07:02:23Z

I might be wrong about the summary and fix from earlier.

From digging in further, I wonder if the network event handler is not at fault. Here is a new summarized portlayer log. I've included the handle change version from the logs.

Container A restart is using the old handle
Container A's network event handler gets triggered and creates a new handle and the IP is released in this changeset.
Container B starts and reserves the IP
Container A restart finally tries to reserve the IP, but has old information (ne.Assigned.IP) since it's using the old change version

Questions

Why does the network event handler create a new handle? Should it do that?
What mechanism does the portlayer have to sync up concurrent access to a handle?
Should we disallow access to this container till the unbind finishes? Are we not detecting this scenario as a concurrent access problem? I think this is ultimately the solution. Detect this as concurrent access, return it for step 1 above (container A restart). When the persona detects this error, it will retry. Eventually, there will be no concurrent access error and the Persona will get the latest handle.

Portlayer-Summarized.log

lgayatri · 2017-12-20T09:27:54Z

@sflxn , here is the repro:

container-logs-10.197.37.131.zip
SSH has been enabled for 10.197.37.131, I sent you the credentials.

mdubya66 · 2017-12-20T17:18:51Z

I was able to restart the container 172.16.0.17, which was one of the two that failed to start. 172.16.0.21 would not restart from either the command line or admiral. It kept failing with the IP address in use message. I then searched for 172.16.0.21 in admiral and found 54 containers with that IP address. If there are 50 VCH's that seems odd. I used the docker command to inspect each container and docker inspect did not see a single stopped container with the IP address. So I then started clicking on each container in admiral and one by one the container would be removed from the search 172.6.0.21 search window in admiral. After all 54 were no longer in the search window I found the affectionate_torvalds container and clicked start. It then ran, where it would not run before on both admiral or from the docker command line

Could admiral holding onto the IP address have some effect?

mdubya66 · 2017-12-20T17:19:56Z

Changes to the system overnight showed a new container with the same error. I searched on it's IP and there are 338 containers sharing the same IP.

sflxn · 2017-12-20T19:53:34Z

I briefly analyzed the logs from @lgayatri's latest run with debug logging. The error is similar to the one I found yesterday but a slight difference. The root cause is that when a container unbinds from the network, it needs to release the IP, which it does. Concurrently, a request comes in to restart the container and tries to grab the same IP because it's using an old handle (where the IP wasn't released yet). It needs to get the new handle (where the IP was released) so that it can request a new IP.

I have a WIP PR that goes toward resolving this. I just need to investigate why your logs shows it repeatedly trying to restart and bind using the old handle. This shouldn't happen. A retry in the persona should get the new handle... eventually. This may still be true. Just need to finish the analysis.

#6989

UPDATE: I've verified that eventually, the container I was looking at gets the IP. I just now tried to start the container and it started. However, there is still an assigned IP address when we tried to bind. This is a bug. It never fully released the IP address earlier. Hours later, I expected a restart would get the updated handle and see the IP was released. It wasn't. I'll need to find this bug and add it to the PR I linked above. This bug needs to be fixed because even with returning ConcurrentResourceError from the network layer, we can still hit this issue if the system is thrashing and the container never fully zeroes out the assigned IP.

Dec 20 2017 06:32:18.692Z DEBUG op=300.51: BindContainer found container ff6fcce8c9b6d20c6ea1f21c2c24377ac5d73899b53fb835716e9c07d5c3127a with dynamic endpoint and unassigned IP
Dec 20 2017 06:32:18.692Z INFO  Reserving IP range [172.16.0.4, 172.16.0.4]

portlayer-summarized.log

Increasing the event page size on the portlayer to help with missing events. Helps with vmware#6384, vmware#6851

sflxn · 2017-12-21T03:22:55Z

Still tracking down why unbind failed, which eventually causes this issue.

300.637, Unbind failed due to concurrent modification - 79752927304b297507870e2cedd21b0b110b33f6a4900031b26503dea56dfe60
----------
Dec 20 2017 07:32:07.191Z ERROR op=300.637: unexpected fault on task retry: &types.ConcurrentAccess{VimFault:types.VimFault{MethodFault:types.MethodFault{FaultCause:(*types.LocalizedMethodFault)(nil), FaultMessage:[]types.LocalizableMessage(nil)}}}

Once the above error occurs, all requests to bind this container has the IP in question assigned because it was never able to properly unbind (per the error above). On a system that is not loaded, you can eventually restart the container. I was able to do it earlier today, but when I looked at the logs, it still had the IP assigned going into BindContainer. This was many, many hours later. If the VC was loaded, I may not have been able to restart the container because it insists on getting that specific IP since it was never able to cleanly unbind.

I do see a potential problem with Admiral. Lakshmi's system is very loaded. Admiral never waits for a successful start. It waits a maxium of 2mins and then sends VIC another request. In this system, one start request took nearly 6 minutes to complete, and it completed successfully, but Admiral didn't wait around. It had sent in 2 more requests in during this time, each waiting a maximum of 2mins. After the one successful start, Admiral continuously resends the same start for the same container. I think this is a pretty serious bug.
persona-summarized.log

lgayatri · 2017-12-21T04:28:14Z

My setup is moderately loaded, this is what we support.
@karthik-narayan

mdubya66 · 2017-12-21T16:52:16Z

Potential workaround but not pretty.

(mwilliamson) ib-aus-office-124:~/Downloads $ docker -H 10.197.37.144:2376 --tls start b18afb2d2d79
Error response from daemon: Server error from portlayer: Cannot reserve IP range 172.16.0.10 - 172.16.0.10.  Already in use
Error: failed to start containers: b18afb2d2d79

Using docker network inspect I found the container that was in conflict.  which was b8f692ac54f and stopped it. Then restarted both.

(mwilliamson) ib-aus-office-124:~/Downloads $ docker -H 10.197.37.144:2376 --tls stop b8f692ac54f
b8f692ac54f
(mwilliamson) ib-aus-office-124:~/Downloads $ docker -H 10.197.37.144:2376 --tls start b18afb2d2d79
b18afb2d2d79
(mwilliamson) ib-aus-office-124:~/Downloads $ docker -H 10.197.37.144:2376 --tls start b8f692ac54f
b8f692ac54f
(mwilliamson) ib-aus-office-124:~/Downloads $

hickeng · 2018-01-03T22:14:38Z

I think this issue after a failed unbind has been around since we added the VCH restart support in #2579. We should not be re-requesting the same address if the cVM is not currently powered on.
Any power state check should be made against the RuntimeState in the Handle, not against the VM itself - that keeps the prereq logic protected by the changeVersion.
The bridge network is not considered a dynamic scope so we will try to reuse the specific IP in the Assigned field

stuclem · 2018-01-09T19:24:58Z

@lgayatri @sflxn @hickeng I have no idea how to condense this issue into a couple of lines for the release notes. Does this even need to be release noted?

lgayatri · 2018-01-30T06:52:34Z

@stuclem, IMHO lets not document this for now. BTW, this issue will be seen in 1.3.1 as well.

stuclem · 2018-01-30T11:24:04Z

Thanks @lgayatri. Removing kind/note.

hickeng · 2018-01-31T15:33:15Z

@sflxn can you summarize how far you got with this?

As a follow up note to my prior comments - the bridge is not “dynamic” however we use that set to true in the config to differentiate between bridge address assigned by user vs bridge address set via network component.

mdubya66 · 2018-03-12T20:18:37Z

@lgayatri can you confirm we are still running into this issue?

lgayatri · 2018-03-16T15:04:44Z

Not seen this issue on 1.4 OVA

andrewtchin mentioned this issue Nov 27, 2017

Error while powering on containers - "Server error from portlayer: could not find IP range" vmware/vic-product#1123

Closed

andrewtchin added kind/defect Behavior that is inconsistent with what's intended priority/p2 team/container labels Nov 27, 2017

andrewtchin assigned sflxn Nov 27, 2017

sflxn added this to the Sprint 23 Container Ops milestone Dec 6, 2017

mdubya66 mentioned this issue Dec 6, 2017

Attempt to concurrently power on 200+ containers across VCHs failed with "~has been marked [DISABLED] after failure to perform data collection" #6850

Closed

sflxn added priority/p0 and removed priority/p2 labels Dec 6, 2017

hickeng pushed a commit to hickeng/vic that referenced this issue Dec 21, 2017

Increase event page size to 1000 (vmware#6937)

957a8b1

Increasing the event page size on the portlayer to help with missing events. Helps with vmware#6384, vmware#6851

mdubya66 added the impact/doc/note Requires creation of or changes to an official release note label Dec 29, 2017

anchal-agrawal unassigned sflxn Jan 3, 2018

anchal-agrawal modified the milestones: Sprint 23 Container Ops, Sprint 24 Container Ops Jan 3, 2018

stuclem removed the impact/doc/note Requires creation of or changes to an official release note label Jan 30, 2018

hickeng mentioned this issue Jan 31, 2018

Containers connected to non-bridge networks should issue gratuitous ARP broadcast #2932

Closed

sflxn modified the milestones: Sprint 24 Container Ops, Sprint 25 Container Ops Jan 31, 2018

sgairo removed this from the Sprint 25 Container Ops milestone Feb 13, 2018

mdubya66 added the source/system-test Reported by the system testing team label Mar 12, 2018

sgairo added this to the Sprint 28 Container Ops milestone Mar 14, 2018

lgayatri closed this as completed Mar 16, 2018

mdubya66 mentioned this issue May 7, 2018

VCH upgrades in a system test setup did not complete and lead to too many open HTTP sessions #7907

Closed

Error while powering on containers - "Server error from portlayer: could not find IP range" #6851

Error while powering on containers - "Server error from portlayer: could not find IP range" #6851

Comments

andrewtchin commented Nov 27, 2017

lgayatri commented Nov 28, 2017

mdubya66 commented Nov 28, 2017

mhagen-vmware commented Nov 28, 2017

lgayatri commented Nov 29, 2017

mdubya66 commented Nov 29, 2017

sflxn commented Nov 29, 2017

lgayatri commented Nov 30, 2017

mdubya66 commented Nov 30, 2017

sflxn commented Dec 1, 2017 • edited Loading

sflxn commented Dec 1, 2017

sflxn commented Dec 5, 2017

sflxn commented Dec 5, 2017

sflxn commented Dec 5, 2017

lgayatri commented Dec 5, 2017

lgayatri commented Dec 5, 2017

sflxn commented Dec 5, 2017 • edited Loading

sflxn commented Dec 5, 2017 • edited Loading

mlh78750 commented Dec 6, 2017

lgayatri commented Dec 6, 2017

lgayatri commented Dec 6, 2017

sflxn commented Dec 6, 2017

lgayatri commented Dec 20, 2017

sflxn commented Dec 20, 2017 • edited Loading

lgayatri commented Dec 20, 2017 • edited Loading

mdubya66 commented Dec 20, 2017

mdubya66 commented Dec 20, 2017

sflxn commented Dec 20, 2017 • edited Loading

sflxn commented Dec 21, 2017 • edited Loading

lgayatri commented Dec 21, 2017 • edited Loading

mdubya66 commented Dec 21, 2017

hickeng commented Jan 3, 2018 • edited Loading

stuclem commented Jan 9, 2018

lgayatri commented Jan 30, 2018

stuclem commented Jan 30, 2018

hickeng commented Jan 31, 2018

mdubya66 commented Mar 12, 2018

lgayatri commented Mar 16, 2018

sflxn commented Dec 1, 2017 •

edited

Loading

sflxn commented Dec 5, 2017 •

edited

Loading

sflxn commented Dec 5, 2017 •

edited

Loading

sflxn commented Dec 20, 2017 •

edited

Loading

lgayatri commented Dec 20, 2017 •

edited

Loading

sflxn commented Dec 20, 2017 •

edited

Loading

sflxn commented Dec 21, 2017 •

edited

Loading

lgayatri commented Dec 21, 2017 •

edited

Loading

hickeng commented Jan 3, 2018 •

edited

Loading