-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error while powering on containers - "Server error from portlayer: could not find IP range" #6851
Comments
@andrewtchin , VCHs were not restarted.
|
is this a regression? @mhagen-vmware |
looks like it, this is the same test that she has been running, it is now not passing. |
@mdubya66 , Regression - Yes |
@sflxn this is a potential stopper. can we root cause? |
@lgayatri I pinged you on slack. Can you provide me with the scripts you used to reproduce this? |
@sflxn , sent in email. BTW, we are different time zones. |
@mlh78750 potential stopper, adding to the list until we get root cause |
I've broken down the portlayer logs the best I can. The logs doesn't provide the the operation that starts a reconfigure so it's hard to determine what is what, but through digging, I've been able to surmise most of the operations that occurred for container c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49 in the logs. What I suspect is happening is that we are not properly accounting for the removal of container IPs properly in our internal accounting. The error that is shown in the original post is an error that originates from VIC's internal accounting and not vSphere. Not sure yet whether this is a regression or not. Also not sure why the accounting is off. Marked up portlayer logs attached. |
If there is a regression, it is likely in PR #6525. Still reviewing changes. |
I've reviewed the PR above, and I do not believe the regression is there. I have studied the IP releasal paths in the portlayer and persona and cannot find a scenario that would cause it to fail removing the IP in the portlayer's internal accounting. I've looked at PRs since 1.2 in the portlayer's networking folder as well as portlayer's rest handler and in the persona and have not come across any change that would cause this. On manual start (by @lgayatri), the tether logs shows the container VM zero'd IP and static is false (dhcp). This is expected. I've discussed this with @hickeng and @hmahmood and haven't come up with any other ideas as to what is causing this. Only path left that I can consider is the following lines in the portlayer logs:
|
@lgayatri As I mentioned on the skype call, can you run these two tests?
|
Some updated info on this issue: @lgayatri found this issue while testing on VC 6.5.
From the logs: Nearly all (if not all) the containers she tried to start returned conflict error from the portlayer multiple times (from retries). Error:
Only one failed with the "could not find IP range" towards the end of the log. |
@sflxn , first step is in progress: |
@sflxn please feel free to access my testbed. There are 2500+ containers with sleep 45m executed and now in STOPPED state. |
@lgayatri I went on your server and tried to start 100 containers and they all started. I then tried with 200+ containers, and it worked. Perhaps debug has altered some timing. I started the containers in the container view of the project_lg. They were spread over many VCHs. I assume you did the same thing. I noticed that there were a lot of containers that should have been in the stopped state but were still running 10-11hrs later. Did you see this? |
I was just able to reproduce it with the debug VCHs that @lgayatri deployed. The debug logs are attached. Container e505021d6222730b7fa2591f549aa454190aefdebdec3523cd0e2eb5e0791815. observations
|
@sflxn We should open an issue with Admiral on the backoff. Also, do you have a good idea of the root cause at this point? |
@sflxn All containers should be in stopped state as they had sleep 45m . I still see 37 containers in RUNNING status This is Admiral UI refresh issue. Opened - vmware-archive/admiral#231 |
@sflxn do you want me to run the second test ? |
@mlh78750 I believe it's trying to reserve an IP that is already in used. The debug logs are not actually helpful.
I'm adding more info to the logs so that we can actually see which IP it is trying to reserve. Also, the error message is ambiguous. The IP could already be in used or the requested IP is outside the pool's range. My update should make that clearer. I need to verify that this hypothesis is correct, that it's requesting an IP that's already in use. How the IP is already in use is the next investigation. I don't see this path while perusing the code. There is something in the logs that Lakshmi originally posted. A lot of the containers she started on that VCH failed after multiple retries. They failed with a concurrent access error. The code that exist in the persona should unbind the container in the event of failure. I'm trying to find a case where it doesn't really unbind. |
@sflxn , test bed is ready, please feel free to use it. The VCHs have been created with debug level 2. |
I might be wrong about the summary and fix from earlier. From digging in further, I wonder if the network event handler is not at fault. Here is a new summarized portlayer log. I've included the handle change version from the logs.
Questions
|
@sflxn , here is the repro:
|
I was able to restart the container 172.16.0.17, which was one of the two that failed to start. 172.16.0.21 would not restart from either the command line or admiral. It kept failing with the IP address in use message. I then searched for 172.16.0.21 in admiral and found 54 containers with that IP address. If there are 50 VCH's that seems odd. I used the docker command to inspect each container and docker inspect did not see a single stopped container with the IP address. So I then started clicking on each container in admiral and one by one the container would be removed from the search 172.6.0.21 search window in admiral. After all 54 were no longer in the search window I found the affectionate_torvalds container and clicked start. It then ran, where it would not run before on both admiral or from the docker command line |
I briefly analyzed the logs from @lgayatri's latest run with debug logging. The error is similar to the one I found yesterday but a slight difference. The root cause is that when a container unbinds from the network, it needs to release the IP, which it does. Concurrently, a request comes in to restart the container and tries to grab the same IP because it's using an old handle (where the IP wasn't released yet). It needs to get the new handle (where the IP was released) so that it can request a new IP. I have a WIP PR that goes toward resolving this. I just need to investigate why your logs shows it repeatedly trying to restart and bind using the old handle. This shouldn't happen. A retry in the persona should get the new handle... eventually. This may still be true. Just need to finish the analysis. UPDATE: I've verified that eventually, the container I was looking at gets the IP. I just now tried to start the container and it started. However, there is still an assigned IP address when we tried to bind. This is a bug. It never fully released the IP address earlier. Hours later, I expected a restart would get the updated handle and see the IP was released. It wasn't. I'll need to find this bug and add it to the PR I linked above. This bug needs to be fixed because even with returning ConcurrentResourceError from the network layer, we can still hit this issue if the system is thrashing and the container never fully zeroes out the assigned IP.
|
Increasing the event page size on the portlayer to help with missing events. Helps with vmware#6384, vmware#6851
Still tracking down why unbind failed, which eventually causes this issue.
Once the above error occurs, all requests to bind this container has the IP in question assigned because it was never able to properly unbind (per the error above). On a system that is not loaded, you can eventually restart the container. I was able to do it earlier today, but when I looked at the logs, it still had the IP assigned going into BindContainer. This was many, many hours later. If the VC was loaded, I may not have been able to restart the container because it insists on getting that specific IP since it was never able to cleanly unbind. I do see a potential problem with Admiral. Lakshmi's system is very loaded. Admiral never waits for a successful start. It waits a maxium of 2mins and then sends VIC another request. In this system, one start request took nearly 6 minutes to complete, and it completed successfully, but Admiral didn't wait around. It had sent in 2 more requests in during this time, each waiting a maximum of 2mins. After the one successful start, Admiral continuously resends the same start for the same container. I think this is a pretty serious bug. |
My setup is moderately loaded, this is what we support. |
Potential workaround but not pretty.
|
I think this issue after a failed unbind has been around since we added the VCH restart support in #2579. We should not be re-requesting the same address if the cVM is not currently powered on. |
@stuclem, IMHO lets not document this for now. BTW, this issue will be seen in 1.3.1 as well. |
Thanks @lgayatri. Removing kind/note. |
@sflxn can you summarize how far you got with this? As a follow up note to my prior comments - the bridge is not “dynamic” however we use that set to true in the config to differentiate between bridge address assigned by user vs bridge address set via network component. |
@lgayatri can you confirm we are still running into this issue? |
Not seen this issue on 1.4 OVA |
@lgayatri commented on Tue Nov 21 2017
@andrewtchin @cgtexmex @mhagen-vmware
Service https://10.197.37.143:2376/v1.24/containers/c8b910a14b383e2d473ac18befbc220d0e072829e9b44424cf8554714f21ac49/start returned error 500 for POST. id 284135304 message Server error from portlayer: could not find IP range; Reason: {"message":"Server error from portlayer: could not find IP range"}
docker-personality.log says:
Logs attached.
@lgayatri commented on Tue Nov 21 2017
container-logs (2).zip
@lgayatri commented on Wed Nov 22 2017
Hitting this issue on other VCHs too.
@mdubya66 commented on Wed Nov 22 2017
@sergiosagu could this be related to the ip range work we did?
@sergiosagu commented on Thu Nov 23 2017
@mdubya66 - hmm... I don't think so, but we should investigate it.
@lgayatri - could you attach the application template that you used for the deployment? or perhaps to create one if you ran the container directly.
@lgayatri commented on Thu Nov 23 2017
@sergiosagu , I have live setup. Please take a look
https://10.197.37.132:8282/#/home/containers - administrator@vsphere.local/Admin!23 and root/Admin!23
@sergiosagu commented on Fri Nov 24 2017
@mdubya66 - It's looks like it could be something with the VCH host itself.
I've found these:
Probably someone from the engine team should take a look.
@andrewtchin commented on Mon Nov 27 2017
@lgayatri Were these VCHs restarted? What are the steps to reproduce this?
The text was updated successfully, but these errors were encountered: