controlapi: Allow removal of network when only dead tasks reference it #2018

aaronlehmann · 2017-03-08T19:53:48Z

RemoveNetwork currently won't allow a network to be removed if any task in the store references it. This is a change relative to Docker 1.12, where only services were checked.

This might be a little heavy handed. Some tasks in the store may be long dead. Resource attachment tasks, in particular, must be detached by the API user, and this might be neglected after a failure.

Change RemoveNetwork to only fail the network deletion if a task has actual and desired state of "running" or below. This means that neither a failed resource attachment task (whose desired state won't be adjusted) or a down node (where actual state won't reflect desired state) can block the network deletion.

There may be some corner cases where deleting the network while a task is in the process of shutting down prevents the allocator from deallocating that task, but on balance this seems better than allowing many situations to block network deletion indefinitely.

cc @alexmavr @dongluochen @aboch

RemoveNetwork currently won't allow a network to be removed if any task in the store references it. This is a change relative to Docker 1.12, where only services were checked. This might be a little heavy handed. Some tasks in the store may be long dead. Resource attachment tasks, in particular, must be detached by the API user, and this might be neglected after a failure. Change RemoveNetwork to only fail the network deletion if a task has actual and desired state of "running" or below. This means that neither a failed resource attachment task (whose desired state won't be adjusted) or a down node (where actual state won't reflect desired state) can block the network deletion. There may be some corner cases where deleting the network while a task is in the process of shutting down prevents the allocator from deallocating that task, but on balance this seems better than allowing many situations to block network deletion indefinitely. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

codecov · 2017-03-08T20:04:51Z

Codecov Report

Merging #2018 into master will increase coverage by 0.07%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2018      +/-   ##
==========================================
+ Coverage   53.66%   53.73%   +0.07%     
==========================================
  Files         109      109              
  Lines       18991    18978      -13     
==========================================
+ Hits        10191    10198       +7     
+ Misses       7578     7549      -29     
- Partials     1222     1231       +9

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4762bc...b00598c. Read the comment docs.

aboch · 2017-03-08T21:56:36Z

Looks good to me

aboch · 2017-03-08T22:17:54Z

Just to make sure,
if t.DesiredState <= api.TaskStateRunning && t.Status.State <= api.TaskStateRunning
the agent has reported the state therefore there is no way the corresponding container is still running. Is that correct ?
Just want to make sure we do not allow a swarm level network remove while the effective network remove in libnetwork would fail.

aaronlehmann · 2017-03-08T22:42:10Z

Just to make sure,
if t.DesiredState <= api.TaskStateRunning && t.Status.State <= api.TaskStateRunning
the agent has reported the state therefore there is no way the corresponding container is still running. Is that correct ?

This logic allows the network removal if either:

The container has stopped running (t.Status.State), OR
The manager wants the container to stop running (t.DesiredState)

I did it this way to lean towards being permissive about deletions. If I only relied on t.Status.State, a disconnected or unresponsive node might block the network from being removed, since we would not have received confirmation that the container has actually shut down. If we only relied on t.DesiredState, network attachment tasks that have stopped running would block networks from being removed, since their desired state is never updated (there is no orchestrator for them).

If the possibility of removing a network that's being used by a container which is about to shut down is a problem, we can revisit this. My thinking was that allowing some corner cases like this would be better than the alternative of cases where it's impossible to remove the network, say because it's blocked by a task on an unresponsive node.

aboch · 2017-03-08T22:50:09Z

My thinking was that allowing some corner cases like this would be better than the alternative of cases where it's impossible to remove the network

Agree, seems reasonable. Also, considering that proving the above theory about a corner case may be harder than actually work to a fix on a different part of the stack, should the issue presents.

aboch · 2017-03-08T22:50:21Z

Still looks good to me

aluzzardi · 2017-03-09T19:19:56Z

And we assume a race where the a task is being rescheduled is not possible since first we're checking for services, correct?

aaronlehmann · 2017-03-09T19:28:13Z

Yeah.

aluzzardi · 2017-03-09T21:05:44Z

LGTM

dongluochen · 2017-03-10T21:39:33Z

I agree with this change. However I think we need to treat "node down" as a common case, not a corner case like this. We should have a way to cleanly evict tasks from a node and handle state properly in manager. Other components don't need to make assumption on it.

aaronlehmann added the process/cherry-pick label Mar 8, 2017

aaronlehmann added this to the 17.03.1 milestone Mar 8, 2017

aluzzardi merged commit c51357d into moby:master Mar 9, 2017

aaronlehmann added process/cherry-picked and removed process/cherry-pick labels Mar 10, 2017

aaronlehmann mentioned this pull request Mar 10, 2017

[17.03.x] Vendor swarmkit f93948c moby/moby#31742

Merged

aaronlehmann deleted the remove-network-failed-task branch March 10, 2017 21:47

aaronlehmann mentioned this pull request Mar 11, 2017

bump to 17.03.1-rc1 moby/moby#31754

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

controlapi: Allow removal of network when only dead tasks reference it #2018

controlapi: Allow removal of network when only dead tasks reference it #2018

aaronlehmann commented Mar 8, 2017

codecov bot commented Mar 8, 2017

aboch commented Mar 8, 2017

aboch commented Mar 8, 2017

aaronlehmann commented Mar 8, 2017

aboch commented Mar 8, 2017

aboch commented Mar 8, 2017

aluzzardi commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aluzzardi commented Mar 9, 2017

dongluochen commented Mar 10, 2017

controlapi: Allow removal of network when only dead tasks reference it #2018

controlapi: Allow removal of network when only dead tasks reference it #2018

Conversation

aaronlehmann commented Mar 8, 2017

codecov bot commented Mar 8, 2017

Codecov Report

aboch commented Mar 8, 2017

aboch commented Mar 8, 2017

aaronlehmann commented Mar 8, 2017

aboch commented Mar 8, 2017

aboch commented Mar 8, 2017

aluzzardi commented Mar 9, 2017

aaronlehmann commented Mar 9, 2017

aluzzardi commented Mar 9, 2017

dongluochen commented Mar 10, 2017