Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controlapi: Allow removal of network when only dead tasks reference it #2018

Merged
merged 1 commit into from
Mar 9, 2017

Conversation

aaronlehmann
Copy link
Collaborator

RemoveNetwork currently won't allow a network to be removed if any task in the store references it. This is a change relative to Docker 1.12, where only services were checked.

This might be a little heavy handed. Some tasks in the store may be long dead. Resource attachment tasks, in particular, must be detached by the API user, and this might be neglected after a failure.

Change RemoveNetwork to only fail the network deletion if a task has actual and desired state of "running" or below. This means that neither a failed resource attachment task (whose desired state won't be adjusted) or a down node (where actual state won't reflect desired state) can block the network deletion.

There may be some corner cases where deleting the network while a task is in the process of shutting down prevents the allocator from deallocating that task, but on balance this seems better than allowing many situations to block network deletion indefinitely.

cc @alexmavr @dongluochen @aboch

RemoveNetwork currently won't allow a network to be removed if any
task in the store references it. This is a change relative to Docker
1.12, where only services were checked.

This might be a little heavy handed. Some tasks in the store may be long
dead. Resource attachment tasks, in particular, must be detached by the
API user, and this might be neglected after a failure.

Change RemoveNetwork to only fail the network deletion if a task has
actual and desired state of "running" or below. This means that neither
a failed resource attachment task (whose desired state won't be
adjusted) or a down node (where actual state won't reflect desired
state) can block the network deletion.

There may be some corner cases where deleting the network while a task
is in the process of shutting down prevents the allocator from
deallocating that task, but on balance this seems better than allowing
many situations to block network deletion indefinitely.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@codecov
Copy link

codecov bot commented Mar 8, 2017

Codecov Report

Merging #2018 into master will increase coverage by 0.07%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2018      +/-   ##
==========================================
+ Coverage   53.66%   53.73%   +0.07%     
==========================================
  Files         109      109              
  Lines       18991    18978      -13     
==========================================
+ Hits        10191    10198       +7     
+ Misses       7578     7549      -29     
- Partials     1222     1231       +9

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e4762bc...b00598c. Read the comment docs.

@aboch
Copy link

aboch commented Mar 8, 2017

Looks good to me

@aboch
Copy link

aboch commented Mar 8, 2017

Just to make sure,
if t.DesiredState <= api.TaskStateRunning && t.Status.State <= api.TaskStateRunning
the agent has reported the state therefore there is no way the corresponding container is still running. Is that correct ?
Just want to make sure we do not allow a swarm level network remove while the effective network remove in libnetwork would fail.

@aaronlehmann
Copy link
Collaborator Author

Just to make sure,
if t.DesiredState <= api.TaskStateRunning && t.Status.State <= api.TaskStateRunning
the agent has reported the state therefore there is no way the corresponding container is still running. Is that correct ?

This logic allows the network removal if either:

  • The container has stopped running (t.Status.State), OR
  • The manager wants the container to stop running (t.DesiredState)

I did it this way to lean towards being permissive about deletions. If I only relied on t.Status.State, a disconnected or unresponsive node might block the network from being removed, since we would not have received confirmation that the container has actually shut down. If we only relied on t.DesiredState, network attachment tasks that have stopped running would block networks from being removed, since their desired state is never updated (there is no orchestrator for them).

If the possibility of removing a network that's being used by a container which is about to shut down is a problem, we can revisit this. My thinking was that allowing some corner cases like this would be better than the alternative of cases where it's impossible to remove the network, say because it's blocked by a task on an unresponsive node.

@aboch
Copy link

aboch commented Mar 8, 2017

My thinking was that allowing some corner cases like this would be better than the alternative of cases where it's impossible to remove the network

Agree, seems reasonable. Also, considering that proving the above theory about a corner case may be harder than actually work to a fix on a different part of the stack, should the issue presents.

@aboch
Copy link

aboch commented Mar 8, 2017

Still looks good to me

@aluzzardi
Copy link
Member

And we assume a race where the a task is being rescheduled is not possible since first we're checking for services, correct?

@aaronlehmann
Copy link
Collaborator Author

Yeah.

@aluzzardi
Copy link
Member

LGTM

@dongluochen
Copy link
Contributor

I agree with this change. However I think we need to treat "node down" as a common case, not a corner case like this. We should have a way to cleanly evict tasks from a node and handle state properly in manager. Other components don't need to make assumption on it.

@aaronlehmann aaronlehmann deleted the remove-network-failed-task branch March 10, 2017 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants