-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Security Contexts #3910
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,190 @@ | ||
# Security Contexts | ||
## Abstract | ||
A security context is a set of constraints that are applied to a container in order to achieve the following goals (from [security design](security.md)): | ||
|
||
1. Ensure a clear isolation between container and the underlying host it runs on | ||
2. Limit the ability of the container to negatively impact the infrastructure or other containers | ||
|
||
## Background | ||
|
||
The problem of securing containers in Kubernetes has come up [before](https://github.com/GoogleCloudPlatform/kubernetes/issues/398) and the potential problems with container security are [well known](http://opensource.com/business/14/7/docker-security-selinux). Although it is not possible to completely isolate Docker containers from their hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) make it possible to greatly reduce the attack surface. | ||
|
||
## Motivation | ||
|
||
### Container isolation | ||
|
||
In order to improve container isolation from host and other containers running on the host, containers should only be | ||
granted the access they need to perform their work. To this end it should be possible to take advantage of Docker | ||
features such as the ability to [add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration) | ||
to the container process. | ||
|
||
Support for user namespaces has recently been [merged](https://github.com/docker/libcontainer/pull/304) into Docker's libcontainer project and should soon surface in Docker itself. It will make it possible to assign a range of unprivileged uids and gids from the host to each container, improving the isolation between host and container and between containers. | ||
|
||
### External integration with shared storage | ||
In order to support external integration with shared storage, processes running in a Kubernetes cluster | ||
should be able to be uniquely identified by their Unix UID, such that a chain of ownership can be established. | ||
Processes in pods will need to have consistent UID/GID/SELinux category labels in order to access shared disks. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this mean the in-namespace UID or the root-namespace UID? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For disk, outside. However, the user to run as inside the namespace may be something a user wants to change. If namespaces are not present, a mismatch between the two should reject the pod, maybe.
|
||
|
||
## Constraints and Assumptions | ||
* It is out of the scope of this document to prescribe a specific set | ||
of constraints to isolate containers from their host. Different use cases need different | ||
settings. | ||
* The concept of a security context should not be tied to a particular security mechanism or platform | ||
(ie. SELinux, AppArmor) | ||
* Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for | ||
[service accounts](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). | ||
|
||
## Use Cases | ||
|
||
In order of increasing complexity, following are example use cases that would | ||
be addressed with security contexts: | ||
|
||
1. Kubernetes is used to run a single cloud application. In order to protect | ||
nodes from containers: | ||
* All containers run as a single non-root user | ||
* Privileged containers are disabled | ||
* All containers run with a particular MCS label | ||
* Kernel capabilities like CHOWN and MKNOD are removed from containers | ||
|
||
2. Just like case #1, except that I have more than one application running on | ||
the Kubernetes cluster. | ||
* Each application is run in its own namespace to avoid name collisions | ||
* For each application a different uid and MCS label is used | ||
|
||
3. Kubernetes is used as the base for a PAAS with | ||
multiple projects, each project represented by a namespace. | ||
* Each namespace is associated with a range of uids/gids on the node that | ||
are mapped to uids/gids on containers using linux user namespaces. | ||
* Certain pods in each namespace have special privileges to perform system | ||
actions such as talking back to the server for deployment, run docker | ||
builds, etc. | ||
* External NFS storage is assigned to each namespace and permissions set | ||
using the range of uids/gids assigned to that namespace. | ||
|
||
## Proposed Design | ||
|
||
### Overview | ||
A *security context* consists of a set of constraints that determine how a container | ||
is secured before getting created and run. It has a 1:1 correspondence to a | ||
[service account](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). A *security context provider* is passed to the Kubelet so it can have a chance | ||
to mutate Docker API calls in order to apply the security context. | ||
|
||
It is recommended that this design be implemented in two phases: | ||
|
||
1. Implement the security context provider extension point in the Kubelet | ||
so that a default security context can be applied on container run and creation. | ||
2. Implement a security context structure that is part of a service account. The | ||
default context provider can then be used to apply a security context based | ||
on the service account associated with the pod. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If pods on different nodes are accessing shared storage, their UIDs need to be be unique across nodes. So, their uids need to be allocated either statically to nodes at node join time, or dynamically to pods at bind time, by some cluster level thing. Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct - I believe that a reasonable default would be that each security context on the master (a service account that is the "default" for the namespace?) would get a UID allocated to it that no other security context would get. An administrator would then later be able to assign complementary UIDs across namespaces if needed. In the future, there could be additional security contexts that grant access to shared resources. ----- Original Message -----
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay. So, this could be done manually, or by a namespace creation helper client, or perhaps by a control loop. SGTM. |
||
|
||
### Security Context Provider | ||
|
||
The Kubelet will have an interface that points to a `SecurityContextProvider`. The `SecurityContextProvider` is invoked before creating and running a given container: | ||
|
||
```go | ||
type SecurityContextProvider interface { | ||
// ModifyContainerConfig is called before the Docker createContainer call. | ||
// The security context provider can make changes to the Config with which | ||
// the container is created. | ||
// An error is returned if it's not possible to secure the container as | ||
// requested with a security context. | ||
ModifyContainerConfig(pod *api.BoundPod, container *api.Container, config *docker.Config) error | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it work to have the SecurityContextProvider just modify the api.BoundPod, and not take a docker.Config as an argument. Reasons:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ----- Original Message -----
True - I think cesar (correct me if I'm wrong here) had started here because the options to docker may be complex - setting up user namespaces, labels, and default behavior. However, a two step abstraction of making the docker interface we have from the kubelet support the additional options on BoundPods, and then adding bound pods options, seems reasonable. Some security context stuff might be a finalizer at the master level. Security context, if applied on the kubelet for final defaults, and on the master for cluster level isolation, seems similar to other finalizer style patterns. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ----- Original Message -----
One thing I did think of - you may need to know (from the image) what user the image is going to run as, and like ENTRYPOINT I think it's frustrating to an end user to have to specify that image up front in the pod definition. Some level of "map user X inside the container to Y outside" happening by default seemed potentially valuable. However, the two step process (setup security context, then pass to the docker interface) could also handle that.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There exist both "SCRATCH" container images (single process, single uid, not sensitive to the choice of uid) and "traditional" images, (which have many entries in their Should the default default [sic] security context support the rich base image style? If so how? Need a range of UIDs right, and don't know how many till you examine the contents of the image. On the other hand, should we make it easy to also run the scratch style, and encourage it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @erictune - Back to the first comment about only modifying the BoundPod ... If we want to only express intent in the pod definition, then we couldn't just mutate it to apply the security context. At some point the intent needs to become implementation. The security context provider which knows how to implement the pod's intent needs to make specific changes to the actual docker calls. Or am I missing something? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the idea is to support other container formats in the future, seems like we would need a Security Context Provider associated with the underlying container runtime, but agree that ultimately, for the container runtime == docker, you need a docker.Config. ----- Original Message -----
@erictune - Back to the first comment about only modifying the BoundPod ... If we want to only express intent in the pod definition, then we couldn't just mutate it to apply the security context. At some point the intent needs to become implementation. The security context provider which knows how to implement the pod's intent needs to make specific changes to the actual docker calls. Or am I missing something? Reply to this email directly or view it on GitHub: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I withdraw my comments about ModifyContainerConfig. We can change the code pretty easily later to support things other than docker. The important thing is to get the API right. |
||
|
||
// ModifyHostConfig is called before the Docker runContainer call. | ||
// The security context provider can make changes to the HostConfig, affecting | ||
// security options, whether the container is privileged, volume binds, etc. | ||
// An error is returned if it's not possible to secure the container as requested | ||
// with a security context. | ||
ModifyHostConfig(pod *api.BoundPod, container *api.Container, hostConfig *docker.HostConfig) | ||
} | ||
``` | ||
|
||
If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today. | ||
|
||
### Security Context | ||
|
||
A security context has a 1:1 correspondence to a service account and it can be included as | ||
part of the service account resource. Following is an example of an initial implementation: | ||
|
||
```go | ||
|
||
// SecurityContext specifies the security constraints associated with a service account | ||
type SecurityContext struct { | ||
// user is the uid to use when running the container | ||
User int | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Containers can have multiple processes running as multiple uids. This may not be the recommended style of container, by my impression is that there are many of them out there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One option is to declare that we don't support this style of container, but that excludes a lot of existing images, I think. Another option is to just set the lead process to have this uid, but allow other processes to have other virtual uids which map back to useless physical uids. Another option, which I'm not sure if it works at all, is to map multiple virtual uids to 1 physical uids. (I'm using virtual to mean "in-namespace" and physical to mean "in the root linux namespace"). Another option is to use a per-volume strategy for virtual-to-physical mapping. I think higher level questions is: If there are two uids in a container, should their filesystem writes, at the canonical view of the filesystem, appear as one or two different uids/gids? I think "one" is simpler for users but harder for implementers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ----- Original Message -----
Some, but that could also be the old school image (pre-pods) vs the new school (one process / user per container). We can probably get pretty far on that for Kube users.
I don't think it works. In docker upstream we had a long discussion on this about ranges - we think we can allocate ranges and have this work, but you have to have large ranges. If people had to predeclare how many uids they need and had a map or something, we could maybe allocate a set for a namespace (10k was mooted per container before). I've also wondered whether we could just do two ranges - shared, and unshared. Shared is allocated by the master and cluster wide. Unshared is node scoped and each started container gets a set. I think you can then pass two ranges into the container. @mrunalp to keep me honest here.
Yeah, although in practice I suspect 60-80% of containers that people should run will be single uid. So we can make single uid work well, and have multi uid be not quite as nice.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with "make single uid work well, and have multi uid be not quite as nice". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does "Not quite as nice" mean:
I hope it means the first one. First priority is ability to move your existing dockerized workloads onto kubernetes with minimal pod spec writing. Second priority is to lock down your cluster. |
||
|
||
// AllowPrivileged indicates whether this context allows privileged mode containers | ||
AllowPrivileged bool | ||
|
||
// AllowedVolumeTypes lists the types of volumes that a container can bind | ||
AllowedVolumeTypes []string | ||
|
||
// AddCapabilities is the list of Linux kernel capabilities to add | ||
AddCapabilities []string | ||
|
||
// RemoveCapabilities is the list of Linux kernel capabilities to remove | ||
RemoveCapabilities []string | ||
|
||
// Isolation specifies the type of isolation required for containers | ||
// in this security context | ||
Isolation ContainerIsolationSpec | ||
} | ||
|
||
// ContainerIsolationSpec indicates intent for container isolation | ||
type ContainerIsolationSpec struct { | ||
// Type is the container isolation type (None, Private) | ||
Type ContainerIsolationType | ||
|
||
// FUTURE: IDMapping specifies how users and groups from the host will be mapped | ||
IDMapping *IDMapping | ||
} | ||
|
||
// ContainerIsolationType is the type of container isolation for a security context | ||
type ContainerIsolationType string | ||
|
||
const ( | ||
// ContainerIsolationNone means that no additional consraints are added to | ||
// containers to isolate them from their host | ||
ContainerIsolationNone ContainerIsolationType = "None" | ||
|
||
// ContainerIsolationPrivate means that containers are isolated in process | ||
// and storage from their host and other containers. | ||
ContainerIsolationPrivate ContainerIsolationType = "Private" | ||
) | ||
|
||
// IDMapping specifies the requested user and group mappings for containers | ||
// associated with a specific security context | ||
type IDMapping struct { | ||
// SharedUsers is the set of user ranges that must be unique to the entire cluster | ||
SharedUsers []IDMappingRange | ||
|
||
// SharedGroups is the set of group ranges that must be unique to the entire cluster | ||
SharedGroups []IDMappingRange | ||
|
||
// PrivateUsers are mapped to users on the host node, but are not necessarily | ||
// unique to the entire cluster | ||
PrivateUsers []IDMappingRange | ||
|
||
// PrivateGroups are mapped to groups on the host node, but are not necessarily | ||
// unique to the entire cluster | ||
PrivateGroups []IDMappingRange | ||
} | ||
|
||
// IDMappingRange specifies a mapping between container IDs and node IDs | ||
type IDMappingRange struct { | ||
// ContainerID is the starting container ID | ||
ContainerID int | ||
|
||
// HostID is the starting host ID | ||
HostID int | ||
|
||
// Length is the length of the ID range | ||
Length int | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that we would want to use all the mechanisms mentioned above (capabilities, MCS labels, apparmor profiles) if available. And that the initial implementation should use these, since RedHat has so much expertise with these. At the same time, it is very much tied to a specific implementation. This makes it harder for users to understand so much detail; harder to drop in alternative implementations should we ever want to do that. Examples of different implementations: some company might use grsecurity and Pax already. (I don't but someone might); some hosting provider might write a different implementation that has similar effect (I can see us doing that). So, can you think of a way to divide this up into two layers: one that is a core API object that expresses intent, and another, which implements the intent? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example, if my intent is: "This container should not have the same identity as any other container, both for node-local resources and for shared (storage) resources", then I am pretty sure the system could automatically come up with a User, SELinux.Level, SELinux.Type, and AppArmor.Profile settings. The question that needs some thought is whether most other intents can similarly be expressed at an abstract level. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ----- Original Message -----
At a minimum, anything that is not 100% all Linuxes should be an extension plugin (or a default extension). No disagreement from me.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ----- Original Message -----
This is an interesting question - volumes are very low level (give me EXACTLY this thing). Security context as modeled is a bit more like volumes. It means a finalizer goes and turns a generic intent (maybe expressed by the admin or the namespace) onto a specific context on the pod (pod should run as this UID). Your suggesting the opposite, something that the user can set "hey, I want this kind of security context", and then something has to go finalize and specialize it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see developers as:
I see project admins as:
I see cluster admins as:
Do you agree that those are reasonable separations of responsibilities for a Kubernetes cluster? If so, do you think the current design allows those three groups to work independently of each other and to focus on the information they need? I'm not sure; I'm trying to think that through. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ----- Original Message -----
Also: Concerned about how some things running as higher trust (builds can push images to docker repo X) don't get abused by ordinary developers. Your phrasing is fine, just adding a scenario we think about.
Yes, nailed it. Those distinctions should be in this proposal and in service account (or whatever context it takes)
I think the proposal doesn't describe the higher level pieces that are in service account and secrets, but assumes they exist. I would feel like service account is a concept for the developer end of the spectrum and security context is much more about the other end. The cluster and project admins must allow developers to have capabilities, the developers must understand how they use those capabilities, and in general higher level developer concepts get boiled down into security contexts and execution details. So this proposal is definitely talking about a part of the overall story. The outcome of these proposals / prototypes should at minimum include a document that describes the above and how the pieces provide that spectrum. I think at this point that I could argue a convincing story about:
There is of course a lot of handwaving in between those bits. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll try for monday, and then we can collaboratively edit. Agree an overarching story is a gap - we're designing the bits, but not articulating how they flow from a central point. ----- Original Message -----
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, I read the https://github.com/openshift/origin/blob/master/docs/proposals/policy.md. That looks pretty cool and well thought out. I see now how you separate cluster admin and project admin responsibilities with the master namespace versus other namespaces. |
||
|
||
``` | ||
|
||
|
||
#### Security Context Lifecycle | ||
|
||
The lifecycle of a security context will be tied to that of a service account. It is expected that a service account with a default security context will be created for every Kubernetes namespace (without administrator intervention). If resources need to be allocated when creating a security context (for example, assign a range of host uids/gids), a pattern such as [finalizers](https://github.com/GoogleCloudPlatform/kubernetes/issues/3585) can be used before declaring the security context / service account / namespace ready for use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What types of volumes would the MCS labels be used with? Presumably there aren't files that are sensitive for the container process in the emptydir. If this for files in the hostDir, or some other type of volume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything - the container would be relabeled, the process would have those labels, and any volumes would either be labelled or potentially left as is (in a few cases maybe this is reasonable?). Common case though is "you get these labels". I believe we have all but the volume support upstream and we carry the relabeling support on RHEL docker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, reading further I see you are talking about NFS, and stuff like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - actually relabel is bad if arbitrary (I shouldn't be able to relabel existing content because I tricked the master). It would be better to relabel only new content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should define a kubelet default security context as well - ie if nothing is specified this is the context. The kubelet can just auto assign uids locally for user namespaces and do similar for labels. At least some defense in depth.