diff --git a/enhancements/network/ovn-observability-api.md b/enhancements/network/ovn-observability-api.md new file mode 100644 index 0000000000..0c14feebe9 --- /dev/null +++ b/enhancements/network/ovn-observability-api.md @@ -0,0 +1,262 @@ +--- +title: OVN Observability API +authors: + - npinaeva +reviewers: # Include a comment about what domain expertise a reviewer is expected to bring and what area of the enhancement you expect them to focus on. For example: - "@networkguru, for networking aspects, please look at IP bootstrapping aspect" + - dceara + - msherif1234 + - jotak +approvers: # A single approver is preferred, the role of the approver is to raise important questions, help ensure the enhancement receives reviews from all applicable areas/SMEs, and determine when consensus is achieved such that the EP can move forward to implementation. Having multiple approvers makes it difficult to determine who is responsible for the actual approval. + - trozet +api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None" + - TBD +creation-date: 2024-10-02 +last-updated: 2024-10-02 +tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement + - https://issues.redhat.com/browse/SDN-5226 +--- + +# OVN Observability API + +## Summary + +OVN Observability is a feature that provides visibility for ovn-kubernetes-managed network by using sampling mechanism. +That means, that network packets generate samples with user-friendly description to give more information about what and why is happening in the network. +In openshift, this feature is integrated with the Network Observability Operator. + +## Motivation + +The purpose of this enhancement is to define an API to configure OVN observability in an openshift cluster. + +### User Stories + +- Observability: As a cluster admin, I want to enable OVN Observability integration with netobserv so that I can use network policy correlation feature. +- Debuggability: As a cluster admin, I want to debug a networking issue using OVN Observability feature. + +### Goals + +- Design a CRD to configure OVN Observability for given user stories + +### Non-Goals + +- Discuss OVN Observability feature in general + +## Proposal + +There is a number of k8s features that can be used to configure OVN Observability right now, and some more that we expect +to be added in the future. +Currently supported features are: +- NetworkPolicy +- (Baseline)AdminNetworkPolicy +- EgressFirewall +- UDN Isolation +- Multicast ACLs + +In the future more observability features, like egressIP and services, but also networking-specific features +like logical port ingress/egress will be added. + +Every feature may set a sampling probability, that defines how often a packet will be sampled, in percent. 100% means every packet +will be sampled, 0% means no packets will be sampled. + +Currently only on/off mode is supported, where on mode turns all available features with 100% probability. + +### Observability use case + +For observability use case, a cluster admin can set a sampling probability for each feature across the whole cluster. +As sampling configuration may affect cluster performance, not every admin may want all features enabled with 100% probability. +We expect this type of configuration to change rarely. + +This configuration should be handled by ovn-kubernetes to configure sampling, and by netobserv ebpf agent to "subscribe" to +the samples enabled by this configuration. + +### Debuggability use case + +Debuggability is expected to be turned on during debugging sessions, and turned off after the session is finished. +It will most likely have 100% probability for required features, and also may enable more networking-related features, like +logical port ingress/egress for cluster admins. Debugging configuration is expected to have more granular filters to +debug specific problems with minimal overhead. For example, per-node or per-namespace filtering may be required. + +Debuggability sampling may be used with or without Network Observability Operator being installed: +- without netobserv, samples may be printed or forwarded to a file by ovnkube-observ binary, which is a part of ovnkube-node pod on every node. +- with neobserv, samples may be displayed in a similar to observability use case way, but may need another interface to ensure +cluster users won't be distracted by samples generated for debugging. + +This configuration should be handled by ovn-kubernetes to configure sampling, and by netobserv ebpf agent OR ovnkube-observ script +to "subscribe" to the samples enabled by this configuration. + +### Workflow Description + +1. Cluster admin wants to enable OVN Observability for admin-managed features. +2. Cluster admin created a CR with 100% probabilities for (Baseline)AdminNetworkPolicy and EgressFirewall. +3. Messages like "Allowed by admin network policy test, direction Egress" appear on netobserv UI. +4. Cluster admin sees CPU usage by ovnkube pods grows up 10%, and reduces the sampling probability to 50%. +5. Still most of the flows have correct enrichment with lower overhead. +6. Admin is happy. Until... +7. A user from project "bigbank" reports that they can't access the internet from pod A. +8. Cluster admin enables debugging for project "bigbank" with 100% probability for NetworkPolicies. +9. netobserv UI or ovnkube-oberv script shows a sample with <`source pod A`, `dst_IP`> saying "Denied by network policy in namespace bigbank, direction Egress" +10. Admin now has to convince a user that they are missing an allow egress rule to `dst_IP` for pod A. +11. Network policy is fixed, debugging can be turned off. + + +### API Extensions + +The CRD for this feature will be a part of ovn-kubernetes to allow using it upstream. + +```go + +// ObservabilityConfig described OVN Observability configuration. +// +// +genclient +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object +// +kubebuilder:resource:path=observabilityconfig,scope=cluster +// +kubebuilder:singular=observabilityconfig +// +kubebuilder:object:root=true +// +kubebuilder:subresource:status +type ObservabilityConfig struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + // +kubebuilder:validation:Required + // +required + Spec ObservabilitySpec `json:"spec"` + // +optional + Status ObservabilityStatus `json:"status,omitempty"` +} + +type ObservabilitySpec struct { + // CollectorID is the ID of the collector that should be unique across the cluster, and will be used to receive configured samples. + // +kubebuilder:validation:Required + // +required + // +kubebuilder:validation:Minimum=1 + CollectorID int32 `json:"collectorID"` + // Features is a list of Observability features that can generate samples and their probabilities for a given collector. + // +kubebuilder:validation:Required + // +required + Features []FeatureConfig `json:"features"` +} + +type FeatureConfig struct { + // Probability is the probability of the feature being sampled in percent. + // +kubebuilder:validation:Required + // +required + // +kubebuilder:validation:Minimum=0 + // +kubebuilder:validation:Maximum=100 + Probability int32 + // Feature is the Observability feature that should be sampled. + // +kubebuilder:validation:Required + // +required + Feature ObservabilityFeature +} + +// ObservabilityStatus contains the observed status of the ObservabilityConfig. +type ObservabilityStatus struct { + // +listType=map + // +listMapKey=type + Conditions []metav1.Condition `json:"conditions,omitempty"` +} + +// +kubebuilder:validation:Enum=NetworkPolicy;AdminNetworkPolicy;EgressFirewall;UDNIsolation;MulticastIsolation +type ObservabilityFeature string + +const ( + NetworkPolicy ObservabilityFeature = "NetworkPolicy" + AdminNetworkPolicy ObservabilityFeature = "AdminNetworkPolicy" + EgressFirewall ObservabilityFeature = "EgressFirewall" + UDNIsolation ObservabilityFeature = "UDNIsolation" + MulticastIsolation ObservabilityFeature = "MulticastIsolation" +) +``` + +For now, we only have use cases for 2 different `ObservabilityConfig`s in a cluster, and ovn-kubernetes will likely have a +limit for the number of CRs that will be handled. + +For the integration with netobserv, we could use a label on `ObservabilityConfig` to signal that configured samples should be +reflected via netobserv. For example, `network.openshift.io/observability: "true"`. + +#### Observability filtering for debugging + +Work In Progress, will be included to the proposed CRD after consensus is reached. + +For debugging, we may need more granular filtering, like per-node or per-namespace filtering. + +- Per-node filtering may be implemented by adding a node selector to `ObservabilityConfig`, and is easy to implement in ovn-k +as every node has its own ovn-k instance with OVN-IC. +- Per-namespace filtering can only be applied to the namespaced Observability Features, like NetworkPolicy and EgressFirewall. +Cluster-scoped features, like (Baseline)AdminNetworkPolicy, can only be additionally filtered based on name. Label selectors +can't be used, as observability module doesn't watch any objects. + + +### Topology Considerations + +#### Hypershift / Hosted Control Planes + +As long as netobserv can be installed on the management or hosted cluster, integration should work fine. +If we need to make it work for shared-cluster netobserv deployment, more work.testing will be needed. +Netobserv team confirmed that separate installations for management and hosted clusters are supported as a result of https://issues.redhat.com/browse/NETOBSERV-1196. + +For debuggability mode without netobserv integration, everything should work separately on every cluster with its own ovn-kubernetes instance. + +#### Standalone Clusters +#### Single-node Deployments or MicroShift + +SNO should be supported with netobserv, Microshift is not. +Microshift most likely will not be supported even without netobserv, as observability requires new features to be enabled +and adds overhead to the cluster. + +### Implementation Details/Notes/Constraints + +Filtering for debugging use case may be tricky as explained in "Observability filtering for debugging" section. +But it is important to implement, as OVN implementation works in a way, where performance optimizations can only be +applied when there is only 1 `ObservabilityConfig` for a given feature. That means, adding a second `ObservabilityConfig` +for a feature will have a bigger performance impact than creating a new one. + +### Risks and Mitigations + +The biggest risk of enabling observability is performance overhead. +This will be mitigated by scale-testing and sampling percentage configuration. + +### Drawbacks + +- e2e testing upstream is not possible until github runners allow using kernel 6.11 + +## Test Plan + +- e2e downstream +- perf/scale testing for dataplane and control plane as a GA plan +- netobserv testing for correct GUI representation + +## Graduation Criteria + +### Dev Preview -> Tech Preview + +None + +### Tech Preview -> GA + +- More testing (upgrade, downgrade) +- Available by default, but no overhead until observability config CR is created +- Run perf/scale tests and document related metrics to track perf overhead, write down any limitations/recommendations +based on the workload and cluster size +- User facing documentation created in openshift-docs and netobserv docs + +### Removing a deprecated feature + +## Upgrade / Downgrade Strategy + +No upgrade impact is expected, as this feature will only be enabled when a new CR is created. + +## Version Skew Strategy + +All component were designed to be backwards-compatible +- old kernel + new OVS: TDB +- old OVS + new OVN: TDB +- old OVN + new ovn-k: rolled out at the same time, no version skew is expected +- old kernel/OVS/ovn-kubernetes + new netobserv: TDB + +## Operational Aspects of API Extensions + +Perf/scale considerations will be documented as a part of GA plan. + +## Support Procedures + +## Alternatives