Add diagram, workflow and controller section details

celery · Aug 18, 2020 · cc1b9a5 · cc1b9a5
1 parent a819280
commit cc1b9a5
Showing 1 changed file with 48 additions and 14 deletions.
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -1,29 +1,44 @@
-## Celery Kubernetes Operator - Architecture Document
+## Celery Kubernetes Operator - High Level Architecture
 
 ### Overview
 
 [Celery](https://docs.celeryproject.org/en/stable/) is a popular distributed task-queue system written in Python. To run Celery in production on Kubernetes, there are multiple manual steps involved like -
 - Writing deployment spec for workers
 - Setting up monitoring using [Flower](https://flower.readthedocs.io/en/latest/)
-- Setting up autoscaling configuration
-
-Apart from that, there's no consistent way to setup multiple clusters, everyone configures there own way which could create problems for infrastructure teams to manage and audit later.
+- Setting up Autoscaling
 
+Apart from that, there's no consistent way to setup multiple clusters, everyone configures their own way which could create problems for infrastructure teams to manage and audit later.
 This project attempts to solve(or automate) these issues. It is aiming to bridge the gap between application engineers and infrastructure operators who manually manage the celery clusters.
 
+Moreover, since Celery is written in Python, we plan to use open source [KOPF](https://github.com/zalando-incubator/kopf)(Kubernetes Operator Pythonic Framework) to write the custom controller implementation.
+
 ### Scope
 
 1. Provide a Custom Resource Definition(CRD) to spec out a Celery and Flower deployment having all the configuration options that they support.
 2. A custom controller implementation that registers and manages self-healing capabilities of custom Celery resource for these operations -
-- CREATE - Creates the worker and flower deployments along with exposing a native Service object for Flower
-- UPDATE - Reads the CRD modifications and updates the running deployments using specified strategy
-- DELETE - Deletes the custom resource and all the child deployments
+    + CREATE - Creates the worker and flower deployments along with exposing a native Service object for Flower
+    + UPDATE - Reads the CRD modifications and updates the running deployments using specified strategy
+    + DELETE - Deletes the custom resource and all the child deployments
 3. Support worker autoscaling/downscaling based on resource constraints(cpu, memory) and task queue length automatically.
 
 Discussions involving other things that this operator should do based on your production use-case are welcome.
 
 ### Diagram
 
+![CKO Arch Diagram](https://i.imgur.com/dTBuG58.png)
+
+### Workflow
+
+End user starts by writing and creating a YAML spec for the desired celery cluster. Creation event is listened by the Creation Handler(KOPF based) which creates deployment for workers, flower and a Service object to expose flower UI to external users.
+
+Assuming we have broker in place, any user facing application can start pushing messages to broker now and celery workers will start processing them.
+
+User can update the custom resource, when that happens, updation handler listening to the event will patch the relevant deployments for change. Rollout strategy can be default or to be specified by user in the spec.
+
+Both creation and updation handlers will return their statuses to be stored in parent resource's status field. Status field will contain the latest status of the cluster children at all times.
+
+User can choose to setup autoscaling of workers by resource constraints(CPU, Memory) or broker queue length. Operator will automatically take care of creating an HPA or use KEDA based autoscaling(See [Autoscaling](#Autoscaling) section below) to make that happen.
+
 ### Components
 
 #### Worker Deployment
@@ -51,6 +66,7 @@ We plan to have following objects in place with their high level description -
 - `workerSpec` - worker deployment specific parameters
     + `numOfWorkers` - Number of workers to launch initially
     + `args` - array of arguments(all celery supported options) to pass to worker process in container  (TODO: Entrypoint vs args vs individual params)
+    + `rolloutStrategy` - Rollout strategy to spawn new worker pods
     + `resources` - optional argument to specify cpu, mem constraints for worker deployment
 - `flowerSpec` - flower deployment and service specific parameters
     + `replicas` - Number of replicas for flower deployment
@@ -77,18 +93,36 @@ Custom Resource Object for a Celery application. Multiple clusters will have mul
 [Custom controller](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#custom-controllers) implementation to manage Celery applications(CRs). Contains the code for creation, updation, deletion and scaling handlers of the cluster.
 
 
-### Controller Handlers(Controller Implementation Details)
+### Async KOPF Handlers(Controller Implementation)
+This section contains brief overview of creation and updation handlers which are going to react on celery resource creation and updation respectively and return their status to be stored back as resource's status.
 
 #### Creation Handler
+Generates deployment spec for worker and flower deployments dynamically based on incoming parameters specified in custom celery resource. Also creates the flower service to expose flower UI. Status of each children is sent back to be stored under parent resource status field.
+
+Additionally, it might handle the HPA object creation too if the scaling is to be done on native metrics(CPU and Memory utilization).
 
 #### Updation Handler
+Updates deployment spec for worker and flower deployments(and service + HPA) dynamically and patch them. Status of each children is sent back to be stored under parent resource status field.
 
-#### Scaling Handlers
+### Autoscaling
+This section covers how operator is going to handle autoscaling. We plan to supporting scaling based on following two metrics.
 
-### Workflow
+#### Native Metrics(CPU, Memory Utilization)
+If workers need to be scaled only on CPU/Memory constraints, we can simply create an HPA object in creation/updation handlers and it'll take care of scaling relevant worker deployment automatically. HPA supports these two metrics out of the box. For custom metrics, we need to do additional work.
+
+#### Broker Queue Length(KEDA based autoscaling)
+Queue Length based scaling needs custom metric server for an HPA to work. [KEDA](https://keda.sh/docs/1.5/concepts/) is a wonderful option because it is built for the same. It provides the [scalers](https://keda.sh/docs/1.5/scalers/) for all the popular brokers(RabbitMQ, Redis, Amazon SQS) supported in Celery.
+
+KEDA provides multiple ways to be deployed on a Kubernetes cluster - Helm, Operator Hub and Yaml. Celery Operator can package KEDA along with itself for distribution.
+
+### Deployment Strategy
+
+Probably the best way would be distribute a Helm Chart which packages CRD, controller and KEDA together(More to be explored here). We'll also support YAML apply based deployments.
+
+Additionally, Helm approach is extensible in the sense that we can package additional components like preferred broker(Redis, RMQ, SQS) as well to start with Celery on Kubernetes out of the box without much efforts.
 
 ### Want to Help?
-If you're running celery on a Kubernetes cluster, your inputs to how you manage applications will be valuable. You could contribute to the discussion on this issue - (TODO -- ISSUE)
+If you're running celery on a Kubernetes cluster, your inputs to how you manage applications will be valuable. You could contribute to the discussion by creating a new issue on the repo.
 
 ### Motivation
 
@@ -98,8 +132,8 @@ Moreover, we wish to build this operator with Python. Kubernetes is written in g
 
 ### TODOs for Exploration
 - [ ] Helm chart to install the operator along with a broker of choice
-- [ ] Role based access control section for the operator
+- [ ] Add role based access control section for the operator
 - [ ] Ingress Resource
-- [ ] KEDA Autoscaling
+- [ ] KEDA Autoscaling Implementation
 - [ ] Create new issue thread to discuss Celery use-cases
-- [ ] What is not in scope of operator
+- [ ] What should not be in scope of celery operator?