diff --git a/docs/user-guide/aws-examples/eks.rst b/docs/user-guide/aws-examples/eks.rst new file mode 100644 index 00000000..21b3ce49 --- /dev/null +++ b/docs/user-guide/aws-examples/eks.rst @@ -0,0 +1,229 @@ +====================================== +Running NeMo Curator on AWS EKS +====================================== + +-------------------------------------- +Background +-------------------------------------- +AWS EKS is a fully managed service that makes it easier to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane. + +Running NeMo Curator on AWS EKS offers streamlined Kubernetes management integrated with AWS services like CloudWatch for enhanced monitoring and logging, and native auto-scaling capabilities. + +For more details, refer to `EKS documentation `__ + +This guide covers all essential prerequisites. It includes an example demonstrating how to create an EFS storage class and offers step-by-step instructions for setting up an EFS Persistent Volume Claim to dynamically provisioning Kubernetes Persistent Volume. Furthermore, it outlines the required steps to deploy a Dask cluster and delves into utilizing the Kubernetes Python client library to assign NeMo-Curator tasks to the Dask scheduler. + + +Prerequisites: +---------------- + +* EKS Cluster: + * `Dask Operator `__ + * If self managed node group is created with ubuntu worker nodes then install GPU operator. When setting up a self-managed node group with Ubuntu worker nodes in Amazon EKS, it's advantageous to install the GPU Operator. The GPU Operator is highly recommended as it simplifies the deployment and management of NVIDIA GPU resources within Kubernetes clusters. This operator automates the installation of NVIDIA drivers, integrates with container runtimes like containerd through the NVIDIA Container Toolkit, manages device plugins, and provides monitoring capabilities. + `GPU operator `__ + * If EKS managed node group is created with Amazon Linux 2 worker nodes then install Nvidia device plugin. This approach has a limitation, the pre-installed NVIDIA GPU driver version and NVIDIA container runtime version lags the release schedule from NVIDIA requires you to assume the responsibility ofr upgrading the NVIDIA devicw plugin version. + `Nvidia Device Plugin installation `__ + For more details, please refer `NVIDIA GPU Operator with Amazon EKS `__ +* Storage: + * `EFS for EKS `__ (setup by Kubernetes cluster admin) + +Create a Storage Class for AWS EFS +---------------------------------- + +.. code-block:: yaml + + cat <&1 |awk '/Config loaded from file:/{print $NF}' + +`v6` sets the verbose level to see the kubeconfig file in use. + + +2) To execute existing NeMo curator modules in a scheduler pod from another pod within the EKS cluster, add necessary permissions, such as pods/exec, and spin up a client pod. + +This approach is allows the execution of NeMo Curator modules within the scheduler pod from a separate client pod. This separation ensures that the client pod can be provisioned with specific permissions tailored for executing commands and accessing resources within the Kubernetes environment. + +Moreover, deploying this client pod can be orchestrated by another service such as AWS Batch, facilitating scalable and efficient management of computational tasks within Kubernetes clusters. + + +.. code-block:: yaml + + cat <` Now that you've curated your data, let's discuss where to go next in the NeMo Framework to put it to good use. +`NeMo Curator on AWS `__ + Demonstration of how to run the existing NeMo Curator modules on AWS services + `Tutorials `__ To get started, you can explore the NeMo Curator GitHub repository and follow the available tutorials and notebooks. These resources cover various aspects of data curation, including training from scratch and Parameter-Efficient Fine-Tuning (PEFT). diff --git a/examples/k8s/kubeclient.py b/examples/k8s/kubeclient.py new file mode 100644 index 00000000..c7e6e26a --- /dev/null +++ b/examples/k8s/kubeclient.py @@ -0,0 +1,51 @@ +import argparse + +from kubernetes import client, config +from kubernetes.stream import stream + + +def execute_command_in_scheduler_pod(api_instance, pod_name, namespace, command): + # Construct command to execute + exec_command = ["/bin/sh", "-c", command] + + # Execute the command in the pod + resp = stream( + api_instance.connect_get_namespaced_pod_exec, + pod_name, + namespace, + command=exec_command, + stderr=True, + stdin=False, + stdout=True, + tty=False, + ) + print("Response: " + resp) + + +def get_scheduler_pod(api_instance, label_selector): + scheduler_pods = api_instance.list_pod_for_all_namespaces( + watch=False, label_selector=label_selector + ) + # This returns the name of the first scheduler pod from the list + return scheduler_pods.items[0].metadata.name + + +if __name__ == "__main__": + + parser = argparse.ArgumentParser() + parser.add_argument("--command", type=str, required=True) + parser.add_argument("--kubeconfig", type=str) + args = parser.parse_args() + + # Load kube config using either the provided kubeconfig or the service account + if args.kubeconfig: # Check if args.kubeconfig is not None + config.load_kube_config(args.kubeconfig) + else: + config.load_incluster_config() + + # Create Kubernetes API client + api_instance = client.CoreV1Api() + + pod_name = get_scheduler_pod(api_instance, "dask.org/component=scheduler") + namespace = "default" + execute_command_in_scheduler_pod(api_instance, pod_name, namespace, args.command)