-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding an example for executing NeMo modules using kubernetes Python … #148
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! The main thing is that the documentation could be improved a bit by explaining why each step is necessary and why your method overall would be preferred to the other existing options. Code looks good, short and sweet!
30a5e68
to
d489d5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of my comments stem from my lack of familiarity with AWS EKS. I do think it's relevant to explain some of the "basics" of EKS since many of our users will be like me with little background in these systems. Thanks again for making this guide!
docs/user-guide/aws-examples/eks.rst
Outdated
Running NeMo Curator on AWS EKS | ||
====================================== | ||
|
||
Prerequisuites: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo. Should be "Prerequisites"
docs/user-guide/aws-examples/eks.rst
Outdated
Prerequisuites: | ||
|
||
* EKS Cluster: | ||
* `Dask Operator <https://kubernetes.dask.org/en/latest/operator_installation.html>`__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Dask Operator" is included twice, and the link does not go anywhere either. Please remove one of the mentions and fix the link.
docs/user-guide/aws-examples/eks.rst
Outdated
Running NeMo Curator on AWS EKS | ||
====================================== | ||
|
||
Prerequisuites: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a high level "Background" section at the top of this document? It should answer the following questions
- What is AWS EKS at a high level?
- What is this guide going to show me to do on AWS EKS with NeMo Curator, and what are the fundamental assumptions that it makes?
- Will it cover how to run any prebuilt CLI script in NeMo Curator?
- Will it cover how to run a python script that I have made?
- Will it cover how to upload my data to EKS?
- Is there any reason to prefer running NeMo Curator on AWS EKS?
- The answer to this might be "only if you are familiar with it or if it's the only compute option you have" which is fine. Just wondering if it gives any benefits.
docs/user-guide/aws-examples/eks.rst
Outdated
|
||
* EKS Cluster: | ||
* `Dask Operator <https://kubernetes.dask.org/en/latest/operator_installation.html>`__ | ||
* If self managed node group is created with ubuntu worker nodes then install GPU operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If self managed node group is created with ubuntu worker nodes
If EKS managed node group is created with Amazon Linux 2 worker nodes
What determines if either of these conditions are true? Is it up to the user? If so, which one should we recommend the user do?
docs/user-guide/aws-examples/eks.rst
Outdated
- Binds the pod-exec ClusterRole to a specific ServiceAccount (default in the default namespace). | ||
- This means that any pods using the default ServiceAccount in the default namespace will have the permissions specified in the pod-exec ClusterRole. | ||
|
||
Now, we can spin up a client pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete the space at the beginning of this line.
docs/user-guide/aws-examples/eks.rst
Outdated
command: ["sh", "-c", "pip install kubernetes && sleep infinity"] | ||
EOF | ||
|
||
Here, we are using a light-weight public python docker image and installing kubernetes Python client package so that we can run kubeclient.py from this client pod and connect to the scheduler pod to run existing NeMo modules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "NeMo" -> "NeMo Curator"
docs/user-guide/aws-examples/eks.rst
Outdated
kubectl exec client-pod -- python3 kubeclient.py --command "add_id --scheduler-address localhost:8786 --input-data-dir=/nemo-workspace/arxiv --output-data-dir=/nemo-workspace/arxiv-addid/" | ||
|
||
|
||
This approach is allows the execution of NeMo modules within the scheduler pod from a separate client pod. This separation ensures that the client pod can be provisioned with specific permissions tailored for executing commands and accessing resources within the Kubernetes environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "NeMo" -> "NeMo Curator"
docs/user-guide/aws-examples/eks.rst
Outdated
|
||
This approach is allows the execution of NeMo modules within the scheduler pod from a separate client pod. This separation ensures that the client pod can be provisioned with specific permissions tailored for executing commands and accessing resources within the Kubernetes environment. | ||
|
||
Moreover, deploying this client pod can be orchestrated by another service such as AWS Batch, facilitating scalable and efficient management of computational tasks within Kubernetes clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this explanation! This makes a lot of sense. Do you mind moving it to before you spin up the client pod so the user has background into why they would be doing this?
storageClassName: efs-sc | ||
resources: | ||
requests: | ||
storage: 150Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too familiar with EKS, but should this be 150GiB
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referred these examples: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/examples/kubernetes/dynamic_provisioning/specs/pod.yaml and it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for checking.
…client Signed-off-by: dpadmanabhan <dpadmanabhan@nvidia.com>
120ca1b
to
88c44a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more comments on some wording choices. Also it seems like most of my feedback was not addressed, so please address it.
@@ -36,6 +36,9 @@ | |||
:ref:`Next Steps <data-curator-next-steps>` | |||
Now that you've curated your data, let's discuss where to go next in the NeMo Framework to put it to good use. | |||
|
|||
`NeMo Curator on AWS <https://github.com/NVIDIA/NeMo-Curator/tree/main/docs/user-guide/aws-examples/>`__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you follow the pattern of other elements in the list and do something like:
:ref:`NeMo Curator on AWS <data-curator-aws-examples>`
Then create an aws-examples/index.rst
with something like this:
.. _data-curator-aws-examples:
==================
AWS
==================
.. toctree::
:maxdepth: 4
:titlesonly:
eks.rst
|
||
For more details, refer to `EKS documentation <https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html>`__ | ||
|
||
This guide covers all essential prerequisites. It includes an example demonstrating how to create an EFS storage class and offers step-by-step instructions for setting up an EFS Persistent Volume Claim to dynamically provisioning Kubernetes Persistent Volume. Furthermore, it outlines the required steps to deploy a Dask cluster and delves into utilizing the Kubernetes Python client library to assign NeMo-Curator tasks to the Dask scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "assign NeMo-Curator tasks to the Dask scheduler" -> "run NeMo Curator scripts"
storageClassName: efs-sc | ||
resources: | ||
requests: | ||
storage: 150Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for checking.
Dask cluster creation: | ||
---------------------- | ||
|
||
Please refer index.rst for instructions on creating a Docker secret to utilize the NeMo image and upload data to the PVC created in the previous step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix this link and also take into account my other feedback about calling out the relevant sections?
…client
Description
This PR adds an example demonstrating the usage of the Kubernetes Python client to execute NeMo modules on the scheduler pod.
Usage
Checklist