Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an example for executing NeMo modules using kubernetes Python … #148

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dpadmanabhan03
Copy link

…client

Description

This PR adds an example demonstrating the usage of the Kubernetes Python client to execute NeMo modules on the scheduler pod.

Usage

Updated documentation.
# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! The main thing is that the documentation could be improved a bit by explaining why each step is necessary and why your method overall would be preferred to the other existing options. Code looks good, short and sweet!

docs/user-guide/kubernetescurator.rst Outdated Show resolved Hide resolved
docs/user-guide/kubernetescurator.rst Outdated Show resolved Hide resolved
docs/user-guide/kubernetescurator.rst Outdated Show resolved Hide resolved
docs/user-guide/kubernetescurator.rst Outdated Show resolved Hide resolved
docs/user-guide/kubernetescurator.rst Outdated Show resolved Hide resolved
docs/user-guide/kubernetescurator.rst Outdated Show resolved Hide resolved
docs/user-guide/kubernetescurator.rst Outdated Show resolved Hide resolved
docs/user-guide/kubernetescurator.rst Outdated Show resolved Hide resolved
docs/user-guide/kubernetescurator.rst Outdated Show resolved Hide resolved
docs/user-guide/kubernetescurator.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments stem from my lack of familiarity with AWS EKS. I do think it's relevant to explain some of the "basics" of EKS since many of our users will be like me with little background in these systems. Thanks again for making this guide!

docs/user-guide/aws-examples/eks.rst Show resolved Hide resolved
Running NeMo Curator on AWS EKS
======================================

Prerequisuites:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo. Should be "Prerequisites"

Prerequisuites:

* EKS Cluster:
* `Dask Operator <https://kubernetes.dask.org/en/latest/operator_installation.html>`__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Dask Operator" is included twice, and the link does not go anywhere either. Please remove one of the mentions and fix the link.

Running NeMo Curator on AWS EKS
======================================

Prerequisuites:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a high level "Background" section at the top of this document? It should answer the following questions

  • What is AWS EKS at a high level?
  • What is this guide going to show me to do on AWS EKS with NeMo Curator, and what are the fundamental assumptions that it makes?
    • Will it cover how to run any prebuilt CLI script in NeMo Curator?
    • Will it cover how to run a python script that I have made?
    • Will it cover how to upload my data to EKS?
  • Is there any reason to prefer running NeMo Curator on AWS EKS?
    • The answer to this might be "only if you are familiar with it or if it's the only compute option you have" which is fine. Just wondering if it gives any benefits.


* EKS Cluster:
* `Dask Operator <https://kubernetes.dask.org/en/latest/operator_installation.html>`__
* If self managed node group is created with ubuntu worker nodes then install GPU operator.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If self managed node group is created with ubuntu worker nodes

If EKS managed node group is created with Amazon Linux 2 worker nodes

What determines if either of these conditions are true? Is it up to the user? If so, which one should we recommend the user do?

- Binds the pod-exec ClusterRole to a specific ServiceAccount (default in the default namespace).
- This means that any pods using the default ServiceAccount in the default namespace will have the permissions specified in the pod-exec ClusterRole.

Now, we can spin up a client pod.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete the space at the beginning of this line.

command: ["sh", "-c", "pip install kubernetes && sleep infinity"]
EOF

Here, we are using a light-weight public python docker image and installing kubernetes Python client package so that we can run kubeclient.py from this client pod and connect to the scheduler pod to run existing NeMo modules.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "NeMo" -> "NeMo Curator"

kubectl exec client-pod -- python3 kubeclient.py --command "add_id --scheduler-address localhost:8786 --input-data-dir=/nemo-workspace/arxiv --output-data-dir=/nemo-workspace/arxiv-addid/"


This approach is allows the execution of NeMo modules within the scheduler pod from a separate client pod. This separation ensures that the client pod can be provisioned with specific permissions tailored for executing commands and accessing resources within the Kubernetes environment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "NeMo" -> "NeMo Curator"


This approach is allows the execution of NeMo modules within the scheduler pod from a separate client pod. This separation ensures that the client pod can be provisioned with specific permissions tailored for executing commands and accessing resources within the Kubernetes environment.

Moreover, deploying this client pod can be orchestrated by another service such as AWS Batch, facilitating scalable and efficient management of computational tasks within Kubernetes clusters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this explanation! This makes a lot of sense. Do you mind moving it to before you spin up the client pod so the user has background into why they would be doing this?

storageClassName: efs-sc
resources:
requests:
storage: 150Gi
Copy link
Collaborator

@ryantwolf ryantwolf Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with EKS, but should this be 150GiB instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for checking.

…client

Signed-off-by: dpadmanabhan <dpadmanabhan@nvidia.com>
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments on some wording choices. Also it seems like most of my feedback was not addressed, so please address it.

@@ -36,6 +36,9 @@
:ref:`Next Steps <data-curator-next-steps>`
Now that you've curated your data, let's discuss where to go next in the NeMo Framework to put it to good use.

`NeMo Curator on AWS <https://github.com/NVIDIA/NeMo-Curator/tree/main/docs/user-guide/aws-examples/>`__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you follow the pattern of other elements in the list and do something like:

:ref:`NeMo Curator on AWS <data-curator-aws-examples>`

Then create an aws-examples/index.rst with something like this:

.. _data-curator-aws-examples:

==================
AWS
==================

.. toctree::
   :maxdepth: 4
   :titlesonly:

  eks.rst


For more details, refer to `EKS documentation <https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html>`__

This guide covers all essential prerequisites. It includes an example demonstrating how to create an EFS storage class and offers step-by-step instructions for setting up an EFS Persistent Volume Claim to dynamically provisioning Kubernetes Persistent Volume. Furthermore, it outlines the required steps to deploy a Dask cluster and delves into utilizing the Kubernetes Python client library to assign NeMo-Curator tasks to the Dask scheduler.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "assign NeMo-Curator tasks to the Dask scheduler" -> "run NeMo Curator scripts"

storageClassName: efs-sc
resources:
requests:
storage: 150Gi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for checking.

Dask cluster creation:
----------------------

Please refer index.rst for instructions on creating a Docker secret to utilize the NeMo image and upload data to the PVC created in the previous step.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix this link and also take into account my other feedback about calling out the relevant sections?

@sarahyurick sarahyurick added the documentation Improvements or additions to documentation label Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants