Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NeMo SDK #131

Merged
merged 13 commits into from
Jul 9, 2024
Merged

Add support for NeMo SDK #131

merged 13 commits into from
Jul 9, 2024

Conversation

ryantwolf
Copy link
Collaborator

@ryantwolf ryantwolf commented Jun 27, 2024

Description

NeMo SDK is a library designed to make running different parts of the NeMo FW easier across computing platforms. It serves as an enhanced version of the NeMo Framework Launcher. This PR adds an example and simple config shortcut to run NeMo Curator scripts on Slurm clusters using NeMo SDK.

Usage

See examples/nemo_sdk/slurm.py for example usage.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
@ryantwolf ryantwolf marked this pull request as ready for review June 28, 2024 17:54
@ryantwolf ryantwolf requested a review from ayushdg July 1, 2024 22:31
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I really like the user guide and how straightforward it is. Just left a couple comments about helping the user know what parameters they should use.

docs/user-guide/nemosdk.rst Outdated Show resolved Hide resolved
Comment on lines +29 to +36
interface: str = "eth0"
protocol: str = "tcp"
cpu_worker_memory_limit: str = "0"
rapids_no_initialize: str = "1"
cudf_spill: str = "1"
rmm_scheduler_pool_size: str = "1GB"
rmm_worker_pool_size: str = "72GiB"
libcudf_cufile_policy: str = "OFF"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my other comment, I often have trouble knowing what to set for these types of parameters. Is there anywhere the user might be able to refer to for recommendations of how to set these parameters for their specific cluster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should release a bigger guide on our recommendations for each parameter. For now I've included a docstring that should provide a bit more context. Let me know if you want me to change anything else to make it clearer.

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, will take another look once you link nemo_sdk

docs/user-guide/nemosdk.rst Show resolved Hide resolved
examples/nemo_sdk/launch_slurm.py Show resolved Hide resolved
Comment on lines 20 to 21
mkdir -p $LOGDIR
mkdir -p $PROFILESDIR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is run inside the container ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're correct. It's a good call out, and it makes me think we maybe should've done this from the beginning since the contents of $LOGDIR and $PROFILESDIR get written inside the container, so they ought to be initialized in it too. Let me know if you disagree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, i think because logs are accessed from outside the container so we should make it clear where the path is from outside the container, I think to make it clear we should echo $LOGDIR and $PROFILESDIR to help someone debugging this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, most of my setups typically mount these to be in a location that's accessible from outside the compute nodes so it would be good to keep in mind that the end goal for these logs/profiles is someplace that's within the mounted dirs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added echo and comments.

examples/slurm/start-slurm.sh Show resolved Hide resolved
nemo_curator/nemo_sdk/slurm.py Show resolved Hide resolved
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, have added non blocking comments around LOGDIR (which is mostly unrelated to this PR)

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, LGTM!

exclusive=True,
time="04:00:00",
container_image="nvcr.io/nvidia/nemo:dev",
container_mounts=["/path/on/machine:/path/in/container"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe a question for Nemo_sdk (apologies for my lack of familiarity). Can users pass in additional args here for other slurm options?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha no need to apologize for your lack of familiarity. Yes the user can pass in additional args.

Comment on lines 20 to 21
mkdir -p $LOGDIR
mkdir -p $PROFILESDIR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, most of my setups typically mount these to be in a location that's accessible from outside the compute nodes so it would be good to keep in mind that the end goal for these logs/profiles is someplace that's within the mounted dirs



@dataclass
class SlurmJobConfig:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also tagging @jacobtomlinson who's done a lot of work on the dask/dask-cuda clusters with Slurm (among other things).

For now this mimics the command line setup to start clusters, but feel free to share any opinions you might have since this overlaps a lot with the dask-runners/dask-jobqueue api.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ayushdg. This seems to follow the common pattern that a lot of Slurm implementations use so I don't have any particular comments. I'm always keen to see how we can reuse code though, so maybe we could work towards a common base in dask-jobqueue that projects like this can use instead of reinventing it each time.

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
@ryantwolf ryantwolf merged commit 9a3bbbd into main Jul 9, 2024
3 checks passed
@ryantwolf ryantwolf deleted the rywolf/nemo-sdk branch July 9, 2024 16:50
sarahyurick pushed a commit to sarahyurick/NeMo-Curator that referenced this pull request Jul 23, 2024
* Begin docs

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add slurm sdk example

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Use safe import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix bugs in sdk

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Update docs and tweak scripts

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add interface helper function

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Update docs

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix formatting

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add config docstring

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Address comments

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants