Skip to content

Commit

Permalink
add cmd for running notebook
Browse files Browse the repository at this point in the history
Signed-off-by: Yang Yu <yayu@yayu-mlt.client.nvidia.com>
  • Loading branch information
Yang Yu committed Oct 2, 2024
1 parent a2725ed commit 1261e75
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion tutorials/pretraining-data-curation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,6 @@ This tutorial demonstrates the usage of NeMo Curator to curate the RedPajama-Dat
RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. In this tutorial, we will be perform data curation on two snapshots for demonstration purpuses.

## Getting Started
This tutorial is designed for multi-node environment and uses slurm for scheduling allocating resources. To start the tutorial, run the `start-distributed-notebook.sh` script in this directory which will start the Jupyter notebook that demonstrates the step by step walkthrough of the end to end curation pipeline. The notebook will run on port 8000 of the scheduler node. To work with the notenook locally, you can set up a SSH connection to the scheduler node.
This tutorial is designed for multi-node environment and uses slurm for scheduling allocating resources. To start the tutorial, run the `start-distributed-notebook.sh` script in this directory which will start the Jupyter notebook that demonstrates the step by step walkthrough of the end to end curation pipeline. The notebook will run on port 8000 of the scheduler node. To work with the notenook locally, you can set up a SSH connection to the scheduler node:

`ssh -L <local_port>:localhost:8888 <user>@<scheduler_address>`

0 comments on commit 1261e75

Please sign in to comment.