AWS SageMaker (#1421)

* API Overhaul First draft of the API overhauls changes. Adds most core functionality, including defining workflow graphs with a ColumnGroup class, the workflow and dataset changes , most operators converted to use the new api, etc. * remove debug print statement * Fix test_io unittest Also partially fix some tests inside test_workflow * Handle multi-column joint/combo categorify * Update JoinGroupby * Fix differencelag * add dependencies method (#498) * Convert TargetEncoding op * Update nvtabular/workflow.py Co-authored-by: Richard (Rick) Zamora <rzamora217@gmail.com> * Update nvtabular/workflow.py Co-authored-by: Richard (Rick) Zamora <rzamora217@gmail.com> * Remove workflow code from dataloaders We should be doing online transforms like ```KerasSequenceLoader(workflow.transform(dataset), ...``` instead of ```KerasSequenceLoader(dataset, workflows=[workflow], ...``` now * Unittest ops + bugfix in Bucketize (#496) * test_minmix * updates test * unittest ops * First draft get_embedding_sizes support Re-add get_embedding_sizes . Note that this changes how we support multi-hot columns here (sizes are returned same as single hot, and we don't use this method to distinguish between multi and singlehot columns) * isort * Remove groupbystatistics * implement serialization of statistics add save_stats/load_stats/clear_stats methods to the workflow, with each statoperator getting called as appropiate * Fix TF dataloader unittests * test_torch_dataloader fixes * doc strings * aws sagemaker * Update cloud_integration.md Co-authored-by: Ben Frederickson <github@benfrederickson.com> Co-authored-by: rnyak <ronayak@hotmail.com> Co-authored-by: Richard (Rick) Zamora <rzamora217@gmail.com> Co-authored-by: root <root@dgx06.aselab.nvidia.com> Co-authored-by: Karl Higley <kmhigley@gmail.com>
NVIDIA-Merlin · Mar 23, 2022 · 4a26637 · 4a26637
1 parent 9d87ffa
commit 4a26637
Show file tree

Hide file tree

Showing 2 changed files with 78 additions and 0 deletions.
diff --git a/conda/environments/nvtabular_aws_sagemaker.yml b/conda/environments/nvtabular_aws_sagemaker.yml
@@ -0,0 +1,22 @@
+# Based on https://github.com/NVIDIA-Merlin/NVTabular/blob/main/conda/environments/nvtabular_dev_cuda11.0.yml
+name: nvtabular
+channels:
+  - rapidsai
+  - nvidia
+  - conda-forge
+  - defaults
+dependencies:
+  - nvtabular
+  - python>=3.7
+  - cudatoolkit=11.0
+  - cudf>=21.08.*
+  - dask-cuda>=21.08.*
+  - dask-cudf>=21.08.*
+  - rmm>=21.08.*
+  - dask==2021.7.1
+  - distributed>=2021.7.1
+  - nvtx>=0.2.1
+  - numba>=0.53.0
+  - dlpack
+  - scikit-learn
+  - asvdb
diff --git a/docs/source/resources/cloud_integration.md b/docs/source/resources/cloud_integration.md
@@ -144,3 +144,59 @@ To run NVTabular on Databricks, do the following:
 
 9. Select a GPU node for the Worker and Driver.
    Once the Databricks cluster is up, NVTabular will be running inside of it.
+
+## AWS SageMaker ##
+
+[AWS SageMaker](https://aws.amazon.com/sagemaker/) is a service from AWS to "build, train and deploy machine learning" models. It automates and manages the MLOps workflow. It supports jupyter notebook instances enabling users to work directly in jupyter notebook/jupyter lab without any additional configurations. In this section, we will explain how to run NVIDIA Merlin (NVTabular) on AWS SageMaker notebook instances. We adopted the work from [Eugene](https://twitter.com/eugeneyan/) from his [twitter post](https://twitter.com/eugeneyan/status/1470916049604268035). We tested the workflow on February, 1st, 2022, but it is not integrated into our CI workflows. Future release of Merlin or Merlin's dependencies can cause issues.
+
+To run the [movielens example](https://github.com/NVIDIA-Merlin/NVTabular/tree/main/examples/getting-started-movielens) on AWS SageMaker, do the following:
+
+1. Login into your AWS console and select AWS SageMaker.
+
+2. Select `Notebook` -> `Notebook instances` -> `Create notebook instance`. Give the instance a name and select a notebook instance type with GPUs. For example, we selected `ml.p3.2xlarge`. Please review the associated costs with each instance type. As a platform identifier, select `notebook-al2-v1`. The previous platform identifier runs with TensorFlow 2.1.x and we had more issue to update it to TensorFlow 2.6.x. The `volume size` can be increased in the section `Additional configuration`.
+
+3. After the instance is running, connect to jupyter lab.
+
+4. Start a terminal to have access to the command line.
+
+5. The image contains many conda environments, which requires ~60GB of disk space. You can remove some of them to free disk space in the folder `/home/ec2-user/anaconda3/envs/`
+
+6. Clone the NVTabular repository and install the conda environment.
+
+```
+cd /home/ec2-user/SageMaker/
+git clone https://github.com/NVIDIA-Merlin/NVTabular.git
+conda env create -f=NVTabular/conda/environments/nvtabular_aws_sagemaker.yml
+```
+
+7. Activate the conda environment
+
+```
+source /home/ec2-user/anaconda3/etc/profile.d/conda.sh
+conda activate nvtabular
+```
+
+8. Install additional packages, such as TensorFlow or PyTorch
+
+```
+pip install tensorflow-gpu 
+pip install torch
+pip install graphviz
+```
+
+9. Install Transformer4Rec, torchmetrics and ipykernel 
+
+```
+conda install -y -c nvidia -c rapidsai -c numba -c conda-forge transformers4rec
+conda install -y torchmetrics ipykernel
+```
+
+10. Add conda environment as ipykernel
+
+```
+python -m ipykernel install --user --name=nvtabular
+```
+
+11. You can switch in jupyter lab and run the [movielens example](https://github.com/NVIDIA-Merlin/NVTabular/tree/main/examples/getting-started-movielens). 
+
+This workflow enables NVTabular ETL and training with TensorFlow or Pytorch. Deployment with Triton Inference Server will follow soon.