Skip to content

Commit

Permalink
AWS SageMaker (#1421)
Browse files Browse the repository at this point in the history
* API Overhaul

First draft of the API overhauls changes. Adds most core functionality, including
defining workflow graphs with a ColumnGroup class, the workflow and dataset changes
, most operators converted to use the new api, etc.

* remove debug print statement

* Fix test_io unittest

Also partially fix some tests inside test_workflow

* Handle multi-column joint/combo categorify

* Update JoinGroupby

* Fix differencelag

* add dependencies method (#498)

* Convert TargetEncoding op

* Update nvtabular/workflow.py

Co-authored-by: Richard (Rick) Zamora <rzamora217@gmail.com>

* Update nvtabular/workflow.py

Co-authored-by: Richard (Rick) Zamora <rzamora217@gmail.com>

* Remove workflow code from dataloaders

We should be doing online transforms like
```KerasSequenceLoader(workflow.transform(dataset), ...```  instead of
```KerasSequenceLoader(dataset, workflows=[workflow], ...``` now

* Unittest ops + bugfix in Bucketize (#496)

* test_minmix

* updates test

* unittest ops

* First draft get_embedding_sizes support

Re-add get_embedding_sizes . Note that this changes how we support multi-hot columns here
(sizes are returned same as single hot, and we don't use this method to distinguish between
multi and singlehot columns)

* isort

* Remove groupbystatistics

* implement serialization of statistics

add save_stats/load_stats/clear_stats methods to the workflow, with each statoperator getting
called as appropiate

* Fix TF dataloader unittests

* test_torch_dataloader fixes

* doc strings

* aws sagemaker

* Update cloud_integration.md

Co-authored-by: Ben Frederickson <github@benfrederickson.com>
Co-authored-by: rnyak <ronayak@hotmail.com>
Co-authored-by: Richard (Rick) Zamora <rzamora217@gmail.com>
Co-authored-by: root <root@dgx06.aselab.nvidia.com>
Co-authored-by: Karl Higley <kmhigley@gmail.com>
  • Loading branch information
6 people authored Mar 23, 2022
1 parent 9d87ffa commit 4a26637
Show file tree
Hide file tree
Showing 2 changed files with 78 additions and 0 deletions.
22 changes: 22 additions & 0 deletions conda/environments/nvtabular_aws_sagemaker.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Based on https://github.com/NVIDIA-Merlin/NVTabular/blob/main/conda/environments/nvtabular_dev_cuda11.0.yml
name: nvtabular
channels:
- rapidsai
- nvidia
- conda-forge
- defaults
dependencies:
- nvtabular
- python>=3.7
- cudatoolkit=11.0
- cudf>=21.08.*
- dask-cuda>=21.08.*
- dask-cudf>=21.08.*
- rmm>=21.08.*
- dask==2021.7.1
- distributed>=2021.7.1
- nvtx>=0.2.1
- numba>=0.53.0
- dlpack
- scikit-learn
- asvdb
56 changes: 56 additions & 0 deletions docs/source/resources/cloud_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,3 +144,59 @@ To run NVTabular on Databricks, do the following:

9. Select a GPU node for the Worker and Driver.
Once the Databricks cluster is up, NVTabular will be running inside of it.

## AWS SageMaker ##

[AWS SageMaker](https://aws.amazon.com/sagemaker/) is a service from AWS to "build, train and deploy machine learning" models. It automates and manages the MLOps workflow. It supports jupyter notebook instances enabling users to work directly in jupyter notebook/jupyter lab without any additional configurations. In this section, we will explain how to run NVIDIA Merlin (NVTabular) on AWS SageMaker notebook instances. We adopted the work from [Eugene](https://twitter.com/eugeneyan/) from his [twitter post](https://twitter.com/eugeneyan/status/1470916049604268035). We tested the workflow on February, 1st, 2022, but it is not integrated into our CI workflows. Future release of Merlin or Merlin's dependencies can cause issues.

To run the [movielens example](https://github.com/NVIDIA-Merlin/NVTabular/tree/main/examples/getting-started-movielens) on AWS SageMaker, do the following:

1. Login into your AWS console and select AWS SageMaker.

2. Select `Notebook` -> `Notebook instances` -> `Create notebook instance`. Give the instance a name and select a notebook instance type with GPUs. For example, we selected `ml.p3.2xlarge`. Please review the associated costs with each instance type. As a platform identifier, select `notebook-al2-v1`. The previous platform identifier runs with TensorFlow 2.1.x and we had more issue to update it to TensorFlow 2.6.x. The `volume size` can be increased in the section `Additional configuration`.

3. After the instance is running, connect to jupyter lab.

4. Start a terminal to have access to the command line.

5. The image contains many conda environments, which requires ~60GB of disk space. You can remove some of them to free disk space in the folder `/home/ec2-user/anaconda3/envs/`

6. Clone the NVTabular repository and install the conda environment.

```
cd /home/ec2-user/SageMaker/
git clone https://github.com/NVIDIA-Merlin/NVTabular.git
conda env create -f=NVTabular/conda/environments/nvtabular_aws_sagemaker.yml
```

7. Activate the conda environment

```
source /home/ec2-user/anaconda3/etc/profile.d/conda.sh
conda activate nvtabular
```

8. Install additional packages, such as TensorFlow or PyTorch

```
pip install tensorflow-gpu
pip install torch
pip install graphviz
```

9. Install Transformer4Rec, torchmetrics and ipykernel

```
conda install -y -c nvidia -c rapidsai -c numba -c conda-forge transformers4rec
conda install -y torchmetrics ipykernel
```

10. Add conda environment as ipykernel

```
python -m ipykernel install --user --name=nvtabular
```

11. You can switch in jupyter lab and run the [movielens example](https://github.com/NVIDIA-Merlin/NVTabular/tree/main/examples/getting-started-movielens).

This workflow enables NVTabular ETL and training with TensorFlow or Pytorch. Deployment with Triton Inference Server will follow soon.

0 comments on commit 4a26637

Please sign in to comment.