Skip to content

This guide provides instructions for creating and managing a SageMaker Hyperpod cluster, and training the AnimateAnyone algorithm on SageMaker Hyperpod

License

Notifications You must be signed in to change notification settings

aws-samples/video-generation-guidance-on-sagemaker-hyperpod

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

SageMaker Hyperpod Cluster Creation and AnimateAnyone training Guide

This guide provides instructions for creating and managing a SageMaker Hyperpod cluster, as well as implementing the AnimateAnyone algorithm. It is based on the SageMaker Hyperpod workshop studio guidance and the Moore-AnimateAnyone repository.

Table of Contents

  1. Cluster Creation
  2. Cluster Access
  3. Train an AnimateAnyone model
  4. Inference
  5. Additional Resources

Cluster Creation

Lifecycle Scripts

Lifecycle scripts allow customization of your cluster during creation. They can be used to:

  • Install software packages
  • Set up configurations
  • Configure Slurm
  • Create users
  • Install Conda or Docker

To set up lifecycle scripts:

  1. Clone the repository and upload scripts to S3:
    git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/
    cd awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/
    aws s3 cp --recursive base-config/ s3://${BUCKET}/src

Cluster Configuration

  1. Prepare cluster-config.json and provisioning_parameters.json files.
  2. Upload the configuration to S3:
    aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
  3. Create the cluster:
    aws sagemaker create-cluster --cli-input-json file://cluster-config.json --region $AWS_REGION

Example of cluster-config.json and provisioning_parameters.json can be found at in ClusterConfig

Scaling the Cluster

To increase worker instances:

  1. Update cluster-config.json with the new instance count.
  2. Run:
    aws sagemaker update-cluster \
     --cluster-name ${my-cluster-name} \
     --instance-groups file://update-cluster-config.json \
     --region $AWS_REGION

Example of update-cluster-config.json can be found at in ClusterConfig

Shutting Down the Cluster

aws sagemaker delete-cluster --cluster-name ${my-cluster-name}

Notes

  • SageMaker HyperPod supports Amazon FSx for Lustre integration, enabling full bi-directional synchronization with Amazon S3.
  • Ensure proper AWS CLI permissions and configurations.
  • Review and test configurations before production deployment.
  • Monitor cluster usage for cost and performance optimization.

SageMaker Hyperpod

Cluster Access

Follow the guidance on Accessing SageMaker HyperPod cluster nodes.

SSH into Controller Node

./easy-ssh.sh -c controller-machine ml-cluster
sudo su - ubuntu

For VS Code connection, follow this guide to set up an SSH Proxy via SSM.

SSH into Worker Node

First-time login to the controller node:

cd ~/.ssh
ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
cat id_rsa.pub >> authorized_keys

Allocate and access a worker node:

salloc -N 1
ssh $(srun hostname)

Install Miniconda (on worker node)

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -f -p ~/miniconda3
source ~/miniconda3/bin/activate
conda create -n videogen python=3.10
conda activate videogen

Useful Slurm Commands

  • List partitions and nodes: sinfo
  • List queued/running jobs: squeue

Running AnimateAnyone

Based on the Moore-AnimateAnyone repository.

Setup

  1. Activate the conda environment:

    source ~/miniconda3/bin/activate
    conda activate videogen
  2. Install required packages:

    pip install -r requirements.txt
  3. Download pre-trained weights:

    python tools/download_weights.py
  4. Test the training script:

    accelerate launch train_stage_1.py --config configs/train/stage1.yaml
    accelerate launch train_stage_2.py --config configs/train/stage2.yaml

Running Experiments

Single Node Job

The detailed instructions can be found in here

sbatch submit-animateanyone-algo.sh

Note: For smaller GPU instances (e.g., G5 2xlarge), adjust train_bs: 2 and train_width: 256 train_height: 256 to avoid out-of-memory issues. See one configuration example in AlgoSlurm

Hyperparameter Testing

sbatch submit-hyperparameter-testing.sh

Multi-Node Job with DeepSpeed

The detailed instructions can be found in here

The folder contains the single node multi-GPUs setup, as well as the multi-mode multi-GPUs Slurm launch file.

Monitoring Experiments

Use MLflow for visualization:

mlflow ui --backend-store-uri ./mlruns/

ML flow

Inference

You can ether try a quick inference on a SageMaker notebook instance of g5.2xlarge by walking through inference code in inference or deploy an inference endpoint on SageMaker. Please refer to Inference README for more details.

Additional Resources

Recent advancements in video generation have rapidly overcome limitations of earlier models like Animate Anyone. Two notable research papers showcase significant progress in this domain:

  • Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance enhances shape alignment and motion guidance. It demonstrates superior ability in generating high-quality human animations that accurately capture both pose and shape variations, with improved generalization on in-the-wild datasets.
  • UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation enables the generation of longer videos, up to one minute, compared to earlier models' limited frame outputs. It introduces a unified noise input supporting both random noised input and first frame conditioned input, enhancing long-term video generation capabilities.

As research in this field rapidly progresses, SageMaker Hyperpod prove invaluable for AI research and experimentation. It provides the necessary computational resources and flexibility to quickly implement and test innovative ideas, accelerating advancements in video generation and related AI technologies. SageMaker Hyperpod's scalable infrastructure allows researchers to efficiently train and fine-tune large models, reducing the time from concept to implementation. Its integrated development environment streamlines the workflow, enabling faster iterations and more comprehensive experiments. By leveraging such advanced cloud computing solutions, researchers can push the boundaries of what's possible in video generation, potentially leading to breakthroughs in areas like virtual reality, film production, and interactive digital media.

Notes

  • Ensure proper GPU resources and CUDA setup before running experiments.
  • Adjust batch files and configurations as needed for your environment.
  • Regularly check the original repository for updates or changes.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

This guide provides instructions for creating and managing a SageMaker Hyperpod cluster, and training the AnimateAnyone algorithm on SageMaker Hyperpod

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks