GitHub - aws-samples/video-generation-guidance-on-sagemaker-hyperpod: This guide provides instructions for creating and managing a SageMaker Hyperpod cluster, and training the AnimateAnyone algorithm on SageMaker Hyperpod

SageMaker Hyperpod Cluster Creation and AnimateAnyone training Guide

This guide provides instructions for creating and managing a SageMaker Hyperpod cluster, as well as implementing the AnimateAnyone algorithm. It is based on the SageMaker Hyperpod workshop studio guidance and the Moore-AnimateAnyone repository.

Cluster Creation

Lifecycle Scripts

Lifecycle scripts allow customization of your cluster during creation. They can be used to:

Install software packages
Set up configurations
Configure Slurm
Create users
Install Conda or Docker

To set up lifecycle scripts:

Clone the repository and upload scripts to S3:

git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/
aws s3 cp --recursive base-config/ s3://${BUCKET}/src

Cluster Configuration

Prepare cluster-config.json and provisioning_parameters.json files.

Upload the configuration to S3:

aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/

Create the cluster:

aws sagemaker create-cluster --cli-input-json file://cluster-config.json --region $AWS_REGION

Example of cluster-config.json and provisioning_parameters.json can be found at in ClusterConfig

Scaling the Cluster

To increase worker instances:

Update cluster-config.json with the new instance count.

Run:

aws sagemaker update-cluster \
 --cluster-name ${my-cluster-name} \
 --instance-groups file://update-cluster-config.json \
 --region $AWS_REGION

Example of update-cluster-config.json can be found at in ClusterConfig

Shutting Down the Cluster

aws sagemaker delete-cluster --cluster-name ${my-cluster-name}

Notes

SageMaker HyperPod supports Amazon FSx for Lustre integration, enabling full bi-directional synchronization with Amazon S3.
Ensure proper AWS CLI permissions and configurations.
Review and test configurations before production deployment.
Monitor cluster usage for cost and performance optimization.

Cluster Access

Follow the guidance on Accessing SageMaker HyperPod cluster nodes.

SSH into Controller Node

./easy-ssh.sh -c controller-machine ml-cluster
sudo su - ubuntu

For VS Code connection, follow this guide to set up an SSH Proxy via SSM.

SSH into Worker Node

First-time login to the controller node:

cd ~/.ssh
ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
cat id_rsa.pub >> authorized_keys

Allocate and access a worker node:

salloc -N 1
ssh $(srun hostname)

Install Miniconda (on worker node)

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -f -p ~/miniconda3
source ~/miniconda3/bin/activate
conda create -n videogen python=3.10
conda activate videogen

Useful Slurm Commands

List partitions and nodes: sinfo
List queued/running jobs: squeue

Running AnimateAnyone

Based on the Moore-AnimateAnyone repository.

Setup

Activate the conda environment:

source ~/miniconda3/bin/activate
conda activate videogen

Install required packages:
```
pip install -r requirements.txt
```
Download pre-trained weights:
```
python tools/download_weights.py
```

Test the training script:

accelerate launch train_stage_1.py --config configs/train/stage1.yaml
accelerate launch train_stage_2.py --config configs/train/stage2.yaml

Running Experiments

Single Node Job

The detailed instructions can be found in here

sbatch submit-animateanyone-algo.sh

Note: For smaller GPU instances (e.g., G5 2xlarge), adjust train_bs: 2 and train_width: 256 train_height: 256 to avoid out-of-memory issues. See one configuration example in AlgoSlurm

Hyperparameter Testing

sbatch submit-hyperparameter-testing.sh

Multi-Node Job with DeepSpeed

The detailed instructions can be found in here

The folder contains the single node multi-GPUs setup, as well as the multi-mode multi-GPUs Slurm launch file.

Monitoring Experiments

Use MLflow for visualization:

mlflow ui --backend-store-uri ./mlruns/

Inference

You can ether try a quick inference on a SageMaker notebook instance of g5.2xlarge by walking through inference code in inference or deploy an inference endpoint on SageMaker. Please refer to Inference README for more details.

Additional Resources

Recent advancements in video generation have rapidly overcome limitations of earlier models like Animate Anyone. Two notable research papers showcase significant progress in this domain:

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance enhances shape alignment and motion guidance. It demonstrates superior ability in generating high-quality human animations that accurately capture both pose and shape variations, with improved generalization on in-the-wild datasets.
UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation enables the generation of longer videos, up to one minute, compared to earlier models' limited frame outputs. It introduces a unified noise input supporting both random noised input and first frame conditioned input, enhancing long-term video generation capabilities.

As research in this field rapidly progresses, SageMaker Hyperpod prove invaluable for AI research and experimentation. It provides the necessary computational resources and flexibility to quickly implement and test innovative ideas, accelerating advancements in video generation and related AI technologies. SageMaker Hyperpod's scalable infrastructure allows researchers to efficiently train and fine-tune large models, reducing the time from concept to implementation. Its integrated development environment streamlines the workflow, enabling faster iterations and more comprehensive experiments. By leveraging such advanced cloud computing solutions, researchers can push the boundaries of what's possible in video generation, potentially leading to breakthroughs in areas like virtual reality, film production, and interactive digital media.

Notes

Ensure proper GPU resources and CUDA setup before running experiments.
Adjust batch files and configurations as needed for your environment.
Regularly check the original repository for updates or changes.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
AlgoSlurm		AlgoSlurm
ClusterConfig		ClusterConfig
DeepSpeedDistributed		DeepSpeedDistributed
LifecycleScripts		LifecycleScripts
img		img
inference		inference
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SageMaker Hyperpod Cluster Creation and AnimateAnyone training Guide

Table of Contents

Cluster Creation

Lifecycle Scripts

Cluster Configuration

Scaling the Cluster

Shutting Down the Cluster

Notes

Cluster Access

SSH into Controller Node

SSH into Worker Node

Install Miniconda (on worker node)

Useful Slurm Commands

Running AnimateAnyone

Setup

Running Experiments

Single Node Job

Hyperparameter Testing

Multi-Node Job with DeepSpeed

Monitoring Experiments

Inference

Additional Resources

Notes

Security

License

About

Contributors 3

Languages

License

aws-samples/video-generation-guidance-on-sagemaker-hyperpod

Folders and files

Latest commit

History

Repository files navigation

SageMaker Hyperpod Cluster Creation and AnimateAnyone training Guide

Table of Contents

Cluster Creation

Lifecycle Scripts

Cluster Configuration

Scaling the Cluster

Shutting Down the Cluster

Notes

Cluster Access

SSH into Controller Node

SSH into Worker Node

Install Miniconda (on worker node)

Useful Slurm Commands

Running AnimateAnyone

Setup

Running Experiments

Single Node Job

Hyperparameter Testing

Multi-Node Job with DeepSpeed

Monitoring Experiments

Inference

Additional Resources

Notes

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Contributors 3

Languages