SageMaker HyperPod
Managed GPU clusters with resilient training and automatic recovery
View on GitHubOverview
Amazon SageMaker HyperPod provides purpose-built infrastructure for distributed model training with built-in resilience, automatic node replacement, and integrated Slurm or EKS scheduling.
Architecture
The CloudFormation template deploys:
- HyperPod cluster with GPU instances (p5, p4d, trn1)
- Integrated FSx for Lustre file system
- VPC with EFA-enabled networking
- Slurm or EKS orchestration layer
Quick Deploy
aws cloudformation deploy \
--template-file hyperpod-cluster.yaml \
--stack-name my-hyperpod-cluster \
--parameter-overrides InstanceType=ml.p5.48xlarge ClusterSize=8