SageMaker HyperPod

Managed GPU clusters with resilient training and automatic recovery

View on GitHub

Overview

Amazon SageMaker HyperPod provides purpose-built infrastructure for distributed model training with built-in resilience, automatic node replacement, and integrated Slurm or EKS scheduling.

Architecture

The CloudFormation template deploys:

  • HyperPod cluster with GPU instances (p5, p4d, trn1)
  • Integrated FSx for Lustre file system
  • VPC with EFA-enabled networking
  • Slurm or EKS orchestration layer

Quick Deploy

aws cloudformation deploy \
  --template-file hyperpod-cluster.yaml \
  --stack-name my-hyperpod-cluster \
  --parameter-overrides InstanceType=ml.p5.48xlarge ClusterSize=8