Reference Architectures
CloudFormation templates and deployment guides for distributed training infrastructure.
🖥️
Compute
🖥️ 🔧 ☸️ 📦 🧮
SageMaker HyperPod
Managed GPU clusters with resilient training and automatic recovery
AWS ParallelCluster
HPC cluster management with Slurm scheduler for distributed training
Amazon EKS
Kubernetes-based orchestration for distributed training jobs
AWS Batch
Serverless batch computing for distributed training workloads
AWS Parallel Computing Service
Managed Slurm service for HPC and ML workloads
🌐
Networking
💾