AWSome
Distributed AI
Reference architectures, test cases, and best practices for training large-scale models with PyTorch, Megatron-LM, NeMo, JAX, and more on AWS infrastructure.
Training Frameworks
Production-ready examples grouped by framework. Each includes Dockerfiles, Slurm scripts, and Kubernetes manifests.
PyTorch
Native distributed training with DDP, FSDP, TorchTitan, DeepSpeed, and more. Covers LLMs, vision, robotics, and RLHF.
Megatron
NVIDIA Megatron-LM and NeMo for large-scale LLM pre-training with tensor, pipeline, and expert parallelism.
JAX
Google JAX with PaxML for distributed training leveraging XLA compilation and automatic parallelism.
AWS Neuron / Trainium
NeuronX Distributed for training on AWS Trainium & Inferentia chips with optimized compilers.
Physical AI & Robotics
Embodied AI training with NVIDIA Isaac Lab, OpenVLA, V-JEPA2, and vision-language-action models.
Inference & Serving
High-performance inference with vLLM and distributed serving for production deployments.
Reference Architectures
CloudFormation templates and deployment guides for every AWS compute platform.
Get Started in Minutes
Three steps to launch your first distributed training job.
Deploy Infrastructure
Launch a cluster using our CloudFormation templates for HyperPod, ParallelCluster, or EKS.
Build Container
Use our Dockerfiles to build a training container with your framework of choice.
Launch Training
Submit your job with Slurm or Kubernetes using our ready-made launch scripts.