Open Source • MIT-0 License

AWSome
Distributed AI

Reference architectures, test cases, and best practices for training large-scale models with PyTorch, Megatron-LM, NeMo, JAX, and more on AWS infrastructure.

30+
Test Cases
10
Architectures
4
Frameworks
1.5K
Commits

Training Frameworks

Production-ready examples grouped by framework. Each includes Dockerfiles, Slurm scripts, and Kubernetes manifests.

Get Started in Minutes

Three steps to launch your first distributed training job.

1

Deploy Infrastructure

Launch a cluster using our CloudFormation templates for HyperPod, ParallelCluster, or EKS.

2

Build Container

Use our Dockerfiles to build a training container with your framework of choice.

3

Launch Training

Submit your job with Slurm or Kubernetes using our ready-made launch scripts.