Open Source • MIT-0 License

AWSome
Distributed AI

Reference architectures, test cases, and best practices for training large-scale models with PyTorch, Megatron-LM, NeMo, JAX, and more on AWS infrastructure.

Explore Frameworks Getting Started

30+

Test Cases

Architectures

Frameworks

1.5K

Commits

Training Frameworks

Production-ready examples grouped by framework. Each includes Dockerfiles, Slurm scripts, and Kubernetes manifests.

🔥

PyTorch

Native distributed training with DDP, FSDP, TorchTitan, DeepSpeed, and more. Covers LLMs, vision, robotics, and RLHF.

FSDPDDPDeepSpeedTorchTitanPicotronvLLMTRLOpenRLHF

⚡

Megatron

NVIDIA Megatron-LM and NeMo for large-scale LLM pre-training with tensor, pipeline, and expert parallelism.

Megatron-LMNeMoNeMo RLBioNeMo

🧬

JAX

Google JAX with PaxML for distributed training leveraging XLA compilation and automatic parallelism.

PaxMLXLATPU/GPU

🧠

AWS Neuron / Trainium

NeuronX Distributed for training on AWS Trainium & Inferentia chips with optimized compilers.

NeuronXOptimum NeuronTrainium

🤖

Physical AI & Robotics

Embodied AI training with NVIDIA Isaac Lab, OpenVLA, V-JEPA2, and vision-language-action models.

Isaac LabOpenVLAV-JEPA 2nanoVLM

🎯

Reinforcement Learning

RLHF, DPO, PPO, and scalable RL frameworks for LLM alignment and post-training.

TRLvERLSLIMEPPODPO

🧪

Model Customisation

Knowledge distillation, compression, and model adaptation techniques for production.

DistillationCompressionTransfer Learning

Reference Architectures

CloudFormation templates and deployment guides for every AWS compute platform.

Kubernetes orchestration

Parallel Computing Service

Get Started in Minutes

Three steps to launch your first distributed training job.

Deploy Infrastructure

Launch a cluster using our CloudFormation templates for HyperPod, ParallelCluster, or EKS.

Build Container

Use our Dockerfiles to build a training container with your framework of choice.

Launch Training

Submit your job with Slurm or Kubernetes using our ready-made launch scripts.

AWSomeDistributed AI