Training Frameworks

Browse test cases and training examples organized by framework and library.

๐Ÿ”ฅ

PyTorch

12 test cases
๐Ÿ”ฅ

PyTorch FSDP

Fully Sharded Data Parallel training for large language models

FSDPShardingLarge ModelsMulti-GPU
๐Ÿ”ฅ

PyTorch DDP

Distributed Data Parallel training - the foundation for multi-GPU PyTorch

DDPData ParallelMulti-GPUBaseline
๐Ÿ”ฅ

DeepSpeed

Microsoft DeepSpeed ZeRO optimizer for memory-efficient distributed training

DeepSpeedZeROMemory EfficientLarge Models
๐Ÿ”ฅ

TorchTitan

PyTorch native distributed training framework for production LLM pre-training

TorchTitanPre-training4D ParallelismProduction
๐Ÿ”ฅ

Picotron

Lightweight distributed training library for educational and research use

PicotronLightweightEducationalResearch
๐Ÿš€

vLLM

High-throughput LLM inference and serving engine

vLLMInferenceServingPagedAttention
๐Ÿ”ฅ

OpenRLHF

Open-source RLHF framework for training reward models and policy optimization

RLHFPPODPOAlignment
๐Ÿ›ก๏ธ

NVRx

NVIDIA's resilient training toolkit for fault-tolerant distributed workloads

NVRxResilienceFault ToleranceCheckpointing
๐Ÿค–

NVIDIA Isaac Lab

Sim-to-real robot learning with NVIDIA Isaac Lab on GPU clusters

Isaac LabRoboticsSim2RealPhysical AIReinforcement Learning
๐Ÿค–

OpenVLA OFT

Open Vision-Language-Action models with fine-tuning for robotic manipulation

OpenVLAVLARoboticsFine-tuningPhysical AI
๐Ÿค–

nanoVLM

Lightweight vision-language model training for embodied AI

nanoVLMVLMMultimodalPhysical AIVision-Language
๐Ÿค–

V-JEPA 2

Video Joint Embedding Predictive Architecture for physical world understanding

V-JEPA 2VideoSelf-supervisedPhysical AIWorld Models
โšก

Megatron / NeMo

4 test cases
๐Ÿงฌ

JAX

1 test cases
๐Ÿง 

AWS Neuron

1 test cases