Training Frameworks
Browse test cases and training examples organized by framework and library.
PyTorch
12 test casesPyTorch FSDP
Fully Sharded Data Parallel training for large language models
PyTorch DDP
Distributed Data Parallel training - the foundation for multi-GPU PyTorch
DeepSpeed
Microsoft DeepSpeed ZeRO optimizer for memory-efficient distributed training
TorchTitan
PyTorch native distributed training framework for production LLM pre-training
Picotron
Lightweight distributed training library for educational and research use
vLLM
High-throughput LLM inference and serving engine
OpenRLHF
Open-source RLHF framework for training reward models and policy optimization
NVRx
NVIDIA's resilient training toolkit for fault-tolerant distributed workloads
NVIDIA Isaac Lab
Sim-to-real robot learning with NVIDIA Isaac Lab on GPU clusters
OpenVLA OFT
Open Vision-Language-Action models with fine-tuning for robotic manipulation
nanoVLM
Lightweight vision-language model training for embodied AI
V-JEPA 2
Video Joint Embedding Predictive Architecture for physical world understanding
Megatron / NeMo
4 test casesMegatron-LM
NVIDIA's framework for training multi-billion parameter transformer models
NVIDIA NeMo
End-to-end framework for building, training, and deploying AI models
NeMo RL
Reinforcement learning from human feedback with NeMo
BioNeMo
NVIDIA's framework for biomolecular AI model training