NVRx
NVIDIA's resilient training toolkit for fault-tolerant distributed workloads
View on GitHubOverview
NVRx is NVIDIA’s resilient training toolkit that provides fault tolerance, automatic recovery, and checkpoint management for large-scale distributed training workloads.
Key Features
- Automatic fault detection and recovery
- Efficient checkpointing strategies
- Node failure handling without job restart
- Integration with Slurm and HyperPod health monitoring