NVRx

NVIDIA's resilient training toolkit for fault-tolerant distributed workloads

View on GitHub

Overview

NVRx is NVIDIA’s resilient training toolkit that provides fault tolerance, automatic recovery, and checkpoint management for large-scale distributed training workloads.

Key Features

  • Automatic fault detection and recovery
  • Efficient checkpointing strategies
  • Node failure handling without job restart
  • Integration with Slurm and HyperPod health monitoring