NeuronX Distributed

Distributed training on AWS Trainium with NeuronX

View on GitHub

Overview

NeuronX Distributed enables training on AWS Trainium chips using the Neuron SDK with tensor parallelism and pipeline parallelism support.

Key Features

  • Tensor parallelism for Trainium
  • Pipeline parallelism
  • Gradient accumulation
  • Mixed precision with BF16
  • Integration with HuggingFace Optimum