Prometheus & Grafana

Monitoring stack for GPU clusters with custom dashboards

View on GitHub

Overview

Deploy Prometheus and Grafana for monitoring GPU utilization, EFA traffic, training throughput, and cluster health.