Prometheus & Grafana
Monitoring stack for GPU clusters with custom dashboards
View on GitHubOverview
Deploy Prometheus and Grafana for monitoring GPU utilization, EFA traffic, training throughput, and cluster health.
Monitoring stack for GPU clusters with custom dashboards
View on GitHubDeploy Prometheus and Grafana for monitoring GPU utilization, EFA traffic, training throughput, and cluster health.