Google, AWS, and Microsoft provide advanced tools that aid in optimizing performance and resource usage for AI workloads. Google offers High Level Operation (HLO) Execution Time Distribution Metrics, which gives detailed timing breakdowns of compiled operations. Additionally, the HLO Queue Size tracks execution pipeline congestion. Amazon also provides Amazon CloudWatch, which offers comprehensive observability on training workloads running on Trainium and Inferentia, including critical metrics such as GPU/accelerator utilization, latency, throughput, and resource availability.
Amazon CloudWatch offers end-to-end observability on training workloads running on Trainium and Inferentia, including metrics like GPU/accelerator utilization, latency, throughput, and resource availability.
The High Level Operation (HLO) Execution Time Distribution Metrics provide detailed timing breakdowns of compiled operations, while the HLO Queue Size monitors execution pipeline congestion.
Collection
[
|
...
]