Optimize Your AI Cloud Infrastructure: A Hardware Perspective - Liang Yan, CoreWeave
Optimize Your AI Cloud Infrastructure: A Hardware Perspective - Liang Yan, CoreWeave
GPU Cloud has become a ubiquitous component of contemporary AI infrastructure, especially for distributed machine learning scenarios. While conversations around AI infrastructure optimization typically revolve around the application layer, such as machine learning tasks and distributed job schedulers, delving into the underhood of the GPU cloud is essential. Numerous factors, including POD Scheduler, Device Plugin, GPU/NUMA topology, ROCE/NCCL Stack, and more, can significantly impact performance.
This session will thoroughly explore the tuning of various machine models(CNN/RNN/Transformer) from MLPerf using an H100 Cluster as a reference. We will analyze the correlation between model performance and device operator configuration in nodes by presenting first-hand experimental results to unveil the hidden potential within an AI Cloud.
GPU Cloud has become a ubiquitous component of contemporary AI infrastructure, especially for distributed machine learning scenarios. While conversations around AI infrastructure optimization typically revolve around the application layer, such as machine learning tasks and distributed job schedulers, delving into the underhood of the GPU cloud is essential. Numerous factors, including POD Scheduler, Device Plugin, GPU/NUMA topology, ROCE/NCCL Stack, and more, can significantly impact performance.
This session will thoroughly explore the tuning of various machine models(CNN/RNN/Transformer) from MLPerf using an H100 Cluster as a reference. We will analyze the correlation between model performance and device operator configuration in nodes by presenting first-hand experimental results to unveil the hidden potential within an AI Cloud.
The Linux Foundation
The Linux Foundation is a nonprofit consortium dedicated to fostering the growth of Linux and collaborative software development. Founded in 2000, the organization sponsors the work of Linux creator Linus Torvalds and promotes, protects and advances the L...