Exploring Distributed Caching for Faster GPU Training with NVMe, GDS, and RDMA - Hope Wang & Bin Fan
Exploring Distributed Caching for Faster GPU Training with NVMe, GDS, and RDMA - Hope Wang & Bin Fan, Alluxio
As GPUs become increasingly powerful, the separation between compute and storage often results in underutilized GPUs waiting for data. Meanwhile, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization.
In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.
As GPUs become increasingly powerful, the separation between compute and storage often results in underutilized GPUs waiting for data. Meanwhile, modern high-performance hardware like NVMe storage and RDMA networks (InfiniBand or specialized NICs) are becoming more widespread. To fully leverage these resources, it’s crucial to build a balanced architecture that avoids GPU underutilization.
In this talk, we will explore various strategies to address this challenge by effectively utilizing these advanced hardware components. Specifically, we will present experimental results from building a Kubernetes-native distributed caching layer, utilizing NVMe storage and high-speed RDMA networks to optimize data access for PyTorch training.
The Linux Foundation
The Linux Foundation is a nonprofit consortium dedicated to fostering the growth of Linux and collaborative software development. Founded in 2000, the organization sponsors the work of Linux creator Linus Torvalds and promotes, protects and advances the L...