Deploying scalable and reliable AI inference on Google Cloud
Learn how to deploy scalable and reliable AI inference workloads on Google Cloud for millions of users. This video outlines a comprehensive architecture focused on multi-region deployments, treating services as disposable, and building in robust observability. Discover how to identify and overcome performance bottlenecks, leverage frameworks like vLLM for efficiency, and utilize Google Cloud storage solutions like GCS Fuse with Anywhere Cache and Managed Lustre. We also explore the GKE Inference Reference Architecture and the model aware GKE Inference Gateway for intelligent routing.
Chapters:
0:00 - Introduction to AI inference challenges
0:16 - Building reliable AI deployments
1:13 - Optimizing AI inference performance
2:23 - Strategies for scalable AI storage
3:18 - Introducing the GKE Inference Architecture
3:35 - GKE Inference Gateway capabilities
4:00 - Deploying AI workloads with confidence
Resources:
High performance parallel file system → https://goo.gle/ra-managed-lustre
Optimize AI and ML workloads with Cloud Storage FUSE → https://goo.gle/ra-gcs-fuse
Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#GoogleCloud #GCSFUSE #CloudStorage #Lustre
Speakers: Don McCasland
Products Mentioned: AI Infrastructure, Cloud Storage
Google Cloud Tech
Helping you build what's next with secure infrastructure, developer tools, APIs, data analytics and machine learning....