
Serving AI models at scale with vLLM
Unlock the full potential of your AI models by serving them at scale with vLLM. This video addresses common challenges like memory inefficiency, high latency under load, and large model sizes, showing how vLLM maximizes throughput from your existing hardware. Discover vLLM's innovative features such as PagedAttention, Prefix Caching, multi-host serving, and disaggregated serving, and learn how it seamlessly integrates with Google Cloud GPUs and TPUs for flexible, high performance AI inference.
Chapters:
0:00 - Introduction: The Challenge of Scaling AI
0:25 - 3 Common Issues
1:01 - Solution: vLLM for Performant Serving
1:13 - vLLM Feature: PagedAttention
1:30 - vLLM Feature: Prefix Caching
1:46 - vLLM Feature: Multi-Host and Disaggregated Serving
2:07 - vLLM Support on Google Cloud (GPUs & TPUs)
2:29 - vLLM Tunable Parameters
2:46 - Conclusion
Resources:
Welcome to vLLM → https://goo.gle/49zlRZN
TPU Inference GitHub → https://goo.gle/3JUkBpn
Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#GoogleCloud #vLLM #AIInfrastructure
Speakers: Don McCasland
Products Mentioned: AI Infrastructure, Tensor Processing Units, Cloud GPUs
Chapters:
0:00 - Introduction: The Challenge of Scaling AI
0:25 - 3 Common Issues
1:01 - Solution: vLLM for Performant Serving
1:13 - vLLM Feature: PagedAttention
1:30 - vLLM Feature: Prefix Caching
1:46 - vLLM Feature: Multi-Host and Disaggregated Serving
2:07 - vLLM Support on Google Cloud (GPUs & TPUs)
2:29 - vLLM Tunable Parameters
2:46 - Conclusion
Resources:
Welcome to vLLM → https://goo.gle/49zlRZN
TPU Inference GitHub → https://goo.gle/3JUkBpn
Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#GoogleCloud #vLLM #AIInfrastructure
Speakers: Don McCasland
Products Mentioned: AI Infrastructure, Tensor Processing Units, Cloud GPUs
Google Cloud Tech
Helping you build what's next with secure infrastructure, developer tools, APIs, data analytics and machine learning....