Serving AI models at scale with vLLM // TRAIN BRAIN

Serving AI models at scale with vLLM

Unlock the full potential of your AI models by serving them at scale with vLLM. This video addresses common challenges like memory inefficiency, high latency under load, and large model sizes, showing how vLLM maximizes throughput from your existing hardware. Discover vLLM's innovative features such as PagedAttention, Prefix Caching, multi-host serving, and disaggregated serving, and learn how it seamlessly integrates with Google Cloud GPUs and TPUs for flexible, high performance AI inference.
Chapters:
0:00 - Introduction: The Challenge of Scaling AI
0:25 - 3 Common Issues
1:01 - Solution: vLLM for Performant Serving
1:13 - vLLM Feature: PagedAttention
1:30 - vLLM Feature: Prefix Caching
1:46 - vLLM Feature: Multi-Host and Disaggregated Serving
2:07 - vLLM Support on Google Cloud (GPUs & TPUs)
2:29 - vLLM Tunable Parameters
2:46 - Conclusion
Resources:
Welcome to vLLM → https://goo.gle/49zlRZN
TPU Inference GitHub → https://goo.gle/3JUkBpn
Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#GoogleCloud #vLLM #AIInfrastructure
Speakers: Don McCasland
Products Mentioned: AI Infrastructure, Tensor Processing Units, Cloud GPUs

Google Cloud Tech

Helping you build what's next with secure infrastructure, developer tools, APIs, data analytics and machine learning....

What are domain specific language models?

Cloud Run Networking explained (Updated!)

Fundamentals of ADK - Learning Path

Looker and AlloyDB: The ultimate stack for near real time operational business intelligence

How to authenticate Google Cloud Client Libraries

Reinforcement learning on TPU demo

Private routing to Google with Network Connectivity Center

Running a multi-agent AI architecture

How to get started with Google Cloud Client Libraries

BigQuery Migration Service: Validation and optimization

Build an AI agent with Gemini CLI and Agent Development Kit

Mainframe Connector demo series

Accelerate with AI debrief

Quickstart: Conversational Analytics with GCP Billing and Looker

Scaling your AI agent architecture with Cloud Run

Building AI agents that speak to each other

What are Google Cloud Client Libraries?

Architecting multi-agent systems

BigQuery Migration Service: SQL and data transfer

Can we build the ultimate AI co-founder in 72 hours with Gemini?

Reinforcement learning & fine-tuning on TPUs | The Agent Factory Podcast

Building a life saving MCP server on Cloud Run (Avalanche demo)

What is Cluster Director?

Serving open models on Vertex AI: The comprehensive developer's guide

How to evaluate agents in practice

Antigravity and Nano Banana Pro with Remik | The Agent Factory Podcast

How to build context systems for AI agents

Run MongoDB compatible apps on Firestore (Zero code changes)

Stop coding, start architecting: Google Antigravity + Cloud Run

[Demo] Network Security Integration with Palo Alto

How to build a financial analyst assistant with Vertex AI Studio & Gemini in under 10 minutes

The agent evaluation revolution

Agent sandbox and Pod snapshotting: Supercharging agents on GKE | The Agent Factory Podcast

Leveraging the Looker connector in Looker Studio

How to assess data lake and data warehouse migrations to BigQuery

Refining your vision: A guide to AI image editing

From text to vision: An intro to AI image generation

Evolving your story: A guide to AI video editing

Bringing ideas to life: An intro to AI video generation

Building with Gemini 3, AI Studio, Antigravity, and Nano Banana | The Agent Factory Podcast

Fine-tuning open LLMs on GKE: The implementation gap

Video avatar agent | The Agent Factory Podcast

Gemini CLI: Write and deploy a Cloud Run app in 5 minutes

Build ANYTHING with Gemini 3 | The Agent Factory Podcast

Building Your Own MCP Server with ADK

This AI agent runs on Cloud Run + NVIDIA GPUs

Scaling AI with Google Cloud's TPUs

Deploying scalable and reliable AI inference on Google Cloud

Serving AI models at scale with vLLM

AI workload orchestration options

AI/ML frameworks for cloud TPUs

Model types and performance bottlenecks

AI workload storage options

Connecting ADK Agents to MCP Servers

Use the Gemini CLI Jules and Observability extensions together

Introduction to Vertex AI Agent Engine

Power your AI agents with MCP tools on Google Cloud Run

Use the Gemini CLI Jules and security extensions to fix security vulnerabilities in the background

Use the Jules extension for Gemini CLI to fix multiple GitHub issues

Dataplex fundamentals: Aspects & glossaries

We tried to jailbreak our AI (and Model Armor stopped it)

Parallel bug fixing & unit testing with Jules and Observability extensions for Gemini CLI

How to fix security vulnerabilities with the Jules and security extensions for Gemini CLI

How to fix multiple GitHub issues at once using the Jules extension for Gemini CLI

The path to AI inferencing on GKE Part 1: Guided model research

Vibe coding with Google AI Studio | The Agent Factory

Is it possible to create a model agnostic prompt?

Building agentic RAG for e-commerce with ADK and Vector Search

Demo: Vibe coding a command line Markdown viewer with the Gemini CLI

Don't guess: How to benchmark your AI prompts

Identity and Access Management for Agents

ComfyUI on GKE for Genmedia solutions

Meet Cloud SQL: Google Cloud's fully managed and intelligent relational database service

Autoscaling Your AI Agent Under Load

Common Looker CI errors (and how to tackle them)

Multi-agent vs. single-agent: Which should you use?

Spanner: The always-on, virtually unlimited scale database

Building an AI tutor that ACTUALLY remembers you

Agent Sessions and Tool Authentication

How to build a multi-agent app with ADK and Gemini