Self host Gemma 4: Deploy LLMs on Cloud Run GPUs
GCP credit → https://goo.gle/handson-ep7-lab1
Lab → https://goo.gle/guardians
In this episode, we deploy Google's Gemma 4 model to Cloud Run two completely different ways, each with real trade-offs you need to understand before choosing one for production.
? Ollama — model baked into the container. Instant cold starts. Rebuild to update.
⚡ vLLM — model mounted from Cloud Storage via FUSE. Slower first boot, but swap models without redeploying.
Both use Cloud Run GPUs, scale to zero, and ship through automated CI/CD with Cloud Build.
We build both. You decide which fits. ?
? CI/CD with Cloud Build
?️ GPU accelerated serverless inference
? Baked in vs. decoupled model architecture
? Scale to zero
⚖️ Cold start speed vs. production agility
Chapters:
0:00 - Intro
6:08 - Getting started with Agentverse lab
7:57 - Laying the foundations of the citadel
16:07 - Forging the power core: Self hosted LLMs
28:02 - Forging the citadel's central core: Deploy vLLM
43:59 - Summary
More resources:
Cloud Run GPU documentation → https://goo.gle/4sEbTvG
Ollama documentation → https://goo.gle/3Qdi64w
vLLM documentation → https://goo.gle/4cvvxE9
Cloud Storage FUSE → https://goo.gle/4cQAb0V
Watch more Hands on AI → https://www.youtube.com/watch?v=qCBreTfjFHQ&list=PLIivdWyY5sqKnJOvP89yF8t9mWuzMTcbM
? Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#Gemma4 #CloudRun
Speakers: Ayo Adedeji, Annie Wang
Products Mentioned: Agent Development Kit, Gemini API, Cloud Run
Google Cloud Tech
Helping you build what's next with secure infrastructure, developer tools, APIs, data analytics and machine learning....