Self host Gemma 4: Deploy LLMs on Cloud Run GPUs // TRAIN BRAIN

Self host Gemma 4: Deploy LLMs on Cloud Run GPUs

GCP credit → https://goo.gle/handson-ep7-lab1
Lab → https://goo.gle/guardians
In this episode, we deploy Google's Gemma 4 model to Cloud Run two completely different ways, each with real trade-offs you need to understand before choosing one for production.
? Ollama — model baked into the container. Instant cold starts. Rebuild to update.
⚡ vLLM — model mounted from Cloud Storage via FUSE. Slower first boot, but swap models without redeploying.
Both use Cloud Run GPUs, scale to zero, and ship through automated CI/CD with Cloud Build.
We build both. You decide which fits. ?
? CI/CD with Cloud Build
?️ GPU accelerated serverless inference
? Baked in vs. decoupled model architecture
? Scale to zero
⚖️ Cold start speed vs. production agility
Chapters:
0:00 - Intro
6:08 - Getting started with Agentverse lab
7:57 - Laying the foundations of the citadel
16:07 - Forging the power core: Self hosted LLMs
28:02 - Forging the citadel's central core: Deploy vLLM
43:59 - Summary
More resources:
Cloud Run GPU documentation → https://goo.gle/4sEbTvG
Ollama documentation → https://goo.gle/3Qdi64w
vLLM documentation → https://goo.gle/4cvvxE9
Cloud Storage FUSE → https://goo.gle/4cQAb0V
Watch more Hands on AI → https://www.youtube.com/watch?v=qCBreTfjFHQ&list=PLIivdWyY5sqKnJOvP89yF8t9mWuzMTcbM
? Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#Gemma4 #CloudRun
Speakers: Ayo Adedeji, Annie Wang
Products Mentioned: Agent Development Kit, Gemini API, Cloud Run

Google Cloud Tech

Helping you build what's next with secure infrastructure, developer tools, APIs, data analytics and machine learning....

How to design a multi-agent system that skips the LLM

What’s going on with Developers and AI?

5 tips for using Antigravity 2.0 on enterprise codebases, planning phase

Automate M365 to Google Workspace Migrations with ADK multi-agents

Converting Oracle data types to PostgreSQL

Automate SAP order-to-cash with multi-agent AI & Google Agent Development Kit (ADK)

Automate project intake with multi-agent AI using MCP, Google ADK, Cloud Run, and BigQuery

AlloyDB AI: Revolutionizing Hybrid Search for PostgreSQL

Build long-running agents with Google’s Agentic Stack | The Agent Factory

How we built 1,000 AI agents that run a marathon

[Demo] Autonomous ML Reliability - Data Center Network

[Demo] High Resolution Network Telemetry: Data Center Network

How to connect AI agents directly to your enterprise data: Introducing the AlloyDB remote MCP server

Stop rogue AI: Enforce policy & cut costs in Agent Development Kit (ADK)

Oracle packages: From black box to PostgreSQL

Vibe coding an AI racing coach in 2 weeks

Tuning custom Gemma models for high speed computer vision

How Whering architects cost efficient multimodal AI apps

Summarize millions of rows with one line of SQL: AI.AGG

AlloyDB Lakehouse Federation: Unified access to BigQuery and Google Cloud Lakehouse

The ABCs of agent building

What's new in Go

Next gen agentic architecture: Hands on with Gemini 3.5 & ADK

From laptop to planet scale: Deploying enterprise grade AI agents

Building production ready full stack apps with AI

How @CodeWithHarry helps developers upskill with AI

Moving AI agents beyond “Hello World” to real production

Eradicating friction for client side AI agents with Firebase

Mastering AI stacks for software engineers

Scaling multi-modal AI to 7 million users

Build an AI agent app with Go ADK, Cloud Run, and Flutter

Build full-stack apps with Google AI Studio, Cloud Run, and Cloud SQL

Develop and integrate AI agents with Google Workspace

Agent-first workflows from prompt to production

Stop over engineering your AI dev setup and just start

Dynamic Firebase skills: Architecting agent ready codebases

How Google Developer Experts vibecoded an AI racing coach with Gemini

Gemini Enterprise Agent Platform: Adding memory to AI agents

Firebase goes SQL: Inside the new SQL Connect (PostgreSQL)

Full-stack Dart is here: Top 5 Flutter highlights from Cloud Next '26

Stop building slow AI: Optimizing multi-agent systems for production

The future of Cloud AI: Mastering MCP servers, Gemini, and agentic workflows

Real time fraud detection with AlloyDB AI

Building conversational applications with Bigtable and ADK

Your AI agent still has no memory? Fix it with these 3 patterns

From systems of intelligence to systems of action: Yasmeen Ahmad on the agentic data cloud

Google Cloud Next '26 Developer Keynote recap

Automatic Read Scaling with AlloyDB Transparent Query Forwarding (TQF)

No design skills? Use this free AI powered design tool from Google | The Agent Factory

Your AI agent is forgetful. Here’s how to give it a brain.

Build cost effective, scalable apps using Firestore with MongoDB compatibility

How to scale Gen AI to billions of rows in BigQuery at a fraction of the cost

How does Bigtable work?

Welcome to Bigtable

Why 99% of AI agents fail in production (and how to fix it) | The Agent Factory

What is Gemini Enterprise Agent Platform?

Gemma 4 production stack: Model Armor, ADK Agents, Tracing

Self host Gemma 4: Deploy LLMs on Cloud Run GPUs

How to scale to 60 million requests (with ZERO ops team) using Cloud Run

AI agent long-term memory with memory bank

Mark your calendars for the AI Agent Clinic on April 15th!

Join our Googlers live on April 15th!

Orchestrating ML/AI workloads with TPUs on GKE

BigQuery Graph in 5'

How to add persistent memory to your AI agent

What are domain specific language models?

Cloud Run Networking explained (Updated!)

Fundamentals of ADK - Learning Path

Looker and AlloyDB: The ultimate stack for near real time operational business intelligence

How to authenticate Google Cloud Client Libraries

Reinforcement learning on TPU demo

Private routing to Google with Network Connectivity Center

Running a multi-agent AI architecture

How to get started with Google Cloud Client Libraries

BigQuery Migration Service: Validation and optimization

Build an AI agent with Gemini CLI and Agent Development Kit

Mainframe Connector demo series

Accelerate with AI debrief

Quickstart: Conversational Analytics with GCP Billing and Looker

Scaling your AI agent architecture with Cloud Run