How to autoscale a TGI deployment on GKE // TRAIN BRAIN

How to autoscale a TGI deployment on GKE

Tutorial: Configure autoscaling for TGI on GKE → https://goo.gle/3Z9a7WK
Learn more about observability on GKE → https://goo.gle/4951bWY
Hugging Face TGI (Text Generation Inference) → https://goo.gle/4hXScLk
Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs. TGI is ready for production with its support for observability and metrics built-in.. Watch along as Googlers Wietse Venema and Abdel Sghiouar demonstrate how to autoscale TGI workloads on Google Kubernetes Engine (GKE) using TGI queue size as the scaling signal.
More resources:
Learn more about the TGI architecture → https://goo.gle/3Oo8mzY
A deep dive into autoscaling LLM workloads on GKE → https://goo.gle/4fKpD2t
Watch more Google Cloud: Building with Hugging Face → https://goo.gle/BuildWithHuggingFace
Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#GoogleCloud #HuggingFace
Speakers: Wietse Venema, Abdel Sghiouar
Products Mentioned: Google Kubernetes Engine, Gemma

Google Cloud Tech

Helping you build what's next with secure infrastructure, developer tools, APIs, data analytics and machine learning....

Gemini APIs for advanced developers

How to use the Gemini APIs: Advanced techniques

Text to image with Google Cloud’s Vertex AI on Cloud Run

How to use customer-managed keys (CMEK)

Gemini Code Assist tools: Stay in the flow while coding

Google Cloud x MLB Hackathon - Building with Gemini Models

How to create Looker Studio Reports in Looker

How to use stable diffusion on Cloud Run

Your first workload with AI Hypercomputer

Deploy Gemma 2 with multiple LoRA adapters on GKE

Multimodal AI in action

How to prepare data for LLMs

Cloud migration insights from banking

Introduction to grounding with Gemini on Vertex AI

Fine-tuning open AI models using Hugging Face TRL

Run Hugging Function Face transformers on GPU enabled Cloud Run functions

Ollama and Cloud Run with GPUs

Cloud Run functions with Gemma 2 and Ollama

Running Diffusion with Cloud Run GPUs

Introduction to Gemini on Vertex AI

How do I know my AI app is working?

How to evaluate AI applications

Choosing between self-hosted GKE and managed Vertex AI to host AI models

How to autoscale a TGI deployment on GKE

Looker Conversational Analytics

RAG expansion for AI apps

Using RAG expansion to improve model speed and accuracy

Protecting sensitive data in AI apps

Learn Hybrid Search with Vertex AI Vector Search

Deploy HUGS on GKE with Hugging Face

New "task type" embedding from the DeepMind team improves RAG search quality

Quick Gemma 2 deployment with Hugging Face

What is an AI agent?

Intro to AI agents

RAG vs Model tuning vs Large prompt window

65K node Kubernetes AI Platform - A Reality

Semantic modeling for AI

Function calling for LLMs, what is it? ?

AI + your code: Function Calling

Deploy open models with TGI on Cloud Run

RAG with LangChain on Google Cloud

Looker's Chart Config Editor & Visualization Assistant

Advanced RAG techniques for better retrieval performance

Advanced RAG techniques for developers

Prompt engineering for developers

How to run anything on Google Axion Processors

Google Axion Processors, explained

GenAI is a game changer for podcasts ??

What are Hugging Face Deep Learning Containers?

Demystifying RAG for developers

How to use Retrieval Augmented Generation (RAG)

Architecting a RAG Podcast Summarizer #AI #Tech

Data insights with Looker for Google Workspace

Embeddings for AI/ML and RAG apps

Build RAG apps with embeddings

How to secure your cloud with VPC Service Controls

Leverage LLM strengths for your features ?

Deploy Hugging Face models from Vertex AI Model Garden

Introducing pipe syntax in BigQuery and Cloud Logging

Experience multimodal AI with manga ONE PIECE (ワンピース)

Gemini for Developers - Vertex AI

Architecting a healthcare and life sciences startup with Google Cloud

Cloud KMS Autokey best practices

Boost productivity with Gemini Code Assist's Multi-suggestion Feature

Meet AI’s multitool: Vector embeddings

What are text embeddings?

Visualize data with Looker Studio

Dataflow for IoT galactic research

How to customize Gemini Code Assist with your private code

Dataflow for real-time IoT analytics

Persistent runtimes for notebooks ❌⌛️? #ProTip

Modern generative AI use cases #shorts

Gemini for Developers - Security

A developer’s guide to LLMs

Gemini for Developers - RAG

Use GPUs in Cloud Run

Getting started with Vertex AI #shorts

How Volkswagen built its AI assistant pt. 2

Gemini at Work: Create a marketing campaign with Imagen 3

Gemini at Work: Extracting business-critical insights from video content