How to autoscale a TGI deployment on GKE
Tutorial: Configure autoscaling for TGI on GKE → https://goo.gle/3Z9a7WK
Learn more about observability on GKE → https://goo.gle/4951bWY
Hugging Face TGI (Text Generation Inference) → https://goo.gle/4hXScLk
Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs. TGI is ready for production with its support for observability and metrics built-in.. Watch along as Googlers Wietse Venema and Abdel Sghiouar demonstrate how to autoscale TGI workloads on Google Kubernetes Engine (GKE) using TGI queue size as the scaling signal.
More resources:
Learn more about the TGI architecture → https://goo.gle/3Oo8mzY
A deep dive into autoscaling LLM workloads on GKE → https://goo.gle/4fKpD2t
Watch more Google Cloud: Building with Hugging Face → https://goo.gle/BuildWithHuggingFace
Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech
#GoogleCloud #HuggingFace
Speakers: Wietse Venema, Abdel Sghiouar
Products Mentioned: Google Kubernetes Engine, Gemma
Google Cloud Tech
Helping you build what's next with secure infrastructure, developer tools, APIs, data analytics and machine learning....