Unlocking Local LLMs with Quantization - Marc Sun, Hugging Face
Unlocking Local LLMs with Quantization - Marc Sun, Hugging Face
This talk will share the story of quantization, its rise in popularity, and its current status in the open-source community. We'll begin by reviewing key quantization papers, such as QLoRA by Tim Dettmers and GPTQ by Elias Frantar. Next, we'll demonstrate how quantization can be applied at various stages of model development, including pre-training, fine-tuning, and inference. Specifically, we'll share our experience in pre-training a 1.58-bit model, show how fine-tuning is achievable using PEFT + QLoRA, and discuss optimizing inference performance with torch.compile or custom kernels. Finally, we'll highlight efforts within the community to make quantized models more accessible, including how transformers incorporate state-of-the-art quantization schemes and how to run GGUF models from llama.cpp.
The Linux Foundation
The Linux Foundation is a nonprofit consortium dedicated to fostering the growth of Linux and collaborative software development. Founded in 2000, the organization sponsors the work of Linux creator Linus Torvalds and promotes, protects and advances the L...