Videos de multi query attention

Videos etiquetados con "multi query attention"

multi query attention 1 videos

Galería completa

19:54

Why Modern LLMs Use GQA | Multi Query and Grouped Query Attention Visually Explained

Why do modern LLMs like Llama, Qwen, Gemma and Gemini use Grouped-Query Attention (GQA) instead of standard Multi-Head Attention (MHA)? In this video we build a complete intuition for Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), two important transformer attention optimizations used in modern large language models. We understand how KV cache memory and memory bandwidth become major bottlenecks during autoregressive decoding and LLM inference, and why transformer architectures moved toward MQA and GQA attention mechanisms for faster inference and reduced KV cache size. We visually explain Multi-Query Attention, Grouped-Query Attention and the spectrum between MHA, MQA and GQA, including how shared key and value projections work across attention heads. We also compare MQA vs GQA performance, KV cache memory consumption, decoding latency and inference efficiency. The second half of the video focuses on implementation in PyTorch, where we start from a baseline Multi-Head Attention implementation and modify it step-by-step into Multi-Query Attention and Grouped-Query Attention. At the end of the video we go through GQA uptraining techniques proposed for converting existing transformer multi head checkpoints into MQA/GQA models. ⏱️ Timestamps: 00:00 Intro - KV Cache Memory & Bandwidth Bottleneck 02:46 Multi-Query Attention (MQA) Explained 03:57 Intuition for what MQA attention heads learn 06:18 Spectrum of MHA, MQA and GQA 07:48 Grouped-Query Attention (GQA) Explained 09:54 KV Cache Size Comparisons 11:04 MQA vs GQA vs MHA Performance Comparisons 11:51 Baseline Multi-Head Attention (MHA) Implementation 13:40 MQA Implementation in PyTorch 15:51 GQA Implementation in PyTorch 17:54 Uptraining MQA/GQA Models 📖 Resources: MQA Paper - https://arxiv.org/pdf/1911.02150 GQA Paper - https://arxiv.org/pdf/2305.13245 🔔 Subscribe: https://tinyurl.com/exai-channel-link Email - explainingai.official@gmail.com

hace 2 días 260

Otras categorías que te pueden interesar:

Tecnología 158 Ciencia 80 Educación 53 Política 0 Economía 0 Deportes 0 Entretenimiento 0 Salud 0

Videos de multi query attention

multi query attention 1 videos

Why Modern LLMs Use GQA | Multi Query and Grouped Query Attention Visually Explained

Otras categorías que te pueden interesar:

Utilizamos cookies