I noticed a small typo and wanted to let you know! Thank you for the great article—I really appreciate it.
Original:
In practice, we do not need to map the entire FP32 range [-3.4e38, 3.4e38] into INT8. We merely need to find a way to map the range of our data (the model’s parameters) into IN8.
Correction:
In practice, we do not need to map the entire FP32 range [-3.4e38, 3.4e38] into INT8. We merely need to find a way to map the range of our data (the model’s parameters) into INT8.
However, when I try to run the AWQ-exported variant or force --quantization awq in vLLM, it crashes with a KeyError: 'merger.linear_fc1.weight'. It seems vLLM's optimized loader expects the unquantized vision-merger layers at merger.*, but AutoRound saves them as model.visual.merger.*. ❌ Failing: https://huggingface.co/Vishva007/Qwen3-VL-8B-Instruct-W4A16-AutoRound-AWQ
Has anyone bridged this naming mismatch for VLMs without manually patching the safetensors? Also struggling to reload the model via AutoModel without getting "Missing Keys" for all quantized weights.
Great blog. Under a caption you mention "Depending on the hardware, integer-based calculations might be faster than floating-point calculations but this isn’t always the case." Can you provide a link or paper substantiating this? I'd be curious to learn more about a use case where dedicated hardware on the same chip silicon is faster for floating point than integer. Thanks!
Thank you for a great explanation on quantization! I just have a question. You mention GGUF in a way that sounds like it is a quantization technique, but when I read about it elsewhere, it is usually described as a file format. I'm a little confused, is GGUF a quantization method, a file format, or both? Or are you referring to the quantization types stored in GGUF, like k-quants and i-quants?
In the `Common Data Types` section of Part 2, the mantissa part in the `BF16` illustration is shown as `1001000`. I'm curious as to why it isn't 1001001.
Thank you for this insightful and visually engaging guide on quantization, Maarten—it's a fantastic resource!
I noticed a small typo and wanted to let you know! Thank you for the great article—I really appreciate it.
Original:
In practice, we do not need to map the entire FP32 range [-3.4e38, 3.4e38] into INT8. We merely need to find a way to map the range of our data (the model’s parameters) into IN8.
Correction:
In practice, we do not need to map the entire FP32 range [-3.4e38, 3.4e38] into INT8. We merely need to find a way to map the range of our data (the model’s parameters) into INT8.
Thanks again for sharing this insightful piece! 😊
Wow! What a beautiful and insightful article.
Thank you very much, MAARTEN GROOTENDORST !
Your explanation is very good 🤗!
Cool! How do you create these visuals and how much time did it take you? They look very nice!
It was easy to understand yet quite insightful. I will definitely check out the other articles. Thanks for writing such good content!
AutoRound Qwen3-VL & vLLM AWQ compatibility?
I’ve successfully quantized Qwen3-VL-8B using AutoRound (W4A16). The base model runs perfectly in vLLM (v0.13) with default settings: ✅ Working: https://huggingface.co/Vishva007/Qwen3-VL-8B-Instruct-W4A16-AutoRound
However, when I try to run the AWQ-exported variant or force --quantization awq in vLLM, it crashes with a KeyError: 'merger.linear_fc1.weight'. It seems vLLM's optimized loader expects the unquantized vision-merger layers at merger.*, but AutoRound saves them as model.visual.merger.*. ❌ Failing: https://huggingface.co/Vishva007/Qwen3-VL-8B-Instruct-W4A16-AutoRound-AWQ
Code used: https://github.com/vishvaRam/AutoRound-Quantaization/blob/main/auto_round_Qwen_3_VL_8B.ipynb
Has anyone bridged this naming mismatch for VLMs without manually patching the safetensors? Also struggling to reload the model via AutoModel without getting "Missing Keys" for all quantized weights.
Stack: Qwen3-VL, AutoRound, vLLM 0.13 (dev).
Thanks a lot for such a lucid quantisation 101
++ Good Post. Also, start here : 500+ LLM, AI Agents, RAG, ML System Design Case Studies, 300+ Implemented Projects, Research papers in detail
https://open.substack.com/pub/naina0405/p/very-important-llm-system-design?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false
Great blog. Under a caption you mention "Depending on the hardware, integer-based calculations might be faster than floating-point calculations but this isn’t always the case." Can you provide a link or paper substantiating this? I'd be curious to learn more about a use case where dedicated hardware on the same chip silicon is faster for floating point than integer. Thanks!
Thank you for a great explanation on quantization! I just have a question. You mention GGUF in a way that sounds like it is a quantization technique, but when I read about it elsewhere, it is usually described as a file format. I'm a little confused, is GGUF a quantization method, a file format, or both? Or are you referring to the quantization types stored in GGUF, like k-quants and i-quants?
In the third figure of GPT-Q section, how does 0.5 get quantized to 0.33? If we are quantizing to int4, shouldn't the output be an integer?
Thanks for the amazing article btw! <3
Very informative about quantization. I have book market it, Thank you for your time and effort!
wow, this is really extensive article. Thanks!
Hello. Thank you for the excellent material.
In the `Common Data Types` section of Part 2, the mantissa part in the `BF16` illustration is shown as `1001000`. I'm curious as to why it isn't 1001001.
You are correct. The BF16 mantissa should be 1001001. Otherwise the encoded value is 3.125.
I wish papers published on arXiv are as accessible as this -- even for a subscription. The two column PDF is too rigid to read on any screen.