I didn't see an explanation as to HOW the predictions from the smaller, faster model are incorporated into the predictions of the larger model, though, in the case of speculative decoding.

You make N predictions from the smaller model in serial: x_0, …, x_N. You then run the N predictions through the bigger model simultaneously as a batch. If the predictions match, you keep them, otherwise you throw them away.

> Their results were mixed; their experiments found it optimal to use 4-bit precision, but the difference wasn’t huge [...] It’s hard to reconcile this with the GPTQ paper, which reports almost tradeoff free results from quantizing.

That figure from the k-bit paper *also* shows ~0 penalty from quantization.

Consider the hypothetical where the penalty for 4-bit quantization is literally zero. In other words, if my model needs B bits at 16-bit precision, and achieves accuracy A at 16-bit precision, I can also achieve accuracy A with B/4 bits, by using the same model in 4-bit precision.

If that were true, the yellow 4-bit line on the plot would look like a copy of the orange 16-bit line, shifted to the left by a factor of 4 (i.e., by 40% of the distance between each major tick on the x-axis).

And if you squint at the plot, it basically *does* look like that! This is especially clear if you look at the endpoints of each line, which correspond to the smallest and biggest models at different precisions. The smallest model always has ~46% accuracy, whether it's at 16-bit or 4-bit; likewise, the largest model always has ~73% accuracy.

The reason the plot feels sort of underwhelming -- despite showing almost the best possible outcome for quantization -- is that the model sizes range over 3+ orders of magnitude, while the highest precision is not even a single OOM above the lowest precision.

By contrast, the GPT-2 pane of Figure 7 makes quantization look more impressive at a glance, because the largest GPT-2 is only about 10x bigger than the smallest. But we're seeing the same story, just with differently scaled axes.

In other words, quantization *really does* let you run a 4x bigger model than you would otherwise be able to fit in VRAM. But running a 4x bigger model isn't actually as exciting as it sounds. The effects of LLM scaling only get really exciting and noticeable across ratios considerably larger than this.

Thanks for writing.

I didn't see an explanation as to HOW the predictions from the smaller, faster model are incorporated into the predictions of the larger model, though, in the case of speculative decoding.

You make N predictions from the smaller model in serial: x_0, …, x_N. You then run the N predictions through the bigger model simultaneously as a batch. If the predictions match, you keep them, otherwise you throw them away.

Thanks! :)

> Their results were mixed; their experiments found it optimal to use 4-bit precision, but the difference wasn’t huge [...] It’s hard to reconcile this with the GPTQ paper, which reports almost tradeoff free results from quantizing.

That figure from the k-bit paper *also* shows ~0 penalty from quantization.

Consider the hypothetical where the penalty for 4-bit quantization is literally zero. In other words, if my model needs B bits at 16-bit precision, and achieves accuracy A at 16-bit precision, I can also achieve accuracy A with B/4 bits, by using the same model in 4-bit precision.

If that were true, the yellow 4-bit line on the plot would look like a copy of the orange 16-bit line, shifted to the left by a factor of 4 (i.e., by 40% of the distance between each major tick on the x-axis).

And if you squint at the plot, it basically *does* look like that! This is especially clear if you look at the endpoints of each line, which correspond to the smallest and biggest models at different precisions. The smallest model always has ~46% accuracy, whether it's at 16-bit or 4-bit; likewise, the largest model always has ~73% accuracy.

The reason the plot feels sort of underwhelming -- despite showing almost the best possible outcome for quantization -- is that the model sizes range over 3+ orders of magnitude, while the highest precision is not even a single OOM above the lowest precision.

By contrast, the GPT-2 pane of Figure 7 makes quantization look more impressive at a glance, because the largest GPT-2 is only about 10x bigger than the smallest. But we're seeing the same story, just with differently scaled axes.

In other words, quantization *really does* let you run a 4x bigger model than you would otherwise be able to fit in VRAM. But running a 4x bigger model isn't actually as exciting as it sounds. The effects of LLM scaling only get really exciting and noticeable across ratios considerably larger than this.

Hmmm, you've convinced me. I'll update the article. Thanks for this!