Transformer inference tricks

Nov 23, 2023

How to make your model run faster than a greased pig

7 Comments

Quantization might work because LLMs are fairly noise resistant and can thus cope with the gaussian esque noise introduced by quantization fairly well

Expand full comment

Reply (1)

Finbarr Timbers

Nov 19

oh, that's interesting. I hadn't thought of it like that.

Expand full comment

Sam

Apr 14, 2024

Thanks for writing.

I didn't see an explanation as to HOW the predictions from the smaller, faster model are incorporated into the predictions of the larger model, though, in the case of speculative decoding.

Expand full comment

Reply (1)

Finbarr Timbers

Apr 14, 2024

You make N predictions from the smaller model in serial: x_0, …, x_N. You then run the N predictions through the bigger model simultaneously as a batch. If the predictions match, you keep them, otherwise you throw them away.

Expand full comment

Reply (1)

Sam

Apr 14, 2024

Thanks! :)

Expand full comment

nostalgebraist

Dec 4, 2023

> Their results were mixed; their experiments found it optimal to use 4-bit precision, but the difference wasn’t huge [...] It’s hard to reconcile this with the GPTQ paper, which reports almost tradeoff free results from quantizing.

That figure from the k-bit paper *also* shows ~0 penalty from quantization.

Consider the hypothetical where the penalty for 4-bit quantization is literally zero. In other words, if my model needs B bits at 16-bit precision, and achieves accuracy A at 16-bit precision, I can also achieve accuracy A with B/4 bits, by using the same model in 4-bit precision.

If that were true, the yellow 4-bit line on the plot would look like a copy of the orange 16-bit line, shifted to the left by a factor of 4 (i.e., by 40% of the distance between each major tick on the x-axis).

And if you squint at the plot, it basically *does* look like that! This is especially clear if you look at the endpoints of each line, which correspond to the smallest and biggest models at different precisions. The smallest model always has ~46% accuracy, whether it's at 16-bit or 4-bit; likewise, the largest model always has ~73% accuracy.

The reason the plot feels sort of underwhelming -- despite showing almost the best possible outcome for quantization -- is that the model sizes range over 3+ orders of magnitude, while the highest precision is not even a single OOM above the lowest precision.

By contrast, the GPT-2 pane of Figure 7 makes quantization look more impressive at a glance, because the largest GPT-2 is only about 10x bigger than the smallest. But we're seeing the same story, just with differently scaled axes.

In other words, quantization *really does* let you run a 4x bigger model than you would otherwise be able to fit in VRAM. But running a 4x bigger model isn't actually as exciting as it sounds. The effects of LLM scaling only get really exciting and noticeable across ratios considerably larger than this.

Expand full comment

Reply (1)

Finbarr Timbers

Dec 4, 2023

Hmmm, you've convinced me. I'll update the article. Thanks for this!

Expand full comment

Artificial Fintelligence

Transformer inference tricks