Discussion about this post

User's avatar
Lukas Nel's avatar

Quantization might work because LLMs are fairly noise resistant and can thus cope with the gaussian esque noise introduced by quantization fairly well

Expand full comment
Sam's avatar

Thanks for writing.

I didn't see an explanation as to HOW the predictions from the smaller, faster model are incorporated into the predictions of the larger model, though, in the case of speculative decoding.

Expand full comment
5 more comments...

No posts