Discussion about this post

User's avatar
Lukas Nel's avatar

Quantization might work because LLMs are fairly noise resistant and can thus cope with the gaussian esque noise introduced by quantization fairly well

Sam's avatar

Thanks for writing.

I didn't see an explanation as to HOW the predictions from the smaller, faster model are incorporated into the predictions of the larger model, though, in the case of speculative decoding.

5 more comments...

No posts

Ready for more?