3 Comments
User's avatar
⭠ Return to thread
Sam's avatar

Thanks for writing.

I didn't see an explanation as to HOW the predictions from the smaller, faster model are incorporated into the predictions of the larger model, though, in the case of speculative decoding.

Expand full comment
Finbarr Timbers's avatar

You make N predictions from the smaller model in serial: x_0, …, x_N. You then run the N predictions through the bigger model simultaneously as a batch. If the predictions match, you keep them, otherwise you throw them away.

Expand full comment
Sam's avatar

Thanks! :)

Expand full comment