I didn't see an explanation as to HOW the predictions from the smaller, faster model are incorporated into the predictions of the larger model, though, in the case of speculative decoding.
I didn't see an explanation as to HOW the predictions from the smaller, faster model are incorporated into the predictions of the larger model, though, in the case of speculative decoding.
You make N predictions from the smaller model in serial: x_0, …, x_N. You then run the N predictions through the bigger model simultaneously as a batch. If the predictions match, you keep them, otherwise you throw them away.
Thanks for writing.
I didn't see an explanation as to HOW the predictions from the smaller, faster model are incorporated into the predictions of the larger model, though, in the case of speculative decoding.
You make N predictions from the smaller model in serial: x_0, …, x_N. You then run the N predictions through the bigger model simultaneously as a batch. If the predictions match, you keep them, otherwise you throw them away.
Thanks! :)