Discussion about this post

User's avatar
Sander Land's avatar

"Google LLMs use SentencePiece, which does basically the same thing with a slight tweak" - SentencePiece is a library with different tokenizers (including BPE variants), not an algorithm itself. I was confused about this as well until I read the ambiguity in https://arxiv.org/abs/2112.10508

Expand full comment
Leonardo Perelli's avatar

Very nice overview and paper choice! Regarding the MoD, i was somewhat surprised to see that the routing mechanism acts on tokens independently. For a MoE, i can understand this, as you act based on the token's semantics. But when choosing if you need to do more computation on a token or not, i was expecting this decision to somehow require additional information. The fact that this is achieved just through a projection layer (so basically an inner product, ie similarity) could be possible due to the layers specialising in different abstractions: at lower levels you can focus on the grammar and sentence structure, and while you go up you focus more on the "abstract" tokens. And also the fact that you have to implement this for a layer yes and the one after no, seems a bit weird.

Expand full comment
2 more comments...

No posts