"Google LLMs use SentencePiece, which does basically the same thing with a slight tweak" - SentencePiece is a library with different tokenizers (including BPE variants), not an algorithm itself. I was confused about this as well until I read the ambiguity in https://arxiv.org/abs/2112.10508
Very nice overview and paper choice! Regarding the MoD, i was somewhat surprised to see that the routing mechanism acts on tokens independently. For a MoE, i can understand this, as you act based on the token's semantics. But when choosing if you need to do more computation on a token or not, i was expecting this decision to somehow require additional information. The fact that this is achieved just through a projection layer (so basically an inner product, ie similarity) could be possible due to the layers specialising in different abstractions: at lower levels you can focus on the grammar and sentence structure, and while you go up you focus more on the "abstract" tokens. And also the fact that you have to implement this for a layer yes and the one after no, seems a bit weird.
It does seem weird. I’m curious if anyone else has used it. It seemed to not have been picked up by any other labs, and I haven’t seen any OS models use it.
"Google LLMs use SentencePiece, which does basically the same thing with a slight tweak" - SentencePiece is a library with different tokenizers (including BPE variants), not an algorithm itself. I was confused about this as well until I read the ambiguity in https://arxiv.org/abs/2112.10508
ahhhh ty!
Very nice overview and paper choice! Regarding the MoD, i was somewhat surprised to see that the routing mechanism acts on tokens independently. For a MoE, i can understand this, as you act based on the token's semantics. But when choosing if you need to do more computation on a token or not, i was expecting this decision to somehow require additional information. The fact that this is achieved just through a projection layer (so basically an inner product, ie similarity) could be possible due to the layers specialising in different abstractions: at lower levels you can focus on the grammar and sentence structure, and while you go up you focus more on the "abstract" tokens. And also the fact that you have to implement this for a layer yes and the one after no, seems a bit weird.
It does seem weird. I’m curious if anyone else has used it. It seemed to not have been picked up by any other labs, and I haven’t seen any OS models use it.