Why do LLMs use greedy sampling?
"Greedy sampling is the worst form of sampling, except all those other forms that have been tried from time to time." - Winston Churchill, if he worked in NLP.
Why do LLMs use greedy sampling?
This is a speculative article, so I’d greatly appreciate any feedback, particularly if you disagree.
When I began working on LLMs, I found it pretty surprising that the SOTA in generative text was to greedily sample from the bare outputs of the neural networks. In other words, with GPT-style models, the typical approach to generate a sequence of text is something like:
Run your prompt through your model and generate probabilities over your vocabulary.
Choose the most likely token (perhaps with some randomization, maybe with some preprocessing, like top-k or nucleus sampling).
If the chosen token is
<|endoftext|>
, you’re done; otherwise, concatenate the new token to your prompt and go back to 1.
In games RL research, it is common to instead conduct a much more complicated calculation to choose the next step in your sequence. For instance, AlphaZero uses a somewhat complicated algorithm called Monte Carlo Tree Search (MCTS). Here, I explore some reasons for why LLMs don’t use a fancier decoding algorithm.
But first, a caveat: there’s a lot of literature proposing various ways to do this that I’m not going to engage with, for sake of time. I have a list of references at the end which I’d encourage you to look at if you want a more detailed look.
The current paradigm of language modelling, with GPT-style decoder models, uses greedy autoregressive sampling to generate a sequence of tokens. This is a somewhat surprising choice; if you look at the history of NLP research, particularly the Neural Machine Translation literature, beam search is often needed to reach SOTA performance (e.g. 1703.03906). Similarly, in games research, search is typically many times stronger than any pure neural network approach, and search will strictly dominate wherever it’s feasible (the exceptions are games like Stratego, where the game tree has much too high of a branching factor to be searched with any non-trivial depth). In games like Go, Chess, Poker, or Scotland Yard, search methods dominate.
By search, I am referring to algorithmic search, which I am defining as any method which uses additional compute at inference time to improve the answer. This has nothing to do with Google-style search (which I call “information retrieval”).
So why don’t GPTs use search? Well, there’s a few answers to this. The first one is a total copout:
GPTs don’t use search as far as we know. OpenAI recently raised my eyebrows when they hired Noam Brown, an expert on search in games, to work on “multi-step reasoning, self-play, and multi-agent AI.” That sounds an awful lot like search to me (and, specifically, sounds a lot like Alpha/MuZero). We also know that Demis has talked about Gemini incorporating techniques from AlphaGo, which, again, makes me think about search (not to mention self-play).
So it’s entirely possible that search is the secret sauce behind GPT-4’s performance, and the lack of it is why the open source world has been unable to match it. I’m suspicious of this— for reasons I’ll get into below— but if I were actively working on LLM research, I’d be focusing on trying to use search.
Let’s assume, then, that the key players (GPT-4, Claude, etc.) aren’t using search. Why?
Keep reading with a 7-day free trial
Subscribe to Artificial Fintelligence to keep reading this post and get 7 days of free access to the full post archives.