Reinforcement learning and general intelligence

Epsilon random is not enough

Jun 05, 2025

A disclaimer: nothing that I say here is representing any organization other than Artificial Fintelligence. These are my views, and mine alone, although I hope that you share them after reading.

Frontier labs are spending, in the aggregate, $100s of millions of dollars annually on data acquisition, leading to a number of startups selling data to them (Mercor, Scale, Surge, etc). The novel data, combined with reinforcement learning (RL) techniques, represents the most clear avenue to improvement, and to AGI. I am firmly convinced that scaling up RL techniques will lead to excellent products, and, eventually, AGI. A primary source of improvement over the last decade has been scale, as the industry has discovered one method after another that allows us to convert money into intelligence. First, bigger models. Then, more data (thereby making Alexandr Wang very rich). And now, RL.

RL is the subfield of machine learning that studies algorithms which discover new knowledge. Reinforcement learning agents take actions in environments to systematically discover the optimal strategy (called a policy). An example environment is Atari: you have an environment (the Atari game) where the agent can take actions (moving in different directions, pressing the “fire” button) and the agent receives a scalar reward signal that it wants to maximize (the score). Without providing any data on how to play Atari games, RL algorithms are able to discover policies which get optimal scores in most Atari games.

The key problem in RL is the exploration/exploitation tradeoff. At each point that the agent is asked to choose an action, the agent has to decide between choosing the action which they currently think is best (”exploiting”) or trying a new action which might be better (”exploring”). This is an extremely difficult decision to get right. Consider a complicated game like Starcraft, or Dota. For any individual situation that the agent is in, how can we know what the optimal action is? It’s only after making an entire game’s worth of decisions that we are able to know if our strategy is sound. and it is only after playing many games that we are able to conclude how good we are in comparison to other players.

Large language models help significantly here, as they are much, much more sample efficient because they have incredibly strong priors. By encoding a significant fraction of human knowledge, the models are able to behave well in a variety of environments before they’ve actually received any training data.

When it comes to language modelling, most use of RL to date has been for RLHF, which is mostly used for behaviour modification. As there is (typically) no live data involved, RLHF isn’t “real” RL and does not face the exploration/exploitation tradeoff, nor does it allow for the discovery of new knowledge.

Knowledge discovery is the main unsolved problem in modern machine learning. While we've become proficient at supervised learning, we haven't yet cracked the code on how to systematically discover new knowledge, especially superhuman knowledge. For AlphaStar, for instance, they spent a lot of compute discovering new policies, as it is an extraordinarily hard problem to discover good strategies in Starcraft without prior knowledge.

Therein lies the rub; RL is simultaneously the most promising and most challenging approach we have. DeepMind invested billions of dollars in RL research with little commercial success to show for it (the Nobel prize, for instance, was for AlphaFold, which didn’t use RL). While RL is often the only solution for certain hard problems, it is notoriously difficult to implement effectively. Consider a game with discrete turns, like Chess or Go. In Go, you have on average 250 different choices at each turn, and the game lasts for 150 moves. Consequently, the game tree has approximately 250^150 nodes, or ~10^360. If searching randomly (which is how many RL algorithms explore), it is exceedingly difficult to find a reasonable trajectory in the game, which is why AlphaZero style selfplay is needed, or an AlphaGo style supervised learning phase. When we consider the LLM setting, in which typical vocabulary sizes are in the 10s to 100s of thousands of tokens, and sequence lengths can be in the 10s to 100s of thousands, the problem is made much worse. The result is a situation where RL is both necessary and yet should be considered a last resort.

Put differently, one way to think of deep learning is that it’s all about learning a good, generalizable, function approximation. In deep RL, we are approximating a value function, i.e. a function that tells us exactly how good or how bad a given state of the world would be. To improve the accuracy of the value function, we need to be able to receive data with non-trivial answers. If all we receive is the same reward (and it’s really bad), we can’t do anything. Consider a coding assistant, like Cursor’s newly released background agent. One way to train the agent would be to give it a reward of 1 if it returns code which is merged into a pull request, and 0 otherwise. If you took a randomly initialized network, it would output gibberish, and would thus always receive a signal of 0. Once you get a model that is actually good enough to sometimes be useful to users, you can start getting meaningful signal and rapidly improve.

As an illustrative example, I have a friend who works at a large video game publisher doing RL research for games (think: EA, Sony, Microsoft, etc.). He consults with teams at the publisher’s studios that want to use RL. His first response, despite being an experienced RL practitioner with more than 2 decades of RL experience, is usually to ask if they've tried everything else, because it’s so difficult to get RL to work in practical settings.

The great question with reinforcement learning and language models is whether or not we’ll see results transfer to other domains, like we have seen with next token prediction. The great boon of autoregressive language models has been that it generalizes well, that is, you can train a model to predict the next token and it learns to generate text that is useful in a number of other situations. It is absolutely not clear whether that will be the case with models trained largely with RL, as RL policies tend to be overly specialized to the exact problem they were trained on. AlphaZero notoriously had problems with catastrophic forgetting; a paper that I wrote while at DeepMind showed that simple exploits existed which could consistently beat AlphaZero. This has been replicated consistently in a number of other papers. To get around this, many RL algorithms require repeatedly looking at the training data via replay buffers, which is awkward and unwieldy.

With LLMs, this is a major problem. Setting aside RL, in the open research space, we see a lot of VLMs that are trained separately from their LLMs equivalents. DeepSeek-VL2 is a separate family of models from V3, which is text-only, despite all the major closed source models accepting multimodal inputs. The main reason for the separation being that, in the published literature, adding multimodal capacities to LLMs sacrifices pure text performance. When we go to add in RL, we should expect the problem to become much worse, and more research to be dedicated to improving the inherent tradeoffs here.

In my experience as a practitioner, RL lives or dies based on the quality of the reward signal. One of the most able RL practitioners that I know, Adam White, begins all of his RL projects by first learning to predict the reward signal; and only then will try to optimize it (first predict, and then control). Systems that are optimizing complex, overfit reward models will struggle. Systems like the Allen Institute's Tulu 3, which used verifiable rewards to do RL, seem like the answer, and provide motivation for the hundreds of millions of dollars that the frontier labs are spending on acquiring data.

The development of AlphaGo illustrates this paradox perfectly:

RL was essentially the only viable approach for achieving superhuman performance in Go
The project succeeded but required enormous resources and effort
The solution existed in a "narrow passageway" - there were likely very few variations of the AlphaGo approach that would have worked, as can be seen by the struggle that others have had replicating AlphaGo’s success in other domains.1

We're now facing a similar situation with language models:

We've largely exhausted the easily accessible training data
We need to discover new knowledge to progress further
For superhuman knowledge in particular, we can't rely on human supervision by definition
RL appears to be the only framework general enough to handle this challenge

In short, this is a call for research labs to start investing in fundamental RL research again, and in particular, on finally making progress on the exploration problem.

I actually can’t think of any successful applications of MCTS to solve real world problems. Other than the AlphaGo/AlphaZero/MuZero line of work, it doesn’t seem to have led to anything, which 2017 Finbarr would have found extremely surprising.