you know, that's a great question. I assume it can't be done side by side? I assume they're doing multiple 1v1 comparisons, and ranking those... but that seems different from what they say?
I wonder if the model could suffer from initial errors in the training data and then the feedback-loop would make it become "super-wrong". I wonder how things such as Dataset poising and mallicious attacks might be played out, either in the data or in the RL by people.
Yes, this is absolutely the major problem with recursive self-improvement. It could easily become recursive worsening. I'm going to try to run some experiments with this and see what happens.
Great post, this is in line with a lot of my thinking about how models can be made more capable in this time of pushing up against the ceilings of easily scrapable internet data.
I like your partitioning of LLM issues into core (exploration, self-improvement, data efficiency issues) and others which I agree are likely to be solved in the near future --- from what I have heard I believe context length, more efficient models, and factual accuracy at least are on course to be greatly advanced this year.
On data availability: my impression is that there is likely at least 10T tokens on the internet of high quality data, the issue is that much of it requires increasingly more effort to scrape, plus only a small fraction of it has permissive licenses if you are concerned about copyright lawsuits. There is likely significant data available from converting from other domains, as well. "OpenAI is using Whisper on all of YouTube to scrape data" is a bit of a meme but actually plausibly a way to get more decent quality tokens at the cost of roughly a few million $.
I agree though that in the long term, models which explore and generate their own data is the way to go though. This seems easiest in domains which are highly structured, like code, where you can also generate unit tests (https://arxiv.org/abs/2106.05784). Over the past few months there has also been progress in evaluation dataset generation too, with Anthropic's work moving towards LLM benchmark generation along many axes (albeit more targeted at RLHF models) (https://arxiv.org/abs/2212.09251) and I am hoping to work on an open-source pipline to generate questions/queries with more complexity for arbitrary evaluation axes and augmentation of current datasets over the next few months.
One of the main issues with doing this naively is that your generations will often lack sufficient diversity to really explore the domain well, particularly in RLHF tuned models which *currently* suffer from collapse to lower entropy distributions; a consistent complaint I've seen in papers on data generation is some mention of trying to maximise generated data diversity or that their method suffers from a lack of diversity using naive LM sampling techniques.
As a result, my preferred direction is in ways of sampling from LLM outputs in a way that incentivises increased diversity (an exploration prior is another way of viewing it). In contrast to a naive approach of generate -> finetune -> repeat (https://arxiv.org/abs/2207.14502), we can achieve potentially greater diversity and exploration of the search space by combining evolutionary algorithms with LLMs (https://arxiv.org/abs/2206.08896, https://arxiv.org/abs/2302.12170), and there is potential to combine this with RL (https://arxiv.org/abs/2302.06692) (and RLHF) in interesting ways to get around some of its weaknesses. I currently work with CarperAI, helping lead the open-endedness team on projects in this area: https://github.com/CarperAI/OpenELM
In general, I think there's a huge amount of potential in intelligently-guided open-ended search here. The Internet Explorer paper is a good direction, but I think you can do well in many domains (e.g. code) without the internet, particularly by exploiting the ability to co-evolve the agent and the evaluation environment, as in https://arxiv.org/abs/1901.01753, such that the difficulty of the environment scales with the agent's explore -> finetune loop.
We might also explore and generate data in the space of programs, as in this recent work https://arxiv.org/abs/2302.14838. With language models as an intelligent variation operator, the search procedure can "stay on the manifold of functionality" and guide the search in novel and high reward directions. In full generality, this is one option for a feedback loop to improve the capability of our most capable AI systems. See Minqi Jiang's work https://arxiv.org/abs/2211.07819https://blog.minch.co/2022/11/15/software-squared.html for some thoughts about that.
Regarding your final paragraph about searching the token tree, a recent promising option is Speculative Sampling (https://arxiv.org/abs/2302.01318, https://arxiv.org/abs/2211.17192 (very funny that the DeepMind team was scooped by the Brain team here)), which uses a small model to generate a draft sequence then queries the larger model's logprobs for each token to decide whether to accept/reject, which seems promising as a way to provide significant speedups and could be generalised towards the direction you suggest with a tree search kind of setup.
P.S. I enjoyed this post a lot and look forward to seeing more of your thoughts on this blog!
thanks for the detailed response and the kind words! I think your comment would be well worth expanding into an article on your own substack, this is great! :)
You've given me a lot of great papers to read- thanks for that! I'm looking forward to reading through them all.
Your point about data is interesting- your estimate is higher than a lot of the public estimates I've seen, but consistent with some of the private conversations that I've had (e.g. I chatted with one of the founders of one of the big AI labs, and they were totally unconcerned about the availability of HQ data).
I'm slightly skeptical about the quality of the whispered youtube data, but I haven't played around with Whisper that much.
> One of the main issues with doing this naively is that your generations will often lack sufficient diversity to really explore the domain well
This makes a lot of sense to me. This comes up a lot with beam-search, as you're probably aware. I think that's the benefit of best-of-n. I haven't been able to run any benchmarks, though. I'm planning on doing some work exploring if conditional routing networks can help with this, e.g. learning separate routed heads to prevent mode collapse.
> we can achieve potentially greater diversity and exploration of the search space by combining evolutionary algorithms with LLMs
I was just talking to Nathan Cooper about this and he said I should speak to you- I'll reach out :)
> The Internet Explorer paper is a good direction, but I think you can do well in many domains (e.g. code) without the internet
I think you're right. IE is cool but I don't think the hard part is the internet scraping.
> a recent promising option is Speculative Sampling
I love the speculative sampling work. I think something like that would work very well in the MuZero inner loop.
I'm really interested in the user interface design of these human feedback systems. Reading 64 outputs to choose the best? How do people do that?
you know, that's a great question. I assume it can't be done side by side? I assume they're doing multiple 1v1 comparisons, and ranking those... but that seems different from what they say?
I'm really curious how they do it...
I wonder if the model could suffer from initial errors in the training data and then the feedback-loop would make it become "super-wrong". I wonder how things such as Dataset poising and mallicious attacks might be played out, either in the data or in the RL by people.
Yes, this is absolutely the major problem with recursive self-improvement. It could easily become recursive worsening. I'm going to try to run some experiments with this and see what happens.
Great post, this is in line with a lot of my thinking about how models can be made more capable in this time of pushing up against the ceilings of easily scrapable internet data.
I like your partitioning of LLM issues into core (exploration, self-improvement, data efficiency issues) and others which I agree are likely to be solved in the near future --- from what I have heard I believe context length, more efficient models, and factual accuracy at least are on course to be greatly advanced this year.
On data availability: my impression is that there is likely at least 10T tokens on the internet of high quality data, the issue is that much of it requires increasingly more effort to scrape, plus only a small fraction of it has permissive licenses if you are concerned about copyright lawsuits. There is likely significant data available from converting from other domains, as well. "OpenAI is using Whisper on all of YouTube to scrape data" is a bit of a meme but actually plausibly a way to get more decent quality tokens at the cost of roughly a few million $.
I agree though that in the long term, models which explore and generate their own data is the way to go though. This seems easiest in domains which are highly structured, like code, where you can also generate unit tests (https://arxiv.org/abs/2106.05784). Over the past few months there has also been progress in evaluation dataset generation too, with Anthropic's work moving towards LLM benchmark generation along many axes (albeit more targeted at RLHF models) (https://arxiv.org/abs/2212.09251) and I am hoping to work on an open-source pipline to generate questions/queries with more complexity for arbitrary evaluation axes and augmentation of current datasets over the next few months.
One of the main issues with doing this naively is that your generations will often lack sufficient diversity to really explore the domain well, particularly in RLHF tuned models which *currently* suffer from collapse to lower entropy distributions; a consistent complaint I've seen in papers on data generation is some mention of trying to maximise generated data diversity or that their method suffers from a lack of diversity using naive LM sampling techniques.
As a result, my preferred direction is in ways of sampling from LLM outputs in a way that incentivises increased diversity (an exploration prior is another way of viewing it). In contrast to a naive approach of generate -> finetune -> repeat (https://arxiv.org/abs/2207.14502), we can achieve potentially greater diversity and exploration of the search space by combining evolutionary algorithms with LLMs (https://arxiv.org/abs/2206.08896, https://arxiv.org/abs/2302.12170), and there is potential to combine this with RL (https://arxiv.org/abs/2302.06692) (and RLHF) in interesting ways to get around some of its weaknesses. I currently work with CarperAI, helping lead the open-endedness team on projects in this area: https://github.com/CarperAI/OpenELM
In general, I think there's a huge amount of potential in intelligently-guided open-ended search here. The Internet Explorer paper is a good direction, but I think you can do well in many domains (e.g. code) without the internet, particularly by exploiting the ability to co-evolve the agent and the evaluation environment, as in https://arxiv.org/abs/1901.01753, such that the difficulty of the environment scales with the agent's explore -> finetune loop.
We might also explore and generate data in the space of programs, as in this recent work https://arxiv.org/abs/2302.14838. With language models as an intelligent variation operator, the search procedure can "stay on the manifold of functionality" and guide the search in novel and high reward directions. In full generality, this is one option for a feedback loop to improve the capability of our most capable AI systems. See Minqi Jiang's work https://arxiv.org/abs/2211.07819 https://blog.minch.co/2022/11/15/software-squared.html for some thoughts about that.
Regarding your final paragraph about searching the token tree, a recent promising option is Speculative Sampling (https://arxiv.org/abs/2302.01318, https://arxiv.org/abs/2211.17192 (very funny that the DeepMind team was scooped by the Brain team here)), which uses a small model to generate a draft sequence then queries the larger model's logprobs for each token to decide whether to accept/reject, which seems promising as a way to provide significant speedups and could be generalised towards the direction you suggest with a tree search kind of setup.
P.S. I enjoyed this post a lot and look forward to seeing more of your thoughts on this blog!
thanks for the detailed response and the kind words! I think your comment would be well worth expanding into an article on your own substack, this is great! :)
You've given me a lot of great papers to read- thanks for that! I'm looking forward to reading through them all.
Your point about data is interesting- your estimate is higher than a lot of the public estimates I've seen, but consistent with some of the private conversations that I've had (e.g. I chatted with one of the founders of one of the big AI labs, and they were totally unconcerned about the availability of HQ data).
I'm slightly skeptical about the quality of the whispered youtube data, but I haven't played around with Whisper that much.
> One of the main issues with doing this naively is that your generations will often lack sufficient diversity to really explore the domain well
This makes a lot of sense to me. This comes up a lot with beam-search, as you're probably aware. I think that's the benefit of best-of-n. I haven't been able to run any benchmarks, though. I'm planning on doing some work exploring if conditional routing networks can help with this, e.g. learning separate routed heads to prevent mode collapse.
> we can achieve potentially greater diversity and exploration of the search space by combining evolutionary algorithms with LLMs
I was just talking to Nathan Cooper about this and he said I should speak to you- I'll reach out :)
> The Internet Explorer paper is a good direction, but I think you can do well in many domains (e.g. code) without the internet
I think you're right. IE is cool but I don't think the hard part is the internet scraping.
> a recent promising option is Speculative Sampling
I love the speculative sampling work. I think something like that would work very well in the MuZero inner loop.